Learn rust by building a simple load test tool

After I tried rust by rebuilding my recipe website, I thought I know enough to use rust to build more stuff, at least will not spend too much time. So I tried to build a load test tool, and I found rust is way more complex than I thought, even I already built the load test tool, some code are still not very clear to me.

About the rust code structure for the load test tool

I use clap crate to build the cli tool, so it can parse the arguments for me, below is the options:

Usage: fun-bomb [OPTIONS] [URLS]...

Arguments:
  [URLS]...  Urls for for load testing, for example:url1 url2

Options:
  -c, --connections <CONNECTIONS>      Number of concurrent connections for sending the requests [default: 10]
  -r, --requests <REQUESTS>            Number of total bomb requests need to to be sent
  -d, --duration <DURATION>            Duration for the sending the bomb requests. Default is 10 seconds. When argument of request and duration are both provided only duration will be used. When argument of request and duration are both not provided, duration in 10 seconds will be used
  -w, --warm-requests <WARM_REQUESTS>  Number of requests need to to be sent for warmup [default: 100]
      --delay <DELAY>                  Delay in seconds between bomb different urls [default: 10]
      --status-code <STATUS_CODE>      Expected status code [default: 200]
  -m, --method <METHOD>                Method for the request: GET, PUT etc [default: GET]
      --header <HEADER>                Headers for the request: --header key=value
      --body <BODY>                    Body in string for the request
      --verbose                        Verbose information
  -h, --help                           Print help
  -V, --version                        Print version

The cfg which used to control the features for compiling is pretty simple and cool, I also know this kind of conditional building concept in dotnet. I use it, so I can build different requestor which can be based on different http-client crates, I used two: reqwest and surf. It turns out that reqwest is faster. Below code will create a requestor according to the feature I give, by default it is reqwest.

let requestor = match requestor::new(&options, &url) {
    Ok(x) => x,
    Err(e) => {
        print_warning(format!("Create requestor failed: {}", e));
        println!();
        continue;
    }
};

Then I will warm the bomb and start bombing:

println!("Warming bomb for {}", &url);
match bomb::warmup(options.warm_requests, &requestor).await {
    Ok(_) => {
        println!("Bomb is triggered. Waiting for completion ...");

        let metric = match bomb_type {
            BombType::Duration(duration) => {
                bomb::trigger_dy_duration(duration, options.connections, requestor.into()).await
            }
            BombType::TotalRequests(requests) => {
                bomb::trigger_by_total(requests, options.connections, requestor.into()).await
            }
        };

        match metric {
            Ok(metric) => {
                print_metric(&bomb_type, &metric);
                metrics.push((url.to_owned(), metric));
            }
            Err(e) => print_warning(format!("Error happened: {e}")),
        }
    }
    Err(e) => print_error(format!("Error happened: {e}")),
}

Below is the code for bomb::trigger_dy_duration, because bomb by duration is more simpler.

pub async fn trigger_dy_duration<Requestor>(
    duration: Duration,
    concurrent: u16,
    requestor: Arc<Requestor>,
) -> Result<Metric, String>
where
    Requestor: SendRequest + Send + Sync + 'static,
{
    let is_done = Arc::new(Mutex::new(false));

    let mut tasks = Vec::new();
    let total_timer = Instant::now();

    for _ in 0..concurrent {
        let requestor = requestor.clone();
        let is_done = is_done.clone();

        tasks.push(task::spawn(async move {
            let mut success_results = Vec::<SuccessResult>::new();
            loop {
                // quickly lock the total_count and decrease it accordingly in this scope so the lock can release as soon as possible
                {
                    let is_done = is_done.lock().await;
                    if *is_done {
                        return success_results;
                    }
                }

                let timer = Instant::now();
                match requestor.send(true).await {
                    Ok(_) => {
                        let time_cost = timer.elapsed().as_nanos();
                        success_results.push(SuccessResult {
                            time_nanos: time_cost,
                        });
                    }
                    _ => {}
                }
            }
        }));
    }

    async_std::task::sleep(duration).await;
    *is_done.lock().await = true;
    let task_results = futures::future::join_all(tasks).await;
    let total_time_sec = total_timer.elapsed().as_secs_f64();

    let success_results: Vec<_> = task_results.iter().flat_map(|x| x).collect();
    if success_results.len() > 0 {
        Ok(calculate_metric(&success_results, total_time_sec))
    } else {
        Err("All requests are failed".to_owned())
    }
}

At first, I wanted to use a simple closure to wrap the requestor, something like below:

pub async fn trigger_dy_duration(
    duration: Duration,
    concurrent: u16,
    requestor: impl Fn() -> dyn Future<Output = Result<(), String>>,
) -> Result<Metric, String>

But the compiler is always unhappy when I call the requestor, I asked GPT, google, I tried Box, Pin and finally I still did not know how to achieve it. Especially when I call it in tokio::spawn.

In the end, I defined a trait SendRequest to make it work:

pub async fn trigger_dy_duration<Requestor>(
    duration: Duration,
    concurrent: u16,
    requestor: Arc<Requestor>,
) -> Result<Metric, String>
where
    // I still does not totally understand below constraint.
    // After reading its docs, I understand, but when I want to use it in other place, then I am confused again.
    Requestor: SendRequest + Send + Sync + 'static,

Below is how the SendRequest trait looks like:

#[async_trait]
pub trait SendRequest {
    async fn send(&self, check_status_code: bool) -> Result<(), String>;
}

Below is the code for implement the trait by using reqwest:

#[async_trait]
impl SendRequest for Requestor {
    async fn send(&self, check_status_code: bool) -> Result<(), String> {
        match self.client.execute(self.request.try_clone().unwrap()).await {
            Ok(response) => {
                if !check_status_code || response.status().as_u16() == self.expected_status {
                    return Ok(());
                }
                return Err(format!(
                    "Response status code is not: {}",
                    self.expected_status
                ));
            }
            Err(e) => return Err(format!("Response failed: {}", e)),
        }
    }
}

I also spend a lot of time to implement this SendRequest for other http-client crates. At first, when I use reqwest, it is not hard, but when I tried to use other crates like awc, ureq etc. The compiler error really fucked my brain, which leads me think that other crates just cannot be used here (what I can do 🤷‍♂️).

Start the load test with 50 concurrent

Finally, I can compile my code and run some tests. But the result is also not as good as what I expected.

I have a simple rust axum server which serve a /hi route like below:

route("/hi", get(|| async { "world" }))

Tested by wrk:

./wrk -c50 -t10 http://localhost:3000/hi

Running 10s test @ http://localhost:3000/hi
  10 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   176.54us  157.09us   9.98ms   94.02%
    Req/Sec    29.64k     5.46k   39.18k    59.07%
  2976333 requests in 10.10s, 343.45MB read
Requests/sec: 294697.32
Transfer/sec:     34.01MB

Tested by my fun-bomb with tokio runtime

cargo run -r -- http://localhost:3000/hi -c 50

Warming bomb for http://localhost:3000/hi
Bomb is triggered. Waiting for completion ...
Duration: 10.00sec, Success count: 722033, RPS: 72201/s
AVG: 0.11ms, P99: 0.10ms, Min: 0.01ms, Max: 18.81ms

Tested by my fun-bomb with async-std runtime

cargo run -r -- http://localhost:3000/hi -c 50

Warming bomb for http://localhost:3000/hi
Bomb is triggered. Waiting for completion ...
Duration: 10.00sec, Success count: 1532503, RPS: 153241/s
AVG: 0.33ms, P99: 0.32ms, Min: 0.03ms, Max: 15.73ms

Tested by drill

base: "http://localhost:3000"
concurrency: 50
iterations: 100000
rampup: 10

plan:
  - name: "hi"
    request:
      url: "/hi"

Time taken for tests      3.9 seconds
Total requests            100000
Successful requests       100000
Failed requests           0
Requests per second       25947.54 [#/sec]

Evaluation summary

Because the server endpoint is so simple, by running the load test tool within the same machine, the load test tool itself will be the bottleneck, because it will rush CPU.

I also tried endpoint like /hi-lazy which will sleep for sometime, then all of the result from different load tools looks like similar, because in that case, the server is the bottleneck.

They are all using 50 concurrent to bomb:

wrk	my fun-bomb with tokio	my fun-bomb with async-std	drill	my csharp implementation
294_697	72_201	153_241	25_947	213_216

From the result, wrk is the fastest, because it is so efficient and consume the most less CPU. My csharp implementation gives me the hope for dotnet 🤣.

One thing to notice is that the async-std runtime is faster in 50 concurrent bomb than tokio.

Below the is summery for 10 concurrent:

wrk	my fun-bomb with tokio	my fun-bomb with async-std	drill	my csharp implementation
112_923	80_351	49_717	4_525	78_679
2.6	0.9	3.1	5.7	2.7

In this case, tokio is faster than async-std. What the fuck ... I really do not know what is going on. Maybe there are some configuration need to be set.

2023-06-26: I found the reason: I miss used the Mutex of tokio which slowed down tokio. At first, from the namepsace, I thought tokio's Mutex is optimized for using under tokio context but it turns out that I am wrong.

wrk is insanely fast, my csharp implementation is perform as I expected. Drill is an HTTP load testing application written in Rust is disappointed to me. My fun-bomb rust implementation is confusing to me😢.

Load test for hi-lazy with 10 concurrent

.route(
    "/hi-lazy",
    get(|| async {
        tokio::time::sleep(Duration::from_millis(10)).await;
        "world"
    }),
)

wrk	my fun-bomb with tokio	my fun-bomb with async-std	drill	my csharp implementation
855	833	829	732	813

From this result, it looks like the load tool itself is kind of ok. Because as long as the server is the bottleneck, then the load tool can perform efficient and get the correct result.

2023-06-26: I created a repo for comparing csharp and rust for the http client performance: https://github.com/albertwoo/fun-load. The results still shows that csharp is faster. But will keep updating its results as I learn more about csharp and rust.

Final summary

It is funny, confusing and long time learning journey for building this simple http load tool in rust for me. Tasted some good part of rust, like trait, compiling features, neat module system, powerful pattern match, great type system and immutable by default etc. (A lot of good stuff is existing in other ML languages, like fsharp.) The most impressed is the less memory usage, small cross platform bundle size, the macro concept, and of course no GC.

But, the concurrent + ownership make the syntax so complex and hard to read, yeah of course, it is safer in that way for runtime.

The async runtime like tokio and async-std is really not that impressed at all. Just need sometime, maybe...

At my previous blog, I want to write all the stuff in rust, now, I am cooling down, will see what app I will build, then determine should I use dotnet or rust.

And with dotnet 8 AOT, dotnet can cover more use cases, like build the load testing tool 😁