How to Benchmark An LLM with vLLM in 10 Minutes

July 21, 2025
5 Min Read
By Team Vast
Share
Subscribe

You wouldn't buy a car without taking it for a test drive, so why deploy a large language model (LLM) without benchmarking it first?

When you're running LLMs at scale, you need more than just theoretical performance data. You're going to want to know how they actually perform under real conditions – throughput, latency, and hardware efficiency all matter. Skipping this step can result in higher costs and wasted resources.

Benchmarking early helps you avoid inefficiencies down the line, and it shouldn't take hours to get started. That's where vLLM comes in.

A Smarter Way to Serve and Benchmark LLMs

vLLM is an open-source library that optimizes the inference and serving of LLMs, and it simplifies the benchmarking process with an efficient architecture built around the attention algorithm PagedAttention. This mechanism reduces memory waste and improves throughput by storing attention data non-contiguously.

When fast and efficient LLM inference is crucial, understanding your model's performance starts with benchmarking. With vLLM, you can spin up a high-performance inference server and start collecting those benchmarks in minutes.

To make things even easier, you don't need to spin up your own infrastructure from scratch. Vast.ai provides an ideal platform to leverage vLLM, offering high-performance GPUs at a fraction of the cost of traditional cloud providers. With Vast, it's easy to find the right hardware for your model, and with simple Docker integration, you can get up and running with vLLM quickly.

This guide demonstrates how to benchmark any LLM using vLLM on Vast.ai, in just a few minutes.

Getting Started: Benchmarking LLMs with vLLM

This guide walks you through how to serve a large language model using vLLM and run a benchmark test to evaluate performance. It’s using a H100 GPU server from Vast.ai with the "PyTorch" template. In this example, we’ll use the model meta-llama/Llama-3.1-8B-Instruct, but you can use any Hugging Face-supported model compatible with vLLM.

1. Install required Python packages

First, install the vLLM library along with Pandas and Hugging Face’s datasets library. For example, you can use pip in your Python environment:

pip install vllm pandas datasets

This installs vLLM and the other libraries for data handling.

2. Log in to Hugging Face

To access the Meta Llama 3.1 model, you need a Hugging Face API token. Then run:

huggingface-cli login

This will prompt you to paste your Hugging Face access token (from your Hugging Face account settings). After entering the token, the CLI saves it so vLLM and other tools can download the model.

3. Start the vLLM server

Now start the vLLM server hosting the Llama 3.1 8B Instruct model. vLLM’s OpenAI-compatible server listens on localhost port 8000 by default. Use the vllm serve command with nohup to run it in the background and redirect output to a log file. For example:

nohup vllm serve meta-llama/Llama-3.1-8B-Instruct > vllm.log 2>&1 &

This command launches the server with the specified model and writes all output (including any Uvicorn logs) to vllm.log. The server will run in the background. (The vLLM docs give a similar example: vllm serve <model-name>) By default, this serves the model on http://localhost:8000.

4. Test the server with a curl request

Once the server is running, you can test it by sending a chat completion request to the API. For example, use curl to call the chat endpoint (/v1/chat/completions) with a simple prompt:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
          "model": "meta-llama/Llama-3.1-8B-Instruct",
          "messages": [
              {"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": "What is the capital of France?"}
          ]
        }'

This sends a JSON payload to the server. The model field names our loaded model, and messages contains a chat-style conversation. The server will respond with a JSON completion. (The official vLLM docs show a similar curl example for chat completions)

You should see a JSON response in the terminal with the model’s answer.

5. Clone the vLLM repository

To access the benchmark script, clone the vLLM source repository:

git clone https://github.com/vllm-project/vllm.git
cd vllm/benchmarks

This pulls the latest vLLM code. Now you have the benchmark_serving.py script and related files in the benchmarks folder.

6. Run the benchmark script

Finally, run the built-in benchmark script to measure serving performance. The benchmark_serving.py tool sends many requests to the running server and reports throughput/latency. For example, you might run:

  python3 vllm/benchmarks/benchmark_serving.py \
    --backend vllm \
    --base-url http://127.0.0.1:8000 \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --tokenizer meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name random \
    --random-input-len 128 \
    --random-output-len 128 \
    --num-prompts 80 \
    --max-concurrency 4 \
    --temperature 0.7

  • --backend vllm: Use the vLLM backend to send benchmark requests.

  • --base-url http://127.0.0.1:8000: Address of your running vLLM server.

  • --model meta-llama/Llama-3.1-8B-Instruct: Name of the model being benchmarked.

  • --tokenizer meta-llama/Llama-3.1-8B-Instruct: Tokenizer used for encoding/decoding; usually the same as the model.

  • --dataset-name random: Use synthetic/randomly generated prompts instead of a real dataset.

  • --random-input-len 128: Length (in tokens) of each generated input prompt.

  • --random-output-len 128: Desired length (in tokens) of the model's output.

  • --num-prompts 80: Total number of prompts to send during the test.

  • --max-concurrency 4: Number of prompts sent in parallel (simulates batch size).

  • --temperature 0.7: Controls randomness in generation; higher = more creative.

This example command (modeled on published vLLM benchmarks) will run 80 chat requests against your server with concurrency 4. After it finishes, the script prints stats like throughput and latency.

By following these steps and citing vLLM’s documentation and examples, you can serve Llama-3.1-8B-Instruct with vLLM and evaluate its performance. Adjust parameters (batch size, number of tokens, etc.) in the commands above as needed for your environment.

Conclusion

And that's it! A fast, easy-to-use LLM benchmarking workflow powered by vLLM. In just ten minutes, you can gather meaningful performance data that will help you make smarter decisions about which models (and hardware) are right for your specific project and workload.

Need powerful GPUs to run your tests? Vast.ai offers on-demand access to H100s, RTX 5090s, and other high-performance options at a fraction of the typical price – so you can save 5-6X on GPU compute. Get started today and benchmark on your terms!

Vast AI

© 2025 Vast.ai. All rights reserved.

Vast.ai