You wouldn't buy a car without taking it for a test drive, so why deploy a large language model (LLM) without benchmarking it first?
When you're running LLMs at scale, you need more than just theoretical performance data. You're going to want to know how they actually perform under real conditions – throughput, latency, and hardware efficiency all matter. Skipping this step can result in higher costs and wasted resources.
Benchmarking early helps you avoid inefficiencies down the line, and it shouldn't take hours to get started. That's where vLLM comes in.
vLLM is an open-source library that optimizes the inference and serving of LLMs, and it simplifies the benchmarking process with an efficient architecture built around the attention algorithm PagedAttention. This mechanism reduces memory waste and improves throughput by storing attention data non-contiguously.
When fast and efficient LLM inference is crucial, understanding your model's performance starts with benchmarking. With vLLM, you can spin up a high-performance inference server and start collecting those benchmarks in minutes.
To make things even easier, you don't need to spin up your own infrastructure from scratch. Vast.ai provides an ideal platform to leverage vLLM, offering high-performance GPUs at a fraction of the cost of traditional cloud providers. With Vast, it's easy to find the right hardware for your model, and with simple Docker integration, you can get up and running with vLLM quickly.
This guide demonstrates how to benchmark any LLM using vLLM on Vast.ai, in just a few minutes.
This guide walks you through how to serve a large language model using vLLM and run a benchmark test to evaluate performance. It’s using a H100 GPU server from Vast.ai with the "PyTorch" template. In this example, we’ll use the model meta-llama/Llama-3.1-8B-Instruct
, but you can use any Hugging Face-supported model compatible with vLLM.
1. Install required Python packages
First, install the vLLM library along with Pandas and Hugging Face’s datasets library. For example, you can use pip in your Python environment:
pip install vllm pandas datasets
This installs vLLM and the other libraries for data handling.
2. Log in to Hugging Face
To access the Meta Llama 3.1 model, you need a Hugging Face API token. Then run:
huggingface-cli login
This will prompt you to paste your Hugging Face access token (from your Hugging Face account settings). After entering the token, the CLI saves it so vLLM and other tools can download the model.
Now start the vLLM server hosting the Llama 3.1 8B Instruct model. vLLM’s OpenAI-compatible server listens on localhost port 8000 by default. Use the vllm serve
command with nohup to run it in the background and redirect output to a log file. For example:
nohup vllm serve meta-llama/Llama-3.1-8B-Instruct > vllm.log 2>&1 &
This command launches the server with the specified model and writes all output (including any Uvicorn logs) to vllm.log
. The server will run in the background. (The vLLM docs give a similar example: vllm serve <model-name>
) By default, this serves the model on http://localhost:8000
.
Once the server is running, you can test it by sending a chat completion request to the API. For example, use curl
to call the chat endpoint (/v1/chat/completions
) with a simple prompt:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
}'
This sends a JSON payload to the server. The model
field names our loaded model, and messages
contains a chat-style conversation. The server will respond with a JSON completion. (The official vLLM docs show a similar curl example for chat completions)
You should see a JSON response in the terminal with the model’s answer.
To access the benchmark script, clone the vLLM source repository:
git clone https://github.com/vllm-project/vllm.git
cd vllm/benchmarks
This pulls the latest vLLM code. Now you have the benchmark_serving.py
script and related files in the benchmarks
folder.
Finally, run the built-in benchmark script to measure serving performance. The benchmark_serving.py
tool sends many requests to the running server and reports throughput/latency. For example, you might run:
python3 vllm/benchmarks/benchmark_serving.py \
--backend vllm \
--base-url http://127.0.0.1:8000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--tokenizer meta-llama/Llama-3.1-8B-Instruct \
--dataset-name random \
--random-input-len 128 \
--random-output-len 128 \
--num-prompts 80 \
--max-concurrency 4 \
--temperature 0.7
--backend vllm
: Use the vLLM backend to send benchmark requests.
--base-url http://127.0.0.1:8000
: Address of your running vLLM server.
--model meta-llama/Llama-3.1-8B-Instruct
: Name of the model being benchmarked.
--tokenizer meta-llama/Llama-3.1-8B-Instruct
: Tokenizer used for encoding/decoding; usually the same as the model.
--dataset-name random
: Use synthetic/randomly generated prompts instead of a real dataset.
--random-input-len 128
: Length (in tokens) of each generated input prompt.
--random-output-len 128
: Desired length (in tokens) of the model's output.
--num-prompts 80
: Total number of prompts to send during the test.
--max-concurrency 4
: Number of prompts sent in parallel (simulates batch size).
--temperature 0.7
: Controls randomness in generation; higher = more creative.
This example command (modeled on published vLLM benchmarks) will run 80 chat requests against your server with concurrency 4. After it finishes, the script prints stats like throughput and latency.
By following these steps and citing vLLM’s documentation and examples, you can serve Llama-3.1-8B-Instruct
with vLLM and evaluate its performance. Adjust parameters (batch size, number of tokens, etc.) in the commands above as needed for your environment.
And that's it! A fast, easy-to-use LLM benchmarking workflow powered by vLLM. In just ten minutes, you can gather meaningful performance data that will help you make smarter decisions about which models (and hardware) are right for your specific project and workload.
Need powerful GPUs to run your tests? Vast.ai offers on-demand access to H100s, RTX 5090s, and other high-performance options at a fraction of the typical price – so you can save 5-6X on GPU compute. Get started today and benchmark on your terms!
© 2025 Vast.ai. All rights reserved.