The AI inference landscape is rapidly evolving, with new frameworks promising faster, more efficient model serving. Modular MAX is one such framework that uses advanced optimization techniques to improve performance. But how does it compare to established solutions like vLLM in real-world scenarios?
In this comprehensive benchmark, we'll deploy both Modular MAX and vLLM on Vast.ai infrastructure, running identical tests with Llama 3.1 8B Instruct to measure the metrics that matter most: Time to First Token (TTFT), response latency, throughput, and batch processing efficiency. By the end, you'll have data-driven insights to choose the right inference framework for your applications.
Modular MAX stands out in the inference landscape through several key innovations:
Vast.ai is a global cloud computing marketplace that connects compute providers with users, offering flexible GPU rental options for AI and machine learning workloads. The platform democratizes access to high-performance computing by providing a diverse ecosystem of GPU resources.
Key advantages of Vast.ai:
This notebook demonstrates how to deploy and compare Modular MAX and vLLM on Vast.ai infrastructure. We'll deploy both inference servers with the same model (Llama 3.1 8B Instruct) and run performance benchmarks to compare:
By the end of this tutorial, you'll have hands-on experience with both frameworks and data-driven insights to choose the best inference solution for your specific needs.
Before starting, ensure you have:
First, we'll install the required packages and configure our environment. We need the Vast.ai CLI for managing GPU instances and the OpenAI client library for interfacing with both MAX and vLLM endpoints.
# Install required packages
pip install --upgrade vastai
pip install --upgrade openai
# Set your Vast.ai API key
export VAST_API_KEY="" # Your key here
vastai set api-key $VAST_API_KEY
Next, we'll search for suitable GPU instances on Vast.ai to run both MAX and vLLM. For optimal performance comparison, we need:
The search below targets these specifications while prioritizing cost-effectiveness:
# Search for suitable GPU instances
vastai search offers "compute_cap >= 800 \
geolocation=US \
gpu_ram >= 80 \
num_gpus >= 1 \
static_ip = true \
direct_port_count >= 2 \
verified = true \
disk_space >= 100 \
rentable = true"
Now we'll deploy our first inference server using Modular MAX. MAX requires specific Docker images and environment configurations to unlock its performance optimizations.
Key deployment parameters:
docker.modular.com/modular/max-nvidia-full:latest
- The official MAX runtime with NVIDIA GPU supportmeta-llama/Llama-3.1-8B-Instruct
- Our benchmark model for consistent comparison40960
tokens - Enables long-context generation for comprehensive testingImportant: You need to accept the terms at https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct before deployment.
The deployment typically takes 5-7 minutes as the instance downloads the Docker image and model weights.
# Deploy MAX instance
export MAX_INSTANCE_ID= # Insert instance ID from search results
export HUGGING_FACE_HUB_TOKEN="" # Your HF token
vastai create instance $MAX_INSTANCE_ID \
--image docker.modular.com/modular/max-nvidia-full:latest \
--disk 100 \
--env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN='$HUGGING_FACE_HUB_TOKEN' -e MAX_SERVE_LOGS_CONSOLE_LEVEL=DEBUG' \
--args --model-path meta-llama/Llama-3.1-8B-Instruct --max-length 40960
Both instances need time to fully initialize before we can begin benchmarking:
💡 Pro Tip: Both servers are ready when you see "Application startup complete" in the logs and can successfully ping the /health
endpoint.
Now we'll build a sophisticated benchmarking system to measure and compare the performance of both inference frameworks. Our benchmark captures the metrics that matter most for real-world applications:
import time
import statistics
import json
from typing import List, Dict
from openai import OpenAI
from datetime import datetime
class SequentialLLMBenchmarker:
def __init__(self, base_url: str, model_name: str = "meta-llama/Llama-3.1-8B-Instruct"):
self.client = OpenAI(
base_url=base_url,
api_key="EMPTY",
)
self.model_name = model_name
def create_test_prompts(self) -> List[str]:
"""Create a variety of test prompts for benchmarking"""
return [
"Explain quantum computing in simple terms.",
"Write a short story about a robot learning to love.",
"What are the key differences between Python and JavaScript?",
"Describe the process of photosynthesis step by step.",
"Create a recipe for chocolate chip cookies.",
"Explain the theory of relativity to a 10-year-old.",
"What are the benefits and drawbacks of renewable energy?",
"Write a professional email declining a job offer.",
"Describe the major events of World War II.",
"How does machine learning work?"
]
def measure_single_request(self, prompt: str) -> Dict:
"""Measure metrics for a single request"""
start_time = time.time()
try:
completion = self.client.chat.completions.create(
model=self.model_name,
messages=[{"role": "user", "content": prompt}],
stream=True # Enable streaming to measure time to first token
)
first_token_time = None
full_response = ""
token_count = 0
for chunk in completion:
current_time = time.time()
if chunk.choices[0].delta.content is not None:
if first_token_time is None:
first_token_time = current_time - start_time
content = chunk.choices[0].delta.content
full_response += content
# Rough token estimation (actual tokenization would be more accurate)
token_count += len(content.split())
end_time = time.time()
total_latency = end_time - start_time
return {
"success": True,
"time_to_first_token": first_token_time,
"total_latency": total_latency,
"token_count": token_count,
"tokens_per_second": token_count / total_latency if total_latency > 0 else 0,
"response_length": len(full_response),
"prompt": prompt[:50] + "..." if len(prompt) > 50 else prompt,
"full_response": full_response
}
except Exception as e:
return {
"success": False,
"error": str(e),
"prompt": prompt[:50] + "..." if len(prompt) > 50 else prompt
}
def run_benchmark(self, num_prompts: int = 5) -> Dict:
"""Run benchmark for this service"""
print(f"Model: {self.model_name}")
print(f"Base URL: {self.client.base_url}")
print(f"Test prompts: {num_prompts}")
prompts = self.create_test_prompts()[:num_prompts]
results = []
for i, prompt in enumerate(prompts):
print(f"\n 📝 Request {i+1}/{len(prompts)}: {prompt[:40]}...")
result = self.measure_single_request(prompt)
results.append(result)
if not result["success"]:
print(f" ❌ Failed: {result['error']}")
else:
print(f" ✅ TTFT: {result['time_to_first_token']:.3f}s")
print(f" ⏱️ Total: {result['total_latency']:.3f}s")
print(f" 🚀 TPS: {result['tokens_per_second']:.1f}")
print(f" 📊 Tokens: {result['token_count']}")
# Calculate aggregate metrics
successful_results = [r for r in results if r["success"]]
if not successful_results:
return {
"success_rate": 0,
"results": results,
"timestamp": datetime.now().isoformat()
}
total_tokens = sum([r["token_count"] for r in successful_results])
total_time = sum([r["total_latency"] for r in successful_results])
metrics = {
"base_url": str(self.client.base_url),
"model": self.model_name,
"timestamp": datetime.now().isoformat(),
"success_rate": len(successful_results) / len(results),
"total_requests": len(results),
"successful_requests": len(successful_results),
# Time to first token metrics
"avg_time_to_first_token": statistics.mean([r["time_to_first_token"] for r in successful_results]),
"median_time_to_first_token": statistics.median([r["time_to_first_token"] for r in successful_results]),
"min_time_to_first_token": min([r["time_to_first_token"] for r in successful_results]),
"max_time_to_first_token": max([r["time_to_first_token"] for r in successful_results]),
# Total latency metrics
"avg_total_latency": statistics.mean([r["total_latency"] for r in successful_results]),
"median_total_latency": statistics.median([r["total_latency"] for r in successful_results]),
"min_total_latency": min([r["total_latency"] for r in successful_results]),
"max_total_latency": max([r["total_latency"] for r in successful_results]),
# Tokens per second metrics
"avg_tokens_per_second": statistics.mean([r["tokens_per_second"] for r in successful_results]),
"median_tokens_per_second": statistics.median([r["tokens_per_second"] for r in successful_results]),
"min_tokens_per_second": min([r["tokens_per_second"] for r in successful_results]),
"max_tokens_per_second": max([r["tokens_per_second"] for r in successful_results]),
# Batch metrics
"total_tokens": total_tokens,
"total_time": total_time,
"batch_tokens_per_second": total_tokens / total_time if total_time > 0 else 0,
"individual_results": results
}
return metrics
def print_summary(self, results: Dict):
"""Print a formatted summary of results"""
print(f"\n{'='*60}")
print(f"📊 BENCHMARK SUMMARY")
print(f"{'='*60}")
if results["success_rate"] == 0:
print("❌ All requests failed!")
return
print(f"✅ Success Rate: {results['success_rate']:.1%} ({results['successful_requests']}/{results['total_requests']})")
print(f"\n⚡ TIME TO FIRST TOKEN:")
print(f" Average: {results['avg_time_to_first_token']:.3f}s")
print(f" Median: {results['median_time_to_first_token']:.3f}s")
print(f" Range: {results['min_time_to_first_token']:.3f}s - {results['max_time_to_first_token']:.3f}s")
print(f"\n⏱️ TOTAL LATENCY:")
print(f" Average: {results['avg_total_latency']:.3f}s")
print(f" Median: {results['median_total_latency']:.3f}s")
print(f" Range: {results['min_total_latency']:.3f}s - {results['max_total_latency']:.3f}s")
print(f"\n🚀 TOKENS PER SECOND:")
print(f" Average: {results['avg_tokens_per_second']:.1f} TPS")
print(f" Median: {results['median_tokens_per_second']:.1f} TPS")
print(f" Range: {results['min_tokens_per_second']:.1f} - {results['max_tokens_per_second']:.1f} TPS")
print(f"\n📦 BATCH THROUGHPUT:")
print(f" Total Tokens: {results['total_tokens']}")
print(f" Total Time: {results['total_time']:.1f}s")
print(f" Batch TPS: {results['batch_tokens_per_second']:.1f}")
print(f"\n🕐 Completed at: {results['timestamp']}")
print(f"{'='*60}")
# Global variables to store results for comparison
max_results = None
vllm_results = None
With our MAX instance deployed and initialized, we'll now run our comprehensive performance benchmark.
Before running the benchmark:
MAX_IP_ADDRESS
and MAX_PORT
variables belowThe benchmark will run 5 diverse test prompts and measure all key performance metrics. Results will be saved automatically for later comparison.
# Set your MAX instance details
MAX_IP_ADDRESS = "" # Your MAX instance IP
MAX_PORT = "" # Your MAX instance port
print("🔥 Testing MAX Performance...")
max_benchmarker = SequentialLLMBenchmarker(f"http://{MAX_IP_ADDRESS}:{MAX_PORT}/v1")
max_results = max_benchmarker.run_benchmark(num_prompts=5) # Adjust number as needed
max_benchmarker.print_summary(max_results)
# Save results
with open("max_benchmark_results.json", "w") as f:
json.dump(max_results, f, indent=2)
print("\n💾 MAX results saved to max_benchmark_results.json")
🔥 Testing MAX Performance...
Model: meta-llama/Llama-3.1-8B-Instruct
Base URL: http://128.24.60.121:5084/v1/
Test prompts: 5
📝 Request 1/5: Explain quantum computing in simple term...
✅ TTFT: 0.184s
⏱️ Total: 5.753s
🚀 TPS: 91.6
📊 Tokens: 527
📝 Request 2/5: Write a short story about a robot learni...
✅ TTFT: 0.081s
⏱️ Total: 7.286s
🚀 TPS: 92.9
📊 Tokens: 677
📝 Request 3/5: What are the key differences between Pyt...
✅ TTFT: 0.123s
⏱️ Total: 7.061s
🚀 TPS: 85.5
📊 Tokens: 604
📝 Request 4/5: Describe the process of photosynthesis s...
✅ TTFT: 0.079s
⏱️ Total: 8.045s
🚀 TPS: 91.6
📊 Tokens: 737
📝 Request 5/5: Create a recipe for chocolate chip cooki...
✅ TTFT: 0.086s
⏱️ Total: 5.348s
🚀 TPS: 87.9
📊 Tokens: 470
============================================================
📊 BENCHMARK SUMMARY
============================================================
✅ Success Rate: 100.0% (5/5)
⚡ TIME TO FIRST TOKEN:
Average: 0.111s
Median: 0.086s
Range: 0.079s - 0.184s
⏱️ TOTAL LATENCY:
Average: 6.699s
Median: 7.061s
Range: 5.348s - 8.045s
🚀 TOKENS PER SECOND:
Average: 89.9 TPS
Median: 91.6 TPS
Range: 85.5 - 92.9 TPS
📦 BATCH THROUGHPUT:
Total Tokens: 3015
Total Time: 33.5s
Batch TPS: 90.0
🕐 Completed at: 2025-07-18T18:08:18.235635
============================================================
💾 MAX results saved to max_benchmark_results.json
To conduct a fair performance comparison, we'll now deploy vLLM with identical model and hardware configurations. First, destroy your MAX instance and preferably choose the same instance ID from your earlier search to ensure any performance differences are attributable to the inference framework rather than hardware variations.
Deployment Configuration:
meta-llama/Llama-3.1-8B-Instruct
for direct comparison40960
tokens to match MAX configuration# Deploy vLLM instance
export VLLM_INSTANCE_ID= # Insert instance ID from search results
export HUGGING_FACE_HUB_TOKEN="" # Your HF token
vastai create instance $VLLM_INSTANCE_ID \
--image vllm/vllm-openai:latest \
--disk 100 \
--env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN='$HUGGING_FACE_HUB_TOKEN \
--args --model meta-llama/Llama-3.1-8B-Instruct --max-model-len 40960
Now we'll run the identical benchmark suite against our vLLM instance. By using the exact same hardware instance (after destroying the MAX deployment), we ensure any performance differences are purely due to the inference framework rather than hardware variations.
Before running the vLLM benchmark:
VLLM_IP_ADDRESS
and VLLM_PORT
variables belowThe benchmark will use the same 5 test prompts as the MAX evaluation, enabling direct metric comparison on identical hardware.
# Set your vLLM instance details
VLLM_IP_ADDRESS = "" # Your vLLM instance IP
VLLM_PORT = "" # Your vLLM instance port
print("🔥 Testing vLLM Performance...")
vllm_benchmarker = SequentialLLMBenchmarker(f"http://{VLLM_IP_ADDRESS}:{VLLM_PORT}/v1")
vllm_results = vllm_benchmarker.run_benchmark(num_prompts=5) # Use same number as MAX
vllm_benchmarker.print_summary(vllm_results)
# Save results
with open("vllm_benchmark_results.json", "w") as f:
json.dump(vllm_results, f, indent=2)
print("\n💾 vLLM results saved to vllm_benchmark_results.json")
🔥 Testing vLLM Performance...
Model: meta-llama/Llama-3.1-8B-Instruct
Base URL: http://128.24.60.121:2622/v1/
Test prompts: 5
📝 Request 1/5: Explain quantum computing in simple term...
✅ TTFT: 0.698s
⏱️ Total: 6.777s
🚀 TPS: 71.7
📊 Tokens: 486
📝 Request 2/5: Write a short story about a robot learni...
✅ TTFT: 0.081s
⏱️ Total: 8.106s
🚀 TPS: 80.1
📊 Tokens: 649
📝 Request 3/5: What are the key differences between Pyt...
✅ TTFT: 0.085s
⏱️ Total: 8.162s
🚀 TPS: 74.5
📊 Tokens: 608
📝 Request 4/5: Describe the process of photosynthesis s...
✅ TTFT: 0.081s
⏱️ Total: 8.253s
🚀 TPS: 78.6
📊 Tokens: 649
📝 Request 5/5: Create a recipe for chocolate chip cooki...
✅ TTFT: 0.128s
⏱️ Total: 5.751s
🚀 TPS: 74.4
📊 Tokens: 428
============================================================
📊 BENCHMARK SUMMARY
============================================================
✅ Success Rate: 100.0% (5/5)
⚡ TIME TO FIRST TOKEN:
Average: 0.215s
Median: 0.085s
Range: 0.081s - 0.698s
⏱️ TOTAL LATENCY:
Average: 7.410s
Median: 8.106s
Range: 5.751s - 8.253s
🚀 TOKENS PER SECOND:
Average: 75.9 TPS
Median: 74.5 TPS
Range: 71.7 - 80.1 TPS
📦 BATCH THROUGHPUT:
Total Tokens: 2820
Total Time: 37.0s
Batch TPS: 76.1
🕐 Completed at: 2025-07-18T18:16:13.377743
============================================================
💾 vLLM results saved to vllm_benchmark_results.json
With both benchmarks complete, we can now perform a detailed performance comparison between Modular MAX and vLLM. Our analysis will reveal which framework excels in different scenarios and help you make informed decisions for your specific use cases.
What the comparison reveals:
def compare_results(max_res, vllm_res):
"""Compare results from both services"""
if max_res is None or vllm_res is None:
print("❌ Need to run both MAX and vLLM benchmarks first!")
return
if max_res["success_rate"] == 0 or vllm_res["success_rate"] == 0:
print("❌ One or both services had no successful requests!")
return
print(f"\n{'='*80}")
print(f"🏆 MAX vs vLLM PERFORMANCE COMPARISON")
print(f"{'='*80}")
def get_winner_and_improvement(max_val, vllm_val, lower_is_better=True):
if lower_is_better:
winner = "MAX" if max_val < vllm_val else "vLLM"
if vllm_val != 0:
improvement = abs((max_val - vllm_val) / vllm_val * 100)
else:
improvement = 0
else:
winner = "MAX" if max_val > vllm_val else "vLLM"
if vllm_val != 0:
improvement = abs((max_val - vllm_val) / vllm_val * 100)
else:
improvement = 0
return winner, improvement
# Success rates
print(f"\n📊 SUCCESS RATES:")
print(f" MAX: {max_res['success_rate']:.1%}")
print(f" vLLM: {vllm_res['success_rate']:.1%}")
# Time to first token
winner, improvement = get_winner_and_improvement(
max_res['avg_time_to_first_token'],
vllm_res['avg_time_to_first_token']
)
print(f"\n⚡ TIME TO FIRST TOKEN (Average):")
print(f" MAX: {max_res['avg_time_to_first_token']:.3f}s")
print(f" vLLM: {vllm_res['avg_time_to_first_token']:.3f}s")
print(f" 🏆 Winner: {winner} ({improvement:.1f}% faster)")
# Total latency
winner, improvement = get_winner_and_improvement(
max_res['avg_total_latency'],
vllm_res['avg_total_latency']
)
print(f"\n⏱️ AVERAGE RESPONSE LATENCY:")
print(f" MAX: {max_res['avg_total_latency']:.3f}s")
print(f" vLLM: {vllm_res['avg_total_latency']:.3f}s")
print(f" 🏆 Winner: {winner} ({improvement:.1f}% faster)")
# Tokens per second (individual)
winner, improvement = get_winner_and_improvement(
max_res['avg_tokens_per_second'],
vllm_res['avg_tokens_per_second'],
lower_is_better=False
)
print(f"\n🚀 TOKENS PER SECOND (Individual Average):")
print(f" MAX: {max_res['avg_tokens_per_second']:.1f} TPS")
print(f" vLLM: {vllm_res['avg_tokens_per_second']:.1f} TPS")
print(f" 🏆 Winner: {winner} ({improvement:.1f}% faster)")
# Batch throughput
winner, improvement = get_winner_and_improvement(
max_res['batch_tokens_per_second'],
vllm_res['batch_tokens_per_second'],
lower_is_better=False
)
print(f"\n📦 BATCH THROUGHPUT:")
print(f" MAX: {max_res['batch_tokens_per_second']:.1f} TPS")
print(f" vLLM: {vllm_res['batch_tokens_per_second']:.1f} TPS")
print(f" 🏆 Winner: {winner} ({improvement:.1f}% faster)")
# Summary
print(f"\n📈 DETAILED BREAKDOWN:")
print(f" MAX - TTFT: {max_res['avg_time_to_first_token']:.3f}s, Latency: {max_res['avg_total_latency']:.3f}s, TPS: {max_res['avg_tokens_per_second']:.1f}")
print(f" vLLM - TTFT: {vllm_res['avg_time_to_first_token']:.3f}s, Latency: {vllm_res['avg_total_latency']:.3f}s, TPS: {vllm_res['avg_tokens_per_second']:.1f}")
print(f"\n{'='*80}")
# Save comparison
comparison = {
"timestamp": datetime.now().isoformat(),
"max_results": max_res,
"vllm_results": vllm_res,
"summary": {
"ttft_winner": get_winner_and_improvement(max_res['avg_time_to_first_token'], vllm_res['avg_time_to_first_token'])[0],
"latency_winner": get_winner_and_improvement(max_res['avg_total_latency'], vllm_res['avg_total_latency'])[0],
"tps_winner": get_winner_and_improvement(max_res['avg_tokens_per_second'], vllm_res['avg_tokens_per_second'], False)[0],
"batch_winner": get_winner_and_improvement(max_res['batch_tokens_per_second'], vllm_res['batch_tokens_per_second'], False)[0]
}
}
with open("comparison_results.json", "w") as f:
json.dump(comparison, f, indent=2)
print("💾 Comparison saved to comparison_results.json")
# Run the comparison
compare_results(max_results, vllm_results)
================================================================================
🏆 MAX vs vLLM PERFORMANCE COMPARISON
================================================================================
📊 SUCCESS RATES:
MAX: 100.0%
vLLM: 100.0%
⚡ TIME TO FIRST TOKEN (Average):
MAX: 0.111s
vLLM: 0.215s
🏆 Winner: MAX (48.5% faster)
⏱️ AVERAGE RESPONSE LATENCY:
MAX: 6.699s
vLLM: 7.410s
🏆 Winner: MAX (9.6% faster)
🚀 TOKENS PER SECOND (Individual Average):
MAX: 89.9 TPS
vLLM: 75.9 TPS
🏆 Winner: MAX (18.5% faster)
📦 BATCH THROUGHPUT:
MAX: 90.0 TPS
vLLM: 76.1 TPS
🏆 Winner: MAX (18.3% faster)
📈 DETAILED BREAKDOWN:
MAX - TTFT: 0.111s, Latency: 6.699s, TPS: 89.9
vLLM - TTFT: 0.215s, Latency: 7.410s, TPS: 75.9
================================================================================
💾 Comparison saved to comparison_results.json
Our benchmarking demonstrates that Modular MAX can achieve better performance than vLLM across multiple metrics. The results show how MAX's optimizations translate into measurable improvements in real-world deployment scenarios.
In our benchmark testing, Modular MAX showed performance improvements across all measured metrics:
These results were consistent across our test prompts, from simple questions to longer creative writing tasks.
The combination of Modular MAX and Vast.ai provides a viable option for deploying AI inference workloads. In our tests, Modular MAX demonstrated it can achieve better performance than vLLM, with improvements ranging from roughly 10% to 50% depending on the metric. These results suggest that MAX can be a good choice for applications where inference speed is a priority, particularly when deployed on Vast.ai's cost-effective GPU infrastructure.
© 2025 Vast.ai. All rights reserved.