Modular MAX vs vLLM Performance Comparison on Vast.ai

July 30, 2025
15 Min Read
By Team Vast

The AI inference landscape is rapidly evolving, with new frameworks promising faster, more efficient model serving. Modular MAX is one such framework that uses advanced optimization techniques to improve performance. But how does it compare to established solutions like vLLM in real-world scenarios?

In this comprehensive benchmark, we'll deploy both Modular MAX and vLLM on Vast.ai infrastructure, running identical tests with Llama 3.1 8B Instruct to measure the metrics that matter most: Time to First Token (TTFT), response latency, throughput, and batch processing efficiency. By the end, you'll have data-driven insights to choose the right inference framework for your applications.

Why Modular MAX?

Modular MAX stands out in the inference landscape through several key innovations:

  • Optimized Performance: Uses MAX Graph optimization for performance and portability across any architecture
  • Hardware Portability: Runs efficiently across AMD and NVIDIA GPUs without vendor lock-in
  • 500+ Model Support: Extensive repository of optimized open-source models
  • OpenAI API Compatibility: Drop-in replacement for existing OpenAI API integrations
  • Enterprise Ready: Professional platform with extensibility and custom operations support

Why Vast.ai?

Vast.ai is a global cloud computing marketplace that connects compute providers with users, offering flexible GPU rental options for AI and machine learning workloads. The platform democratizes access to high-performance computing by providing a diverse ecosystem of GPU resources.

Key advantages of Vast.ai:

  • Cost Efficiency: Up to 80% savings compared to traditional cloud providers with transparent pricing
  • Hardware Selection: Over 10,000 GPU options from individual providers to tier 4 datacenters
  • Flexible Security Tiers: Choose between secure datacenter and community servers based on your requirements

What You'll Learn

This notebook demonstrates how to deploy and compare Modular MAX and vLLM on Vast.ai infrastructure. We'll deploy both inference servers with the same model (Llama 3.1 8B Instruct) and run performance benchmarks to compare:

  • Time to First Token (TTFT) - Critical for user experience and response initiation
  • Total response latency - End-to-end request completion time
  • Tokens per second (TPS) - Sustained generation performance
  • Batch throughput - Overall efficiency for processing multiple requests

By the end of this tutorial, you'll have hands-on experience with both frameworks and data-driven insights to choose the best inference solution for your specific needs.

Prerequisites

Before starting, ensure you have:

  1. Vast.ai account with API key (Get one here)
  2. Hugging Face account with access to Llama models (Create account)
  3. Python environment with Jupyter notebook support and required packages

Step 1: Setup and Installation

First, we'll install the required packages and configure our environment. We need the Vast.ai CLI for managing GPU instances and the OpenAI client library for interfacing with both MAX and vLLM endpoints.

# Install required packages
pip install --upgrade vastai
pip install --upgrade openai
# Set your Vast.ai API key
export VAST_API_KEY="" # Your key here
vastai set api-key $VAST_API_KEY

Step 2: Find the Right GPU Hardware

Next, we'll search for suitable GPU instances on Vast.ai to run both MAX and vLLM. For optimal performance comparison, we need:

  • High-performance GPUs (A100 or H100) with compute capability ≥ 8.0 for modern tensor operations
  • At least 80GB VRAM to accommodate Llama 3.1 8B model weights and KV cache
  • Static IP address for stable API endpoints during our benchmarking session
  • Multiple direct ports to run both inference servers simultaneously
  • Sufficient disk space (100GB+) for Docker images and model downloads

The search below targets these specifications while prioritizing cost-effectiveness:

# Search for suitable GPU instances
vastai search offers "compute_cap >= 800 \
geolocation=US \
gpu_ram >= 80 \
num_gpus >= 1 \
static_ip = true \
direct_port_count >= 2 \
verified = true \
disk_space >= 100 \
rentable = true"

Step 3: Deploy Modular MAX Instance

Now we'll deploy our first inference server using Modular MAX. MAX requires specific Docker images and environment configurations to unlock its performance optimizations.

Key deployment parameters:

  • Image: docker.modular.com/modular/max-nvidia-full:latest - The official MAX runtime with NVIDIA GPU support
  • Model: meta-llama/Llama-3.1-8B-Instruct - Our benchmark model for consistent comparison
  • Max Length: 40960 tokens - Enables long-context generation for comprehensive testing
  • HuggingFace Token: Required for downloading the Llama model weights

Important: You need to accept the terms at https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct before deployment.

The deployment typically takes 5-7 minutes as the instance downloads the Docker image and model weights.

# Deploy MAX instance
export MAX_INSTANCE_ID= # Insert instance ID from search results
export HUGGING_FACE_HUB_TOKEN="" # Your HF token

vastai create instance $MAX_INSTANCE_ID \
    --image docker.modular.com/modular/max-nvidia-full:latest \
    --disk 100 \
    --env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN='$HUGGING_FACE_HUB_TOKEN' -e MAX_SERVE_LOGS_CONSOLE_LEVEL=DEBUG' \
    --args --model-path meta-llama/Llama-3.1-8B-Instruct --max-length 40960

Monitoring Instance Initialization

Both instances need time to fully initialize before we can begin benchmarking:

Initialization Process (5-10 minutes total):

  1. Download Docker images - MAX and vLLM runtime environments
  2. Download model weights - Llama 3.1 8B parameters from Hugging Face (~15GB)
  3. Initialize inference servers - Load model into GPU memory and start API endpoints

How to Monitor Progress:

  1. Navigate to the Vast.ai Instances tab
  2. Click on each instance to view real-time logs
  3. Look for startup completion messages in the console output
  4. Note the external IP addresses and port mappings

💡 Pro Tip: Both servers are ready when you see "Application startup complete" in the logs and can successfully ping the /health endpoint.

Step 4: Create Comprehensive Benchmark Framework

Now we'll build a sophisticated benchmarking system to measure and compare the performance of both inference frameworks. Our benchmark captures the metrics that matter most for real-world applications:

Key Metrics We'll Measure:

  • Time to First Token (TTFT): How quickly the model starts generating - critical for user experience
  • Total Response Latency: Complete request-to-completion time for full responses
  • Tokens Per Second (TPS): Generation speed during active output - measures sustained performance
  • Batch Throughput: Overall efficiency when processing multiple requests

Testing Methodology:

  • Diverse Prompts: Various complexity levels from simple questions to creative writing
  • Streaming Responses: Real-time token measurement for accurate TTFT capture
  • Statistical Analysis: Mean, median, min/max values for robust performance characterization
  • Error Handling: Graceful handling of failures with detailed logging
import time
import statistics
import json
from typing import List, Dict
from openai import OpenAI
from datetime import datetime

class SequentialLLMBenchmarker:
    def __init__(self, base_url: str, model_name: str = "meta-llama/Llama-3.1-8B-Instruct"):
        self.client = OpenAI(
            base_url=base_url,
            api_key="EMPTY",
        )
        self.model_name = model_name

    def create_test_prompts(self) -> List[str]:
        """Create a variety of test prompts for benchmarking"""
        return [
            "Explain quantum computing in simple terms.",
            "Write a short story about a robot learning to love.",
            "What are the key differences between Python and JavaScript?",
            "Describe the process of photosynthesis step by step.",
            "Create a recipe for chocolate chip cookies.",
            "Explain the theory of relativity to a 10-year-old.",
            "What are the benefits and drawbacks of renewable energy?",
            "Write a professional email declining a job offer.",
            "Describe the major events of World War II.",
            "How does machine learning work?"
        ]

    def measure_single_request(self, prompt: str) -> Dict:
        """Measure metrics for a single request"""
        start_time = time.time()

        try:
            completion = self.client.chat.completions.create(
                model=self.model_name,
                messages=[{"role": "user", "content": prompt}],
                stream=True  # Enable streaming to measure time to first token
            )

            first_token_time = None
            full_response = ""
            token_count = 0

            for chunk in completion:
                current_time = time.time()

                if chunk.choices[0].delta.content is not None:
                    if first_token_time is None:
                        first_token_time = current_time - start_time

                    content = chunk.choices[0].delta.content
                    full_response += content
                    # Rough token estimation (actual tokenization would be more accurate)
                    token_count += len(content.split())

            end_time = time.time()
            total_latency = end_time - start_time

            return {
                "success": True,
                "time_to_first_token": first_token_time,
                "total_latency": total_latency,
                "token_count": token_count,
                "tokens_per_second": token_count / total_latency if total_latency > 0 else 0,
                "response_length": len(full_response),
                "prompt": prompt[:50] + "..." if len(prompt) > 50 else prompt,
                "full_response": full_response
            }

        except Exception as e:
            return {
                "success": False,
                "error": str(e),
                "prompt": prompt[:50] + "..." if len(prompt) > 50 else prompt
            }

    def run_benchmark(self, num_prompts: int = 5) -> Dict:
        """Run benchmark for this service"""
        print(f"Model: {self.model_name}")
        print(f"Base URL: {self.client.base_url}")
        print(f"Test prompts: {num_prompts}")

        prompts = self.create_test_prompts()[:num_prompts]
        results = []

        for i, prompt in enumerate(prompts):
            print(f"\n  📝 Request {i+1}/{len(prompts)}: {prompt[:40]}...")
            result = self.measure_single_request(prompt)
            results.append(result)

            if not result["success"]:
                print(f"    ❌ Failed: {result['error']}")
            else:
                print(f"    ✅ TTFT: {result['time_to_first_token']:.3f}s")
                print(f"    ⏱️  Total: {result['total_latency']:.3f}s")
                print(f"    🚀 TPS: {result['tokens_per_second']:.1f}")
                print(f"    📊 Tokens: {result['token_count']}")

        # Calculate aggregate metrics
        successful_results = [r for r in results if r["success"]]

        if not successful_results:
            return {
                "success_rate": 0,
                "results": results,
                "timestamp": datetime.now().isoformat()
            }

        total_tokens = sum([r["token_count"] for r in successful_results])
        total_time = sum([r["total_latency"] for r in successful_results])

        metrics = {
            "base_url": str(self.client.base_url),
            "model": self.model_name,
            "timestamp": datetime.now().isoformat(),
            "success_rate": len(successful_results) / len(results),
            "total_requests": len(results),
            "successful_requests": len(successful_results),

            # Time to first token metrics
            "avg_time_to_first_token": statistics.mean([r["time_to_first_token"] for r in successful_results]),
            "median_time_to_first_token": statistics.median([r["time_to_first_token"] for r in successful_results]),
            "min_time_to_first_token": min([r["time_to_first_token"] for r in successful_results]),
            "max_time_to_first_token": max([r["time_to_first_token"] for r in successful_results]),

            # Total latency metrics
            "avg_total_latency": statistics.mean([r["total_latency"] for r in successful_results]),
            "median_total_latency": statistics.median([r["total_latency"] for r in successful_results]),
            "min_total_latency": min([r["total_latency"] for r in successful_results]),
            "max_total_latency": max([r["total_latency"] for r in successful_results]),

            # Tokens per second metrics
            "avg_tokens_per_second": statistics.mean([r["tokens_per_second"] for r in successful_results]),
            "median_tokens_per_second": statistics.median([r["tokens_per_second"] for r in successful_results]),
            "min_tokens_per_second": min([r["tokens_per_second"] for r in successful_results]),
            "max_tokens_per_second": max([r["tokens_per_second"] for r in successful_results]),

            # Batch metrics
            "total_tokens": total_tokens,
            "total_time": total_time,
            "batch_tokens_per_second": total_tokens / total_time if total_time > 0 else 0,

            "individual_results": results
        }

        return metrics

    def print_summary(self, results: Dict):
        """Print a formatted summary of results"""
        print(f"\n{'='*60}")
        print(f"📊 BENCHMARK SUMMARY")
        print(f"{'='*60}")

        if results["success_rate"] == 0:
            print("❌ All requests failed!")
            return

        print(f"✅ Success Rate: {results['success_rate']:.1%} ({results['successful_requests']}/{results['total_requests']})")

        print(f"\n⚡ TIME TO FIRST TOKEN:")
        print(f"   Average: {results['avg_time_to_first_token']:.3f}s")
        print(f"   Median:  {results['median_time_to_first_token']:.3f}s")
        print(f"   Range:   {results['min_time_to_first_token']:.3f}s - {results['max_time_to_first_token']:.3f}s")

        print(f"\n⏱️  TOTAL LATENCY:")
        print(f"   Average: {results['avg_total_latency']:.3f}s")
        print(f"   Median:  {results['median_total_latency']:.3f}s")
        print(f"   Range:   {results['min_total_latency']:.3f}s - {results['max_total_latency']:.3f}s")

        print(f"\n🚀 TOKENS PER SECOND:")
        print(f"   Average: {results['avg_tokens_per_second']:.1f} TPS")
        print(f"   Median:  {results['median_tokens_per_second']:.1f} TPS")
        print(f"   Range:   {results['min_tokens_per_second']:.1f} - {results['max_tokens_per_second']:.1f} TPS")

        print(f"\n📦 BATCH THROUGHPUT:")
        print(f"   Total Tokens: {results['total_tokens']}")
        print(f"   Total Time:   {results['total_time']:.1f}s")
        print(f"   Batch TPS:    {results['batch_tokens_per_second']:.1f}")

        print(f"\n🕐 Completed at: {results['timestamp']}")
        print(f"{'='*60}")

# Global variables to store results for comparison
max_results = None
vllm_results = None

Step 5: Benchmark Modular MAX Performance

With our MAX instance deployed and initialized, we'll now run our comprehensive performance benchmark.

Before running the benchmark:

  1. Ensure your MAX instance shows "running" status in the Vast.ai console
  2. Get the external IP address and port from the instance details page
  3. Update the MAX_IP_ADDRESS and MAX_PORT variables below

The benchmark will run 5 diverse test prompts and measure all key performance metrics. Results will be saved automatically for later comparison.

# Set your MAX instance details
MAX_IP_ADDRESS = "" # Your MAX instance IP
MAX_PORT = "" # Your MAX instance port

print("🔥 Testing MAX Performance...")
max_benchmarker = SequentialLLMBenchmarker(f"http://{MAX_IP_ADDRESS}:{MAX_PORT}/v1")
max_results = max_benchmarker.run_benchmark(num_prompts=5)  # Adjust number as needed
max_benchmarker.print_summary(max_results)

# Save results
with open("max_benchmark_results.json", "w") as f:
    json.dump(max_results, f, indent=2)
print("\n💾 MAX results saved to max_benchmark_results.json")
🔥 Testing MAX Performance...

Model: meta-llama/Llama-3.1-8B-Instruct
Base URL: http://128.24.60.121:5084/v1/
Test prompts: 5

  📝 Request 1/5: Explain quantum computing in simple term...
    ✅ TTFT: 0.184s
    ⏱️  Total: 5.753s
    🚀 TPS: 91.6
    📊 Tokens: 527

  📝 Request 2/5: Write a short story about a robot learni...
    ✅ TTFT: 0.081s
    ⏱️  Total: 7.286s
    🚀 TPS: 92.9
    📊 Tokens: 677

  📝 Request 3/5: What are the key differences between Pyt...
    ✅ TTFT: 0.123s
    ⏱️  Total: 7.061s
    🚀 TPS: 85.5
    📊 Tokens: 604

  📝 Request 4/5: Describe the process of photosynthesis s...
    ✅ TTFT: 0.079s
    ⏱️  Total: 8.045s
    🚀 TPS: 91.6
    📊 Tokens: 737

  📝 Request 5/5: Create a recipe for chocolate chip cooki...
    ✅ TTFT: 0.086s
    ⏱️  Total: 5.348s
    🚀 TPS: 87.9
    📊 Tokens: 470

============================================================
📊 BENCHMARK SUMMARY
============================================================
✅ Success Rate: 100.0% (5/5)

⚡ TIME TO FIRST TOKEN:
   Average: 0.111s
   Median:  0.086s
   Range:   0.079s - 0.184s

⏱️  TOTAL LATENCY:
   Average: 6.699s
   Median:  7.061s
   Range:   5.348s - 8.045s

🚀 TOKENS PER SECOND:
   Average: 89.9 TPS
   Median:  91.6 TPS
   Range:   85.5 - 92.9 TPS

📦 BATCH THROUGHPUT:
   Total Tokens: 3015
   Total Time:   33.5s
   Batch TPS:    90.0

🕐 Completed at: 2025-07-18T18:08:18.235635
============================================================

💾 MAX results saved to max_benchmark_results.json

Step 6: Deploy vLLM Comparison Instance

To conduct a fair performance comparison, we'll now deploy vLLM with identical model and hardware configurations. First, destroy your MAX instance and preferably choose the same instance ID from your earlier search to ensure any performance differences are attributable to the inference framework rather than hardware variations.

Deployment Configuration:

  • Same Model: meta-llama/Llama-3.1-8B-Instruct for direct comparison
  • Same Hardware: Reuse the same instance ID for identical hardware specifications
  • Same Context Length: 40960 tokens to match MAX configuration
  • Same Environment: Identical HuggingFace token and networking setup
# Deploy vLLM instance
export VLLM_INSTANCE_ID= # Insert instance ID from search results
export HUGGING_FACE_HUB_TOKEN="" # Your HF token

vastai create instance $VLLM_INSTANCE_ID \
    --image vllm/vllm-openai:latest \
    --disk 100 \
    --env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN='$HUGGING_FACE_HUB_TOKEN \
    --args --model meta-llama/Llama-3.1-8B-Instruct --max-model-len 40960

Step 7: Benchmark vLLM Performance

Now we'll run the identical benchmark suite against our vLLM instance. By using the exact same hardware instance (after destroying the MAX deployment), we ensure any performance differences are purely due to the inference framework rather than hardware variations.

Before running the vLLM benchmark:

  1. Wait for the vLLM instance to fully initialize (check logs for "Application startup complete")
  2. Get the external IP address and port from the Vast.ai console
  3. Update the VLLM_IP_ADDRESS and VLLM_PORT variables below

The benchmark will use the same 5 test prompts as the MAX evaluation, enabling direct metric comparison on identical hardware.

# Set your vLLM instance details
VLLM_IP_ADDRESS = "" # Your vLLM instance IP
VLLM_PORT = "" # Your vLLM instance port

print("🔥 Testing vLLM Performance...")
vllm_benchmarker = SequentialLLMBenchmarker(f"http://{VLLM_IP_ADDRESS}:{VLLM_PORT}/v1")
vllm_results = vllm_benchmarker.run_benchmark(num_prompts=5)  # Use same number as MAX
vllm_benchmarker.print_summary(vllm_results)

# Save results
with open("vllm_benchmark_results.json", "w") as f:
    json.dump(vllm_results, f, indent=2)
print("\n💾 vLLM results saved to vllm_benchmark_results.json")
🔥 Testing vLLM Performance...

Model: meta-llama/Llama-3.1-8B-Instruct
Base URL: http://128.24.60.121:2622/v1/
Test prompts: 5

  📝 Request 1/5: Explain quantum computing in simple term...
    ✅ TTFT: 0.698s
    ⏱️  Total: 6.777s
    🚀 TPS: 71.7
    📊 Tokens: 486

  📝 Request 2/5: Write a short story about a robot learni...
    ✅ TTFT: 0.081s
    ⏱️  Total: 8.106s
    🚀 TPS: 80.1
    📊 Tokens: 649

  📝 Request 3/5: What are the key differences between Pyt...
    ✅ TTFT: 0.085s
    ⏱️  Total: 8.162s
    🚀 TPS: 74.5
    📊 Tokens: 608

  📝 Request 4/5: Describe the process of photosynthesis s...
    ✅ TTFT: 0.081s
    ⏱️  Total: 8.253s
    🚀 TPS: 78.6
    📊 Tokens: 649

  📝 Request 5/5: Create a recipe for chocolate chip cooki...
    ✅ TTFT: 0.128s
    ⏱️  Total: 5.751s
    🚀 TPS: 74.4
    📊 Tokens: 428

============================================================
📊 BENCHMARK SUMMARY
============================================================
✅ Success Rate: 100.0% (5/5)

⚡ TIME TO FIRST TOKEN:
   Average: 0.215s
   Median:  0.085s
   Range:   0.081s - 0.698s

⏱️  TOTAL LATENCY:
   Average: 7.410s
   Median:  8.106s
   Range:   5.751s - 8.253s

🚀 TOKENS PER SECOND:
   Average: 75.9 TPS
   Median:  74.5 TPS
   Range:   71.7 - 80.1 TPS

📦 BATCH THROUGHPUT:
   Total Tokens: 2820
   Total Time:   37.0s
   Batch TPS:    76.1

🕐 Completed at: 2025-07-18T18:16:13.377743
============================================================

💾 vLLM results saved to vllm_benchmark_results.json

Step 8: Comprehensive Performance Analysis

With both benchmarks complete, we can now perform a detailed performance comparison between Modular MAX and vLLM. Our analysis will reveal which framework excels in different scenarios and help you make informed decisions for your specific use cases.

What the comparison reveals:

  • Time to First Token: Which framework provides better user experience with faster response initiation
  • Overall Latency: Which framework completes requests faster end-to-end
  • Throughput Performance: Which framework processes more tokens per second
  • Consistency: Which framework provides more stable and predictable performance
  • Cost Efficiency: Performance per dollar spent on GPU resources
def compare_results(max_res, vllm_res):
    """Compare results from both services"""
    if max_res is None or vllm_res is None:
        print("❌ Need to run both MAX and vLLM benchmarks first!")
        return

    if max_res["success_rate"] == 0 or vllm_res["success_rate"] == 0:
        print("❌ One or both services had no successful requests!")
        return

    print(f"\n{'='*80}")
    print(f"🏆 MAX vs vLLM PERFORMANCE COMPARISON")
    print(f"{'='*80}")

    def get_winner_and_improvement(max_val, vllm_val, lower_is_better=True):
        if lower_is_better:
            winner = "MAX" if max_val < vllm_val else "vLLM"
            if vllm_val != 0:
                improvement = abs((max_val - vllm_val) / vllm_val * 100)
            else:
                improvement = 0
        else:
            winner = "MAX" if max_val > vllm_val else "vLLM"
            if vllm_val != 0:
                improvement = abs((max_val - vllm_val) / vllm_val * 100)
            else:
                improvement = 0
        return winner, improvement

    # Success rates
    print(f"\n📊 SUCCESS RATES:")
    print(f"   MAX:  {max_res['success_rate']:.1%}")
    print(f"   vLLM: {vllm_res['success_rate']:.1%}")

    # Time to first token
    winner, improvement = get_winner_and_improvement(
        max_res['avg_time_to_first_token'],
        vllm_res['avg_time_to_first_token']
    )
    print(f"\n⚡ TIME TO FIRST TOKEN (Average):")
    print(f"   MAX:  {max_res['avg_time_to_first_token']:.3f}s")
    print(f"   vLLM: {vllm_res['avg_time_to_first_token']:.3f}s")
    print(f"   🏆 Winner: {winner} ({improvement:.1f}% faster)")

    # Total latency
    winner, improvement = get_winner_and_improvement(
        max_res['avg_total_latency'],
        vllm_res['avg_total_latency']
    )
    print(f"\n⏱️  AVERAGE RESPONSE LATENCY:")
    print(f"   MAX:  {max_res['avg_total_latency']:.3f}s")
    print(f"   vLLM: {vllm_res['avg_total_latency']:.3f}s")
    print(f"   🏆 Winner: {winner} ({improvement:.1f}% faster)")

    # Tokens per second (individual)
    winner, improvement = get_winner_and_improvement(
        max_res['avg_tokens_per_second'],
        vllm_res['avg_tokens_per_second'],
        lower_is_better=False
    )
    print(f"\n🚀 TOKENS PER SECOND (Individual Average):")
    print(f"   MAX:  {max_res['avg_tokens_per_second']:.1f} TPS")
    print(f"   vLLM: {vllm_res['avg_tokens_per_second']:.1f} TPS")
    print(f"   🏆 Winner: {winner} ({improvement:.1f}% faster)")

    # Batch throughput
    winner, improvement = get_winner_and_improvement(
        max_res['batch_tokens_per_second'],
        vllm_res['batch_tokens_per_second'],
        lower_is_better=False
    )
    print(f"\n📦 BATCH THROUGHPUT:")
    print(f"   MAX:  {max_res['batch_tokens_per_second']:.1f} TPS")
    print(f"   vLLM: {vllm_res['batch_tokens_per_second']:.1f} TPS")
    print(f"   🏆 Winner: {winner} ({improvement:.1f}% faster)")

    # Summary
    print(f"\n📈 DETAILED BREAKDOWN:")
    print(f"   MAX  - TTFT: {max_res['avg_time_to_first_token']:.3f}s, Latency: {max_res['avg_total_latency']:.3f}s, TPS: {max_res['avg_tokens_per_second']:.1f}")
    print(f"   vLLM - TTFT: {vllm_res['avg_time_to_first_token']:.3f}s, Latency: {vllm_res['avg_total_latency']:.3f}s, TPS: {vllm_res['avg_tokens_per_second']:.1f}")

    print(f"\n{'='*80}")

    # Save comparison
    comparison = {
        "timestamp": datetime.now().isoformat(),
        "max_results": max_res,
        "vllm_results": vllm_res,
        "summary": {
            "ttft_winner": get_winner_and_improvement(max_res['avg_time_to_first_token'], vllm_res['avg_time_to_first_token'])[0],
            "latency_winner": get_winner_and_improvement(max_res['avg_total_latency'], vllm_res['avg_total_latency'])[0],
            "tps_winner": get_winner_and_improvement(max_res['avg_tokens_per_second'], vllm_res['avg_tokens_per_second'], False)[0],
            "batch_winner": get_winner_and_improvement(max_res['batch_tokens_per_second'], vllm_res['batch_tokens_per_second'], False)[0]
        }
    }

    with open("comparison_results.json", "w") as f:
        json.dump(comparison, f, indent=2)
    print("💾 Comparison saved to comparison_results.json")
# Run the comparison
compare_results(max_results, vllm_results)

================================================================================
🏆 MAX vs vLLM PERFORMANCE COMPARISON
================================================================================

📊 SUCCESS RATES:
   MAX:  100.0%
   vLLM: 100.0%

⚡ TIME TO FIRST TOKEN (Average):
   MAX:  0.111s
   vLLM: 0.215s
   🏆 Winner: MAX (48.5% faster)

⏱️  AVERAGE RESPONSE LATENCY:
   MAX:  6.699s
   vLLM: 7.410s
   🏆 Winner: MAX (9.6% faster)

🚀 TOKENS PER SECOND (Individual Average):
   MAX:  89.9 TPS
   vLLM: 75.9 TPS
   🏆 Winner: MAX (18.5% faster)

📦 BATCH THROUGHPUT:
   MAX:  90.0 TPS
   vLLM: 76.1 TPS
   🏆 Winner: MAX (18.3% faster)

📈 DETAILED BREAKDOWN:
   MAX  - TTFT: 0.111s, Latency: 6.699s, TPS: 89.9
   vLLM - TTFT: 0.215s, Latency: 7.410s, TPS: 75.9

================================================================================
💾 Comparison saved to comparison_results.json

Performance Comparison Results

Our benchmarking demonstrates that Modular MAX can achieve better performance than vLLM across multiple metrics. The results show how MAX's optimizations translate into measurable improvements in real-world deployment scenarios.

Key Findings

In our benchmark testing, Modular MAX showed performance improvements across all measured metrics:

  • Time to First Token: 0.111s vs 0.215s - Users see the start of responses more quickly
  • Overall response latency: 6.699s vs 7.410s - Complete responses are delivered faster
  • Tokens per second: 89.9 vs 75.9 TPS - Higher sustained generation speed
  • Batch throughput: 90.0 vs 76.1 TPS - Better efficiency when handling multiple requests

These results were consistent across our test prompts, from simple questions to longer creative writing tasks.

Conclusion

The combination of Modular MAX and Vast.ai provides a viable option for deploying AI inference workloads. In our tests, Modular MAX demonstrated it can achieve better performance than vLLM, with improvements ranging from roughly 10% to 50% depending on the metric. These results suggest that MAX can be a good choice for applications where inference speed is a priority, particularly when deployed on Vast.ai's cost-effective GPU infrastructure.

Vast AI

© 2025 Vast.ai. All rights reserved.

Vast.ai