Blog

Running Llama 4 Models on Vast.ai

- Team Vast

May 5, 2025-Llama 4NVIDIACloud ComputingVLLMInferenceMeta

Introduction

Meta's Llama 4 is a breakthrough AI model combining state-of-the-art multimodal capabilities with the computational efficiency of mixture-of-experts (MoE) architecture. These models can process both text and images through an early fusion design that seamlessly integrates different modalities. Perhaps most impressively, they support context windows of up to 10 million tokens - dramatically expanding what's possible with large language models.

The Llama 4 family includes:

  • Llama 4 Scout: 17B active parameters with 16 experts - efficient enough to run on modest GPU setups.

  • Llama 4 Maverick: 17B active parameters but with 128 experts - offering enhanced capabilities while maintaining efficiency.

  • Llama 4 Behemoth: 288B active parameters with 16 experts - Meta's most powerful model (not yet publicly released).

What makes these models special:

  • Mixture-of-Experts (MoE): Unlike traditional models that activate all parameters for every token, MoE models selectively activate only a fraction - dramatically improving computational efficiency.
  • Early Fusion: Vision and text processing are integrated directly into the model backbone, enabling true multimodal reasoning.
  • 10M Token Context Window: Process entire books, codebases, or document collections in a single prompt - enabling comprehensive analysis impossible with earlier models.

In this guide, we'll deploy Llama 4 models on Vast.ai using three practical hardware configurations:

  1. Llama 4 Scout on 8× H100 GPUs
  2. Llama 4 Scout on 4× H100 GPUs
  3. Llama 4 Maverick on 8× H200 GPUs

You'll learn exactly how to set up each configuration, deploy the models with the vLLM server, and interact with them through an OpenAI-compatible API. By the end, you'll have hands-on experience running some of the most advanced AI models available today.

Installing Vast

Let's start by installing the Vast.ai command-line tools and setting up our API key:

pip install --upgrade vastai
# Here we will set our api key
export VAST_API_KEY="" #Your key here
vastai set api-key $VAST_API_KEY

Running Llama 4 Scout on 8× H100s

Choosing the Right Hardware

First, we need to search for GPUs on Vast.ai to run the Llama 4 Scout model on 8× H100 GPUs with a 200k token context window. This model requires specific hardware capabilities to run efficiently with vLLM's optimizations. Here are our requirements:

  1. 8× H100 GPUs (80GB of VRAM each) to accommodate:

    • Llama 4 Scout model weights (17B active parameters with 16 experts)
    • KV Cache for handling the 200k token context window
  2. A static IP address for:

    • Stable API endpoint hosting
    • Consistent client connections
  3. At least one direct port that we can forward for:

    • vLLM's OpenAI-compatible API server
    • External access to the model endpoint
    • Secure request routing
  4. At least 600GB of disk space to hold the model and other dependencies

Let's search for machines that meet these requirements:

vastai search offers " \
gpu_name = H100_NVL \
num_gpus = 8 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 600 \
rentable = true"

Deploying the Server via Vast

We'll choose a machine from our search results and copy and paste the id below to set INSTANCE_ID.

We use vastai create instance to create an instance that:

  1. Uses vllm/vllm-openai:latest docker image. This gives us an OpenAI-compatible server.
  2. Forwards port 8000 to the outside of the container, which is the default OpenAI server port.
  3. Uses --model meta-llama/Llama-4-Scout-17B-16E-Instruct to serve the Llama 4 Scout model.
  4. Uses Llama 4-specific parameters:
    • --tensor-parallel-size 8 to distribute inference across multiple GPUs.
    • --max-model-len 200000 to set a 200k token context window.
    • --override-generation-config='{"attn_temperature_tuning": true}' for optimal text generation quality.
  5. Uses --disk 600 to ensure that we have 600GB of disk space for model weights and dependencies.

Note: Ensure that you fill in your huggingface token HUGGING_FACE_HUB_TOKEN to access the model. You'll need to accept the Llama 4 license terms on the model's huggingface page: https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct

export INSTANCE_ID=19319856 #insert instance ID
vastai create instance $INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=HUGGING_FACE_HUB_TOKEN' --disk 600 --args --model meta-llama/Llama-4-Scout-17B-16E-Instruct --tensor-parallel-size 8 --max-model-len 200000 --override-generation-config='{"attn_temperature_tuning": true}'

Get Instance IP Address and Port

Now, we need to get our IP address and port to call our model. We first need to wait for our machine to download the image and the model and start serving. This will take a few minutes. The logs will show you when it's done.

Next, go to the Instances tab in the Vast AI Console and find the instance you just created.

At the top of the instance, there is a button with an IP address in it. Click this and a panel will show up of the ip address and the forwarded ports. You should see something like:

Open Ports
XX.XX.XXX.XX:YYYY -> 8000/tcp

You will need the IP address (XX.XX.XXX.XX) and the port (YYYY) for the next step.

Call our Model

Install OpenAI

To call our model we will install the OpenAI SDK.

pip install --upgrade openai

Download our Data

To show the power of our large context window, we will download The Great Gatsby from Project Gutenberg and ask Llama 4 to summarize it.

Here we download the text for the book:

import requests

# Download the text file to a variable
url = "https://www.gutenberg.org/ebooks/64317.txt.utf-8"
text = requests.get(url).text

# Print the first 100 characters to confirm it worked
print(text[:200])

Call Our Model

We will then use the OpenAI SDK. To do this, we need to set our VAST_IP_ADDRESS and VAST_PORT that we found above.

from openai import OpenAI

# Your Vast.ai instance details
VAST_IP_ADDRESS = ""
VAST_PORT = ""

# Initialize the client with your server URL
client = OpenAI(
    api_key="EMPTY",  # vLLM doesn't require an actual API key
    base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
)

prompt = "please summarize this book in two sentences:" + text

response = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[
        {"role": "user", "content": prompt}
    ],

)
print(response.choices[0].message.content)

Model Response

We see that our model was able to ingest a large amount of text and summarize our book.

Here is a two-sentence summary of The Great Gatsby:

Set in the roaring twenties, The Great Gatsby is a classic novel by F. Scott Fitzgerald that revolves around the mysterious millionaire Jay Gatsby and his obsession with winning back his lost love, Daisy Buchanan. Through the eyes of narrator Nick Carraway, the novel explores themes of wealth, class, love, and the corrupting influence of materialism in the excesses of the Jazz Age

Running Llama 4 Scout on 4× H100s

After successfully deploying Scout on 8× H100s, we'll now try a more cost-effective setup using 4× H100s. This will reduce our context window capacity but allow us to save on GPU costs.

We'll start by searching for an instance that meets our specifications:

vastai search offers " \
num_gpus = 4 \
gpu_name = H100_SXM \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 600 \
rentable = true"

We'll choose a machine from our search and create our instance with 4 GPUs and a 100k token context window using --tensor-parallel-size 4 and --max-model-len 100000:

export INSTANCE_ID= #insert instance ID
vastai create instance $INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=HUGGING_FACE_HUB_TOKEN' --disk 600 --args --model meta-llama/Llama-4-Scout-17B-16E-Instruct --tensor-parallel-size 4 --max-model-len 100000 --override-generation-config='{"attn_temperature_tuning": true}'

Once our instance is running, we'll set the new VAST_IP_ADDRESS and VAST_PORT and call our instance like we did earlier:

from openai import OpenAI

# Your Vast.ai instance details
VAST_IP_ADDRESS = ""
VAST_PORT = ""

# Initialize the client with your server URL
client = OpenAI(
    api_key="EMPTY",  # vLLM doesn't require an actual API key
    base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
)

prompt = "please summarize this book in two sentences:" + text

response = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[
        {"role": "user", "content": prompt}
    ],

)
print(response.choices[0].message.content)

Model Response

Here is a two-sentence summary of The Great Gatsby:

Set in the roaring twenties, The Great Gatsby is a classic novel by F. Scott Fitzgerald that revolves around the mysterious millionaire Jay Gatsby and his obsession with winning back his lost love, Daisy Buchanan. Through the eyes of narrator Nick Carraway, the novel explores themes of love, greed, class, and the corrupting influence of wealth, ultimately leading to a tragic confrontation that exposes the dark underbelly of the American Dream.

Running Llama 4 Maverick on 8× H200s

Finally, we'll upgrade to Llama 4's larger Maverick model with a 100k token context window on 8× H200 GPUs - a more powerful configuration that delivers enhanced capabilities.

To start, we'll search for an H200 machine with 8 GPUs:

vastai search offers " \
num_gpus = 8 \
gpu_name = H200 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 1000 \
rentable = true"

We'll find a machine from our search and create an instance with Llama 4 Maverick using 8 GPUs and a 100k token context window by setting --disk 1000, --model meta-llama/Llama-4-Maverick-17B-128E-Instruct, --tensor-parallel-size 8 and --max-model-len 100000:

export INSTANCE_ID= #insert instance ID
vastai create instance $INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=HUGGING_FACE_HUB_TOKEN' --disk 1000 --args --model meta-llama/Llama-4-Maverick-17B-128E-Instruct --tensor-parallel-size 8 --max-model-len 100000 --override-generation-config='{"attn_temperature_tuning": true}'

Finally, we'll set the new VAST_IP_ADDRESS and VAST_PORT, change our model to "meta-llama/Llama-4-Maverick-17B-128E-Instruct", and call our instance:

from openai import OpenAI

# Your Vast.ai instance details
VAST_IP_ADDRESS = ""
VAST_PORT = ""

# Initialize the client with your server URL
client = OpenAI(
    api_key="EMPTY",  # vLLM doesn't require an actual API key
    base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
)

prompt = "please summarize this book in two sentences:" + text

response = client.chat.completions.create(
    model="meta-llama/Llama-4-Maverick-17B-128E-Instruct",
    messages=[
        {"role": "user", "content": prompt}
    ],

)
print(response.choices[0].message.content)

Model Response

Here is a two-sentence summary of The Great Gatsby:

The novel is narrated by Nick Carraway, who becomes fascinated with his wealthy neighbor Jay Gatsby and becomes entangled in Gatsby's quest to win back his lost love, Daisy Buchanan, through a complex web of relationships and tragic events. Ultimately, Gatsby's dream is destroyed, and he is murdered by George Wilson, the husband of Myrtle Wilson, who was having an affair with Tom Buchanan, Daisy's husband, highlighting the corrupting influence of wealth and the elusiveness of the American Dream.

Conclusion

In this post, we've demonstrated how to deploy Meta's Llama 4 models on Vast.ai using three specific hardware configurations: Llama 4 Scout on 8× H100 GPUs with a 200k token context window, Scout on 4× H100 GPUs with a 100k context window, and Llama 4 Maverick on 8× H200 GPUs with a 100k context window. We've shown that with the right GPU setup, you can easily deploy and interact with these models through an OpenAI-compatible API.

Now that you have a basic deployment working, here are some next steps you might consider:

  • Expand the context window: While we demonstrated context windows of 100k-200k tokens, Llama 4 supports up to 10M tokens. Try increasing --max-model-len to experiment with longer documents, books, or codebases. Note: You will also need to increase the number of GPUs.

  • Explore multimodal capabilities: The tutorial focused on text processing, but Llama 4 is natively multimodal. Try adapting the deployment to accept image inputs using vLLM's multimodal capabilities and the early fusion architecture.

By leveraging Vast.ai's flexible GPU infrastructure, you can cost-effectively experiment with these cutting-edge models and build applications that take advantage of their sophisticated reasoning capabilities.

Share on
  • Contact
  • Get in Touch