May 5, 2025-Llama 4NVIDIACloud ComputingVLLMInferenceMeta
Meta's Llama 4 is a breakthrough AI model combining state-of-the-art multimodal capabilities with the computational efficiency of mixture-of-experts (MoE) architecture. These models can process both text and images through an early fusion design that seamlessly integrates different modalities. Perhaps most impressively, they support context windows of up to 10 million tokens - dramatically expanding what's possible with large language models.
The Llama 4 family includes:
Llama 4 Scout: 17B active parameters with 16 experts - efficient enough to run on modest GPU setups.
Llama 4 Maverick: 17B active parameters but with 128 experts - offering enhanced capabilities while maintaining efficiency.
Llama 4 Behemoth: 288B active parameters with 16 experts - Meta's most powerful model (not yet publicly released).
What makes these models special:
In this guide, we'll deploy Llama 4 models on Vast.ai using three practical hardware configurations:
You'll learn exactly how to set up each configuration, deploy the models with the vLLM server, and interact with them through an OpenAI-compatible API. By the end, you'll have hands-on experience running some of the most advanced AI models available today.
Let's start by installing the Vast.ai command-line tools and setting up our API key:
pip install --upgrade vastai
# Here we will set our api key
export VAST_API_KEY="" #Your key here
vastai set api-key $VAST_API_KEY
First, we need to search for GPUs on Vast.ai to run the Llama 4 Scout model on 8× H100 GPUs with a 200k token context window. This model requires specific hardware capabilities to run efficiently with vLLM's optimizations. Here are our requirements:
8× H100 GPUs (80GB of VRAM each) to accommodate:
A static IP address for:
At least one direct port that we can forward for:
At least 600GB of disk space to hold the model and other dependencies
Let's search for machines that meet these requirements:
vastai search offers " \
gpu_name = H100_NVL \
num_gpus = 8 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 600 \
rentable = true"
We'll choose a machine from our search results and copy and paste the id below to set INSTANCE_ID
.
We use vastai create instance
to create an instance that:
vllm/vllm-openai:latest
docker image. This gives us an OpenAI-compatible server.8000
to the outside of the container, which is the default OpenAI server port.--model meta-llama/Llama-4-Scout-17B-16E-Instruct
to serve the Llama 4 Scout model.--tensor-parallel-size 8
to distribute inference across multiple GPUs.--max-model-len 200000
to set a 200k token context window.--override-generation-config='{"attn_temperature_tuning": true}'
for optimal text generation quality.--disk 600
to ensure that we have 600GB of disk space for model weights and dependencies.Note: Ensure that you fill in your huggingface token HUGGING_FACE_HUB_TOKEN
to access the model. You'll need to accept the Llama 4 license terms on the model's huggingface page: https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct
export INSTANCE_ID=19319856 #insert instance ID
vastai create instance $INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=HUGGING_FACE_HUB_TOKEN' --disk 600 --args --model meta-llama/Llama-4-Scout-17B-16E-Instruct --tensor-parallel-size 8 --max-model-len 200000 --override-generation-config='{"attn_temperature_tuning": true}'
Now, we need to get our IP address and port to call our model. We first need to wait for our machine to download the image and the model and start serving. This will take a few minutes. The logs will show you when it's done.
Next, go to the Instances tab in the Vast AI Console and find the instance you just created.
At the top of the instance, there is a button with an IP address in it. Click this and a panel will show up of the ip address and the forwarded ports. You should see something like:
Open Ports
XX.XX.XXX.XX:YYYY -> 8000/tcp
You will need the IP address (XX.XX.XXX.XX) and the port (YYYY) for the next step.
To call our model we will install the OpenAI SDK.
pip install --upgrade openai
To show the power of our large context window, we will download The Great Gatsby from Project Gutenberg and ask Llama 4 to summarize it.
Here we download the text for the book:
import requests
# Download the text file to a variable
url = "https://www.gutenberg.org/ebooks/64317.txt.utf-8"
text = requests.get(url).text
# Print the first 100 characters to confirm it worked
print(text[:200])
We will then use the OpenAI SDK. To do this, we need to set our VAST_IP_ADDRESS
and VAST_PORT
that we found above.
from openai import OpenAI
# Your Vast.ai instance details
VAST_IP_ADDRESS = ""
VAST_PORT = ""
# Initialize the client with your server URL
client = OpenAI(
api_key="EMPTY", # vLLM doesn't require an actual API key
base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
)
prompt = "please summarize this book in two sentences:" + text
response = client.chat.completions.create(
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
messages=[
{"role": "user", "content": prompt}
],
)
print(response.choices[0].message.content)
We see that our model was able to ingest a large amount of text and summarize our book.
Here is a two-sentence summary of The Great Gatsby:
Set in the roaring twenties, The Great Gatsby is a classic novel by F. Scott Fitzgerald that revolves around the mysterious millionaire Jay Gatsby and his obsession with winning back his lost love, Daisy Buchanan. Through the eyes of narrator Nick Carraway, the novel explores themes of wealth, class, love, and the corrupting influence of materialism in the excesses of the Jazz Age
After successfully deploying Scout on 8× H100s, we'll now try a more cost-effective setup using 4× H100s. This will reduce our context window capacity but allow us to save on GPU costs.
We'll start by searching for an instance that meets our specifications:
vastai search offers " \
num_gpus = 4 \
gpu_name = H100_SXM \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 600 \
rentable = true"
We'll choose a machine from our search and create our instance with 4 GPUs and a 100k token context window using --tensor-parallel-size 4
and --max-model-len 100000
:
export INSTANCE_ID= #insert instance ID
vastai create instance $INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=HUGGING_FACE_HUB_TOKEN' --disk 600 --args --model meta-llama/Llama-4-Scout-17B-16E-Instruct --tensor-parallel-size 4 --max-model-len 100000 --override-generation-config='{"attn_temperature_tuning": true}'
Once our instance is running, we'll set the new VAST_IP_ADDRESS
and VAST_PORT
and call our instance like we did earlier:
from openai import OpenAI
# Your Vast.ai instance details
VAST_IP_ADDRESS = ""
VAST_PORT = ""
# Initialize the client with your server URL
client = OpenAI(
api_key="EMPTY", # vLLM doesn't require an actual API key
base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
)
prompt = "please summarize this book in two sentences:" + text
response = client.chat.completions.create(
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
messages=[
{"role": "user", "content": prompt}
],
)
print(response.choices[0].message.content)
Here is a two-sentence summary of The Great Gatsby:
Set in the roaring twenties, The Great Gatsby is a classic novel by F. Scott Fitzgerald that revolves around the mysterious millionaire Jay Gatsby and his obsession with winning back his lost love, Daisy Buchanan. Through the eyes of narrator Nick Carraway, the novel explores themes of love, greed, class, and the corrupting influence of wealth, ultimately leading to a tragic confrontation that exposes the dark underbelly of the American Dream.
Finally, we'll upgrade to Llama 4's larger Maverick model with a 100k token context window on 8× H200 GPUs - a more powerful configuration that delivers enhanced capabilities.
To start, we'll search for an H200 machine with 8 GPUs:
vastai search offers " \
num_gpus = 8 \
gpu_name = H200 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 1000 \
rentable = true"
We'll find a machine from our search and create an instance with Llama 4 Maverick using 8 GPUs and a 100k token context window by setting --disk 1000
, --model meta-llama/Llama-4-Maverick-17B-128E-Instruct
, --tensor-parallel-size 8
and --max-model-len 100000
:
export INSTANCE_ID= #insert instance ID
vastai create instance $INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=HUGGING_FACE_HUB_TOKEN' --disk 1000 --args --model meta-llama/Llama-4-Maverick-17B-128E-Instruct --tensor-parallel-size 8 --max-model-len 100000 --override-generation-config='{"attn_temperature_tuning": true}'
Finally, we'll set the new VAST_IP_ADDRESS
and VAST_PORT
, change our model to "meta-llama/Llama-4-Maverick-17B-128E-Instruct"
, and call our instance:
from openai import OpenAI
# Your Vast.ai instance details
VAST_IP_ADDRESS = ""
VAST_PORT = ""
# Initialize the client with your server URL
client = OpenAI(
api_key="EMPTY", # vLLM doesn't require an actual API key
base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
)
prompt = "please summarize this book in two sentences:" + text
response = client.chat.completions.create(
model="meta-llama/Llama-4-Maverick-17B-128E-Instruct",
messages=[
{"role": "user", "content": prompt}
],
)
print(response.choices[0].message.content)
Here is a two-sentence summary of The Great Gatsby:
The novel is narrated by Nick Carraway, who becomes fascinated with his wealthy neighbor Jay Gatsby and becomes entangled in Gatsby's quest to win back his lost love, Daisy Buchanan, through a complex web of relationships and tragic events. Ultimately, Gatsby's dream is destroyed, and he is murdered by George Wilson, the husband of Myrtle Wilson, who was having an affair with Tom Buchanan, Daisy's husband, highlighting the corrupting influence of wealth and the elusiveness of the American Dream.
In this post, we've demonstrated how to deploy Meta's Llama 4 models on Vast.ai using three specific hardware configurations: Llama 4 Scout on 8× H100 GPUs with a 200k token context window, Scout on 4× H100 GPUs with a 100k context window, and Llama 4 Maverick on 8× H200 GPUs with a 100k context window. We've shown that with the right GPU setup, you can easily deploy and interact with these models through an OpenAI-compatible API.
Now that you have a basic deployment working, here are some next steps you might consider:
Expand the context window: While we demonstrated context windows of 100k-200k tokens, Llama 4 supports up to 10M tokens. Try increasing --max-model-len
to experiment with longer documents, books, or codebases. Note: You will also need to increase the number of GPUs.
Explore multimodal capabilities: The tutorial focused on text processing, but Llama 4 is natively multimodal. Try adapting the deployment to accept image inputs using vLLM's multimodal capabilities and the early fusion architecture.
By leveraging Vast.ai's flexible GPU infrastructure, you can cost-effectively experiment with these cutting-edge models and build applications that take advantage of their sophisticated reasoning capabilities.