Running OpenAI's GPT-OSS on Vast.ai

August 6, 2025

7 Min Read

By Team Vast

Introduction

OpenAI has just released the GPT-OSS family of models, marking their return to open-weight model releases. The family includes GPT-OSS-120B (120 billion parameters) and GPT-OSS-20B (20 billion parameters), offering developers access to state-of-the-art language capabilities previously locked behind proprietary APIs.

GPT-OSS offers configurable reasoning capabilities with reasoning levels - from "low" for quick responses to "high" for complex problem-solving. The model uses OpenAI's harmony encoding system, which structures conversations and reasoning in a sophisticated format that enables fine-grained control over the model's thinking process.

This breakthrough means you can now:

Deploy OpenAI-quality models on your own infrastructure
Scale inference according to your exact needs

Vast.ai provides the perfect platform for running GPT-OSS models. With its marketplace of GPUs, you can choose the right hardware for your needs - whether running the efficient 20B model or the powerful 120B variant.

In this guide, we'll show you how to deploy both GPT-OSS models on Vast.ai using vLLM for optimized inference, with a focus on the 120B model. You'll learn how to interact with these models using the harmony encoding system for different reasoning levels.

Setting Up the Environment

Before we can deploy our model, we need to set up our Vast.ai environment. First, install the Vast SDK:

# Install required packages
pip install --upgrade vastai
pip install --upgrade openai
pip install --upgrade openai-harmony

Set up your Vast API key (available from your Account Page):

# Set your Vast.ai API key
export VAST_API_KEY="" # Your key here
vastai set api-key $VAST_API_KEY

Choosing the Right Hardware

The GPT-OSS models have different hardware requirements:

GPT-OSS-120B: Requires an H100 GPU with 80GB VRAM
GPT-OSS-20B: Requires 16GB+ VRAM

For this guide, we'll demonstrate with the 120B model on H100. Let's search for suitable instances:

# Search for suitable GPU instances
vastai search offers " \
gpu_name = H100_SXM \
geolocation=US \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 120 \
rentable = true"

Deploying GPT-OSS with vLLM

Now let's deploy our GPT-OSS instance using vLLM's optimized inference server. We'll use the latest vLLM image with special GPT-OSS support. The pip install commands are from the model's Hugging Face pages (120B, 20B).

Deploying GPT-OSS-120B (H100 recommended)

# Deploy vLLM instance for 120B model
export INSTANCE_ID= # Insert instance ID from search results

vastai create instance $INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000' --disk 160  --onstart-cmd 'uv pip install --system --upgrade transformers kernels torch openai; uv pip install --system --pre vllm==0.10.1+gptoss --extra-index-url https://wheels.vllm.ai/gpt-oss/ --extra-index-url https://download.pytorch.org/whl/nightly/cu128; vllm serve openai/gpt-oss-120b --max-model-len 80000'

Alternative: Deploying GPT-OSS-20B on A100

Here's how to deploy the 20B model on an A100 GPU:

# Deploy vLLM instance for 20B model
export INSTANCE_ID= # Insert instance ID from search results

vastai create instance $INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000' --disk 80  --onstart-cmd 'uv pip install --system --upgrade transformers kernels torch openai; uv pip install --system --pre vllm==0.10.1+gptoss --extra-index-url https://wheels.vllm.ai/gpt-oss/ --extra-index-url https://download.pytorch.org/whl/nightly/cu128; vllm serve openai/gpt-oss-20b'

After deployment, wait for the model to download and start serving. You can monitor the logs to see when it's ready. Once running, find your instance's IP address and port from the Instances tab in the Vast AI Console.

Interacting with GPT-OSS Using Harmony Encoding

GPT-OSS uses OpenAI's harmony encoding system, which provides structured conversation formatting and enables different reasoning levels. Let's set up our client:

from openai import OpenAI
from openai_harmony import (
    load_harmony_encoding,
    HarmonyEncodingName,
    Role,
    Message,
    Conversation,
    SystemContent,
    DeveloperContent,
)

# Your server details
VAST_IP_ADDRESS = ""
VAST_PORT = ""

# Initialize client
client = OpenAI(
    api_key="EMPTY",
    base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
)

# Load harmony encoding
enc = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)

## Simple Chat Function
def chat_gpt_oss(prompt, reasoning="medium"):
    """
    Simple chat with GPT-OSS
    reasoning: "low", "medium", or "high"
    """
    # Create conversation with harmony format
    conversation = Conversation.from_messages([
        Message.from_role_and_content(Role.SYSTEM, SystemContent.new()),
        Message.from_role_and_content(
            Role.DEVELOPER,
            DeveloperContent.new().with_instructions(f"Reasoning: {reasoning}")
        ),
        Message.from_role_and_content(Role.USER, prompt),
    ])

    # Get formatted prompt
    formatted_prompt = enc.render_conversation_for_completion(conversation, Role.ASSISTANT)

    # Make API call
    response = client.completions.create(
        model="openai/gpt-oss-120b",
        prompt=formatted_prompt,
        max_tokens=2000,
        temperature=0.7,
    )

    return response.choices[0].text


prompt = "Explain quantum computing to a 4th grader. Keep in concise."

Comparing Reasoning Levels

One of GPT-OSS's most interesting features is its configurable reasoning levels. Let's see how the model responds differently to the same prompt with low and high reasoning:

Low Reasoning Response

result = chat_gpt_oss(prompt, "low")
print("----LOW REASONING----")
print(result)

Output:

----LOW REASONING----
analysisWe need to explain quantum computing to a 4th grader, concise. Use simple analogies, like bits as light switches, qubits as coins that can be heads, tails, or both. Keep short.assistantfinal**Quantum Computing in a Nutshell (for a 4th‑grader)**

1. **Regular computers = light switches**
   - A normal computer works with bits that are either **0** (off) or **1** (on), just like a light switch that's either down or up.

2. **Quantum computers = magic coins**
   - A quantum computer uses **qubits**.
   - Think of a qubit like a coin that can be **heads (0)**, **tails (1)**, **or both at the same time** while it's spinning in the air. That "both at once" is called **superposition**.

3. **Working together**
   - When you have many regular bits, you have to try every possible combination one after another.
   - With many qubits, they can be in many combinations **all at once** because they can be linked together in a special way called **entanglement**. It's like a group of magic coins that always know what the others are doing, even if they're far apart.

4. **Why it matters**
   - Because of superposition and entanglement, a quantum computer can solve some puzzles (like cracking codes or simulating molecules) **much faster** than a regular computer.

**Bottom line:** A regular computer is like a row of on/off switches; a quantum computer is like a bunch of spinning magic coins that can be many things at once, letting it try many solutions at the same time.

High Reasoning Response

result = chat_gpt_oss(prompt, "high")
print("----HIGH REASONING----")
print(result)

Output:

----HIGH REASONING----
analysisWe need to explain quantum computing to a 4th grader, concise. Use simple analogies, avoid jargon, but keep accurate. Should be short. Provide analogies like coins, dice, superposition, entanglement. Keep concise.assistantfinal**What is a quantum computer?**

Imagine you have a super‑smart robot that can solve puzzles much faster than a regular computer. A quantum computer is that robot, but instead of using normal "on/off" switches (bits) like a regular computer, it uses tiny, magical switches called **qubits**.

### Bits vs. Qubits
- **Bit (regular computer):** Like a tiny light that can be **off (0)** or **on (1)**.
- **Qubit (quantum computer):** Like a tiny spinning top that can be **0**, **1**, **or both at the same time** (this is called *superposition*).

### Why is that cool?
If you have 10 bits, they can show only one of 2¹⁰ = 1,024 possible patterns at a time.
If you have 10 qubits, they can be in all 2¹⁰ = 1,024 patterns **at once**! That lets a quantum computer try many solutions together.

### A Simple Analogy
Think of a maze puzzle:

- **Regular computer:** Sends a single mouse down the maze, step by step, until it finds the exit.
- **Quantum computer:** Sends a whole swarm of mice that can explore every possible path at the same time and tell you the exit almost instantly.

### Entanglement (Teamwork)
Sometimes two qubits become "best friends" and link together, no matter how far apart they are. Changing one instantly changes the other. This is called **entanglement**, and it lets quantum computers coordinate their work in a super‑efficient way.

### Bottom Line
A quantum computer uses qubits that can be 0, 1, or both at once, and they can be entangled with each other. This lets the machine look at many possibilities at the same time, solving certain problems way faster than ordinary computers.

So, it's like a super‑fast, super‑clever puzzle‑solver that can try lots of answers all at once!

Comparing Low vs High Reasoning Outputs

The quantum computing explanations demonstrate the distinct capabilities of GPT-OSS-120B's reasoning levels. With low reasoning, the model produced a concise, structured explanation using simple analogies like "magic coins" and "light switches" - delivering clear concepts in a compact format suitable for quick understanding.

The high reasoning response revealed significantly more sophisticated thinking. The model provided detailed analogies like the maze puzzle with mice, explained mathematical concepts (2¹⁰ patterns), and structured the explanation with clear sections and formatting. Most notably, it offered multiple perspectives on the same concepts - from the spinning top analogy to the teamwork explanation of entanglement.

This difference in reasoning depth represents a significant advancement in model control. Rather than simply adjusting temperature or other parameters, GPT-OSS allows direct specification of how thoroughly the model should think through problems. The low reasoning approach prioritized efficiency while maintaining accuracy, while high reasoning prioritized comprehensive understanding with rich examples and detailed explanations.

Conclusion

OpenAI's first open-source model thus introduces not just accessibility to advanced language capabilities, but a new paradigm for controlling model reasoning depth - enabling applications to dynamically adjust between quick responses and thorough analysis based on context and requirements.