OpenAI's GPT-OSS-20B model provides transparency in AI reasoning through their Harmony SDK with visibility into the model's thought process. Unlike traditional chatbots that hide their decision-making, GPT-OSS models expose their analytical reasoning, function calling logic, and structured responses across multiple communication channels.
This multi-channel approach changes how we interact with AI. The model separates its internal reasoning (analysis channel), function execution (commentary channel), and user-facing responses (final channel), giving developers insight into the AI's decision-making process. This transparency is essential for building AI applications where understanding "why" is as important as getting the right answer.
Running GPT-OSS on Vast.ai combines AI capabilities with affordable GPU infrastructure. Instead of relying on opaque API calls, you control your own deployment, customize the behavior, and gain visibility into the model's reasoning while using Vast.ai's GPU marketplace.
In this guide, you'll deploy OpenAI's GPT-OSS-20B model on Vast.ai and build a weather assistant that demonstrates Harmony SDK's multi-channel reasoning system.
The first step is installing the necessary tools to interact with Vast.ai's GPU marketplace and OpenAI's Harmony SDK. These tools provide everything needed to rent GPUs, deploy models, and build applications with structured AI reasoning.
pip install --quiet vastai openai openai-harmony requests
The vastai
CLI enables programmatic GPU rental and instance management. The openai
library provides the client interface for model interaction, while openai-harmony
adds support for multi-channel reasoning format. The requests
library will power our weather data fetching.
Next, configure your Vast.ai API credentials. This key authenticates your account and enables GPU rental through the CLI:
export VAST_API_KEY="<your-api-key>" # Get from https://cloud.vast.ai/account/
vastai set api-key $VAST_API_KEY
Your API key is available in your Vast.ai account settings. This authentication persists across sessions, so you only need to set it once per environment.
We'll find a GPU with enough space for GPT-OSS-20B. The model requires H100 or newer architecture.
Hardware Requirements:
vastai search offers " \
gpu_name in [H100_SXM, H100_NVL] \
gpu_ram >= 40 \
geolocation=US \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 60 \
rentable = true"
This search command filters for H100 GPUs with the necessary specifications. The verified = true
flag ensures you're renting from reliable providers, while static_ip
and direct_port_count
enable external API access. Geographic filtering (geolocation=US
) can reduce latency for US-based users.
With suitable hardware identified, deploy the model using vLLM's optimized inference server. vLLM provides an OpenAI-compatible API endpoint with performance optimizations through PagedAttention and continuous batching.
export INSTANCE_ID=<instance-id> # From your search results
vastai create instance $INSTANCE_ID \
--image vllm/vllm-openai:gptoss \
--env '-p 8000:8000' \
--disk 60 \
--args --model openai/gpt-oss-20b
Deployment parameters explained:
--image vllm/vllm-openai:gptoss
: vLLM's official Docker image specifically built for GPT-OSS models--env '-p 8000:8000'
: Maps container port 8000 to host, enabling API access--disk 60
: Allocates 60GB storage for model weights and cache--args --model openai/gpt-oss-20b
: Specifies the exact model variant to loadThe deployment process downloads model weights and initializes the inference server. vLLM handles model sharding, memory allocation, and optimization setup.
Once deployment completes, retrieve your instance's connection details through the Vast.ai console. The platform assigns a public IP address and port mapping for API access.
To connect to your deployed model:
The connection information appears in the format:
XX.XX.XXX.XX:YYYY -> 8000/tcp
Where XX.XX.XXX.XX
is your public IP and YYYY
is the external port mapped to the container's port 8000.
Test your connection with a completion request:
from openai import OpenAI
# Your instance details from Vast.ai
VAST_IP = "<your-instance-ip>"
VAST_PORT = "<your-port>"
client = OpenAI(
api_key="EMPTY", # vLLM doesn't require authentication
base_url=f"http://{VAST_IP}:{VAST_PORT}/v1"
)
# Quick connection test
try:
response = client.completions.create(
model="openai/gpt-oss-20b",
prompt="Hello, I am"
)
print("IT WORKS! GPT-OSS is running!")
print(f"Test response: {response.choices[0].text}")
except Exception as e:
print(f"Not ready yet. Error: {e}")
print("Wait for the model to load.")
A successful response confirms your model is operational and ready for Harmony SDK integration.
OpenAI's Harmony is a response format designed for GPT-OSS models to structure conversations, generate reasoning output, and handle function calls. Unlike traditional single-stream outputs, Harmony enables models to separate their thoughts, actions, and responses into distinct channels.
The three channels are:
This format provides transparency into AI decision-making. Developers can observe how the model analyzes problems, why it chooses specific functions, and how it constructs its final response. This visibility helps with debugging, improving prompts, and building trust in AI systems.
Let's initialize the Harmony SDK and explore its capabilities:
from openai_harmony import (
load_harmony_encoding,
HarmonyEncodingName,
Role,
Message,
Conversation,
SystemContent,
DeveloperContent
)
# Load the Harmony encoding for GPT-OSS models
enc = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)
print("Harmony SDK loaded successfully!")
The Harmony encoding handles the special token format that enables multi-channel communication.
Before constructing our AI assistant, we need a weather data source. The wttr.in API provides free, anonymous weather data:
import requests
import json
def get_weather(city):
"""Get weather data using the free wttr.in API"""
try:
url = f"https://wttr.in/{city}?format=j1"
response = requests.get(url)
data = response.json()
current = data['current_condition'][0]
return {
"city": city,
"temperature_c": current['temp_C'],
"temperature_f": current['temp_F'],
"description": current['weatherDesc'][0]['value'],
"humidity": current['humidity'],
"wind_speed_kmh": current['windspeedKmph']
}
except Exception as e:
return {"error": f"Could not get weather for {city}"}
# Test the weather function
test_weather = get_weather("London")
print("Weather function test:")
print(json.dumps(test_weather, indent=2))
This function fetches weather data with error handling. It returns structured data that the AI can interpret and present to users.
Harmony conversations follow a hierarchical structure with three message types, each serving a distinct purpose in guiding model behavior:
def create_weather_conversation(user_query):
return Conversation.from_messages([
# 1. SYSTEM (highest priority) - Core behavior and identity
Message.from_role_and_content(
Role.SYSTEM,
SystemContent(
model_identity="You are WeatherBot, a helpful weather assistant. Always show your reasoning in the analysis channel. Valid channels: analysis, commentary, final."
)
),
# 2. DEVELOPER - Custom instructions and constraints
Message.from_role_and_content(
Role.DEVELOPER,
DeveloperContent(
instructions="Use metric units by default. If a city is ambiguous, ask for clarification. Always show your reasoning process. You have access to a get_weather function that takes a city parameter."
)
),
# 3. USER - The actual query
Message.from_role_and_content(Role.USER, user_query)
])
print("Conversation builder ready!")
The SYSTEM message establishes the assistant's identity and fundamental behavior. It defines available channels and ensures the model always explains its reasoning. This message has the highest priority and cannot be overridden by user input.
The DEVELOPER message adds application-specific logic and constraints. Here we specify metric units as default, ambiguity handling rules, and available functions. These instructions shape how the assistant interprets and responds to queries.
The USER message contains the actual query. By structuring conversations this way, we maintain consistent behavior while allowing flexible user interaction.
vLLM's inference server strips special tokens that Harmony uses for channel separation. We need a custom parser to reconstruct the multi-channel structure and extract function calls:
def parse_harmony_response(response_tokens):
"""Parse the model's response and execute any function calls
Why we need this: vLLM strips the special tokens that Harmony uses for multi-channel output.
We need to parse the response to extract function calls and separate channels.
"""
# Parse the response into structured messages
parsed = enc.parse_messages_from_completion_tokens(
response_tokens,
role=Role.ASSISTANT
)
channels = {
"analysis": [],
"commentary": [],
"final": []
}
for message in parsed:
# Get channel designation
channel = getattr(message, 'channel', 'final')
# message.content is a LIST of TextContent objects (not a single object)
# We must iterate through each item to extract the actual text
if hasattr(message, 'content') and isinstance(message.content, list):
for content_item in message.content:
# Extract text from TextContent objects
if hasattr(content_item, 'text'):
content_text = content_item.text
elif isinstance(content_item, str):
content_text = content_item
else:
continue
# Check if this is a function call in the commentary channel
if channel == "commentary" and "to=functions.get_weather" in content_text:
# Extract JSON arguments - they come after "json"
if "json" in content_text:
try:
import json
# Find where "json" appears and take everything after it
json_start = content_text.find("json") + 4
json_str = content_text[json_start:].strip()
# Parse the JSON arguments
func_args = json.loads(json_str)
city = func_args.get("city")
if city:
# Call the weather function
result = get_weather(city)
channels["commentary"].append({
"type": "function_call",
"function": "get_weather",
"args": func_args,
"result": result
})
else:
channels[channel].append({"type": "text", "content": content_text})
except Exception as e:
print(f"Failed to parse function call: {e}")
channels[channel].append({"type": "text", "content": content_text})
else:
channels[channel].append({"type": "text", "content": content_text})
else:
# Regular content for all channels
if channel in channels:
channels[channel].append({
"type": "text",
"content": content_text
})
return channels
print("Response parser ready!")
This parser performs several functions:
The parser handles edge cases like malformed JSON, missing arguments, and unexpected content formats.
Now we combine all components into a weather assistant that uses Harmony's multi-channel capabilities:
def weather_assistant(user_query, model_client):
"""Main weather assistant function with proper Harmony support"""
# Create the conversation with proper role hierarchy
conversation = create_weather_conversation(user_query)
# Render the conversation into Harmony format tokens
prompt_tokens = enc.render_conversation_for_completion(conversation, Role.ASSISTANT)
prompt_text = enc.decode(prompt_tokens)
# Send to model for completion
response = model_client.completions.create(
model="openai/gpt-oss-20b",
prompt=prompt_text,
temperature=0.7,
stop=["<|return|>", "<|call|>", "<|end|>"]
)
response_text = response.choices[0].text
# Reconstruct proper Harmony format (vLLM strips special tokens, so we add them back)
formatted = response_text
if formatted.startswith("analysis"):
formatted = "<|start|>assistant<|channel|>" + formatted
formatted = formatted.replace("analysis", "analysis<|message|>", 1)
# Handle channel transitions
formatted = formatted.replace("assistantfinal", "<|end|><|start|>assistant<|channel|>final<|message|>")
formatted = formatted.replace("assistantanalysis", "<|end|><|start|>assistant<|channel|>analysis<|message|>")
formatted = formatted.replace("assistantcommentary", "<|end|><|start|>assistant<|channel|>commentary<|message|>")
# Ensure proper termination
if not formatted.endswith(("<|return|>", "<|call|>", "<|end|>")):
formatted += "<|return|>"
# Encode with special tokens allowed
response_tokens = enc.encode(formatted, allowed_special='all')
# Parse the harmony response
try:
channels = parse_harmony_response(response_tokens)
except Exception as e:
return {
"error": str(e),
"raw_response": response_text
}
return channels
print("Weather assistant ready!")
The assistant orchestrates the interaction flow:
The temperature setting balances creativity with consistency, while the stop tokens prevent the model from generating beyond logical endpoints.
To make the multi-channel output human-readable, we need a display function that clearly presents each channel's content:
def display_harmony_result(result):
"""Display the Harmony multi-channel response in a formatted way"""
if "error" in result:
print(f"Error: {result['error']}")
return
# Show the reasoning (analysis channel)
if result.get("analysis"):
print("\n📊 AI REASONING (analysis channel):")
for item in result["analysis"]:
if item["type"] == "text":
text = item['content']
if hasattr(text, 'text'):
text = text.text
print(f" {text}")
# Show function calls (commentary channel)
if result.get("commentary"):
print("\n🔧 FUNCTION CALLS (commentary channel):")
for item in result["commentary"]:
if item["type"] == "function_call":
print(f" Calling: {item['function']}({item['args']})")
print(f"\n Weather Data Retrieved:")
weather = item['result']
if 'error' not in weather:
print(f" • Temperature: {weather['temperature_c']}°C ({weather['temperature_f']}°F)")
print(f" • Conditions: {weather['description']}")
print(f" • Humidity: {weather['humidity']}%")
print(f" • Wind: {weather['wind_speed_kmh']} km/h")
else:
print(f" • Error: {weather['error']}")
# Show final response (final channel)
if result.get("final"):
print("\n💬 FINAL RESPONSE (final channel):")
for item in result["final"]:
if item["type"] == "text":
text = item['content']
if hasattr(text, 'text'):
text = text.text
print(f" {text}")
# If no final response, generate one from the weather data
elif result.get("commentary"):
print("\n💬 FINAL RESPONSE (generated):")
for item in result["commentary"]:
if item["type"] == "function_call" and item.get("result"):
weather = item['result']
if 'error' not in weather:
print(f" The current weather in {weather['city']} is {weather['description'].lower()}.")
print(f" It's {weather['temperature_c']}°C ({weather['temperature_f']}°F) with {weather['humidity']}% humidity")
print(f" and winds at {weather['wind_speed_kmh']} km/h.")
This display function transforms raw channel data into organized output. Each channel gets its own section with formatting, making it easy to understand the AI's thought process from reasoning through execution to final response.
Let's see the weather assistant in action with different queries to demonstrate its capabilities:
# Test the weather assistant
query = "What's the weather like in Tokyo?"
print(f"\nQuery: {query}")
print("-" * 60)
result = weather_assistant(query, client)
display_harmony_result(result)
Output:
Query: What's the weather like in Tokyo?
------------------------------------------------------------
📊 AI REASONING (analysis channel):
The user asks: "What's the weather like in Tokyo?" It's a straightforward request for weather in Tokyo. There's no ambiguity. We can use the get_weather function. The user wants the weather. We should call get_weather with city="Tokyo". And then display the result. We should also show reasoning. The answer should be in metric units. The developer instructions say to use metric units by default. That may refer to temperature in Celsius. The get_weather function presumably returns data in metric units. We'll call it.
We need to show reasoning in the analysis channel. Then produce final answer with the weather.
🔧 FUNCTION CALLS (commentary channel):
Calling: get_weather({'city': 'Tokyo'})
Weather Data Retrieved:
• Temperature: 27°C (80°F)
• Conditions: Clear
• Humidity: 85%
• Wind: 23 km/h
💬 FINAL RESPONSE (generated):
The current weather in Tokyo is clear.
It's 27°C (80°F) with 85% humidity
and winds at 23 km/h.
Notice how the analysis channel reveals the model's complete reasoning process. It identifies the request type, confirms there's no ambiguity, recalls the developer instructions about metric units, and plans its response strategy. This transparency helps with understanding and debugging AI behavior.
# Test the weather assistant
query = "What's the weather like in Paris?"
print(f"\nQuery: {query}")
print("-" * 60)
result = weather_assistant(query, client)
display_harmony_result(result)
Output:
Query: What's the weather like in Paris?
------------------------------------------------------------
📊 AI REASONING (analysis channel):
We need to answer weather for Paris. But Paris is ambiguous? Possibly Paris, France or Paris, Texas? Typically default to Paris, France. But could ask clarification. The instruction says if city ambiguous, ask for clarification. But the user says "What's the weather like in Paris?" Without specifying country. We can assume Paris, France. But ambiguous: there are many Paris. Usually default is Paris, France. But the instruction says if ambiguous, ask for clarification. But "Paris" is ambiguous by definition. However guidelines say "If a city is ambiguous, ask for clarification." So we should ask clarifying question. But maybe we can interpret that user likely means Paris, France. But better to ask for clarification to be safe. So we ask: "Which Paris? Paris, France or Paris, Texas, etc." Let's ask: "Do you mean Paris, France or another Paris?" Then we can proceed.
💬 FINAL RESPONSE (final channel):
Sure! Paris can refer to several places. Could you let me know which one you're interested in? For example, Paris, France or Paris, Texas (or another city named Paris).
The model demonstrates reasoning about ambiguity. It recognizes multiple cities share the name "Paris," weighs the likelihood of different interpretations, recalls the developer instruction to ask for clarification, and decides to request more information rather than making assumptions.
# Test the weather assistant
query = "What's the weather like in Paris, France?"
print(f"\nQuery: {query}")
print("-" * 60)
result = weather_assistant(query, client)
display_harmony_result(result)
Output:
Query: What's the weather like in Paris, France?
------------------------------------------------------------
📊 AI REASONING (analysis channel):
The user asks for weather in Paris, France. That is a city. It's ambiguous? "Paris" could be Paris, France or Paris, Texas etc. The user added "France" to clarify. So it's clear: Paris, France. Should use get_weather function with city parameter "Paris, France". Ensure metric units? The function may return data. We need to call get_weather.
🔧 FUNCTION CALLS (commentary channel):
Calling: get_weather({'city': 'Paris, France'})
Weather Data Retrieved:
• Temperature: 22°C (72°F)
• Conditions: Partly cloudy
• Humidity: 46%
• Wind: 6 km/h
💬 FINAL RESPONSE (generated):
The current weather in Paris, France is partly cloudy.
It's 22°C (72°F) with 46% humidity
and winds at 6 km/h.
With clarification provided, the model proceeds confidently. The analysis channel shows it recognizes the disambiguation, confirming "Paris, France" removes ambiguity. It then executes the weather function and presents results.
This guide demonstrated deploying OpenAI's GPT-OSS-20B model on Vast.ai with the Harmony SDK for multi-channel reasoning. The weather assistant shows how the model separates its internal reasoning, function calls, and user responses into distinct channels, providing full transparency into its decision-making process.
The combination of Vast.ai's GPU infrastructure and Harmony's structured format enables building AI applications where understanding the model's reasoning is as important as getting the right answer.