The landscape of AI inference is rapidly evolving, with organizations seeking cost-effective solutions that don't compromise on performance or control. While cloud providers offer convenient managed services, they often come with premium pricing and limited customization options. On the other hand, on-premises deployments provide control but require significant hardware investments.
A hybrid approach offers the best of both worlds: maintaining local control over your inference pipeline while leveraging cost-effective remote GPU resources. This blog post demonstrates how to implement such a system using LiteLLM as a local proxy and Vast.ai for remote GPU hosting.
This architecture combines local control with cost-effective cloud GPU access through two key components:
LiteLLM is a Python SDK and proxy server that provides a unified OpenAI-compatible interface for 100+ LLM APIs from different providers. It includes features like cost tracking, rate limiting, and logging capabilities when used as a proxy server.
Vast.ai is a GPU marketplace that offers access to over 10,000 GPUs from secure datacenters. According to their website, users can save up to 80% compared to traditional cloud services with their pay-as-you-go pricing model.
In this blog post, we'll deploy a complete hybrid inference pipeline:
The result is a flexible, cost-effective inference setup that you control locally while leveraging remote GPU resources.
Let's start by installing and configuring the Vast AI API client.
Get your API key from the Account Page in the Vast Console.
pip install --upgrade vastai
# Here we will set our api key
export VAST_API_KEY="" #Your key here
vastai set api-key $VAST_API_KEY
For our LiteLLM + vLLM pipeline, we need a GPU with 24GB+ RAM for the DeepSeek-R1 model weights and KV cache, plus a static IP address with direct ports for stable connections.
Let's search for suitable instances:
vastai search offers "compute_cap >= 750 \
gpu_ram >= 24 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 80 \
rentable = true"
Choose an instance ID from the search results above and deploy using the Docker command below.
This deployment uses the vllm/vllm-openai:latest image to serve the deepseek-ai/DeepSeek-R1-0528-Qwen3-8B model with reasoning capabilities on port 8000:
export INSTANCE_ID="" #insert instance ID
vastai create instance $INSTANCE_ID \
--image vllm/vllm-openai:latest \
--env '-p 8000:8000' \
--disk 60 \
--args \
--model deepseek-ai/DeepSeek-R1-0528-Qwen3-8B \
--served-model-name deepseek \
--max-model-len 4096 \
--reasoning-parser qwen3 \
After deployment, go to the Instances Tab in the Vast AI Console and find your instance.
Click the IP address button at the top of the instance. A panel will show the IP address and forwarded ports:
Open Ports
XX.XX.XXX.XX:YYYY -> 8000/tcp
Note down VAST_IP (XX.XX.XXX.XX) and VAST_PORT (YYYY) for the next steps.
Now let's install LiteLLM locally to proxy requests to our Vast.ai server:
pip install litellm
pip install 'litellm[proxy]'
Set your Vast.ai instance details from the previous step:
# Configure your Vast.ai instance details
VAST_IP = ""
VAST_PORT = ""
MODEL_NAME = "deepseek"
# Configure LiteLLM settings
LITELLM_PORT = 4000 # Change this if port 4000 is in use
# Create the API base URL
API_BASE = f"http://{VAST_IP}:{VAST_PORT}/v1"
print(f"Vast.ai API Base URL: {API_BASE}")
print(f"LiteLLM will run on port: {LITELLM_PORT}")
# Write config file with variables
config_content = f"""model_list:
- model_name: {MODEL_NAME}
litellm_params:
model: openai/{MODEL_NAME}
api_base: {API_BASE}
api_key: fake-key
general_settings:
master_key: fake-key
"""
with open('litellm_config.yaml', 'w') as f:
f.write(config_content)
print("Config file created with:")
print(f"- Model: {MODEL_NAME}")
print(f"- API Base: {API_BASE}")
import subprocess
import time
# Kill any existing LiteLLM processes
subprocess.run("pkill -f litellm || true", shell=True)
time.sleep(1)
# Start LiteLLM
cmd = f"nohup litellm --config litellm_config.yaml --port {LITELLM_PORT} --host 0.0.0.0 > litellm.log 2>&1 &"
subprocess.run(cmd, shell=True)
time.sleep(5)
print(f"✅ LiteLLM running on port: {LITELLM_PORT}")
print(f"URL: http://localhost:{LITELLM_PORT}/v1")
# Save port for other cells
with open('.litellm_port', 'w') as f:
f.write(str(LITELLM_PORT))
Output:
✅ LiteLLM running on port: 4000
URL: http://localhost:4000/v1
Now install the OpenAI client to test our setup:
pip install openai
Let's test our complete LiteLLM + Vast.ai setup with a basic API call:
from openai import OpenAI
client = OpenAI(
api_key="fake-key",
base_url=f"http://localhost:{LITELLM_PORT}/v1"
)
# Basic test
try:
response = client.chat.completions.create(
model="deepseek",
messages=[
{"role": "user", "content": "Hello! How are you?"}
],
max_tokens=1000
)
print("Response:", response.choices[0].message.content)
except Exception as e:
print(f"❌ Error: {e}")
Output:
Response:
Hello! I'm just a friendly AI here, so I don't get tired or have feelings—but I'm fully ready and happy to help you! How are you doing today? Let me know if there's anything you'd like to chat about or need assistance with.
Perfect! Now let's test the reasoning capabilities of the DeepSeek model:
try:
response_1 = client.chat.completions.create(
model="deepseek",
messages=[
{"role": "user", "content": "what is 8 * 7?, Give me your answer in step by step reasoning."}
],
max_tokens=3000
)
print("Response:", response_1.choices[0].message.content)
except Exception as e:
print(f"❌ Error: {e}")
Output:
Response:
### Step-by-Step Reasoning for 8 * 7
Multiplication is a basic arithmetic operation that can be thought of as repeated addition. To find the product of 8 and 7, you can add the number 8 to itself 7 times (or equivalently, add the number 7 to itself 8 times). I'll use the repeated addition method with the number 8 to make it clear.
Start with the first 8:
- Add the second 8: 8 + 8 = 16
- Add the third 8: 16 + 8 = 24
- Add the fourth 8: 24 + 8 = 32
- Add the fifth 8: 32 + 8 = 40
- Add the sixth 8: 40 + 8 = 48
- Add the seventh 8: 48 + 8 = 56
As shown, after adding 8 seven times, the result is 56.
Alternatively, you can verify this with another method, such as multiplying 7 eight times or subtracting 8 from 8 * 8 (since 8 * 7 = (8 * 8) - 8 = 64 - 8 = 56). Both methods confirm the result.
**Final Answer:** 56
The pipeline is working correctly. You now have a complete LiteLLM + Vast.ai setup with local proxy control and remote GPU inference.
Note: For production deployments, consider using Docker containers for LiteLLM rather than the command-line approach shown here.
This hybrid inference architecture demonstrates how modern AI deployment can balance cost, control, and performance. By combining LiteLLM's local proxy capabilities with Vast.ai's GPU marketplace, we've created a system that offers several key advantages:
Cost Considerations: Vast.ai's marketplace model and pay-as-you-go pricing can provide cost savings compared to traditional cloud providers, while running LiteLLM locally eliminates additional hosting costs for the proxy layer.
Local Control: Running LiteLLM locally provides control over request routing, logging, and configuration. The proxy server includes built-in features for cost tracking, rate limiting, and observability integrations.
API Compatibility: LiteLLM's OpenAI-compatible interface allows existing applications to integrate with minimal changes, while supporting multiple LLM providers through a unified API.
Flexible Architecture: This setup allows you to experiment with different models and providers while maintaining a consistent interface for your applications.
Whether you're building AI-powered applications, conducting research, or optimizing existing ML pipelines, this hybrid approach provides a practical path to cost-effective, controllable AI inference that doesn't sacrifice capability for economics.
© 2025 Vast.ai. All rights reserved.