Hybrid AI Inference: Local LiteLLM Proxy with Remote Vast.ai GPU

July 31, 2025
6 Min Read
By Team Vast

The landscape of AI inference is rapidly evolving, with organizations seeking cost-effective solutions that don't compromise on performance or control. While cloud providers offer convenient managed services, they often come with premium pricing and limited customization options. On the other hand, on-premises deployments provide control but require significant hardware investments.

A hybrid approach offers the best of both worlds: maintaining local control over your inference pipeline while leveraging cost-effective remote GPU resources. This blog post demonstrates how to implement such a system using LiteLLM as a local proxy and Vast.ai for remote GPU hosting.

The Hybrid Architecture Advantage

This architecture combines local control with cost-effective cloud GPU access through two key components:

LiteLLM is a Python SDK and proxy server that provides a unified OpenAI-compatible interface for 100+ LLM APIs from different providers. It includes features like cost tracking, rate limiting, and logging capabilities when used as a proxy server.

Vast.ai is a GPU marketplace that offers access to over 10,000 GPUs from secure datacenters. According to their website, users can save up to 80% compared to traditional cloud services with their pay-as-you-go pricing model.

What You'll Learn

In this blog post, we'll deploy a complete hybrid inference pipeline:

  1. Deploy a vLLM server with the DeepSeek-R1 model on Vast.ai
  2. Configure LiteLLM locally to proxy requests to the remote server
  3. Test the complete pipeline using OpenAI client libraries
  4. Demonstrate advanced reasoning capabilities

The result is a flexible, cost-effective inference setup that you control locally while leveraging remote GPU resources.

Deploy LiteLLM + vLLM Pipeline on Vast.ai

Let's start by installing and configuring the Vast AI API client.

Install Vast AI Client

Get your API key from the Account Page in the Vast Console.

pip install --upgrade vastai
# Here we will set our api key
export VAST_API_KEY="" #Your key here
vastai set api-key $VAST_API_KEY

Hardware Requirements

For our LiteLLM + vLLM pipeline, we need a GPU with 24GB+ RAM for the DeepSeek-R1 model weights and KV cache, plus a static IP address with direct ports for stable connections.

Let's search for suitable instances:

vastai search offers "compute_cap >= 750 \
gpu_ram >= 24 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 80 \
rentable = true"

Deploy vLLM on Vast.ai

Choose an instance ID from the search results above and deploy using the Docker command below.

This deployment uses the vllm/vllm-openai:latest image to serve the deepseek-ai/DeepSeek-R1-0528-Qwen3-8B model with reasoning capabilities on port 8000:

export INSTANCE_ID="" #insert instance ID

vastai create instance $INSTANCE_ID \
  --image vllm/vllm-openai:latest \
  --env '-p 8000:8000' \
  --disk 60 \
  --args \
    --model deepseek-ai/DeepSeek-R1-0528-Qwen3-8B \
    --served-model-name deepseek \
    --max-model-len 4096 \
    --reasoning-parser qwen3 \

Get Vast IP and Port

After deployment, go to the Instances Tab in the Vast AI Console and find your instance.

Click the IP address button at the top of the instance. A panel will show the IP address and forwarded ports:

Open Ports
XX.XX.XXX.XX:YYYY -> 8000/tcp

Note down VAST_IP (XX.XX.XXX.XX) and VAST_PORT (YYYY) for the next steps.

Installing LiteLLM and Dependencies

Now let's install LiteLLM locally to proxy requests to our Vast.ai server:

pip install litellm
pip install 'litellm[proxy]'

Configure LiteLLM

Set your Vast.ai instance details from the previous step:

# Configure your Vast.ai instance details
VAST_IP = ""
VAST_PORT = ""
MODEL_NAME = "deepseek"

# Configure LiteLLM settings
LITELLM_PORT = 4000  # Change this if port 4000 is in use

# Create the API base URL
API_BASE = f"http://{VAST_IP}:{VAST_PORT}/v1"
print(f"Vast.ai API Base URL: {API_BASE}")
print(f"LiteLLM will run on port: {LITELLM_PORT}")
# Write config file with variables
config_content = f"""model_list:
  - model_name: {MODEL_NAME}
    litellm_params:
      model: openai/{MODEL_NAME}
      api_base: {API_BASE}
      api_key: fake-key
general_settings:
  master_key: fake-key
"""

with open('litellm_config.yaml', 'w') as f:
    f.write(config_content)
    
print("Config file created with:")
print(f"- Model: {MODEL_NAME}")
print(f"- API Base: {API_BASE}")
import subprocess
import time

# Kill any existing LiteLLM processes
subprocess.run("pkill -f litellm || true", shell=True)
time.sleep(1)

# Start LiteLLM
cmd = f"nohup litellm --config litellm_config.yaml --port {LITELLM_PORT} --host 0.0.0.0 > litellm.log 2>&1 &"
subprocess.run(cmd, shell=True)
time.sleep(5)

print(f"✅ LiteLLM running on port: {LITELLM_PORT}")
print(f"URL: http://localhost:{LITELLM_PORT}/v1")

# Save port for other cells
with open('.litellm_port', 'w') as f:
    f.write(str(LITELLM_PORT))

Output:

✅ LiteLLM running on port: 4000
URL: http://localhost:4000/v1

Install OpenAI Python Client

Now install the OpenAI client to test our setup:

pip install openai

Test the Pipeline

Let's test our complete LiteLLM + Vast.ai setup with a basic API call:

from openai import OpenAI

client = OpenAI(
    api_key="fake-key",
    base_url=f"http://localhost:{LITELLM_PORT}/v1"
)

# Basic test
try:
    response = client.chat.completions.create(
        model="deepseek",
        messages=[
            {"role": "user", "content": "Hello! How are you?"}
        ],
        max_tokens=1000
    )
    print("Response:", response.choices[0].message.content)
    
except Exception as e:
    print(f"❌ Error: {e}")

Output:

Response: 
Hello! I'm just a friendly AI here, so I don't get tired or have feelings—but I'm fully ready and happy to help you! How are you doing today? Let me know if there's anything you'd like to chat about or need assistance with.

Perfect! Now let's test the reasoning capabilities of the DeepSeek model:

try:
    response_1 = client.chat.completions.create(
        model="deepseek",
        messages=[
            {"role": "user", "content": "what is 8 * 7?, Give me your answer in step by step reasoning."}
        ],
        max_tokens=3000
    )
    print("Response:", response_1.choices[0].message.content)
    
except Exception as e:
    print(f"❌ Error: {e}")

Output:

Response: 
### Step-by-Step Reasoning for 8 * 7

Multiplication is a basic arithmetic operation that can be thought of as repeated addition. To find the product of 8 and 7, you can add the number 8 to itself 7 times (or equivalently, add the number 7 to itself 8 times). I'll use the repeated addition method with the number 8 to make it clear.

Start with the first 8:
- Add the second 8: 8 + 8 = 16
- Add the third 8: 16 + 8 = 24
- Add the fourth 8: 24 + 8 = 32
- Add the fifth 8: 32 + 8 = 40
- Add the sixth 8: 40 + 8 = 48
- Add the seventh 8: 48 + 8 = 56

As shown, after adding 8 seven times, the result is 56.

Alternatively, you can verify this with another method, such as multiplying 7 eight times or subtracting 8 from 8 * 8 (since 8 * 7 = (8 * 8) - 8 = 64 - 8 = 56). Both methods confirm the result.

**Final Answer:** 56

The pipeline is working correctly. You now have a complete LiteLLM + Vast.ai setup with local proxy control and remote GPU inference.

Note: For production deployments, consider using Docker containers for LiteLLM rather than the command-line approach shown here.

Conclusion

This hybrid inference architecture demonstrates how modern AI deployment can balance cost, control, and performance. By combining LiteLLM's local proxy capabilities with Vast.ai's GPU marketplace, we've created a system that offers several key advantages:

Cost Considerations: Vast.ai's marketplace model and pay-as-you-go pricing can provide cost savings compared to traditional cloud providers, while running LiteLLM locally eliminates additional hosting costs for the proxy layer.

Local Control: Running LiteLLM locally provides control over request routing, logging, and configuration. The proxy server includes built-in features for cost tracking, rate limiting, and observability integrations.

API Compatibility: LiteLLM's OpenAI-compatible interface allows existing applications to integrate with minimal changes, while supporting multiple LLM providers through a unified API.

Flexible Architecture: This setup allows you to experiment with different models and providers while maintaining a consistent interface for your applications.

Whether you're building AI-powered applications, conducting research, or optimizing existing ML pipelines, this hybrid approach provides a practical path to cost-effective, controllable AI inference that doesn't sacrifice capability for economics.

Vast AI

© 2025 Vast.ai. All rights reserved.

Vast.ai