Blog

Serving DeepSeek Models on Vast.ai with vLLM and Langchain!

- Team Vast

February 27, 2025-DeepseekVast.aiVLLMLangchain

Introduction

With the release of o1 family of models, reasoning has taken the AI world by storm. Instead of immediately answering a prompt, reasoning is where a model is trained to first output it's "thinking" for its answer. This increases performance on the most difficult tasks today, and since it is trained into the model, it is better than asking it to record its "chain of thought" via a prompt.

R1 takes this even further. We now have Open Source models for this type of task, which means:

  1. We can retain the "thinking" token outputs which we couldn't do beforehand. This lets us create datasets for finetuning.
  2. We can easily fine-tune these models on our own tasks
  3. We can run them on more affordable compute to drastically reduce the cost!

This implementation combines three key components: A distilled DeepSeek model for reasoning transparency, a Vast Template for vLLM's optimized inference server for efficient deployment, and Langchain for parsing reasoning tokens versus normal output tokens. Together, they create a production-ready system that can handle both the technical demands of serving a large language model and the practical needs of processing its unique output format.

Vast.ai provides an ideal platform for this setup, offering the necessary GPU resources at a fraction of traditional cloud costs. Its marketplace model and simple Docker integration make it particularly well-suited for deploying and scaling language models, allowing developers to focus on building applications rather than managing infrastructure. The Templates feature from Vast allows for a repeatable deployment with little additional configuration.

This guide demonstrates how to deploy deepseek/DeepSeek-R1-Distill-Qwen-32B with Vast's templates and integrate them with Langchain for advanced processing capabilities. We'll show you how to leverage vLLM's optimized inference server and create custom parsers to handle DeepSeek's distinctive output format. Feel free to follow along in this companion notebook

Setting Up the Environment

Before we can deploy our model, we need to set up our Vast.ai environment. First, install the Vast SDK:

%%bash
pip install --upgrade vastai

Set up your Vast API key (available from your Account Page):

%%bash
# Here we will set our api key
export VAST_API_KEY="<your-key-here>"
vastai set api-key $VAST_API_KEY

Choosing the Right Hardware

Now we are going to search for GPUs on Vast.ai to run the DeepSeek-R1-Distill-Qwen-32B model. This model requires specific hardware capabilities to run efficiently with vLLM's optimizations. Here are our requirements:

  1. A minimum of 80GB GPU RAM to accommodate:

    • DeepSeek model weights (32B Parameters)
    • KV Cache for handling of extra long output token lengths
  2. A single GPU configuration, as DeepSeek-R1-Distill-Qwen-32B can be efficiently served on one GPU: Note: Multi-GPU configurations are supported if higher throughput is needed.

  3. A static IP address for:

    • Stable API endpoint hosting
    • Consistent client connections
    • Reliable Langchain integration
  4. At least one direct port that we can forward for:

    • vLLM's OpenAI-compatible API server
    • External access to the model endpoint
    • Secure request routing
  5. At least 120GB of disk space to hold the model and other things we might like to download

Here's how to search for suitable instances on Vast.ai:

%%bash
vastai search offers "compute_cap >= 750 \
gpu_ram >= 80 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 120 \
rentable = true"

Deploying the Server via Vast Template

We'll use vLLM's OpenAI-compatible server to deploy the DeepSeek model. This setup provides an OpenAI-compatible API endpoint that works seamlessly with existing tools and libraries.

We will do this with a template that:

  1. Uses vllm/vllm-openai:latest docker image. This gives us an OpenAI-compatible server.
  2. Forwards port 8000 to the outside of the container, which is the default OpenAI server port
  3. Forwards --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --max-model-len 8192 --enforce-eager on to the default entrypoint (the server itself)
  4. Uses --tensor-parallel-size 1 by default.
  5. Uses --gpu-memory-utilization 0.90 by default
  6. Ensures that we have 120 GB of Disk space
%%bash
export INSTANCE_ID= #insert instance ID
vastai create instance $INSTANCE_ID --disk 120 --template_hash eda062b3e0c9c36f09d9d9a294405ded

Verify Setup

After deployment, verify that your server is running correctly with a simple curl test:

%%bash
export VAST_IP_ADDRESS="<your-ip-here>"
export VAST_PORT="<your-port-here>"
curl -X POST http://$VAST_IP_ADDRESS:$VAST_PORT/v1/completions \
     -H "Content-Type: application/json" \
     -d '{
           "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
           "prompt": "Hello, how are you?",
           "max_tokens": 50
         }'

Note: You can find your instance's IP address and port in the Instances tab of the Vast AI Console.

Implementing Custom Output Parsing

Install Required Dependencies

%%bash
pip install --upgrade langchain langchain-openai openai

Creating a DeepSeek Output Parser

DeepSeek models output responses in a unique format with separate thinking and response sections. The thinking section, wrapped in <think> tags, contains the model's reasoning process, while everything after the closing tag represents the final response. This separation is valuable for understanding how the model reaches its conclusions.

Let's create a custom parser that can handle this format:

from typing import Optional, Tuple
from langchain.schema import BaseOutputParser

class R1OutputParser(BaseOutputParser[Tuple[Optional[str], str]]):
    """Parser for DeepSeek R1 model output that includes thinking and response sections."""

    def parse(self, text: str) -> Tuple[Optional[str], str]:
        """Parse the model output into thinking and response sections.

        Args:
            text: Raw text output from the model

        Returns:
            Tuple containing (thinking_text, response_text)
            - thinking_text will be None if no thinking section is found
        """
        if "</think>" in text:
            # Split on </think> tag
            parts = text.split("</think>")
            # Extract thinking text (remove <think> tag)
            thinking_text = parts[0].replace("<think>", "").strip()
            # Get response text
            response_text = parts[1].strip()
            return thinking_text, response_text

        # If no thinking tags found, return None for thinking and full text as response
        return None, text.strip()

    @property
    def _type(self) -> str:
        """Return type key for serialization."""
        return "r1_output_parser"

Our parser does several important things:

  1. It looks for the </think> tag to identify the boundary between thinking and response
  2. It extracts and cleans up the thinking section by removing the <think> tags
  3. It handles cases where no thinking section is present by returning None
  4. It ensures both sections are properly stripped of extra whitespace

Setting Up the Model Chain

Now that we have our parser, we need to set up a processing chain that connects our model to the parser. This chain will:

  • Send requests to our deployed model
  • Process the responses through our custom parser
  • Return the separated thinking and response sections

Here's how we set it up:

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough

# Initialize the model with your Vast instance details
VAST_IP_ADDRESS="<your-ip-here>"
VAST_PORT="<your-port-here>"

openai_api_key = "EMPTY"
openai_api_base = f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"

model = ChatOpenAI(
    base_url=openai_api_base,
    api_key=openai_api_key,
    model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    max_tokens=8000,
    temperature=0.7
)

# Create prompt template
prompt = ChatPromptTemplate.from_messages([
    ("user", "{input}")
])

# Create parser
parser = R1OutputParser()

# Create chain
chain = (
    {"input": RunnablePassthrough()}
    | prompt
    | model
    | parser
)

Let's break down what each component does:

  • The ChatOpenAI instance connects to our deployed model using the OpenAI-compatible API
  • The ChatPromptTemplate formats our input messages
  • The RunnablePassthrough ensures our inputs flow through the chain correctly
  • Finally, our custom parser processes the model's output

Example Usage

Let's test our setup with a prompt that can demonstrate the model's reasoning capabilities:

prompt_text = "Explain quantum computing to a curious 10-year-old who loves video games."

thinking, response = chain.invoke(prompt_text)
print("\nTHINKING:\n")
print(thinking)
print("\nRESPONSE:\n")
print(response)

Output:

THINKING:

Okay, so I need to explain quantum computing to a 10-year-old who loves video games. Hmm, let's break this down. First, I should think about what a 10-year-old knows. They understand basic concepts like computers, games, maybe some video game terms like pixels or bits. They also enjoy stories or analogies that relate to their interests, so using video games as a context makes sense.

Quantum computing is a complex topic, so I need to simplify it. I remember that classical computers use bits, which are 0s and 1s. Quantum computers use qubits, which can be both 0 and 1 at the same time. That's a key difference. Maybe I can compare it to something in a video game. Maybe like a character that can be in two places at once, or a power-up that gives multiple abilities simultaneously.

I should also mention superposition and entanglement. Superposition is when a qubit can be in multiple states at once, and entanglement is when qubits are connected, so the state of one affects the others. Maybe relate that to something like a power-up that affects multiple things at once or a team of characters working together.

Another important point is that quantum computers can solve certain problems much faster than classical computers. Maybe compare it to a game where you can unlock a level or defeat a boss much quicker with a special power-up. But I should also note that not all problems are better solved by quantum computers, just specific ones.

Putting it all together, the explanation should start with what a classical computer does, then introduce qubits as special bits, explain their unique properties using video game analogies, and then talk about the advantages and limitations. I should keep the language simple, use terms they're familiar with, and make it engaging.

Let me think of an example. Maybe a game where you have to find a key in a maze. A classical computer would check each path one by one, while a quantum computer could check all paths at the same time, making it faster. Or maybe a power-up that lets you see all possible paths at once.

I also want to make it clear that it's not just faster, but it's a different way of computing. Maybe use the idea of magic or special abilities in a game that allow for unique capabilities.

I should avoid technical jargon and keep sentences short. Maybe use a narrative style, like telling a story about a game where quantum computing helps the player win. That could make it more relatable and fun.

So, to sum up, I'll start by comparing classical computers to video game rules, then introduce qubits as special bits that can do multiple things at once, explain superposition and entanglement with game-related examples, and discuss the speed advantages and specific applications. I'll wrap it up by connecting it back to video games, showing how quantum computing could make games more complex or fun in the future.

I think that's a solid approach. Now, I'll structure this into a clear, concise explanation using these points and analogies.

RESPONSE:

Quantum computing is like having a supercharged video game character! Imagine you're playing a game where you need to find a hidden treasure in a maze. A regular computer (like a normal game character) would check each path one by one, which could take a long time. But a quantum computer is like a special character that can check all paths at the same time, making it super fast to find the treasure!

In video games, you sometimes get power-ups that give you special abilities. Quantum computers use something called qubits, which are like those power-ups. Unlike regular bits in a computer that can only be 0 or 1, qubits can be both at the same time. This is called superposition. It's like having a character that can be in two places at once!

Also, qubits can work together in a magical way called entanglement. If one qubit changes, the others change too, no matter how far apart they are. It's like having a team of characters where each member's actions affect the others instantly.

Quantum computers are especially good at solving tricky puzzles or problems that would take a regular computer forever. But they're not better at everything—just certain tasks. So, in the game, your special character would shine in specific levels, making them super fun and challenging!

One day, quantum computing might help create even more exciting games, with puzzles that seem impossible to solve without those special powers. So, think of quantum computing as the ultimate power-up for solving complex problems in the blink of an eye!

This example shows how the model:

  1. First works through its reasoning in the thinking section
  2. Then provides a clear, final response
  3. Keeps these two aspects separate for better analysis and control

Key Features

  • Separated Thinking and Response: DeepSeek's unique output format provides insight into the model's reasoning process
  • Custom Output Parsing: Langchain integration enables structured parsing of the model's unique output format

Conclusion

This guide has demonstrated how to deploy and serve the DeepSeek-R1-Distill-Qwen-32B model on Vast.ai. We've shown how to:

  • Set up a cost-effective GPU instance with the right specifications
  • Easilly deploy the model using Vast's Templates with vLLM
  • Create a custom Langchain parser to handle DeepSeek's unique thinking/response format
  • Integrate everything through an OpenAI-compatible API

With this setup on Vast.ai's GPU marketplace, you can now build powerful AI applications with DeepSeek while minimizing both infrastructure complexity and compute costs. The platform's combination of flexible GPU options and simple Docker-based deployment makes it an ideal foundation for serving and scaling language models without the expensive cost model of traditional cloud providers.

Share on
  • Contact
  • Get in Touch