Structured Outputs with vLLM and Outlines on Vast.ai

January 28, 2025

6 Min Read

By Team Vast

Structured Outputs with vLLM and Outlines on Vast.ai

Introduction

In the world of AI applications, getting consistent, well-formatted responses from language models is crucial. While LLMs are powerful, their free-form outputs can be unpredictable and hard to parse programmatically. This normally prevents LLM's from integrating into existing applications, or giving developers control over their business logic. This is where structured outputs come in - they allow us to enforce specific response formats, making it easier to build reliable AI applications and fit the LLM responses into existing paradigms like Pydantic and JSON schemas.

vLLM, combined with the plugin for the Outlines library, provides an elegant solution for generating structured outputs. By enforcing response schemas through Pydantic models or JSON schemas, we can ensure our LLM outputs follow exact specifications. Running this setup on Vast.ai makes it cost-effective and scalable, giving you access to powerful GPUs without the overhead of managing infrastructure.

Setting Up Your Environment

Install Vast

Install the Vast SDK

%%bash
pip install --upgrade vastai

Set up Vast API Key

%%bash
# Here we will set our api key
export VAST_API_KEY= #Your key here
vastai set api-key $VAST_API_KEY

Choosing the Right Hardware

For optimal performance with vLLM, you'll need:

GPUs with Turing architecture or newer (compute capability ≥ 7.5)
At least 32GB GPU RAM for comfortable operation
A static IP address for stable API access
At least one direct port that we can forward for the API server

Vast.ai makes it easy to find machines meeting these requirements. Here's how to search for suitable instances:

%%bash
vastai search offers 'compute_cap >= 750 gpu_ram >= 32 num_gpus = 1 static_ip=true direct_port_count >= 1'

Deploying the Server

We'll use vLLM's OpenAI-compatible server, which allows us to use the familiar OpenAI API format and the OpenAI SDK while leveraging vLLM's optimizations.

For this example, the meta-llama/Meta-Llama-3.1-8B-Instruct model needs the user to accept the terms of use for the Llama 3.1 model, and use an authentication token. The setup process is otherwise straightforward:

%%bash
vastai create instance <instance-id> \
    --image vllm/vllm-openai:latest \
    --env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=<your-token>' \
    --disk 40 \
    --args --model meta-llama/Meta-Llama-3.1-8B-Instruct --guided-decoding-backend outlines

Implementing Structured Outputs

Install OpenAI and Pydantic

%%bash

pip install --upgrade openai
pip install --upgrade pydantic

Calendar Event Example

Let's look at a practical example - extracting calendar event information from text. First we will set up our connection to the server on our Vast instance. Then, using Pydantic, we can define exactly what fields we expect.

import json

from pydantic import BaseModel
from openai import OpenAI
from typing import List


VAST_IP_ADDRESS=""
VAST_PORT=""


openai_api_key = "EMPTY"
openai_api_base = f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

class CalendarEvent(BaseModel):
    name: str
    date: str
    participants: List[str]

The model will now format its responses to match this structure exactly, making it easy to process the output programmatically. Here we use the Pydantic model to define the Calendar Event schema for the response.

completion = client.beta.chat.completions.parse(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "Extract the event information."},
        {"role": "user", "content": "The Science Fair is on Friday. Alice and Bob are going."},
    ],
    extra_body={
        "guided_json": CalendarEvent.model_json_schema()
    }
)

event = json.loads(completion.choices[0].message.content)

print(event)

In this example output, the model successfully extracts all required information from the input text:

The event name ("Science Fair")
The date mentioned ("Friday")
Both participants as an array (["Alice", "Bob"])

#output
{'name': 'Science Fair', 'date': 'Friday', 'participants': ['Alice', 'Bob']}

Customer Service Response Schema

For more complex use cases, we can use JSON schemas to define response structures. This is particularly useful for customer service applications where we need consistent response formats:

def generate_structured_response(user_message):
    response_schema = {
        "type": "object",
        "properties": {
            "category": {
                "type": "string",
                "enum": ["Order Issue", "Product Inquiry", "Payment Issue", "General Query"]
            },
            "response": {
                "type": "string"
            },
            "next_steps": {
                "type": "array",
                "items": {
                    "type": "string"
                }
            },
            "follow_up_required": {
                "type": "boolean"
            }
        },
        "required": ["category", "response", "next_steps", "follow_up_required"]
    }

    prompt = f"""
    Given the following user message, generate a structured response in JSON format.

    - Category: Choose from ['Order Issue', 'Product Inquiry', 'Payment Issue', 'General Query']
    - Response: Craft a helpful response
    - Next Steps: Suggest steps the user should take or the support team will take
    - Follow-up Required: Yes or No

    User Message:
    {user_message}

    Respond in JSON format.
    """

    completion = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        messages=[
            {"role": "system", "content": "You are a helpful customer service bot that provides structured JSON responses."},
            {"role": "user", "content": prompt}
        ],
        temperature=0,
        extra_body={
            "guided_json": response_schema
        }
    )

    # Parse the JSON response
    return json.loads(completion.choices[0].message.content)

# User message to be processed
user_message = "I received my order, but a few items were missing. How can I get the missing items?"

# Generate structured response
structured_output = generate_structured_response(user_message)
print(json.dumps(structured_output, indent=2))

The response demonstrates the structured output capabilities:

The issue is correctly categorized as an "Order Issue"
A professional, detailed response is generated with contact information
Clear next steps are provided in an array format
A boolean flag indicates follow-up is required

#output
{
  "category": "Order Issue",
  "response": "We apologize for the inconvenience. To resolve the issue, please contact our customer service team via email at [support@email.com](mailto:support@email.com) or phone at 1-800-SUPPORT. We will assist you in processing a replacement order for the missing items. Please have your order number ready when you reach out to us. We appreciate your patience and apologize again for the missing items.",
  "next_steps": [
    "Contact customer service team via email or phone",
    "Have order number ready for assistance"
  ],
  "follow_up_required": true
}

Key Features

Structured Output Control: vLLM with Outlines enforces strict output formats through JSON schemas and Pydantic models, ensuring consistent and reliable responses from LLMs.
OpenAI Compatibility: Uses the familiar OpenAI API specification, unlocking the ability to integrate this into the OpenAI SDK. This makes it easy to integrate into existing applications or migrate from OpenAI services.

Future Possibilities

Leverage Outputs in Langchain: Leverage these structured responses in libraries like Langchain or others to build out complex AI systems.
Automated Data Processing: Create pipelines that automatically extract and structure data from unstructured text, emails, or documents using custom schemas. This setup is perfect for batch processing in the background with vLLM's automatic batching.

Conclusion

Structured outputs with vLLM and Outlines provide a powerful foundation for building reliable AI applications. By running on Vast.ai, you get the benefits of cost-effective GPU access while maintaining full control over your deployment. Whether you're building a customer service bot, data extraction system, or complex AI workflow, this setup gives you the tools you need for consistent, reliable AI outputs.

Start experimenting with structured outputs and see how they can improve your AI applications' reliability and usability.

Structured Outputs with vLLM and Outlines on Vast.ai

Structured Outputs with vLLM and Outlines on Vast.ai

Introduction

Setting Up Your Environment

Install Vast

Choosing the Right Hardware

Deploying the Server

Implementing Structured Outputs

Install OpenAI and Pydantic

Calendar Event Example

Customer Service Response Schema

Key Features

Future Possibilities

Conclusion

Running OpenAI's GPT-OSS on Vast.ai

Serving DeepSeek Models on Vast.ai with vLLM and Langchain!

Vast.ai GPUs Can Now Be Rented Through SkyPilot

Subscribe for our product updates.