In the world of AI applications, getting consistent, well-formatted responses from language models is crucial. While LLMs are powerful, their free-form outputs can be unpredictable and hard to parse programmatically. This normally prevents LLM's from integrating into existing applications, or giving developers control over their business logic. This is where structured outputs come in - they allow us to enforce specific response formats, making it easier to build reliable AI applications and fit the LLM responses into existing paradigms like Pydantic and JSON schemas.
vLLM, combined with the plugin for the Outlines library, provides an elegant solution for generating structured outputs. By enforcing response schemas through Pydantic models or JSON schemas, we can ensure our LLM outputs follow exact specifications. Running this setup on Vast.ai makes it cost-effective and scalable, giving you access to powerful GPUs without the overhead of managing infrastructure.
Install the Vast SDK
%%bash
pip install --upgrade vastai
Set up Vast API Key
%%bash
# Here we will set our api key
export VAST_API_KEY= #Your key here
vastai set api-key $VAST_API_KEY
For optimal performance with vLLM, you'll need:
Vast.ai makes it easy to find machines meeting these requirements. Here's how to search for suitable instances:
%%bash
vastai search offers 'compute_cap >= 750 gpu_ram >= 32 num_gpus = 1 static_ip=true direct_port_count >= 1'
We'll use vLLM's OpenAI-compatible server, which allows us to use the familiar OpenAI API format and the OpenAI SDK while leveraging vLLM's optimizations.
For this example, the meta-llama/Meta-Llama-3.1-8B-Instruct
model needs the user to accept the terms of use for the Llama 3.1 model, and use an authentication token. The setup process is otherwise straightforward:
%%bash
vastai create instance <instance-id> \
--image vllm/vllm-openai:latest \
--env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=<your-token>' \
--disk 40 \
--args --model meta-llama/Meta-Llama-3.1-8B-Instruct --guided-decoding-backend outlines
%%bash
pip install --upgrade openai
pip install --upgrade pydantic
Let's look at a practical example - extracting calendar event information from text. First we will set up our connection to the server on our Vast instance. Then, using Pydantic, we can define exactly what fields we expect.
import json
from pydantic import BaseModel
from openai import OpenAI
from typing import List
VAST_IP_ADDRESS=""
VAST_PORT=""
openai_api_key = "EMPTY"
openai_api_base = f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
class CalendarEvent(BaseModel):
name: str
date: str
participants: List[str]
The model will now format its responses to match this structure exactly, making it easy to process the output programmatically. Here we use the Pydantic model to define the Calendar Event schema for the response.
completion = client.beta.chat.completions.parse(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "Extract the event information."},
{"role": "user", "content": "The Science Fair is on Friday. Alice and Bob are going."},
],
extra_body={
"guided_json": CalendarEvent.model_json_schema()
}
)
event = json.loads(completion.choices[0].message.content)
print(event)
In this example output, the model successfully extracts all required information from the input text:
#output
{'name': 'Science Fair', 'date': 'Friday', 'participants': ['Alice', 'Bob']}
For more complex use cases, we can use JSON schemas to define response structures. This is particularly useful for customer service applications where we need consistent response formats:
def generate_structured_response(user_message):
response_schema = {
"type": "object",
"properties": {
"category": {
"type": "string",
"enum": ["Order Issue", "Product Inquiry", "Payment Issue", "General Query"]
},
"response": {
"type": "string"
},
"next_steps": {
"type": "array",
"items": {
"type": "string"
}
},
"follow_up_required": {
"type": "boolean"
}
},
"required": ["category", "response", "next_steps", "follow_up_required"]
}
prompt = f"""
Given the following user message, generate a structured response in JSON format.
- Category: Choose from ['Order Issue', 'Product Inquiry', 'Payment Issue', 'General Query']
- Response: Craft a helpful response
- Next Steps: Suggest steps the user should take or the support team will take
- Follow-up Required: Yes or No
User Message:
{user_message}
Respond in JSON format.
"""
completion = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful customer service bot that provides structured JSON responses."},
{"role": "user", "content": prompt}
],
temperature=0,
extra_body={
"guided_json": response_schema
}
)
# Parse the JSON response
return json.loads(completion.choices[0].message.content)
# User message to be processed
user_message = "I received my order, but a few items were missing. How can I get the missing items?"
# Generate structured response
structured_output = generate_structured_response(user_message)
print(json.dumps(structured_output, indent=2))
The response demonstrates the structured output capabilities:
#output
{
"category": "Order Issue",
"response": "We apologize for the inconvenience. To resolve the issue, please contact our customer service team via email at [support@email.com](mailto:support@email.com) or phone at 1-800-SUPPORT. We will assist you in processing a replacement order for the missing items. Please have your order number ready when you reach out to us. We appreciate your patience and apologize again for the missing items.",
"next_steps": [
"Contact customer service team via email or phone",
"Have order number ready for assistance"
],
"follow_up_required": true
}
Structured outputs with vLLM and Outlines provide a powerful foundation for building reliable AI applications. By running on Vast.ai, you get the benefits of cost-effective GPU access while maintaining full control over your deployment. Whether you're building a customer service bot, data extraction system, or complex AI workflow, this setup gives you the tools you need for consistent, reliable AI outputs.
Start experimenting with structured outputs and see how they can improve your AI applications' reliability and usability.