Efficiently Serving Multiple Machine Learning Models with Lorax and vLLM on Vast.ai

June 19, 2025
6 Min Read
By Team Vast

Serving multiple machine learning models simultaneously is a key challenge in production AI systems. The increasing size of foundational models and constraints on GPU memory often mean that deploying separate models for each task results in high RAM usage, underutilized hardware, and latency problems caused by repeatedly swapping entire models in and out of GPU memory. This not only increases cost but also complicates scaling.

However, by leveraging Lorax, a framework for dynamically loading LoRA adapters—and Vast.ai's flexible GPU marketplace, developers can efficiently serve several specialized models on a single base model deployment. This results in significantly better hardware utilization, reduced infrastructure costs, and low-latency inference for diverse AI workloads, making this a game-changer for enterprises deploying their own AI infra.

In this blog post, we'll explore how to deploy and run multiple LoRA adapters on a shared base model using Lorax on Vast.ai's cloud platform. This setup enables hosting thousands of fine-tuned models simultaneously, loading task-specific adapters on demand, and seamlessly switching between tasks such as math problem solving and customer support classification with minimal overhead.


What Are LoRA Adapters and Lorax?

  • LoRAs (Low-Rank Adaptation): Lightweight, task-specific parameter adjustments that adapt a large base model to new tasks with far fewer parameters than full fine-tuning. LoRAs can be swapped in and out without needing to reload the entire large model, saving significant RAM and time.

  • Lorax (LoRA eXchange): An efficient serving framework that enables running thousands of LoRA adapters on a single GPU by dynamically loading these adapters into a base model at inference time. Lorax maintains high throughput and low latency while drastically reducing the deployment cost of hosting multiple fine-tuned variants.


Deploying Multiple Models Using Lorax and Vast.ai

The following example uses the mistralai/Mistral-7B-v0.1 base model hosted on Vast.ai, along with two LoRA adapters:

  • predibase/gsm8k: Designed for solving math problems like the GSM8K benchmark.
  • predibase/customer_support: Specializes in customer service query classification.

We deploy the base model once and dynamically switch between these specialized LoRA adapters, enabling efficient multi-model serving on the same underlying GPU instance.


Step 1: Set Up Vast.ai Environment

  1. Install Vast CLI and configure API key:
pip install vastai==0.2.6

export VAST_API_KEY="your_vast_api_key"
vastai set api-key $VAST_API_KEY
  1. Search for a suitable GPU instance:

The mistralai/Mistral-7B-v0.1 model requires at least 16GB VRAM, but it's safer to select a larger instance (e.g., 32GB VRAM) for LoRA adapter loading.

vastai search offers "compute_cap >= 750 \
geolocation=US \
gpu_ram >= 32 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 80 \
rentable = true"
  1. Deploy the instance:

Make sure you have accepted the usage terms for Mistral-7B-v0.1 on Huggingface before proceeding.

export INSTANCE_ID=your_instance_id

vastai create instance $INSTANCE_ID --image ghcr.io/predibase/lorax:main \
--env '-p 8080:80 --shm-size 1g -e HUGGING_FACE_HUB_TOKEN="your_hf_token"' \
--disk 80 --args --model-id mistralai/Mistral-7B-v0.1

This command runs the Lorax server that hosts the base model and dynamically loads adapters on demand.

  1. Get Instance IP Address and Port

Now, we need to get our IP address and port to call our model. We first need to wait for our machine to download the image and the model and start serving. This will take a few minutes. The logs will show you when it's done.

Next, go to the Instances tab in the Vast AI Console and find the instance you just created.

At the top of the instance, there is a button with an IP address in it. Click this and a panel will show up of the ip address and the forwarded ports. You should see something like:

Open Ports
XX.XX.XXX.XX:YYYY -> 8000/tcp

You will need the IP address (XX.XX.XXX.XX) and the port (YYYY) for the next step.


Step 2: Query LoRA Adapters via HTTP Requests

Lorax supports on-demand loading of LoRAs from the Huggingface Hub, minimizing memory overhead versus deploying multiple large models simultaneously.

First, configure environment variables to connect to your Vast instance and set your Huggingface token:

HF_TOKEN = "your_hf_token"
VAST_IP_ADDRESS = "your_instance_ip"
VAST_PORT = "your_vast_port"

Example: Math Problem Solving with predibase/gsm8k

Prepare a prompt:

question = "James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?"
math_prompt = f"Please answer the following question: {question}\nAnswer"

Make a POST request to the /generate endpoint specifying the LoRA adapter ID:

import requests

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {HF_TOKEN}"
}
data = {
    "inputs": math_prompt,
    "parameters": {
        "max_new_tokens": 100,
        "adapter_id": "predibase/gsm8k",
        "adapter_source": "hub"
    }
}

url = f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/generate"
response = requests.post(url, headers=headers, json=data)
print("Response:", response.json()["generated_text"])

Response:

He runs 3*3=<<3*3=9>>9 sprints a week
So he runs 9*60=<<9*60=540>>540 meters a week
#### 540

Example: Customer Support Classification with predibase/customer_support

Prepare a customer support transcript prompt:

transcript = "Hi I am having trouble with my account. It says I need to reset my password to log in but I already reset my password."

customer_support_prompt = f"""Consider the case of a customer contacting the support center.
The term task type refers to the reason for why the customer contacted support.

### The possible task types are: account issue, billing issue, product issue, none of the above

Summarize the issue/question/reason that drove the customer to contact support:

Transcript: {transcript}

Task Type:
"""

Make the API call:

data = {
    "inputs": customer_support_prompt,
    "parameters": {
        "max_new_tokens": 100,
        "adapter_id": "predibase/customer_support",
        "adapter_source": "hub"
    }
}

response = requests.post(url, headers=headers, json=data)
print("Response:", response.json()["generated_text"])

Response:

account issue

Step 3: Access Models Using the OpenAI SDK

Lorax provides compatibility with the OpenAI API, allowing easy integration into existing apps with minimal code changes.

  1. Install OpenAI SDK
pip install --upgrade openai
  1. Call the math adapter:
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1",
    default_headers={"Authorization": f"Bearer {HF_TOKEN}"}
)

resp = client.completions.create(
    model="predibase/gsm8k",
    prompt=math_prompt,
    max_tokens=100,
    temperature=0.7
)

print("Response:", resp.choices[0].text)

Output:

James runs 3 x 3 = <<3*3=9>>9 sprints in a week.
He runs 9 x 60 = <<9*60=540>>540 meters in a week.
#### 540
  1. Call the customer support adapter:
resp = client.completions.create(
    model="predibase/customer_support",
    prompt=customer_support_prompt,
    max_tokens=100,
    temperature=0.7
)
print("Response:", resp.choices[0].text)

Output:

account issue

Why This Approach Matters

Serving multiple specialized AI models traditionally involves deploying each fine-tuned model independently, consuming lots of GPU memory and resulting in higher cost and slower inference when switching tasks.

By contrast, Lorax’s approach uses a single shared base model and dynamically loads lightweight LoRA adapters tailored for different tasks. This yields several benefits:

  • Reduced RAM and GPU usage: Only the small adapter layers are loaded on demand, not the entire large model.
  • Lower infrastructure cost: A single GPU instance can serve thousands of specialized models.
  • Faster context switching: Switching between tasks is near-instant as you don’t reload the full model.
  • Flexibility: New adapters can be added or updated independently.
  • Ease of integration: OpenAI-compatible APIs simplify adoption.

Combined with Vast.ai’s cost-effective, flexible GPU rental options, this method provides a scalable, production-ready solution for deploying multi-model AI services with a minimal footprint.


Conclusion

Efficiently serving multiple machine learning models is vital for scaling AI-powered products. The combination of Lorax’s dynamic LoRA adapter serving and Vast.ai’s flexible GPU infrastructure offers a powerful, cost-effective, and scalable way to support many specialized models on a single deployment.

Whether solving math problems, classifying customer inquiries, or other tasks, this approach reduces overhead and unlocks practical multi-model serving for modern AI applications.

Try setting up your own multi-LoRA serving environment today to experience the benefits firsthand!


Useful Resources


Happy modeling! 🚀

Vast AI

© 2025 Vast.ai. All rights reserved.

Vast.ai