Serving multiple machine learning models simultaneously is a key challenge in production AI systems. The increasing size of foundational models and constraints on GPU memory often mean that deploying separate models for each task results in high RAM usage, underutilized hardware, and latency problems caused by repeatedly swapping entire models in and out of GPU memory. This not only increases cost but also complicates scaling.
However, by leveraging Lorax, a framework for dynamically loading LoRA adapters—and Vast.ai's flexible GPU marketplace, developers can efficiently serve several specialized models on a single base model deployment. This results in significantly better hardware utilization, reduced infrastructure costs, and low-latency inference for diverse AI workloads, making this a game-changer for enterprises deploying their own AI infra.
In this blog post, we'll explore how to deploy and run multiple LoRA adapters on a shared base model using Lorax on Vast.ai's cloud platform. This setup enables hosting thousands of fine-tuned models simultaneously, loading task-specific adapters on demand, and seamlessly switching between tasks such as math problem solving and customer support classification with minimal overhead.
LoRAs (Low-Rank Adaptation): Lightweight, task-specific parameter adjustments that adapt a large base model to new tasks with far fewer parameters than full fine-tuning. LoRAs can be swapped in and out without needing to reload the entire large model, saving significant RAM and time.
Lorax (LoRA eXchange): An efficient serving framework that enables running thousands of LoRA adapters on a single GPU by dynamically loading these adapters into a base model at inference time. Lorax maintains high throughput and low latency while drastically reducing the deployment cost of hosting multiple fine-tuned variants.
The following example uses the mistralai/Mistral-7B-v0.1
base model hosted on Vast.ai, along with two LoRA adapters:
predibase/gsm8k
: Designed for solving math problems like the GSM8K benchmark.predibase/customer_support
: Specializes in customer service query classification.We deploy the base model once and dynamically switch between these specialized LoRA adapters, enabling efficient multi-model serving on the same underlying GPU instance.
pip install vastai==0.2.6
export VAST_API_KEY="your_vast_api_key"
vastai set api-key $VAST_API_KEY
The mistralai/Mistral-7B-v0.1
model requires at least 16GB VRAM, but it's safer to select a larger instance (e.g., 32GB VRAM) for LoRA adapter loading.
vastai search offers "compute_cap >= 750 \
geolocation=US \
gpu_ram >= 32 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 80 \
rentable = true"
Make sure you have accepted the usage terms for Mistral-7B-v0.1
on Huggingface before proceeding.
export INSTANCE_ID=your_instance_id
vastai create instance $INSTANCE_ID --image ghcr.io/predibase/lorax:main \
--env '-p 8080:80 --shm-size 1g -e HUGGING_FACE_HUB_TOKEN="your_hf_token"' \
--disk 80 --args --model-id mistralai/Mistral-7B-v0.1
This command runs the Lorax server that hosts the base model and dynamically loads adapters on demand.
Now, we need to get our IP address and port to call our model. We first need to wait for our machine to download the image and the model and start serving. This will take a few minutes. The logs will show you when it's done.
Next, go to the Instances tab in the Vast AI Console and find the instance you just created.
At the top of the instance, there is a button with an IP address in it. Click this and a panel will show up of the ip address and the forwarded ports. You should see something like:
Open Ports
XX.XX.XXX.XX:YYYY -> 8000/tcp
You will need the IP address (XX.XX.XXX.XX) and the port (YYYY) for the next step.
Lorax supports on-demand loading of LoRAs from the Huggingface Hub, minimizing memory overhead versus deploying multiple large models simultaneously.
First, configure environment variables to connect to your Vast instance and set your Huggingface token:
HF_TOKEN = "your_hf_token"
VAST_IP_ADDRESS = "your_instance_ip"
VAST_PORT = "your_vast_port"
predibase/gsm8k
Prepare a prompt:
question = "James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?"
math_prompt = f"Please answer the following question: {question}\nAnswer"
Make a POST request to the /generate
endpoint specifying the LoRA adapter ID:
import requests
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {HF_TOKEN}"
}
data = {
"inputs": math_prompt,
"parameters": {
"max_new_tokens": 100,
"adapter_id": "predibase/gsm8k",
"adapter_source": "hub"
}
}
url = f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/generate"
response = requests.post(url, headers=headers, json=data)
print("Response:", response.json()["generated_text"])
Response:
He runs 3*3=<<3*3=9>>9 sprints a week
So he runs 9*60=<<9*60=540>>540 meters a week
#### 540
predibase/customer_support
Prepare a customer support transcript prompt:
transcript = "Hi I am having trouble with my account. It says I need to reset my password to log in but I already reset my password."
customer_support_prompt = f"""Consider the case of a customer contacting the support center.
The term task type refers to the reason for why the customer contacted support.
### The possible task types are: account issue, billing issue, product issue, none of the above
Summarize the issue/question/reason that drove the customer to contact support:
Transcript: {transcript}
Task Type:
"""
Make the API call:
data = {
"inputs": customer_support_prompt,
"parameters": {
"max_new_tokens": 100,
"adapter_id": "predibase/customer_support",
"adapter_source": "hub"
}
}
response = requests.post(url, headers=headers, json=data)
print("Response:", response.json()["generated_text"])
Response:
account issue
Lorax provides compatibility with the OpenAI API, allowing easy integration into existing apps with minimal code changes.
pip install --upgrade openai
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1",
default_headers={"Authorization": f"Bearer {HF_TOKEN}"}
)
resp = client.completions.create(
model="predibase/gsm8k",
prompt=math_prompt,
max_tokens=100,
temperature=0.7
)
print("Response:", resp.choices[0].text)
Output:
James runs 3 x 3 = <<3*3=9>>9 sprints in a week.
He runs 9 x 60 = <<9*60=540>>540 meters in a week.
#### 540
resp = client.completions.create(
model="predibase/customer_support",
prompt=customer_support_prompt,
max_tokens=100,
temperature=0.7
)
print("Response:", resp.choices[0].text)
Output:
account issue
Serving multiple specialized AI models traditionally involves deploying each fine-tuned model independently, consuming lots of GPU memory and resulting in higher cost and slower inference when switching tasks.
By contrast, Lorax’s approach uses a single shared base model and dynamically loads lightweight LoRA adapters tailored for different tasks. This yields several benefits:
Combined with Vast.ai’s cost-effective, flexible GPU rental options, this method provides a scalable, production-ready solution for deploying multi-model AI services with a minimal footprint.
Efficiently serving multiple machine learning models is vital for scaling AI-powered products. The combination of Lorax’s dynamic LoRA adapter serving and Vast.ai’s flexible GPU infrastructure offers a powerful, cost-effective, and scalable way to support many specialized models on a single deployment.
Whether solving math problems, classifying customer inquiries, or other tasks, this approach reduces overhead and unlocks practical multi-model serving for modern AI applications.
Try setting up your own multi-LoRA serving environment today to experience the benefits firsthand!
Happy modeling! 🚀
© 2025 Vast.ai. All rights reserved.