Serving Rerankers on Vast.ai using vLLM

Introduction
Rerankers are powerful tools that excel at determining the relevance between pairs of text - whether you're matching search queries to documents, evaluating LLM outputs against prompts, or finding similar content in a database. Unlike simple keyword matching or embedding similarity, these specialized models perform a detailed comparison between inputs, capturing nuanced relationships that simpler methods might miss. They're particularly valuable in RAG (Retrieval Augmented Generation) systems, recommendation engines, and content filtering pipelines, where they can significantly improve quality while requiring minimal computational overhead.
vLLM has recently expanded its capabilities to include reranker model serving, offering compatibility with both OpenAI and Cohere APIs. In this guide, we'll focus on deploying the BAAI/bge-reranker-base model - a powerful yet efficient reranker designed for semantic similarity scoring.
Vast.ai provides a marketplace for renting GPU compute power, offering a cost-effective alternative to major cloud providers. Vast has GPU SKU's that you cannot normally find on clouds, including 4000 series GPU's which are cheaper, with a tradeoff of less RAM. Re-ranker models are often quite small so we can take advantage of these extra affordable prices.
In this guide, we will:
- Set up a Vast.ai instance with the right GPU specifications for serving rerankers
- Deploy the model using vLLM's optimized inference server
- Demonstrate two ways to interact with the reranker:
- Using the Cohere-compatible API for batch reranking
- Using the OpenAI-compatible API for cross-encoder scoring
This setup provides a production-ready environment for serving reranker models, with the ability to handle both batch reranking requests and individual similarity scoring tasks.
Setting Up Vast.ai
First, we'll install the Vast.ai API:
pip install --upgrade vastai
Next, we'll set our API key (found on the Account Page):
export VAST_API_KEY="your-key-here"
vastai set api-key $VAST_API_KEY
Choosing Hardware
The BAAI/bge-reranker-base model has modest requirements compared to larger language models. Here's what we need:
- 16GB GPU RAM for:
- Model weights (~278M parameters)
- Batch processing overhead
- System operations
- Single GPU with Turing architecture or newer
- Static IP for stable API endpoint
- At least one direct port for the API server
Note that you could potentially go down to 8GB of GPU RAM for this model for more affordable usages.
We'll use the vastai search offers command to find instances that meet our requirements:
vastai search offers "compute_cap >= 750 \
gpu_ram >= 16 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
rentable = true"
Deploying the Server
Next, we'll copy the ID from our chosen instance into the instance-id field below along with our HUGGINGFACE_TOKEN to rent an instance and deploy BAAI/bge-reranker-base using vLLM's OpenAI-compatible server:
export INSTANCE_ID=<instance-id>
vastai create instance $INSTANCE_ID \
--image vllm/vllm-openai:latest \
--env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=HUGGINGFACE_TOKEN' \
--disk 40 \
--args --model BAAI/bge-reranker-base
This setup:
- Uses vLLM's optimized inference server
- Exposes port 8000 inside the container (Vast.ai will forward this to a different external port)
- Downloads and serves the
BAAI/bge-reranker-basemodel
Verify the Setup
Before proceeding, verify that your instance is running correctly:
- Go to the Instances tab in the Vast AI Console
- Wait for the instance to download the image and model (this may take a few minutes)
- Find your instance's IP address and port from the "Open Ports" panel - you'll see something like
XX.XX.XXX.XX:YYYY -> 8000/tcpwhere YYYY is the external port that forwards to the container's port 8000 - Use these values in
VAST_IP_ADDRESSandVAST_PORTfor the test request below
export VAST_IP_ADDRESS="your-ip-here"
export VAST_PORT="your-port-here"
curl -X POST http://$VAST_IP_ADDRESS:$VAST_PORT/rerank \
-H "Content-Type: application/json" \
-d '{
"model": "BAAI/bge-reranker-base",
"query": "What is deep learning?",
"documents": [
"Deep learning is a type of machine learning"
]
}'
Using the Reranker
vLLM provides two ways to rerank documents using the same underlying model:
-
OpenAI API Compatible (
/score)- Raw scoring between query and documents
- Returns similarity scores for manual sorting
- Best for custom ranking logic
-
Cohere API Compatible (
/rerank)- Direct reranking of documents against a query
- Returns pre-sorted results
- Best for quick integration
OpenAI-Compatible Endpoint
First, we'll create a function to call the /score endpoint. We'll need to enter the IP_ADDRESS and PORT from above:
import requests
IP_ADDRESS = ""
PORT = ""
def openai_score(query,documents):
base_url = f"http://{IP_ADDRESS}:{PORT}"
# Format request for score endpoint
test_request = {
"model": "BAAI/bge-reranker-base",
"text_1": query, # Query is text_1
"text_2": documents # Documents are text_2
}
# Make request and print raw response
response = requests.post(f"{base_url}/score", json=test_request)
print("Status code:", response.status_code)
print("Raw response:", response.text)
# If successful, print formatted results
if response.status_code == 200:
data = response.json()
scores = [(doc, item["score"]) for doc, item in zip(documents, data["data"])]
scores.sort(key=lambda x: x[1], reverse=True)
print("\nRanked results:")
for text, score in scores:
print("\nScore:", score)
print("Text:", text)
We'll start with a simple example to show what happens when we have completely irrelevant documents:
query = "What is Deep Learning?"
documents = [
"Deep learning is a subset of machine learning that uses neural networks with many layers",
"The weather is nice today",
"Deep learning enables computers to learn from large amounts of data",
"I like pizza"
]
openai_score(query,documents)
In this example, we can see how effectively the reranker distinguishes between relevant and irrelevant content:
- The most relevant document (score ~1.0) provides a direct definition of deep learning
- The second document (score ~0.18) mentions deep learning but provides less specific information
- The irrelevant documents about weather and pizza receive nearly zero scores (~0.00004)
Ranked results:
Score: 0.99951171875
Text: Deep learning is a subset of machine learning that uses neural networks with many layers
Score: 0.17626953125
Text: Deep learning enables computers to learn from large amounts of data
Score: 3.737211227416992e-05
Text: The weather is nice today
Score: 3.737211227416992e-05
Text: I like pizza
Next, we'll run it with a slightly more realistic set of documents to see how the reranker handles more nuanced differences in relevance:
query = "What is Deep Learning?"
documents = [
"Deep learning is a subset of machine learning that uses neural networks with multiple layers to progressively extract higher-level features from raw input",
"Machine learning algorithms enable computers to learn from data without being explicitly programmed",
"The latest smartphone features advanced AI capabilities for photo enhancement",
"Deep learning has revolutionized computer vision, enabling tasks like facial recognition and object detection",
"A neural network is inspired by the biological neural networks that constitute animal brains",
"Cloud computing provides scalable infrastructure for training deep learning models",
"The history of artificial intelligence dates back to the 1950s",
"Deep learning models require significant computational resources and large datasets for training"
]
openai_score(query,documents)
The reranker shows sophisticated understanding in this more nuanced example:
- The comprehensive definition gets a perfect score (1.0)
- Application examples (computer vision) score moderately well (~0.43)
- Related concepts (machine learning, neural networks) get low but non-zero scores (~0.03)
- Infrastructure and historical context receive nearly zero scores (
<0.001)
This demonstrates how the reranker can capture subtle differences in relevance, not just obvious distinctions between relevant and irrelevant content.
Ranked results:
Score: 1.0
Text: Deep learning is a subset of machine learning that uses neural networks with multiple layers to progressively extract higher-level features from raw input
Score: 0.425537109375
Text: Deep learning has revolutionized computer vision, enabling tasks like facial recognition and object detection
Score: 0.034942626953125
Text: Machine learning algorithms enable computers to learn from data without being explicitly programmed
Score: 0.001674652099609375
Text: A neural network is inspired by the biological neural networks that constitute animal brains
Score: 0.0004711151123046875
Text: The history of artificial intelligence dates back to the 1950s
Score: 0.00026535987854003906
Text: Deep learning models require significant computational resources and large datasets for training
Score: 5.7816505432128906e-05
Text: Cloud computing provides scalable infrastructure for training deep learning models
Score: 3.737211227416992e-05
Text: The latest smartphone features advanced AI capabilities for photo enhancement
Cohere-Compatible Endpoint
The /rerank endpoint provides a higher-level interface that:
- Directly reranks documents against a query
- Returns pre-sorted results
- Handles all scoring logic internally
- Simplifies integration into existing pipelines
First, we'll install Cohere.
pip install --upgrade cohere
We'll then create a function to call the cohere endpoint.
import cohere
IP_ADDRESS = ""
PORT = ""
def cohere_reranker(query,documents):
base_url = f"http://{IP_ADDRESS}:{PORT}"
# Initialize v2 client with our endpoint
co = cohere.ClientV2("sk-fake-key", base_url=base_url)
result = co.rerank(
model="BAAI/bge-reranker-base",
query=query,
documents=documents
)
print("\nRanked results:")
for doc in result.results:
print(f"\nScore: {doc.relevance_score}")
print(f"Text: {doc.document.text}") # Access text through document.text
First, we'll test with our simple example:
query = "What is Deep Learning?"
documents = [
"Deep learning is a subset of machine learning that uses neural networks with many layers",
"The weather is nice today",
"Deep learning enables computers to learn from large amounts of data",
"I like pizza"
]
cohere_reranker(query,documents)
Output:
Ranked results:
Score: 0.99951171875
Text: Deep learning is a subset of machine learning that uses neural networks with many layers
Score: 0.17626953125
Text: Deep learning enables computers to learn from large amounts of data
Score: 3.737211227416992e-05
Text: The weather is nice today
Score: 3.737211227416992e-05
Text: I like pizza
Notice that the Cohere endpoint produces identical scores to the OpenAI endpoint - this is because both are using the same underlying model, just with different APIs. The key difference is that the Cohere endpoint automatically handles the sorting and formatting of results.
Next, we'll try our more complex example:
query = "What is Deep Learning?"
documents = [
"Deep learning is a subset of machine learning that uses neural networks with multiple layers to progressively extract higher-level features from raw input",
"Machine learning algorithms enable computers to learn from data without being explicitly programmed",
"The latest smartphone features advanced AI capabilities for photo enhancement",
"Deep learning has revolutionized computer vision, enabling tasks like facial recognition and object detection",
"A neural network is inspired by the biological neural networks that constitute animal brains",
"Cloud computing provides scalable infrastructure for training deep learning models",
"The history of artificial intelligence dates back to the 1950s",
"Deep learning models require significant computational resources and large datasets for training"
]
cohere_reranker(query,documents)
Output:
Ranked results:
Score: 1.0
Text: Deep learning is a subset of machine learning that uses neural networks with multiple layers to progressively extract higher-level features from raw input
Score: 0.425537109375
Text: Deep learning has revolutionized computer vision, enabling tasks like facial recognition and object detection
Score: 0.034942626953125
Text: Machine learning algorithms enable computers to learn from data without being explicitly programmed
Score: 0.001674652099609375
Text: A neural network is inspired by the biological neural networks that constitute animal brains
Score: 0.0004711151123046875
Text: The history of artificial intelligence dates back to the 1950s
Score: 0.00026535987854003906
Text: Deep learning models require significant computational resources and large datasets for training
Score: 5.7816505432128906e-05
Text: Cloud computing provides scalable infrastructure for training deep learning models
Score: 3.737211227416992e-05
Text: The latest smartphone features advanced AI capabilities for photo enhancement
Again we see identical scores to the OpenAI endpoint, demonstrating that:
- Both APIs provide consistent access to the same model
- The Cohere endpoint offers a more streamlined interface
- You can choose whichever API better fits your application's needs
The main advantages of the Cohere endpoint are:
- Pre-sorted results (no need to sort scores manually)
- Simpler response format
- Built-in batching support
- Familiar interface for existing Cohere users
Key Features
- Optimized Inference: vLLM's server provides efficient batch processing and automatic memory management, making it easy to serve rerankers without dealing with low-level optimizations.
- Dual API Support: Access the model through both OpenAI and Cohere-compatible APIs, allowing flexible integration with existing applications.
- Cost-Effective Deployment: Vast.ai's marketplace lets you access powerful GPUs at a fraction of traditional cloud costs, with easy scaling as your needs grow.
What's Now Possible
With this setup, you can build:
- More accurate semantic search systems
- Better RAG applications with filtered context
- Content recommendation systems
- Semantic duplicate detection
- Document clustering and organization
The combination of vLLM's efficient serving and Vast.ai's affordable GPUs makes it practical to deploy rerankers in production. You can start small and scale up as needed, while maintaining high performance and cost efficiency. Happy ranking!


