Serving Rerankers on Vast.ai using vLLM

February 19, 2025

10 Min Read

By Team Vast

Introduction

Rerankers are powerful tools that excel at determining the relevance between pairs of text - whether you're matching search queries to documents, evaluating LLM outputs against prompts, or finding similar content in a database. Unlike simple keyword matching or embedding similarity, these specialized models perform a detailed comparison between inputs, capturing nuanced relationships that simpler methods might miss. They're particularly valuable in RAG (Retrieval Augmented Generation) systems, recommendation engines, and content filtering pipelines, where they can significantly improve quality while requiring minimal computational overhead.

vLLM has recently expanded its capabilities to include reranker model serving, offering compatibility with both OpenAI and Cohere APIs. In this guide, we'll focus on deploying the BAAI/bge-reranker-base model - a powerful yet efficient reranker designed for semantic similarity scoring.

Vast.ai provides a marketplace for renting GPU compute power, offering a cost-effective alternative to major cloud providers. Vast has GPU SKU's that you cannot normally find on clouds, including 4000 series GPU's which are cheaper, with a tradeoff of less RAM. Re-ranker models are often quite small so we can take advantage of these extra affordable prices.

In this guide, we will:

Set up a Vast.ai instance with the right GPU specifications for serving rerankers
Deploy the model using vLLM's optimized inference server
Demonstrate two ways to interact with the reranker:
- Using the Cohere-compatible API for batch reranking
- Using the OpenAI-compatible API for cross-encoder scoring

This setup provides a production-ready environment for serving reranker models, with the ability to handle both batch reranking requests and individual similarity scoring tasks.

Setting Up Vast.ai

First, we'll install the Vast.ai API:

pip install --upgrade vastai

Next, we'll set our API key (found on the Account Page):

export VAST_API_KEY="your-key-here"
vastai set api-key $VAST_API_KEY

Choosing Hardware

The BAAI/bge-reranker-base model has modest requirements compared to larger language models. Here's what we need:

16GB GPU RAM for:
- Model weights (~278M parameters)
- Batch processing overhead
- System operations
Single GPU with Turing architecture or newer
Static IP for stable API endpoint
At least one direct port for the API server

Note that you could potentially go down to 8GB of GPU RAM for this model for more affordable usages.

We'll use the vastai search offers command to find instances that meet our requirements:

vastai search offers "compute_cap >= 750 \
gpu_ram >= 16 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
rentable = true"

Deploying the Server

Next, we'll copy the ID from our chosen instance into the instance-id field below along with our HUGGINGFACE_TOKEN to rent an instance and deploy BAAI/bge-reranker-base using vLLM's OpenAI-compatible server:

export INSTANCE_ID=<instance-id>
vastai create instance $INSTANCE_ID \
    --image vllm/vllm-openai:latest \
    --env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=HUGGINGFACE_TOKEN' \
    --disk 40 \
    --args --model BAAI/bge-reranker-base

This setup:

Uses vLLM's optimized inference server
Exposes port 8000 inside the container (Vast.ai will forward this to a different external port)
Downloads and serves the BAAI/bge-reranker-base model

Verify the Setup

Before proceeding, verify that your instance is running correctly:

Go to the Instances tab in the Vast AI Console
Wait for the instance to download the image and model (this may take a few minutes)
Find your instance's IP address and port from the "Open Ports" panel - you'll see something like XX.XX.XXX.XX:YYYY -> 8000/tcp where YYYY is the external port that forwards to the container's port 8000
Use these values in VAST_IP_ADDRESS and VAST_PORT for the test request below

export VAST_IP_ADDRESS="your-ip-here"
export VAST_PORT="your-port-here"

curl -X POST http://$VAST_IP_ADDRESS:$VAST_PORT/rerank \
    -H "Content-Type: application/json" \
    -d '{
    "model": "BAAI/bge-reranker-base",
    "query": "What is deep learning?",
    "documents": [
        "Deep learning is a type of machine learning"
    ]
    }'

Using the Reranker

vLLM provides two ways to rerank documents using the same underlying model:

OpenAI API Compatible (/score)
- Raw scoring between query and documents
- Returns similarity scores for manual sorting
- Best for custom ranking logic
Cohere API Compatible (/rerank)
- Direct reranking of documents against a query
- Returns pre-sorted results
- Best for quick integration

OpenAI-Compatible Endpoint

First, we'll create a function to call the /score endpoint. We'll need to enter the IP_ADDRESS and PORT from above:

import requests

IP_ADDRESS = ""
PORT = ""

def openai_score(query,documents):
    base_url = f"http://{IP_ADDRESS}:{PORT}"

    # Format request for score endpoint
    test_request = {
        "model": "BAAI/bge-reranker-base",
        "text_1": query,  # Query is text_1
        "text_2": documents  # Documents are text_2
    }

    # Make request and print raw response
    response = requests.post(f"{base_url}/score", json=test_request)
    print("Status code:", response.status_code)
    print("Raw response:", response.text)

    # If successful, print formatted results
    if response.status_code == 200:
        data = response.json()
        scores = [(doc, item["score"]) for doc, item in zip(documents, data["data"])]
        scores.sort(key=lambda x: x[1], reverse=True)

        print("\nRanked results:")
        for text, score in scores:
            print("\nScore:", score)
            print("Text:", text)

We'll start with a simple example to show what happens when we have completely irrelevant documents:

query = "What is Deep Learning?"
documents = [
    "Deep learning is a subset of machine learning that uses neural networks with many layers",
    "The weather is nice today",
    "Deep learning enables computers to learn from large amounts of data",
    "I like pizza"
]
openai_score(query,documents)

In this example, we can see how effectively the reranker distinguishes between relevant and irrelevant content:

The most relevant document (score ~1.0) provides a direct definition of deep learning
The second document (score ~0.18) mentions deep learning but provides less specific information
The irrelevant documents about weather and pizza receive nearly zero scores (~0.00004)

Ranked results:

Score: 0.99951171875
Text: Deep learning is a subset of machine learning that uses neural networks with many layers

Score: 0.17626953125
Text: Deep learning enables computers to learn from large amounts of data

Score: 3.737211227416992e-05
Text: The weather is nice today

Score: 3.737211227416992e-05
Text: I like pizza

Next, we'll run it with a slightly more realistic set of documents to see how the reranker handles more nuanced differences in relevance:

query = "What is Deep Learning?"

documents = [
    "Deep learning is a subset of machine learning that uses neural networks with multiple layers to progressively extract higher-level features from raw input",
    "Machine learning algorithms enable computers to learn from data without being explicitly programmed",
    "The latest smartphone features advanced AI capabilities for photo enhancement",
    "Deep learning has revolutionized computer vision, enabling tasks like facial recognition and object detection",
    "A neural network is inspired by the biological neural networks that constitute animal brains",
    "Cloud computing provides scalable infrastructure for training deep learning models",
    "The history of artificial intelligence dates back to the 1950s",
    "Deep learning models require significant computational resources and large datasets for training"
]
openai_score(query,documents)

The reranker shows sophisticated understanding in this more nuanced example:

The comprehensive definition gets a perfect score (1.0)
Application examples (computer vision) score moderately well (~0.43)
Related concepts (machine learning, neural networks) get low but non-zero scores (~0.03)
Infrastructure and historical context receive nearly zero scores (<0.001)

This demonstrates how the reranker can capture subtle differences in relevance, not just obvious distinctions between relevant and irrelevant content.

Ranked results:

Score: 1.0
Text: Deep learning is a subset of machine learning that uses neural networks with multiple layers to progressively extract higher-level features from raw input

Score: 0.425537109375
Text: Deep learning has revolutionized computer vision, enabling tasks like facial recognition and object detection

Score: 0.034942626953125
Text: Machine learning algorithms enable computers to learn from data without being explicitly programmed

Score: 0.001674652099609375
Text: A neural network is inspired by the biological neural networks that constitute animal brains

Score: 0.0004711151123046875
Text: The history of artificial intelligence dates back to the 1950s

Score: 0.00026535987854003906
Text: Deep learning models require significant computational resources and large datasets for training

Score: 5.7816505432128906e-05
Text: Cloud computing provides scalable infrastructure for training deep learning models

Score: 3.737211227416992e-05
Text: The latest smartphone features advanced AI capabilities for photo enhancement

Cohere-Compatible Endpoint

The /rerank endpoint provides a higher-level interface that:

Directly reranks documents against a query
Returns pre-sorted results
Handles all scoring logic internally
Simplifies integration into existing pipelines

First, we'll install Cohere.

pip install --upgrade cohere

We'll then create a function to call the cohere endpoint.

import cohere

IP_ADDRESS = ""
PORT = ""

def cohere_reranker(query,documents):
    base_url = f"http://{IP_ADDRESS}:{PORT}"

    # Initialize v2 client with our endpoint
    co = cohere.ClientV2("sk-fake-key", base_url=base_url)

    result = co.rerank(
        model="BAAI/bge-reranker-base",
        query=query,
        documents=documents
    )

    print("\nRanked results:")
    for doc in result.results:
        print(f"\nScore: {doc.relevance_score}")
        print(f"Text: {doc.document.text}")  # Access text through document.text

First, we'll test with our simple example:

query = "What is Deep Learning?"
documents = [
    "Deep learning is a subset of machine learning that uses neural networks with many layers",
    "The weather is nice today",
    "Deep learning enables computers to learn from large amounts of data",
    "I like pizza"
]
cohere_reranker(query,documents)

Output:

Ranked results:

Score: 0.99951171875
Text: Deep learning is a subset of machine learning that uses neural networks with many layers

Score: 0.17626953125
Text: Deep learning enables computers to learn from large amounts of data

Score: 3.737211227416992e-05
Text: The weather is nice today

Score: 3.737211227416992e-05
Text: I like pizza

Notice that the Cohere endpoint produces identical scores to the OpenAI endpoint - this is because both are using the same underlying model, just with different APIs. The key difference is that the Cohere endpoint automatically handles the sorting and formatting of results.

Next, we'll try our more complex example:

query = "What is Deep Learning?"

documents = [
    "Deep learning is a subset of machine learning that uses neural networks with multiple layers to progressively extract higher-level features from raw input",
    "Machine learning algorithms enable computers to learn from data without being explicitly programmed",
    "The latest smartphone features advanced AI capabilities for photo enhancement",
    "Deep learning has revolutionized computer vision, enabling tasks like facial recognition and object detection",
    "A neural network is inspired by the biological neural networks that constitute animal brains",
    "Cloud computing provides scalable infrastructure for training deep learning models",
    "The history of artificial intelligence dates back to the 1950s",
    "Deep learning models require significant computational resources and large datasets for training"
]
cohere_reranker(query,documents)

Output:

Ranked results:

Score: 1.0
Text: Deep learning is a subset of machine learning that uses neural networks with multiple layers to progressively extract higher-level features from raw input

Score: 0.425537109375
Text: Deep learning has revolutionized computer vision, enabling tasks like facial recognition and object detection

Score: 0.034942626953125
Text: Machine learning algorithms enable computers to learn from data without being explicitly programmed

Score: 0.001674652099609375
Text: A neural network is inspired by the biological neural networks that constitute animal brains

Score: 0.0004711151123046875
Text: The history of artificial intelligence dates back to the 1950s

Score: 0.00026535987854003906
Text: Deep learning models require significant computational resources and large datasets for training

Score: 5.7816505432128906e-05
Text: Cloud computing provides scalable infrastructure for training deep learning models

Score: 3.737211227416992e-05
Text: The latest smartphone features advanced AI capabilities for photo enhancement

Again we see identical scores to the OpenAI endpoint, demonstrating that:

Both APIs provide consistent access to the same model
The Cohere endpoint offers a more streamlined interface
You can choose whichever API better fits your application's needs

The main advantages of the Cohere endpoint are:

Pre-sorted results (no need to sort scores manually)
Simpler response format
Built-in batching support
Familiar interface for existing Cohere users

Key Features

Optimized Inference: vLLM's server provides efficient batch processing and automatic memory management, making it easy to serve rerankers without dealing with low-level optimizations.
Dual API Support: Access the model through both OpenAI and Cohere-compatible APIs, allowing flexible integration with existing applications.
Cost-Effective Deployment: Vast.ai's marketplace lets you access powerful GPUs at a fraction of traditional cloud costs, with easy scaling as your needs grow.

What's Now Possible

With this setup, you can build:

More accurate semantic search systems
Better RAG applications with filtered context
Content recommendation systems
Semantic duplicate detection
Document clustering and organization

The combination of vLLM's efficient serving and Vast.ai's affordable GPUs makes it practical to deploy rerankers in production. You can start small and scale up as needed, while maintaining high performance and cost efficiency. Happy ranking!