February 20, 2025-Industry
Rerankers are powerful tools that excel at determining the relevance between pairs of text - whether you're matching search queries to documents, evaluating LLM outputs against prompts, or finding similar content in a database. Unlike simple keyword matching or embedding similarity, these specialized models perform a detailed comparison between inputs, capturing nuanced relationships that simpler methods might miss. They're particularly valuable in RAG (Retrieval Augmented Generation) systems, recommendation engines, and content filtering pipelines, where they can significantly improve quality while requiring minimal computational overhead.
vLLM has recently expanded its capabilities to include reranker model serving, offering compatibility with both OpenAI and Cohere APIs. In this guide, we'll focus on deploying the BAAI/bge-reranker-base
model - a powerful yet efficient reranker designed for semantic similarity scoring.
Vast.ai provides a marketplace for renting GPU compute power, offering a cost-effective alternative to major cloud providers. Vast has GPU SKU's that you cannot normally find on clouds, including 4000 series GPU's which are cheaper, with a tradeoff of less RAM. Re-ranker models are often quite small so we can take advantage of these extra affordable prices.
In this guide, we will:
This setup provides a production-ready environment for serving reranker models, with the ability to handle both batch reranking requests and individual similarity scoring tasks.
First, we'll install the Vast.ai API:
pip install --upgrade vastai
Next, we'll set our API key (found on the Account Page):
export VAST_API_KEY="your-key-here"
vastai set api-key $VAST_API_KEY
The BAAI/bge-reranker-base
model has modest requirements compared to larger language models. Here's what we need:
Note that you could potentially go down to 8GB of GPU RAM for this model for more affordable usages.
We'll use the vastai search offers
command to find instances that meet our requirements:
vastai search offers "compute_cap >= 750 \
gpu_ram >= 16 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
rentable = true"
Next, we'll copy the ID
from our chosen instance into the instance-id
field below along with our HUGGINGFACE_TOKEN
to rent an instance and deploy BAAI/bge-reranker-base
using vLLM's OpenAI-compatible server:
export INSTANCE_ID=<instance-id>
vastai create instance $INSTANCE_ID \
--image vllm/vllm-openai:latest \
--env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=HUGGINGFACE_TOKEN' \
--disk 40 \
--args --model BAAI/bge-reranker-base
This setup:
BAAI/bge-reranker-base
modelBefore proceeding, verify that your instance is running correctly:
XX.XX.XXX.XX:YYYY -> 8000/tcp
where YYYY is the external port that forwards to the container's port 8000VAST_IP_ADDRESS
and VAST_PORT
for the test request belowexport VAST_IP_ADDRESS="your-ip-here"
export VAST_PORT="your-port-here"
curl -X POST http://$VAST_IP_ADDRESS:$VAST_PORT/rerank \
-H "Content-Type: application/json" \
-d '{
"model": "BAAI/bge-reranker-base",
"query": "What is deep learning?",
"documents": [
"Deep learning is a type of machine learning"
]
}'
vLLM provides two ways to rerank documents using the same underlying model:
OpenAI API Compatible (/score
)
Cohere API Compatible (/rerank
)
First, we'll create a function to call the /score
endpoint. We'll need to enter the IP_ADDRESS
and PORT
from above:
import requests
IP_ADDRESS = ""
PORT = ""
def openai_score(query,documents):
base_url = f"http://{IP_ADDRESS}:{PORT}"
# Format request for score endpoint
test_request = {
"model": "BAAI/bge-reranker-base",
"text_1": query, # Query is text_1
"text_2": documents # Documents are text_2
}
# Make request and print raw response
response = requests.post(f"{base_url}/score", json=test_request)
print("Status code:", response.status_code)
print("Raw response:", response.text)
# If successful, print formatted results
if response.status_code == 200:
data = response.json()
scores = [(doc, item["score"]) for doc, item in zip(documents, data["data"])]
scores.sort(key=lambda x: x[1], reverse=True)
print("\nRanked results:")
for text, score in scores:
print("\nScore:", score)
print("Text:", text)
We'll start with a simple example to show what happens when we have completely irrelevant documents:
query = "What is Deep Learning?"
documents = [
"Deep learning is a subset of machine learning that uses neural networks with many layers",
"The weather is nice today",
"Deep learning enables computers to learn from large amounts of data",
"I like pizza"
]
openai_score(query,documents)
In this example, we can see how effectively the reranker distinguishes between relevant and irrelevant content:
Ranked results:
Score: 0.99951171875
Text: Deep learning is a subset of machine learning that uses neural networks with many layers
Score: 0.17626953125
Text: Deep learning enables computers to learn from large amounts of data
Score: 3.737211227416992e-05
Text: The weather is nice today
Score: 3.737211227416992e-05
Text: I like pizza
Next, we'll run it with a slightly more realistic set of documents to see how the reranker handles more nuanced differences in relevance:
query = "What is Deep Learning?"
documents = [
"Deep learning is a subset of machine learning that uses neural networks with multiple layers to progressively extract higher-level features from raw input",
"Machine learning algorithms enable computers to learn from data without being explicitly programmed",
"The latest smartphone features advanced AI capabilities for photo enhancement",
"Deep learning has revolutionized computer vision, enabling tasks like facial recognition and object detection",
"A neural network is inspired by the biological neural networks that constitute animal brains",
"Cloud computing provides scalable infrastructure for training deep learning models",
"The history of artificial intelligence dates back to the 1950s",
"Deep learning models require significant computational resources and large datasets for training"
]
openai_score(query,documents)
The reranker shows sophisticated understanding in this more nuanced example:
This demonstrates how the reranker can capture subtle differences in relevance, not just obvious distinctions between relevant and irrelevant content.
Ranked results:
Score: 1.0
Text: Deep learning is a subset of machine learning that uses neural networks with multiple layers to progressively extract higher-level features from raw input
Score: 0.425537109375
Text: Deep learning has revolutionized computer vision, enabling tasks like facial recognition and object detection
Score: 0.034942626953125
Text: Machine learning algorithms enable computers to learn from data without being explicitly programmed
Score: 0.001674652099609375
Text: A neural network is inspired by the biological neural networks that constitute animal brains
Score: 0.0004711151123046875
Text: The history of artificial intelligence dates back to the 1950s
Score: 0.00026535987854003906
Text: Deep learning models require significant computational resources and large datasets for training
Score: 5.7816505432128906e-05
Text: Cloud computing provides scalable infrastructure for training deep learning models
Score: 3.737211227416992e-05
Text: The latest smartphone features advanced AI capabilities for photo enhancement
The /rerank
endpoint provides a higher-level interface that:
First, we'll install Cohere.
pip install --upgrade cohere
We'll then create a function to call the cohere endpoint.
import cohere
IP_ADDRESS = ""
PORT = ""
def cohere_reranker(query,documents):
base_url = f"http://{IP_ADDRESS}:{PORT}"
# Initialize v2 client with our endpoint
co = cohere.ClientV2("sk-fake-key", base_url=base_url)
result = co.rerank(
model="BAAI/bge-reranker-base",
query=query,
documents=documents
)
print("\nRanked results:")
for doc in result.results:
print(f"\nScore: {doc.relevance_score}")
print(f"Text: {doc.document.text}") # Access text through document.text
First, we'll test with our simple example:
query = "What is Deep Learning?"
documents = [
"Deep learning is a subset of machine learning that uses neural networks with many layers",
"The weather is nice today",
"Deep learning enables computers to learn from large amounts of data",
"I like pizza"
]
cohere_reranker(query,documents)
Output:
Ranked results:
Score: 0.99951171875
Text: Deep learning is a subset of machine learning that uses neural networks with many layers
Score: 0.17626953125
Text: Deep learning enables computers to learn from large amounts of data
Score: 3.737211227416992e-05
Text: The weather is nice today
Score: 3.737211227416992e-05
Text: I like pizza
Notice that the Cohere endpoint produces identical scores to the OpenAI endpoint - this is because both are using the same underlying model, just with different APIs. The key difference is that the Cohere endpoint automatically handles the sorting and formatting of results.
Next, we'll try our more complex example:
query = "What is Deep Learning?"
documents = [
"Deep learning is a subset of machine learning that uses neural networks with multiple layers to progressively extract higher-level features from raw input",
"Machine learning algorithms enable computers to learn from data without being explicitly programmed",
"The latest smartphone features advanced AI capabilities for photo enhancement",
"Deep learning has revolutionized computer vision, enabling tasks like facial recognition and object detection",
"A neural network is inspired by the biological neural networks that constitute animal brains",
"Cloud computing provides scalable infrastructure for training deep learning models",
"The history of artificial intelligence dates back to the 1950s",
"Deep learning models require significant computational resources and large datasets for training"
]
cohere_reranker(query,documents)
Output:
Ranked results:
Score: 1.0
Text: Deep learning is a subset of machine learning that uses neural networks with multiple layers to progressively extract higher-level features from raw input
Score: 0.425537109375
Text: Deep learning has revolutionized computer vision, enabling tasks like facial recognition and object detection
Score: 0.034942626953125
Text: Machine learning algorithms enable computers to learn from data without being explicitly programmed
Score: 0.001674652099609375
Text: A neural network is inspired by the biological neural networks that constitute animal brains
Score: 0.0004711151123046875
Text: The history of artificial intelligence dates back to the 1950s
Score: 0.00026535987854003906
Text: Deep learning models require significant computational resources and large datasets for training
Score: 5.7816505432128906e-05
Text: Cloud computing provides scalable infrastructure for training deep learning models
Score: 3.737211227416992e-05
Text: The latest smartphone features advanced AI capabilities for photo enhancement
Again we see identical scores to the OpenAI endpoint, demonstrating that:
The main advantages of the Cohere endpoint are:
With this setup, you can build:
The combination of vLLM's efficient serving and Vast.ai's affordable GPUs makes it practical to deploy rerankers in production. You can start small and scale up as needed, while maintaining high performance and cost efficiency. Happy ranking!