Skip to main content

Reranking

Retrieval (whether dense, sparse, or hybrid) optimizes for recall — finding potentially relevant documents. Reranking optimizes for precision — ensuring the top results are truly the most relevant. Reranking is a second-pass scoring that dramatically improves the quality of retrieved context.

Why Reranking?

Dense retrieval uses bi-encoders: the query and document are encoded separately, then compared by cosine similarity. This is fast but approximate because the query and document never "see" each other during encoding.

Reranking uses cross-encoders: the query and document are processed together, allowing the model to attend to the interaction between them. This is slower but much more accurate.

code
Bi-encoder (retrieval): Cross-encoder (reranking):
Query → Encoder → Vec Query ─┐
├→ Model → Score
Doc → Encoder → Vec Doc ─┘

Score = cosine(vec_q, vec_d) Score = cross_attention(query, doc)

Cross-Encoder Reranking with sentence-transformers

bash
uv add sentence-transformers
python
from sentence_transformers import CrossEncoder
import numpy as np

# Load a cross-encoder model
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Documents retrieved by first-pass search
query = "How does Docker networking work?"
retrieved_docs = [
"Docker Compose orchestrates multiple containers for local development.",
"Docker networking allows containers to communicate with each other and the host.",
"The Docker daemon manages container lifecycle and image building.",
"Bridge networks are the default networking mode in Docker.",
"FastAPI builds REST APIs with automatic documentation.",
"Docker overlay networks enable communication across multiple hosts.",
]

# Score each query-document pair
pairs = [[query, doc] for doc in retrieved_docs]
scores = model.predict(pairs)

# Rerank by score
ranked = sorted(zip(retrieved_docs, scores), key=lambda x: x[1], reverse=True)

print("Reranked results:")
for doc, score in ranked:
print(f" [{score:.3f}] {doc}")
When to rerank

Rerank when:

  • Your first-pass retrieval returns 20-100 candidates
  • Precision at the top K (K=3-5) matters for your application
  • Your users complain about irrelevant results

Skip reranking when:

  • You retrieve only 3-5 candidates (the overhead isn't worth it)
  • Latency requirements are strict (cross-encoders add 50-200ms)
  • First-pass retrieval is already highly accurate

Cohere Rerank API

Cohere Rerank is a hosted reranking service that is easy to integrate:

bash
uv add cohere
python
import cohere

co = cohere.ClientV2(api_key="your-cohere-api-key")

query = "How does Docker networking work?"
documents = [
"Docker Compose orchestrates multiple containers for local development.",
"Docker networking allows containers to communicate with each other and the host.",
"The Docker daemon manages container lifecycle and image building.",
"Bridge networks are the default networking mode in Docker.",
"Docker overlay networks enable communication across multiple hosts.",
]

response = co.v2.rerank(
model="rerank-v3.5",
query=query,
documents=documents,
top_n=3,
)

print("Top 3 reranked results:")
for result in response.results:
print(f" [score={result.relevance_score:.3f}] {documents[result.index]}")

Full RAG Pipeline with Reranking

python
from openai import OpenAI
import cohere

openai_client = OpenAI()
cohere_client = cohere.ClientV2(api_key="your-cohere-api-key")

def rag_with_reranking(query: str, collection, top_k: int = 3) -> str:
"""Full RAG pipeline: retrieve → rerank → generate."""

# Step 1: First-pass retrieval (get more candidates than needed)
results = collection.query(
query_texts=[query],
n_results=20, # Retrieve 20, will rerank to top 3
)
candidates = results["documents"][0]

# Step 2: Rerank
rerank_response = cohere_client.v2.rerank(
model="rerank-v3.5",
query=query,
documents=candidates,
top_n=top_k,
)

# Get top reranked documents
top_docs = [candidates[r.index] for r in rerank_response.results]

# Step 3: Generate answer with reranked context
context = "\n\n".join(top_docs)
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "Answer the question based on the provided context. "
"If the context doesn't contain the answer, say so."
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}"
}
]
)

return response.choices[0].message.content

# Usage
answer = rag_with_reranking("How does Docker networking work?", collection)
print(answer)

Reranker Model Comparison

ModelTypeSpeedQualityCost
ms-marco-MiniLM-L-6-v2Local cross-encoderFastGoodFree
ms-marco-electra-baseLocal cross-encoderMediumBetterFree
Cohere rerank-v3.5APIFastBest$0.002/query
BGE-reranker-largeLocal cross-encoderSlowVery GoodFree
Latency budget

In a RAG pipeline, allocate your latency budget:

  • Retrieval: 50-100ms
  • Reranking: 50-200ms
  • LLM generation: 500-2000ms

Reranking typically adds 10-20% to total latency but can improve answer quality by 15-30%.