Reranking
Retrieval (whether dense, sparse, or hybrid) optimizes for recall — finding potentially relevant documents. Reranking optimizes for precision — ensuring the top results are truly the most relevant. Reranking is a second-pass scoring that dramatically improves the quality of retrieved context.
Why Reranking?
Dense retrieval uses bi-encoders: the query and document are encoded separately, then compared by cosine similarity. This is fast but approximate because the query and document never "see" each other during encoding.
Reranking uses cross-encoders: the query and document are processed together, allowing the model to attend to the interaction between them. This is slower but much more accurate.
Bi-encoder (retrieval): Cross-encoder (reranking):
Query → Encoder → Vec Query ─┐
├→ Model → Score
Doc → Encoder → Vec Doc ─┘
Score = cosine(vec_q, vec_d) Score = cross_attention(query, doc)
Cross-Encoder Reranking with sentence-transformers
uv add sentence-transformers
from sentence_transformers import CrossEncoder
import numpy as np
# Load a cross-encoder model
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
# Documents retrieved by first-pass search
query = "How does Docker networking work?"
retrieved_docs = [
"Docker Compose orchestrates multiple containers for local development.",
"Docker networking allows containers to communicate with each other and the host.",
"The Docker daemon manages container lifecycle and image building.",
"Bridge networks are the default networking mode in Docker.",
"FastAPI builds REST APIs with automatic documentation.",
"Docker overlay networks enable communication across multiple hosts.",
]
# Score each query-document pair
pairs = [[query, doc] for doc in retrieved_docs]
scores = model.predict(pairs)
# Rerank by score
ranked = sorted(zip(retrieved_docs, scores), key=lambda x: x[1], reverse=True)
print("Reranked results:")
for doc, score in ranked:
print(f" [{score:.3f}] {doc}")
Rerank when:
- Your first-pass retrieval returns 20-100 candidates
- Precision at the top K (K=3-5) matters for your application
- Your users complain about irrelevant results
Skip reranking when:
- You retrieve only 3-5 candidates (the overhead isn't worth it)
- Latency requirements are strict (cross-encoders add 50-200ms)
- First-pass retrieval is already highly accurate
Cohere Rerank API
Cohere Rerank is a hosted reranking service that is easy to integrate:
uv add cohere
import cohere
co = cohere.ClientV2(api_key="your-cohere-api-key")
query = "How does Docker networking work?"
documents = [
"Docker Compose orchestrates multiple containers for local development.",
"Docker networking allows containers to communicate with each other and the host.",
"The Docker daemon manages container lifecycle and image building.",
"Bridge networks are the default networking mode in Docker.",
"Docker overlay networks enable communication across multiple hosts.",
]
response = co.v2.rerank(
model="rerank-v3.5",
query=query,
documents=documents,
top_n=3,
)
print("Top 3 reranked results:")
for result in response.results:
print(f" [score={result.relevance_score:.3f}] {documents[result.index]}")
Full RAG Pipeline with Reranking
from openai import OpenAI
import cohere
openai_client = OpenAI()
cohere_client = cohere.ClientV2(api_key="your-cohere-api-key")
def rag_with_reranking(query: str, collection, top_k: int = 3) -> str:
"""Full RAG pipeline: retrieve → rerank → generate."""
# Step 1: First-pass retrieval (get more candidates than needed)
results = collection.query(
query_texts=[query],
n_results=20, # Retrieve 20, will rerank to top 3
)
candidates = results["documents"][0]
# Step 2: Rerank
rerank_response = cohere_client.v2.rerank(
model="rerank-v3.5",
query=query,
documents=candidates,
top_n=top_k,
)
# Get top reranked documents
top_docs = [candidates[r.index] for r in rerank_response.results]
# Step 3: Generate answer with reranked context
context = "\n\n".join(top_docs)
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "Answer the question based on the provided context. "
"If the context doesn't contain the answer, say so."
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}"
}
]
)
return response.choices[0].message.content
# Usage
answer = rag_with_reranking("How does Docker networking work?", collection)
print(answer)
Reranker Model Comparison
| Model | Type | Speed | Quality | Cost |
|---|---|---|---|---|
| ms-marco-MiniLM-L-6-v2 | Local cross-encoder | Fast | Good | Free |
| ms-marco-electra-base | Local cross-encoder | Medium | Better | Free |
| Cohere rerank-v3.5 | API | Fast | Best | $0.002/query |
| BGE-reranker-large | Local cross-encoder | Slow | Very Good | Free |
In a RAG pipeline, allocate your latency budget:
- Retrieval: 50-100ms
- Reranking: 50-200ms
- LLM generation: 500-2000ms
Reranking typically adds 10-20% to total latency but can improve answer quality by 15-30%.