RAG Evaluation
Building a RAG system is one thing; knowing whether it actually works is another. The RAGAS (Retrieval Augmented Generation Assessment) framework provides metrics to evaluate both the retrieval and generation components of your RAG pipeline.
Why Evaluate RAG?
Without evaluation, you are guessing whether changes to your chunking strategy, embedding model, or prompt actually improve the system. RAG evaluation gives you:
- A baseline — How good is your current system?
- A feedback loop — Did the change help or hurt?
- Visibility — Is the problem in retrieval or generation?
RAGAS Metrics
RAGAS evaluates four key aspects:
| Metric | What It Measures | Requires |
|---|---|---|
| Faithfulness | Is the answer grounded in the retrieved context? | Question, Answer, Context |
| Answer Relevancy | Is the answer relevant to the question? | Question, Answer |
| Context Precision | Are the retrieved chunks relevant? | Question, Context |
| Context Recall | Did retrieval find all necessary information? | Question, Context, Ground Truth |
Installing RAGAS
uv add ragas
Preparing Evaluation Data
RAGAS requires a dataset of question-answer-context triples:
from datasets import Dataset
# Prepare evaluation data
# Each row needs: question, answer, contexts, ground_truth
eval_data = {
"question": [
"What is Docker networking?",
"How does FAISS work?",
"What are the benefits of hybrid search?",
],
"answer": [
"Docker networking allows containers to communicate with each other and the host system using bridge, overlay, and host network modes.",
"FAISS uses approximate nearest neighbor algorithms to quickly find similar vectors in large datasets.",
"Hybrid search combines dense and sparse retrieval for better recall, improving results by 15-25% over either method alone.",
],
"contexts": [
[
"Docker networking provides several modes: bridge (default), overlay (multi-host), host (no isolation), and macvlan (physical network).",
"Containers on the same bridge network can communicate using container names."
],
[
"FAISS implements IVF and HNSW indexing for fast approximate nearest neighbor search.",
"FAISS runs in-memory and supports GPU acceleration for billion-scale datasets."
],
[
"Dense search captures semantic similarity while sparse search catches exact keyword matches.",
"Hybrid search using RRF fusion outperforms individual methods by 15-25% on standard benchmarks."
],
],
"ground_truth": [
"Docker networking enables container communication via bridge, overlay, host, and macvlan modes.",
"FAISS provides fast similarity search using ANN algorithms like IVF and HNSW.",
"Hybrid search combines semantic (dense) and keyword (sparse) retrieval for better results.",
],
}
dataset = Dataset.from_dict(eval_data)
Running Evaluation
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
# Run evaluation
results = evaluate(
dataset=dataset,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
],
)
print(results)
# Example output:
# {'faithfulness': 0.89, 'answer_relevancy': 0.92, 'context_precision': 0.85, 'context_recall': 0.78}
Understanding the Metrics
Faithfulness
Measures whether the generated answer is supported by the retrieved context:
# High faithfulness: Answer claims are found in context
# Question: "What is Docker?"
# Context: "Docker is a platform for containerizing applications."
# Answer: "Docker is a containerization platform." ✅ Faithful
# Low faithfulness: Answer makes claims not in context
# Question: "What is Docker?"
# Context: "Docker is a platform for containerizing applications."
# Answer: "Docker was founded in 2013 by Solomon Hykes." ❌ Not in context
Answer Relevancy
Measures whether the answer actually addresses the question:
# High relevancy: Answer addresses the question directly
# Question: "What is the default Docker network mode?"
# Answer: "The default Docker network mode is bridge." ✅ Relevant
# Low relevancy: Answer is off-topic
# Question: "What is the default Docker network mode?"
# Answer: "Docker is a popular tool for containerization." ❌ Irrelevant
If faithfulness is low:
- Check if retrieved context contains the needed information (may need context recall)
- Adjust your system prompt to strongly instruct the model to only use provided context
- Add "If the context doesn't contain the answer, say 'I don't have enough information'"
A/B Testing Your RAG Pipeline
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
# Evaluate baseline (recursive chunking, chroma, no reranking)
baseline_results = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy],
)
# Evaluate improved version (semantic chunking, hybrid search, reranking)
improved_results = evaluate(
dataset=dataset, # Same dataset for fair comparison
metrics=[faithfulness, answer_relevancy],
)
# Compare
print("Baseline:", baseline_results)
print("Improved:", improved_results)
metrics = ["faithfulness", "answer_relevancy"]
for metric in metrics:
base = baseline_results[metric]
improved = improved_results[metric]
delta = improved - base
emoji = "📈" if delta > 0 else "📉"
print(f"{emoji} {metric}: {base:.3f} → {improved:.3f} ({delta:+.3f})")
Generating Synthetic Test Data
Creating evaluation data manually is tedious. Use LLMs to generate it:
import instructor
from pydantic import BaseModel, Field
from typing import List
from openai import OpenAI
client = instructor.from_openai(OpenAI())
class QAPair(BaseModel):
question: str = Field(description="A question that can be answered from the text")
answer: str = Field(description="The ground truth answer")
difficulty: str = Field(description="easy, medium, or hard")
class SyntheticEvalData(BaseModel):
qa_pairs: List[QAPair]
def generate_eval_data(document: str, n_questions: int = 5) -> List[QAPair]:
result = client.chat.completions.create(
model="gpt-4o-mini",
response_model=SyntheticEvalData,
messages=[{
"role": "user",
"content": f"Generate {n_questions} question-answer pairs from this text:\n\n{document}"
}],
)
return result.qa_pairs
# Generate from your knowledge base
qa_pairs = generate_eval_data("Docker networking provides several modes...")
for qa in qa_pairs:
print(f"Q: {qa.question}")
print(f"A: {qa.answer}")
print(f"Difficulty: {qa.difficulty}\n")
LLM-generated evaluation data may miss edge cases that real users encounter. Supplement synthetic data with:
- Real user queries from logs
- Edge cases you've observed
- Questions specifically designed to test failure modes