RAG Evaluation

Building a RAG system is one thing; knowing whether it actually works is another. The RAGAS (Retrieval Augmented Generation Assessment) framework provides metrics to evaluate both the retrieval and generation components of your RAG pipeline.

Why Evaluate RAG?

Without evaluation, you are guessing whether changes to your chunking strategy, embedding model, or prompt actually improve the system. RAG evaluation gives you:

A baseline — How good is your current system?
A feedback loop — Did the change help or hurt?
Visibility — Is the problem in retrieval or generation?

RAGAS Metrics

RAGAS evaluates four key aspects:

Metric	What It Measures	Requires
Faithfulness	Is the answer grounded in the retrieved context?	Question, Answer, Context
Answer Relevancy	Is the answer relevant to the question?	Question, Answer
Context Precision	Are the retrieved chunks relevant?	Question, Context
Context Recall	Did retrieval find all necessary information?	Question, Context, Ground Truth

Installing RAGAS

bash

uv add ragas

Preparing Evaluation Data

RAGAS requires a dataset of question-answer-context triples:

python

from datasets import Dataset

# Prepare evaluation data
# Each row needs: question, answer, contexts, ground_truth
eval_data = {
    "question": [
        "What is Docker networking?",
        "How does FAISS work?",
        "What are the benefits of hybrid search?",
    ],
    "answer": [
        "Docker networking allows containers to communicate with each other and the host system using bridge, overlay, and host network modes.",
        "FAISS uses approximate nearest neighbor algorithms to quickly find similar vectors in large datasets.",
        "Hybrid search combines dense and sparse retrieval for better recall, improving results by 15-25% over either method alone.",
    ],
    "contexts": [
        [
            "Docker networking provides several modes: bridge (default), overlay (multi-host), host (no isolation), and macvlan (physical network).",
            "Containers on the same bridge network can communicate using container names."
        ],
        [
            "FAISS implements IVF and HNSW indexing for fast approximate nearest neighbor search.",
            "FAISS runs in-memory and supports GPU acceleration for billion-scale datasets."
        ],
        [
            "Dense search captures semantic similarity while sparse search catches exact keyword matches.",
            "Hybrid search using RRF fusion outperforms individual methods by 15-25% on standard benchmarks."
        ],
    ],
    "ground_truth": [
        "Docker networking enables container communication via bridge, overlay, host, and macvlan modes.",
        "FAISS provides fast similarity search using ANN algorithms like IVF and HNSW.",
        "Hybrid search combines semantic (dense) and keyword (sparse) retrieval for better results.",
    ],
}

dataset = Dataset.from_dict(eval_data)

Running Evaluation

python

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)

# Run evaluation
results = evaluate(
    dataset=dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
    ],
)

print(results)
# Example output:
# {'faithfulness': 0.89, 'answer_relevancy': 0.92, 'context_precision': 0.85, 'context_recall': 0.78}

Understanding the Metrics

Faithfulness

Measures whether the generated answer is supported by the retrieved context:

python

# High faithfulness: Answer claims are found in context
# Question: "What is Docker?"
# Context: "Docker is a platform for containerizing applications."
# Answer: "Docker is a containerization platform."  ✅ Faithful

# Low faithfulness: Answer makes claims not in context
# Question: "What is Docker?"
# Context: "Docker is a platform for containerizing applications."
# Answer: "Docker was founded in 2013 by Solomon Hykes."  ❌ Not in context

Answer Relevancy

Measures whether the answer actually addresses the question:

python

# High relevancy: Answer addresses the question directly
# Question: "What is the default Docker network mode?"
# Answer: "The default Docker network mode is bridge."  ✅ Relevant

# Low relevancy: Answer is off-topic
# Question: "What is the default Docker network mode?"
# Answer: "Docker is a popular tool for containerization."  ❌ Irrelevant

Improving faithfulness

If faithfulness is low:

Check if retrieved context contains the needed information (may need context recall)
Adjust your system prompt to strongly instruct the model to only use provided context
Add "If the context doesn't contain the answer, say 'I don't have enough information'"

A/B Testing Your RAG Pipeline

python

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

# Evaluate baseline (recursive chunking, chroma, no reranking)
baseline_results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy],
)

# Evaluate improved version (semantic chunking, hybrid search, reranking)
improved_results = evaluate(
    dataset=dataset,  # Same dataset for fair comparison
    metrics=[faithfulness, answer_relevancy],
)

# Compare
print("Baseline:", baseline_results)
print("Improved:", improved_results)

metrics = ["faithfulness", "answer_relevancy"]
for metric in metrics:
    base = baseline_results[metric]
    improved = improved_results[metric]
    delta = improved - base
    emoji = "📈" if delta > 0 else "📉"
    print(f"{emoji} {metric}: {base:.3f} → {improved:.3f} ({delta:+.3f})")

Generating Synthetic Test Data

Creating evaluation data manually is tedious. Use LLMs to generate it:

python

import instructor
from pydantic import BaseModel, Field
from typing import List
from openai import OpenAI

client = instructor.from_openai(OpenAI())

class QAPair(BaseModel):
    question: str = Field(description="A question that can be answered from the text")
    answer: str = Field(description="The ground truth answer")
    difficulty: str = Field(description="easy, medium, or hard")

class SyntheticEvalData(BaseModel):
    qa_pairs: List[QAPair]

def generate_eval_data(document: str, n_questions: int = 5) -> List[QAPair]:
    result = client.chat.completions.create(
        model="gpt-4o-mini",
        response_model=SyntheticEvalData,
        messages=[{
            "role": "user",
            "content": f"Generate {n_questions} question-answer pairs from this text:\n\n{document}"
        }],
    )
    return result.qa_pairs

# Generate from your knowledge base
qa_pairs = generate_eval_data("Docker networking provides several modes...")
for qa in qa_pairs:
    print(f"Q: {qa.question}")
    print(f"A: {qa.answer}")
    print(f"Difficulty: {qa.difficulty}\n")

Synthetic data quality

LLM-generated evaluation data may miss edge cases that real users encounter. Supplement synthetic data with:

Real user queries from logs
Edge cases you've observed
Questions specifically designed to test failure modes

Why Evaluate RAG?​

RAGAS Metrics​

Installing RAGAS​

Preparing Evaluation Data​

Running Evaluation​

Understanding the Metrics​

Faithfulness​

Answer Relevancy​

A/B Testing Your RAG Pipeline​

Generating Synthetic Test Data​

Why Evaluate RAG?

RAGAS Metrics

Installing RAGAS

Preparing Evaluation Data

Running Evaluation

Understanding the Metrics

Faithfulness

Answer Relevancy

A/B Testing Your RAG Pipeline

Generating Synthetic Test Data