Contextual Retrieval

The problem: chunks lose context when extracted. The fix: prepend a custom context sentence to each chunk before embedding.

The Core Idea

Anthropic published this technique in September 2024. The insight is deceptively simple:

Before embedding a chunk, prepend a short context sentence that describes where the chunk comes from and what it's about — generated by a fast LLM using the full document as input.

code

BEFORE (standard chunk):
"The revenue was $1.2B, up 15% YoY."

AFTER (contextual chunk):
"This chunk is from the Q3 2024 earnings report for Acme Corp, discussing 
financial performance. The revenue was $1.2B, up 15% YoY."

The embedding now captures the context. The chunk is self-sufficient.

Why This Works

Without context, the embedding of "The revenue was $1.2B" looks like a million other revenue sentences. With context, it's uniquely positioned in the embedding space as "Acme Corp Q3 2024 revenue."

Anthropic reported 49% reduction in retrieval failures on their benchmarks.

Implementation

bash

uv add anthropic langchain-anthropic langchain chromadb

python

import anthropic
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document
from typing import Optional

client = anthropic.Anthropic()


def generate_chunk_context(
    full_document: str,
    chunk: str,
    model: str = "claude-haiku-4-5",
) -> str:
    """
    Use Claude to generate a context sentence for a chunk,
    given the full document.
    
    Uses prompt caching for the full document (major cost saving
    when processing many chunks from the same document).
    """
    response = client.beta.messages.create(
        model=model,
        max_tokens=200,
        system="You are a document analyzer. Given a document and a chunk from it, "
               "write 1-2 sentences of context that describe what the chunk is about "
               "and where it fits in the document. Be concise and specific.",
        messages=[
            {
                "role": "user",
                "content": [
                    # Full document with cache_control — this gets cached!
                    {
                        "type": "text",
                        "text": f"<document>\n{full_document}\n</document>",
                        "cache_control": {"type": "ephemeral"},
                    },
                    {
                        "type": "text",
                        "text": (
                            f"<chunk>\n{chunk}\n</chunk>\n\n"
                            "Write a brief context sentence for this chunk. "
                            "Return ONLY the context sentence, nothing else."
                        ),
                    },
                ],
            }
        ],
        betas=["prompt-caching-2024-07-31"],
    )
    
    return response.content[0].text.strip()


def contextual_chunk(
    full_document: str,
    chunk: str,
) -> str:
    """Prepend generated context to a chunk."""
    context = generate_chunk_context(full_document, chunk)
    return f"{context}\n\n{chunk}"


def build_contextual_vectorstore(
    documents: list[str],
    collection_name: str = "contextual_rag",
) -> Chroma:
    """
    Build a Chroma vectorstore using contextual retrieval.
    
    For each document:
    1. Split into chunks
    2. Generate context for each chunk (with caching)
    3. Prepend context + embed
    """
    splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    all_docs = []
    
    for doc_idx, document in enumerate(documents):
        chunks = splitter.split_text(document)
        print(f"Document {doc_idx + 1}: {len(chunks)} chunks")
        
        for chunk_idx, chunk in enumerate(chunks):
            print(f"  Generating context for chunk {chunk_idx + 1}/{len(chunks)}...")
            contextualized = contextual_chunk(full_document=document, chunk=chunk)
            
            all_docs.append(Document(
                page_content=contextualized,
                metadata={
                    "doc_idx": doc_idx,
                    "chunk_idx": chunk_idx,
                    "original_chunk": chunk,  # keep original for display
                }
            ))
    
    print(f"\nBuilding vectorstore with {len(all_docs)} contextual chunks...")
    vectorstore = Chroma.from_documents(
        documents=all_docs,
        embedding=OpenAIEmbeddings(),
        collection_name=collection_name,
    )
    print("Done!")
    return vectorstore


# Usage
documents = [
    """
    Acme Corporation Q3 2024 Earnings Report
    
    Financial Performance:
    Revenue reached $1.2 billion, representing 15% year-over-year growth.
    Operating income was $240 million, with margins expanding 200 basis points.
    
    Product Segment:
    The cloud division grew 42% to $450 million, now representing 37.5% of revenue.
    Hardware sales declined 8% to $320 million due to supply chain constraints.
    
    Outlook:
    Management raised full-year guidance to $4.8-5.0 billion in revenue.
    Q4 expected revenue: $1.4-1.5 billion.
    """,
]

vectorstore = build_contextual_vectorstore(documents)

# Query
results = vectorstore.similarity_search("What happened with cloud revenue?", k=3)
for r in results:
    print(f"\n--- Result ---")
    print(r.page_content[:300])

Cost Optimization with Prompt Caching

The key insight from Anthropic's blog: when processing many chunks from the same document, the document gets cached after the first call. Subsequent chunks reuse the cache, saving up to 90% of token costs.

python

def batch_contextualize(full_document: str, chunks: list[str]) -> list[str]:
    """
    Process all chunks from one document.
    The first call caches the document; subsequent calls use cache.
    This is ~10x cheaper than processing chunks separately.
    """
    contextualized_chunks = []
    
    for i, chunk in enumerate(chunks):
        print(f"Chunk {i+1}/{len(chunks)}", end="\r")
        contextualized = contextual_chunk(full_document, chunk)
        contextualized_chunks.append(contextualized)
    
    return contextualized_chunks

Cost estimate (claude-haiku-4-5):

Without caching: ~$0.25 per 100-page PDF
With caching: ~$0.03 per 100-page PDF (~90% savings)

Comparing Standard vs Contextual RAG

python

from langchain.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Build both vectorstores
standard_chunks = splitter.split_text(document)
standard_vs = Chroma.from_texts(
    standard_chunks,
    OpenAIEmbeddings(),
    collection_name="standard_rag"
)

contextual_vs = build_contextual_vectorstore([document], "contextual_rag")

# Test query
query = "What is the cloud division revenue growth?"

standard_results = standard_vs.similarity_search(query, k=3)
contextual_results = contextual_vs.similarity_search(query, k=3)

print("=== Standard RAG ===")
for r in standard_results:
    print(f"  {r.page_content[:150]}")

print("\n=== Contextual RAG ===")
for r in contextual_results:
    print(f"  {r.page_content[:150]}")