Context Engineering
Context engineering is the discipline of designing what goes in the context window — the system prompt, background documents, instructions, tools, and conversation history — to make an LLM behave consistently and correctly across all inputs.
- Prompt Engineering: How to phrase one query to get a good answer
- Context Engineering: How to architect the entire context so the model behaves reliably on every query
The Context Window Budget
Every LLM has a finite context window (measured in tokens). You decide how to spend it:
┌─────────────────────────────────────────────────────────────┐
│ Context Window (e.g. 200,000 tokens for Claude) │
│ │
│ System Prompt ████████████ ~2,000 tokens │
│ Tools/Schema ████████ ~1,500 tokens │
│ Retrieved Docs (RAG) ████████████████████ ~50,000 │
│ Conversation History ████████████ ~20,000 tokens │
│ Current User Message ██ ~500 tokens │
│ ───────────────────────────────────────────────────────── │
│ Reserve for Output ████ ~4,096 tokens │
└─────────────────────────────────────────────────────────────┘
Context engineering is about budgeting well — putting the right information in the right position.
System Prompt Design
The system prompt is the most important thing you control. It runs on every request.
Anatomy of a Great System Prompt
import anthropic
SYSTEM_PROMPT = """# Role
You are TDS Assistant, an expert teaching assistant for the IIT Madras "Tools in Data Science" course.
# Primary Responsibilities
- Answer student questions about course topics: Python, FastAPI, Docker, LLMs, RAG, agents
- Provide working code examples when asked
- Point students to relevant week/lab content when appropriate
# Tone and Style
- Be precise and technical — students are intermediate-to-advanced programmers
- Use concrete examples over abstract explanations
- If you don't know something specific to the course, say so clearly
# Constraints
- Only answer questions relevant to the course topics listed above
- Do not write complete assignments for students — guide them to the answer
- Always test code mentally before sharing it
# Output Format
- For code: use fenced code blocks with the language specified
- For explanations: use headers to organize long answers
- For step-by-step guides: use numbered lists
"""
client = anthropic.Anthropic()
def ask(question: str, history: list = None) -> str:
messages = history or []
messages.append({"role": "user", "content": question})
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system=SYSTEM_PROMPT,
messages=messages,
)
return response.content[0].text
System Prompt Anti-Patterns
❌ Too vague:
"Be helpful and answer questions."
❌ Contradictory:
"Be concise. Always provide thorough, detailed explanations."
❌ No format guidance:
(model returns inconsistent formats for the same query type)
❌ Missing constraints:
(model happily writes student assignments for them)
✅ Good system prompt structure:
- Role / Who you are
- Responsibilities / What you do
- Tone / How you communicate
- Constraints / What you won't do
- Output Format / How to structure responses
AGENTS.md
AGENTS.md is a convention for documenting how AI coding agents should interact with a codebase. Tools like Claude Code, Cursor, and GitHub Copilot read this file to understand project-specific context.
# TDS Course API — Agent Context
## Project Overview
FastAPI service for the TDS course platform. Handles student auth, lab submissions, and grade tracking.
## Tech Stack
- **Runtime**: Python 3.12, UV package manager
- **Framework**: FastAPI 0.115+
- **Database**: PostgreSQL 16 via asyncpg
- **Cache**: Redis 7
- **Tests**: pytest + httpx AsyncClient
## Directory Structure
```
src/
├── api/ # FastAPI routes (one file per resource)
├── models/ # Pydantic request/response models
├── db/ # Database layer (asyncpg, no ORM)
├── services/ # Business logic
└── tests/ # mirrors src/ structure
```
## Coding Conventions
- All DB functions are async; use `await` everywhere
- Use `structlog` for logging, not `print` or stdlib logging
- Pydantic models live in `models/`, not inline in routes
- Every route MUST have a corresponding test in `tests/`
- Error handling: raise `HTTPException`, never return error dicts
## Common Commands
```bash
uv run pytest # run all tests
uv run uvicorn src.main:app --reload # dev server
uv run alembic upgrade head # run migrations
```
## DO NOT
- Use synchronous database calls (no sqlite3 directly)
- Commit secrets or API keys
- Use `import *` anywhere
- Write tests that hit real external APIs (mock them)
## When Adding a New Endpoint
1. Add route in `src/api/`
2. Add Pydantic models in `src/models/`
3. Add business logic in `src/services/`
4. Add test in `src/tests/`
5. Update this file if the pattern changes
CLAUDE.md
CLAUDE.md is specifically recognized by Claude Code (Anthropic's CLI coding agent). It provides project context that gets prepended to every Claude Code session.
# TDS Lab 3 — YouTube Pipeline
## What This Project Does
Extracts YouTube video subtitles, identifies topics using an LLM, and generates
a structured JSON summary with timestamps.
## Architecture
```
YouTube URL
↓ yt-dlp
VTT Subtitle File
↓ parse_vtt()
Timestamped Segments (list of {text, start_time})
↓ LLM (Claude) — topic extraction
Topics with timestamps
↓ LLM (Claude) — structured JSON
Final summary.json
```
## Key Files
- `pipeline.py` — main orchestrator
- `subtitle_parser.py` — VTT parsing logic
- `llm_client.py` — all LLM calls go here
- `models.py` — Pydantic models for the output
## Environment Variables Required
```
ANTHROPIC_API_KEY=sk-ant-...
```
## Running the Pipeline
```bash
uv run python pipeline.py "https://youtube.com/watch?v=VIDEO_ID"
# → creates output/VIDEO_ID_summary.json
```
## Gotchas
- VTT files have duplicate lines (de-duplicate before processing)
- Auto-generated captions use `.en-US.vtt` extension
- Rate limit: 1 request/second to avoid Anthropic 429 errors
- Videos > 2 hours: chunk into 30-min segments before sending to LLM
## Testing
```bash
uv run pytest tests/ -v
# Integration tests require ANTHROPIC_API_KEY to be set
```
Context Window Management
The Position Effect
Information at the beginning and end of the context window is remembered better than the middle (the "lost in the middle" problem).
┌──────────────────────────────────────────┐
│ Position │ Attention Weight │
│───────────────────│──────────────────────│
│ Start (system) │ ████████████ HIGH │
│ Middle │ ████ LOW │
│ End (recent msgs) │ ████████████ HIGH │
└──────────────────────────────────────────┘
Implication: Put your most critical instructions at the start (system prompt) and repeat key constraints at the end of the context if needed.
Context Compression
For long conversations, compress older history:
async def compress_history(history: list[dict], max_tokens: int = 4000) -> list[dict]:
"""Compress old conversation history to save context space."""
if not history or len(history) < 6:
return history # keep short histories as-is
# Keep the last 4 messages verbatim (most recent context)
recent = history[-4:]
older = history[:-4]
if not older:
return recent
# Summarize older messages with LLM
older_text = "\n".join(
f"{m['role'].upper()}: {m['content'][:200]}..."
for m in older
)
summary_response = client.messages.create(
model="claude-haiku-4-5-20251001", # cheap model for summarization
max_tokens=512,
messages=[{
"role": "user",
"content": f"Summarize this conversation in 3-5 bullet points, preserving key decisions and facts:\n\n{older_text}"
}],
)
summary = summary_response.content[0].text
compressed = [{
"role": "user",
"content": f"[Earlier conversation summary]:\n{summary}\n\n[Continuing from here...]"
}]
return compressed + recent
# Use in a chat loop
history = []
while True:
user_input = input("You: ")
history.append({"role": "user", "content": user_input})
# Compress if getting long
history = await compress_history(history)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=SYSTEM_PROMPT,
messages=history,
)
reply = response.content[0].text
print(f"Assistant: {reply}")
history.append({"role": "assistant", "content": reply})
Context for RAG (Retrieved Documents)
Structure retrieved documents for maximum readability by the model:
def build_rag_context(query: str, documents: list[dict]) -> str:
"""Build a well-structured context block for RAG."""
docs_block = "\n\n".join([
f"<document id='{i+1}' source='{doc['source']}'>\n{doc['text']}\n</document>"
for i, doc in enumerate(documents)
])
return f"""<retrieved_documents>
{docs_block}
</retrieved_documents>
Using ONLY the information in the documents above, answer this question:
{query}
If the answer is not in the documents, say "I don't have enough information to answer that."
Cite document IDs when you use information from them, e.g. [doc 2]."""
Practical: Multi-Turn Context with System Prompt
import anthropic
from typing import Optional
class ConversationManager:
def __init__(self, system_prompt: str, model: str = "claude-sonnet-4-6"):
self.client = anthropic.Anthropic()
self.system = system_prompt
self.model = model
self.history: list[dict] = []
self.total_input_tokens = 0
self.total_output_tokens = 0
def chat(self, message: str, max_tokens: int = 1024) -> str:
self.history.append({"role": "user", "content": message})
response = self.client.messages.create(
model=self.model,
max_tokens=max_tokens,
system=self.system,
messages=self.history,
)
reply = response.content[0].text
self.history.append({"role": "assistant", "content": reply})
# Track usage
self.total_input_tokens += response.usage.input_tokens
self.total_output_tokens += response.usage.output_tokens
return reply
def get_usage(self) -> dict:
# Rough cost estimate for Claude Sonnet 4.6 (May 2026 pricing)
input_cost = self.total_input_tokens * 3 / 1_000_000 # $3/MTok
output_cost = self.total_output_tokens * 15 / 1_000_000 # $15/MTok
return {
"input_tokens": self.total_input_tokens,
"output_tokens": self.total_output_tokens,
"estimated_cost_usd": round(input_cost + output_cost, 4),
}
def reset(self):
self.history = []
# Usage
bot = ConversationManager(
system_prompt="You are a helpful Python tutor. Keep answers concise and code-focused.",
)
print(bot.chat("What's the difference between a list and a tuple?"))
print(bot.chat("When should I use each?"))
print(bot.chat("Give me an example of when a tuple is better."))
print(bot.get_usage())
# → {'input_tokens': 1247, 'output_tokens': 312, 'estimated_cost_usd': 0.0084}
Summary
| Concept | Key Principle |
|---|---|
| System prompt | Role + Responsibilities + Tone + Constraints + Format |
| AGENTS.md | Project context for AI coding agents (universal) |
| CLAUDE.md | Project context specifically for Claude Code |
| Position effect | Important info at start/end, not buried in middle |
| Context compression | Summarize old history with cheap model, keep recent verbatim |
| RAG context | Wrap documents in XML tags, cite by ID |