Prompt Caching
Prompt caching is the skill of making repeated LLM calls cheaper and faster by reusing stable context (policies, rubrics, long instructions, docs) instead of re-sending and re-processing it every time.
In 2026, almost every production LLM app does some form of caching: provider prompt caching, app-level response caching, retrieval caching, and tool-result caching.
Learning goals
- Design prompts that are cache-friendly
- Implement a simple app-level cache (SQLite) with safe invalidation
- Measure cost/latency impact with real numbers
What can be cached?
- Prompt prefix (stable system instructions + rubric + tool specs)
- RAG retrieval results (top-k chunks for a query)
- Tool results (expensive API calls)
- Final model outputs (when inputs repeat)
Cache only when inputs are stable and the output is safe to reuse (no user-specific secrets, no time-sensitive answers).
Cache-friendly prompt design (the 80/20)
- Put stable content in a versioned block (e.g.,
COURSE_CONTEXT_v3). - Keep dynamic content (user question) in a small section.
- Avoid cache-busters: timestamps, random IDs, unstable ordering.
Example template:
[SYSTEM] COURSE_CONTEXT_v3
- course: TDS 2026
- style: concise, technical, cite retrieved context
- rules: never invent citations
[USER]
Question: {{question}}
Context: {{retrieved_chunks}}
App-level response caching (SQLite example)
Even if your provider does prompt caching, app-level caching is useful for:
- CI test suites / eval harnesses
- Dev workflows
- Replaying agent runs
import hashlib
import json
import sqlite3
from typing import Any
DB = "prompt_cache.sqlite"
conn = sqlite3.connect(DB)
conn.execute(
"""
CREATE TABLE IF NOT EXISTS cache (
k TEXT PRIMARY KEY,
v TEXT NOT NULL,
created_at INTEGER NOT NULL
)
"""
)
def cache_key(model: str, system: str, user: str, template_version: str) -> str:
raw = json.dumps(
{"model": model, "system": system, "user": user, "v": template_version},
sort_keys=True,
).encode("utf-8")
return hashlib.sha256(raw).hexdigest()
def cache_get(k: str) -> Any | None:
row = conn.execute("SELECT v FROM cache WHERE k=?", (k,)).fetchone()
return json.loads(row[0]) if row else None
def cache_put(k: str, v: Any) -> None:
conn.execute(
"INSERT OR REPLACE INTO cache (k, v, created_at) VALUES (?, ?, strftime('%s','now'))",
(k, json.dumps(v),),
)
conn.commit()
Invalidation strategy
- Bump
template_versionwhen you change instructions. - Include retrieval context hash if the answer depends on documents.
- Use a TTL if outputs should expire.
Measuring impact
Track:
- input tokens, output tokens
- p50 / p95 latency
- cache hit rate
A good first goal: >30% hit rate in dev/eval workflows.
Mini-lab (optional)
Build a CLI command tds-ask "...":
- caches the stable course primer
- caches full answers by (model, prompt_version, normalized question)
- prints cache hit/miss + latency
Where this fits
Prompt caching pairs naturally with Structured Output and Function Calling: both reduce retries and make agent loops cheaper.