Skip to main content

Prompt Caching

Prompt caching is the skill of making repeated LLM calls cheaper and faster by reusing stable context (policies, rubrics, long instructions, docs) instead of re-sending and re-processing it every time.

In 2026, almost every production LLM app does some form of caching: provider prompt caching, app-level response caching, retrieval caching, and tool-result caching.

Learning goals

  • Design prompts that are cache-friendly
  • Implement a simple app-level cache (SQLite) with safe invalidation
  • Measure cost/latency impact with real numbers

What can be cached?

  1. Prompt prefix (stable system instructions + rubric + tool specs)
  2. RAG retrieval results (top-k chunks for a query)
  3. Tool results (expensive API calls)
  4. Final model outputs (when inputs repeat)
Don’t cache blindly

Cache only when inputs are stable and the output is safe to reuse (no user-specific secrets, no time-sensitive answers).

Cache-friendly prompt design (the 80/20)

  • Put stable content in a versioned block (e.g., COURSE_CONTEXT_v3).
  • Keep dynamic content (user question) in a small section.
  • Avoid cache-busters: timestamps, random IDs, unstable ordering.

Example template:

text
[SYSTEM] COURSE_CONTEXT_v3
- course: TDS 2026
- style: concise, technical, cite retrieved context
- rules: never invent citations

[USER]
Question: {{question}}
Context: {{retrieved_chunks}}

App-level response caching (SQLite example)

Even if your provider does prompt caching, app-level caching is useful for:

  • CI test suites / eval harnesses
  • Dev workflows
  • Replaying agent runs
python
import hashlib
import json
import sqlite3
from typing import Any

DB = "prompt_cache.sqlite"

conn = sqlite3.connect(DB)
conn.execute(
"""
CREATE TABLE IF NOT EXISTS cache (
k TEXT PRIMARY KEY,
v TEXT NOT NULL,
created_at INTEGER NOT NULL
)
"""
)

def cache_key(model: str, system: str, user: str, template_version: str) -> str:
raw = json.dumps(
{"model": model, "system": system, "user": user, "v": template_version},
sort_keys=True,
).encode("utf-8")
return hashlib.sha256(raw).hexdigest()

def cache_get(k: str) -> Any | None:
row = conn.execute("SELECT v FROM cache WHERE k=?", (k,)).fetchone()
return json.loads(row[0]) if row else None

def cache_put(k: str, v: Any) -> None:
conn.execute(
"INSERT OR REPLACE INTO cache (k, v, created_at) VALUES (?, ?, strftime('%s','now'))",
(k, json.dumps(v),),
)
conn.commit()

Invalidation strategy

  • Bump template_version when you change instructions.
  • Include retrieval context hash if the answer depends on documents.
  • Use a TTL if outputs should expire.

Measuring impact

Track:

  • input tokens, output tokens
  • p50 / p95 latency
  • cache hit rate

A good first goal: >30% hit rate in dev/eval workflows.

Mini-lab (optional)

Build a CLI command tds-ask "...":

  • caches the stable course primer
  • caches full answers by (model, prompt_version, normalized question)
  • prints cache hit/miss + latency

Where this fits

Prompt caching pairs naturally with Structured Output and Function Calling: both reduce retries and make agent loops cheaper.