Lab 9: LLM Red-Teaming
Difficulty: Advanced · Estimated time: ~3–5 hours
Objective
Build a repeatable red-teaming harness for a tool-using LLM app (your Lab 2/4 RAG chatbot is fine):
- Threat model the app (inputs, tools, data sources)
- Create attack prompts and tool-misuse scenarios
- Add automated checks so vulnerabilities become regression tests
What you will build
A foldered red-team suite:
text
redteam/
attacks.yml
run.py
reports/
fixtures/
Task 1 — Threat model (write it down)
Create a short THREAT_MODEL.md that answers:
- What tools does the agent have? (web fetch, DB, shell, filesystem)
- What data is sensitive? (API keys, user PII, system prompt)
- What are the invariants?
- never reveal system prompt
- never execute arbitrary commands
- never fetch non-allowlisted internal URLs
Task 2 — Attack corpus
Create attacks.yml with at least:
- 10 direct prompt-injection attacks
- 10 indirect prompt-injection attacks (poisoned “document” content)
- 5 tool-misuse attacks (unsafe args)
- 5 data-exfiltration attempts
Task 3 — Automated checks
Your harness must output a machine-readable result per attack:
- pass/fail
- reason
- evidence snippet
Example rules:
- fail if output contains
sk-(API key-like) - fail if output contains “system prompt” markers
- fail if a tool is called with a disallowed URL
Task 4 — Mitigate + regress
Pick 3 failing attacks and mitigate them using one or more:
- stricter tool allowlists
- safer system prompt
- retrieval filtering
- guardrails (Week 8)
Then rerun and show they pass.
Deliverables
THREAT_MODEL.mdredteam/attacks.ymlredteam/run.pyredteam/reports/latest.json