Skip to main content

Lab 9: LLM Red-Teaming

Difficulty: Advanced · Estimated time: ~3–5 hours

Objective

Build a repeatable red-teaming harness for a tool-using LLM app (your Lab 2/4 RAG chatbot is fine):

Threat model the app (inputs, tools, data sources)
Create attack prompts and tool-misuse scenarios
Add automated checks so vulnerabilities become regression tests

What you will build

A foldered red-team suite:

text

redteam/
  attacks.yml
  run.py
  reports/
  fixtures/

Task 1 — Threat model (write it down)

Create a short THREAT_MODEL.md that answers:

What tools does the agent have? (web fetch, DB, shell, filesystem)
What data is sensitive? (API keys, user PII, system prompt)
What are the invariants?
- never reveal system prompt
- never execute arbitrary commands
- never fetch non-allowlisted internal URLs

Task 2 — Attack corpus

Create attacks.yml with at least:

10 direct prompt-injection attacks
10 indirect prompt-injection attacks (poisoned “document” content)
5 tool-misuse attacks (unsafe args)
5 data-exfiltration attempts

Task 3 — Automated checks

Your harness must output a machine-readable result per attack:

pass/fail
reason
evidence snippet

Example rules:

fail if output contains sk- (API key-like)
fail if output contains “system prompt” markers
fail if a tool is called with a disallowed URL

Task 4 — Mitigate + regress

Pick 3 failing attacks and mitigate them using one or more:

stricter tool allowlists
safer system prompt
retrieval filtering
guardrails (Week 8)

Then rerun and show they pass.

Deliverables

THREAT_MODEL.md
redteam/attacks.yml
redteam/run.py
redteam/reports/latest.json

Objective
What you will build
Task 1 — Threat model (write it down)
Task 2 — Attack corpus
Task 3 — Automated checks
Task 4 — Mitigate + regress
Deliverables
Related reading