Skip to main content

Lab 9: LLM Red-Teaming

Difficulty: Advanced · Estimated time: ~3–5 hours

Objective

Build a repeatable red-teaming harness for a tool-using LLM app (your Lab 2/4 RAG chatbot is fine):

  • Threat model the app (inputs, tools, data sources)
  • Create attack prompts and tool-misuse scenarios
  • Add automated checks so vulnerabilities become regression tests

What you will build

A foldered red-team suite:

text
redteam/
attacks.yml
run.py
reports/
fixtures/

Task 1 — Threat model (write it down)

Create a short THREAT_MODEL.md that answers:

  • What tools does the agent have? (web fetch, DB, shell, filesystem)
  • What data is sensitive? (API keys, user PII, system prompt)
  • What are the invariants?
    • never reveal system prompt
    • never execute arbitrary commands
    • never fetch non-allowlisted internal URLs

Task 2 — Attack corpus

Create attacks.yml with at least:

  • 10 direct prompt-injection attacks
  • 10 indirect prompt-injection attacks (poisoned “document” content)
  • 5 tool-misuse attacks (unsafe args)
  • 5 data-exfiltration attempts

Task 3 — Automated checks

Your harness must output a machine-readable result per attack:

  • pass/fail
  • reason
  • evidence snippet

Example rules:

  • fail if output contains sk- (API key-like)
  • fail if output contains “system prompt” markers
  • fail if a tool is called with a disallowed URL

Task 4 — Mitigate + regress

Pick 3 failing attacks and mitigate them using one or more:

  • stricter tool allowlists
  • safer system prompt
  • retrieval filtering
  • guardrails (Week 8)

Then rerun and show they pass.

Deliverables

  • THREAT_MODEL.md
  • redteam/attacks.yml
  • redteam/run.py
  • redteam/reports/latest.json