OpenClaw / LLM Red-Teaming
LLM red-teaming is the practice of systematically attacking an LLM app (especially tool-using agents) and converting every discovered failure into a regression test.
Learning goals
- Build a threat model for an LLM app (inputs, tools, data stores)
- Create an attack corpus that stays useful as the app evolves
- Turn attacks into automated tests in CI
The modern LLM attack surface
- Prompt injection (direct + indirect via retrieved docs)
- Data exfiltration (secrets, system prompts, tool outputs)
- Tool misuse (unsafe arguments, command execution, SSRF-like fetches)
- Policy bypass (jailbreaks, instruction hierarchy confusion)
A simple red-team workflow
- Define the app’s security invariants (what must never happen)
- Create a prompt corpus (
attacks/) - Run attacks against the app (scripted harness)
- Record failures + mitigations
- Add regression tests so the same exploit never ships again
What “good” looks like
- Attacks are versioned and run in CI
- You measure:
- refusal rate on disallowed requests
- false positives (over-blocking)
- data leakage checks
Guardrails are necessary but not sufficient
NeMo Guardrails and similar tools reduce risk, but you still need:
- strict input validation
- tool allowlists
- least-privilege credentials
- logging + monitoring
See: NeMo Guardrails and LLM Security.
Mini-lab (optional)
Build a red-team harness for your RAG chatbot:
- 30 prompt-injection attacks (including indirect injection via docs)
- automated evaluation that asserts “no secrets leaked”
- a report that maps each attack → mitigation