This framework takes a white-box approach: you feed it your agent's architecture, its tool definitions, and its role configuration. It then generates thousands of multi-turn attack sequences that are specific to what your agent can actually do. In our benchmarks, white-box attacks found 5x more vulnerabilities than black-box approaches.
Some of the threat categories it covers that we think are under explored: chained data exfiltration, where a single prompt chains read_file into send_email and your data is gone before any alert fires. Cascading hallucination attacks that gradually corrupt agent reasoning across a conversation. Rogue agent behavior where agents get manipulated into taking actions outside their scope (unauthorized Slack messages, GitHub commits, webhook triggers). Indirect prompt injection via retrieved documents, emails, or web content that hijack your agent mid-task. Multi-agent privilege escalation where a compromised sub-agent poisons context flowing to an orchestrator. Out-of-band exfiltration through DNS lookups, HTTP callbacks, or steganographic patterns that bypass DLP entirely.
None of these show up in a CVE scanner. The biggest vulnerability in an agentic system isn't a code bug; it's what a rogue user or rogue agent can convince your AI to do.
Stack: TypeScript, MIT license. Here's a longer write up: https://votal.ai/white-box-red-teaming-for-agentic-ai-an-ope...
Would love feedback on the attack catalog structure, the white-box approach vs. black-box tradeoffs, and any threat categories we're missing. PRs and issues welcome. Thank you.