It worked too well. The agent bypassed my safeguards and executed a `drop_table` tool call because it thought it was "telling a story."
Realizing 99% of agents are vulnerable to this kind of context-based injection, I spent the weekend building two things:
1. A repo of 5,000+ Agent Attack Vectors (Grandma, CEO Override, Debug Mode): https://github.com/Esrbwt1/voidgate
2. VoidGate – A semantic firewall and remote kill switch for Python agents: https://voidgate.vercel.app/
It uses Upstash Redis to check a "Kill Flag" in <50ms before allowing any tool execution. If you see your agent going rogue, you flip the switch, and it dies instantly globally.
I'm releasing the Python client as open source. The "Attack DB" is also free to use for your own red-teaming.
Would love feedback on the attack vectors. Is anyone else seeing agents fail simple roleplay exploits?
Esrbwt•58m ago
I'm curious if others are handling this via LLM-based guardrails (checking the input with another LLM) or hard-coded logic like this? In my testing, LLM-guardrails added too much latency.