The thing that surprised me most was how unreliable even basic guardrails were once you gave agents real tools. The gap between "works in a demo" and "works in production with adversarial input" is massive.
Curious how you handle the evaluation side. When someone claims a successful jailbreak, is that verified automatically or manually? Seems like auto-verification could itself be exploitable.
Evaluation is automated and server-side. We check whether the agent actually did the thing it wasn’t supposed to (tool calls, actions, outputs) rather than just pattern-matching on the response text (at least for the first challenge where the agent is manipulated to call the reveal_access_code tool). But honestly you’re touching on something we’ve been debating internally - the evaluator itself is an attack surface. We’ve kicked around the idea of making “break the evaluator” an explicit challenge. Not sure yet.
What were you seeing at Octomind with the browsing agents? Was it mostly stuff embedded in page content or were attacks coming through structured data / metadata too? Are bad actors sophisticated enough already to exploit this?
Anthropic just showed us that the problem isn't what people think it is. They found that attackers don't try to hack the safety features head-on. Instead they just... ask the AI to do a bunch of separate things that sound totally normal. "Run a security scan." "Check the credentials." "Extract some data." Each request by itself is fine. But put them together and boom, you've hacked the system.
The issue is safety systems only look at one request at a time. They miss what's actually happening because they're not watching the pattern. You can block 95% of obvious jailbreaks and still get totally compromised.
So yeah, publishing the exploits every week is actually smart. It forces companies to stop pretending their guardrails are good enough and actually do something about it.
For example, I've seen Recursive Execution work: where you don't just plant a prompt in a page, but you plant a prompt that specifically instructs the agent to use a second tool (like a calculator or code interpreter) to execute a hidden payload. Many guardrails seem to focus on the 'retrieval' phase but drop their guard once the agent moves to the 'execution' phase of a sub-task.
Has anyone else noticed specific 'blind spots' that appear only when an agent is halfway through a multi-tool chain? It feels like the more tools we give them, the more surface area we create for these 'logic leaps.
zachdotai•1h ago
Context stuffing - flood the conversation with benign text, bury a prompt injection in the middle. The agent's attention dilutes across the context window and the instruction slips through. Guardrails that work fine on short exchanges just miss it.
Indirect injection via tool outputs - if the agent can browse or search, you don't attack the conversation at all. You plant instructions in a page the agent retrieves. Most guardrails only watch user input, not what comes back from tools.
Both are really simple. That's kind of the point.
We build runtime security for AI agents at Fabraix and we open-sourced a playground to stress-test this stuff in the open. Weekly challenges, visible system prompts, real agent capabilities. Winning techniques get published. Community proposes and votes on what gets tested next.