AI agents are easy to break

4•zachdotai•1h ago

Comments

zachdotai•1h ago

Two techniques that keep working against agents with real tools:

Context stuffing - flood the conversation with benign text, bury a prompt injection in the middle. The agent's attention dilutes across the context window and the instruction slips through. Guardrails that work fine on short exchanges just miss it.

Indirect injection via tool outputs - if the agent can browse or search, you don't attack the conversation at all. You plant instructions in a page the agent retrieves. Most guardrails only watch user input, not what comes back from tools.

Both are really simple. That's kind of the point.

We build runtime security for AI agents at Fabraix and we open-sourced a playground to stress-test this stuff in the open. Weekly challenges, visible system prompts, real agent capabilities. Winning techniques get published. Community proposes and votes on what gets tested next.

bothlabs•1h ago

This is a neat idea. At my last company (Octomind) we built AI agents for end-to-end testing and ran into the indirect injection problem constantly. Agents that browse or interact with web pages are especially vulnerable because you can't sanitize the entire internet.

The thing that surprised me most was how unreliable even basic guardrails were once you gave agents real tools. The gap between "works in a demo" and "works in production with adversarial input" is massive.

Curious how you handle the evaluation side. When someone claims a successful jailbreak, is that verified automatically or manually? Seems like auto-verification could itself be exploitable.

zachdotai•1h ago

Yeah the demo-to-production gap is massive. We see the same thing with browser agents being potentially the most vulnerable. And I think this is because of context being stuffed with the web page html that it obscures small injection attempts.

Evaluation is automated and server-side. We check whether the agent actually did the thing it wasn’t supposed to (tool calls, actions, outputs) rather than just pattern-matching on the response text (at least for the first challenge where the agent is manipulated to call the reveal_access_code tool). But honestly you’re touching on something we’ve been debating internally - the evaluator itself is an attack surface. We’ve kicked around the idea of making “break the evaluator” an explicit challenge. Not sure yet.

What were you seeing at Octomind with the browsing agents? Was it mostly stuff embedded in page content or were attacks coming through structured data / metadata too? Are bad actors sophisticated enough already to exploit this?

Kshamiyah•1h ago

Yeah, I think Fabraix is doing something really important here.

Anthropic just showed us that the problem isn't what people think it is. They found that attackers don't try to hack the safety features head-on. Instead they just... ask the AI to do a bunch of separate things that sound totally normal. "Run a security scan." "Check the credentials." "Extract some data." Each request by itself is fine. But put them together and boom, you've hacked the system.

The issue is safety systems only look at one request at a time. They miss what's actually happening because they're not watching the pattern. You can block 95% of obvious jailbreaks and still get totally compromised.

So yeah, publishing the exploits every week is actually smart. It forces companies to stop pretending their guardrails are good enough and actually do something about it.

zachdotai•45m ago

The multi-step thing is exactly what makes agents with real tools so much harder to secure than chat-based setups. Each action looks fine in isolation, it's the sequence that's the problem. And most (but not all) guardrail systems are stateless, they evaluate each turn on its own.

XeonQ8•34m ago

Great point on the indirect injection via tool outputs. I’ve noticed a similar 'tool-chain' vulnerability when working with agents that handle multi-step data processing.

For example, I've seen Recursive Execution work: where you don't just plant a prompt in a page, but you plant a prompt that specifically instructs the agent to use a second tool (like a calculator or code interpreter) to execute a hidden payload. Many guardrails seem to focus on the 'retrieval' phase but drop their guard once the agent moves to the 'execution' phase of a sub-task.

Has anyone else noticed specific 'blind spots' that appear only when an agent is halfway through a multi-tool chain? It feels like the more tools we give them, the more surface area we create for these 'logic leaps.

Cognitive Threat Scanner – Detect manipulation in content using SCT taxonomy

Don't Use Moving Averages (2024)

When Every "AI Headshot" Looked Fake, I Spent 2 Weeks Hacking Together My Own

Canonicalize Your Web Identity and Achieve Data Sovereignty with Pesos

The Barbican Basin

One in Four Smartphones Are Now iPhones

Google follows Anthropic: Antigravity sub can't be used in OpenCode/etc.

The Simulation Argument Revisited in a Functional Universe

Building a production-grade SaaS product just with AI

Show HN: Godot MCP – Give AI assistants full access to the Godot editor

Show HN: Tahc – AI chat widget that hands off to a human when it can't answer

"Energy Based" model vs. frontier AI models for Sudoku

FDA says companies can claim "no artificial colors" if they use natural dyes

So, if Rust is in Linux can it be in Emacs, too?

Show HN: Clap.Net – Source generated CLI Parsing for .NET (Inspired by Clap-Rs)

Apparently, there is a website that ships 52.5 MB of CSS

What I learned when I started assigning the hard reading again

Show HN: TapTap AI – Use Your OpenClaw Agent from Apple Watch/AirPods/CarPlay

Two co-founders of Elon Musk's xAI resign, joining exodus

Show HN: Renovate – The Kubernetes-Native Way

Show HN: Superjson – Simple, beautiful JSON explorer

Show HN: Minimal Pomodoro timer for macOS (1.7MB, now with keyboard shortcuts)

FAA Lifts Closure at El Paso Airport

An AI-generated pull request that makes sense

Deploying Rust to Production Checklist

Show HN: Triclock – A Triangular Clock

Show HN: Deeploy v0.3.0 – terminal-first VPS app deployment tool

Show HN: Eryx, a fast WASM-based Python sandbox with native extension support

Don't Go Monolithic; the Enterprise Agent Stack Is Stratifying

Ejabberd 26.02 / ProcessOne – Erlang Jabber/XMPP/Matrix Server – Communication