[LIVE DEMO] AI Agents Jailbreak Themselves Without Any Attack
Normally, getting an LLM to produce harmful content — self-harm instructions, weapon tutorials, exploit code — requires a pretty sophisticated attack. Prompt injection, jailbreaks, adversarial suffixes, the whole arms race.
I found that in an agent setting, you don't need any of that. You just give the model a normal task — say, training a LlamaGuard content moderation model — and it will produce a full harmful text dataset on its own. No refusal. No hesitation. It thinks it's doing its job.
I tested 100 frontier models. Basically every model can be triggered this way. GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6 — all of them. Every major provider. Zero adversarial effort required.
This is a big deal for anyone deploying agents in production — tools like Openclaw, Claude Code, Codex, or any agentic framework that gives LLMs file access and code execution. If your agent touches sensitive data in science, healthcare, or security workflows, it could generate harmful content as a side effect of doing its job.
I want to share this finding because I think both developers building on LLMs and normal users need to be aware. This is real — I've included live demos as proof so you can see it happening, not just take my word for it:
pythonsen•1h ago
Normally, getting an LLM to produce harmful content — self-harm instructions, weapon tutorials, exploit code — requires a pretty sophisticated attack. Prompt injection, jailbreaks, adversarial suffixes, the whole arms race.
I found that in an agent setting, you don't need any of that. You just give the model a normal task — say, training a LlamaGuard content moderation model — and it will produce a full harmful text dataset on its own. No refusal. No hesitation. It thinks it's doing its job.
I tested 100 frontier models. Basically every model can be triggered this way. GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6 — all of them. Every major provider. Zero adversarial effort required.
This is a big deal for anyone deploying agents in production — tools like Openclaw, Claude Code, Codex, or any agentic framework that gives LLMs file access and code execution. If your agent touches sensitive data in science, healthcare, or security workflows, it could generate harmful content as a side effect of doing its job. I want to share this finding because I think both developers building on LLMs and normal users need to be aware. This is real — I've included live demos as proof so you can see it happening, not just take my word for it:
85 reproducible prompt if you want to try it yourself: https://github.com/wuyoscar/ISC-Bench