Show HN: Open-source playground to red-team AI agents with exploits published

17•zachdotai•2h ago

We build runtime security for AI agents. The playground started as an internal tool that we used to test our own guardrails. But we kept finding the same types of vulnerabilities because we think about attacks a certain way. At some point you need people who don't think like you.

So we open-sourced it. Each challenge is a live agent with real tools and a published system prompt. Whenever a challenge is over, the full winning conversation transcript and guardrail logs get documented publicly.

Building the general-purpose agent itself was probably the most fun part. Getting it to reliably use tools, stay in character, and follow instructions while still being useful is harder than it sounds. That alone reminded us how early we all are in understanding and deploying these systems at scale.

First challenge was to get an agent to call a tool it's been told to never call.

Someone got through in around 60 seconds without ever asking for the secret directly (which taught us a lot).

Next challenge is focused on data exfiltration with harder defences: https://playground.fabraix.com

Comments

hellocr7•1h ago

I have tried to manipulate it using base64 encoding and translaion into other languages which didnt work so far but seems to be that llm as a judge is a very fragile defence for this. Would be cool to add a leaderboard though

zachdotai•28m ago

Thanks for trying it out! Base64 and language switching are solid approaches but they don't tend to work anymore with the latest models in my experience.

You're right that LLM-as-a-judge is fragile though. We saw that as well in the first challenge. The attacker fabricated some research context that made the guardrail want to approve the call. The judge's own reasoning at the end was basically "yes this normally violates the security directive, but given the authorised experiment context it's fine." It talked itself into it.

Full transcript and guardrail logs are published here btw: https://github.com/fabraix/playground/blob/master/challenges...

The leaderboard should start populating once we have more submissions!

spranab•10m ago

Ran fabraix/playground through IdeaCred (automated repo scorer) — 61/100, strongest in undefined.

Free badge/profile: https://ideacred.com/profile/fabraix

Show HN: Free OpenAI API Access with ChatGPT Account

Show HN: GDSL – 800 line kernel: Lisp subset in 500, C subset in 1300

Show HN: Open-source playground to red-team AI agents with exploits published

Show HN: Signet – Autonomous wildfire tracking from satellite and weather data

Show HN: What if your synthesizer was powered by APL (or a dumb K clone)?

Show HN: Lockstep – A data-oriented programming language

Show HN: Ritual – An Open Source Local Monochrome Themed Habit Tracker PWA

Show HN: Nova–Self-hosted personal AI learns from corrections &fine-tunes itself

Show HN: Webassembly4J Run WebAssembly from Java

Show HN: Goal.md, a goal-specification file for autonomous coding agents

Show HN: Tmux-nvim-navigator – Seamless navigation with zero Neovim config

Show HN: Flutterby, an App for Flutter Developers

Show HN: HUMANTODO

Show HN: Han – A Korean programming language written in Rust

Show HN: Claude's 2x usage promotion (March 2026) in your timezone

Show HN: Ichinichi – One note per day, E2E encrypted, local-first

Show HN: HN Skins – Available Skins: Cafe, Courier, London, Midnight, Terminal

Show HN: GitAgent – An open standard that turns any Git repo into an AI agent

Show HN: Detach – Mobile UI for managing AI coding agents from your phone

Show HN: AgentMailr – dedicated email inboxes for AI agents

Show HN: GrobPaint: Somewhere Between MS Paint and Paint.net

Show HN: Channel Surfer – Watch YouTube like it’s cable TV

Show HN: Sway, a board game benchmark for quantum computing

Show HN: Context Gateway – Compress agent context before it hits the LLM

Show HN: Voice-tracked teleprompter using on-device ASR in the browser

Show HN: Data-anim – Animate HTML with just data attributes

Show HN: RSS tool to remix feeds, build from webpages, and skip podcast reruns

Show HN: Axe – A 12MB binary that replaces your AI framework

Show HN: Dialtone watcher – what is my laptop doing and am I normal

Show HN: Ink – Deploy full-stack apps from AI agents via MCP or Skills

Show HN: Open-source playground to red-team AI agents with exploits published

Comments

Show HN: Free OpenAI API Access with ChatGPT Account

Show HN: GDSL – 800 line kernel: Lisp subset in 500, C subset in 1300

Show HN: Open-source playground to red-team AI agents with exploits published

Show HN: Signet – Autonomous wildfire tracking from satellite and weather data

Show HN: What if your synthesizer was powered by APL (or a dumb K clone)?

Show HN: Lockstep – A data-oriented programming language

Show HN: Ritual – An Open Source Local Monochrome Themed Habit Tracker PWA

Show HN: Nova–Self-hosted personal AI learns from corrections &fine-tunes itself

Show HN: Webassembly4J Run WebAssembly from Java

Show HN: Goal.md, a goal-specification file for autonomous coding agents

Show HN: Tmux-nvim-navigator – Seamless navigation with zero Neovim config

Show HN: Flutterby, an App for Flutter Developers

Show HN: HUMANTODO

Show HN: Han – A Korean programming language written in Rust

Show HN: Claude's 2x usage promotion (March 2026) in your timezone

Show HN: Ichinichi – One note per day, E2E encrypted, local-first

Show HN: HN Skins – Available Skins: Cafe, Courier, London, Midnight, Terminal

Show HN: GitAgent – An open standard that turns any Git repo into an AI agent

Show HN: Detach – Mobile UI for managing AI coding agents from your phone

Show HN: AgentMailr – dedicated email inboxes for AI agents

Show HN: GrobPaint: Somewhere Between MS Paint and Paint.net

Show HN: Channel Surfer – Watch YouTube like it’s cable TV

Show HN: Sway, a board game benchmark for quantum computing

Show HN: Context Gateway – Compress agent context before it hits the LLM

Show HN: Voice-tracked teleprompter using on-device ASR in the browser

Show HN: Data-anim – Animate HTML with just data attributes

Show HN: RSS tool to remix feeds, build from webpages, and skip podcast reruns

Show HN: Axe – A 12MB binary that replaces your AI framework

Show HN: Dialtone watcher – what is my laptop doing and am I normal

Show HN: Ink – Deploy full-stack apps from AI agents via MCP or Skills