Show HN: Open-source playground to red-team AI agents with exploits published

17•zachdotai•3h ago

We build runtime security for AI agents. The playground started as an internal tool that we used to test our own guardrails. But we kept finding the same types of vulnerabilities because we think about attacks a certain way. At some point you need people who don't think like you.

So we open-sourced it. Each challenge is a live agent with real tools and a published system prompt. Whenever a challenge is over, the full winning conversation transcript and guardrail logs get documented publicly.

Building the general-purpose agent itself was probably the most fun part. Getting it to reliably use tools, stay in character, and follow instructions while still being useful is harder than it sounds. That alone reminded us how early we all are in understanding and deploying these systems at scale.

First challenge was to get an agent to call a tool it's been told to never call.

Someone got through in around 60 seconds without ever asking for the secret directly (which taught us a lot).

Next challenge is focused on data exfiltration with harder defences: https://playground.fabraix.com

Comments

hellocr7•1h ago

I have tried to manipulate it using base64 encoding and translaion into other languages which didnt work so far but seems to be that llm as a judge is a very fragile defence for this. Would be cool to add a leaderboard though

zachdotai•32m ago

Thanks for trying it out! Base64 and language switching are solid approaches but they don't tend to work anymore with the latest models in my experience.

You're right that LLM-as-a-judge is fragile though. We saw that as well in the first challenge. The attacker fabricated some research context that made the guardrail want to approve the call. The judge's own reasoning at the end was basically "yes this normally violates the security directive, but given the authorised experiment context it's fine." It talked itself into it.

Full transcript and guardrail logs are published here btw: https://github.com/fabraix/playground/blob/master/challenges...

The leaderboard should start populating once we have more submissions!

Canada's bill C-22 mandates mass metadata surveillance of Canadians

Chrome DevTools MCP

The 49MB web page

Cannabinoids remove plaque-forming Alzheimer's proteins from brain cells

LLM Architecture Gallery

A new Bigfoot documentary helps explain our conspiracy-minded era

//go:fix inline and the source-level inliner

The Linux Programming Interface as a university course text

Separating the Wayland compositor and window manager

What makes Intel Optane stand out (2023)

Bandit: A 32bit baremetal computer that runs Color Forth [video]

Glassworm Is Back: A New Wave of Invisible Unicode Attacks Hits Repositories

C++26: The Oxford Variadic Comma

Nasdaq's Shame

Stop Sloppypasta

Learning athletic humanoid tennis skills from imperfect human motion data

Excel incorrectly assumes that the year 1900 is a leap year

In Memoriam: John W. Addison, my PhD advisor

A Visual Introduction to Machine Learning (2015)

Type systems are leaky abstractions: the case of Map.take!/2

Bus travel from Lima to Rio de Janeiro

Show HN: Free OpenAI API Access with ChatGPT Account

I'm Too Lazy to Check Datadog Every Morning, So I Made AI Do It

Kangina

LLMs can be exhausting

Show HN: GDSL – 800 line kernel: Lisp subset in 500, C subset in 1300

Ask HN: How is AI-assisted coding going for you professionally?

Hollywood Enters Oscars Weekend in Existential Crisis

Show HN: Open-source playground to red-team AI agents with exploits published

Show HN: Signet – Autonomous wildfire tracking from satellite and weather data

Show HN: Open-source playground to red-team AI agents with exploits published

Comments

Canada's bill C-22 mandates mass metadata surveillance of Canadians

Chrome DevTools MCP

The 49MB web page

Cannabinoids remove plaque-forming Alzheimer's proteins from brain cells

LLM Architecture Gallery

A new Bigfoot documentary helps explain our conspiracy-minded era

//go:fix inline and the source-level inliner

The Linux Programming Interface as a university course text

Separating the Wayland compositor and window manager

What makes Intel Optane stand out (2023)

Bandit: A 32bit baremetal computer that runs Color Forth [video]

Glassworm Is Back: A New Wave of Invisible Unicode Attacks Hits Repositories

C++26: The Oxford Variadic Comma

Nasdaq's Shame

Stop Sloppypasta

Learning athletic humanoid tennis skills from imperfect human motion data

Excel incorrectly assumes that the year 1900 is a leap year

In Memoriam: John W. Addison, my PhD advisor

A Visual Introduction to Machine Learning (2015)

Type systems are leaky abstractions: the case of Map.take!/2

Bus travel from Lima to Rio de Janeiro

Show HN: Free OpenAI API Access with ChatGPT Account

I'm Too Lazy to Check Datadog Every Morning, So I Made AI Do It

Kangina

LLMs can be exhausting

Show HN: GDSL – 800 line kernel: Lisp subset in 500, C subset in 1300

Ask HN: How is AI-assisted coding going for you professionally?

Hollywood Enters Oscars Weekend in Existential Crisis

Show HN: Open-source playground to red-team AI agents with exploits published

Show HN: Signet – Autonomous wildfire tracking from satellite and weather data