So we open-sourced it. Each challenge is a live agent with real tools and a published system prompt. Whenever a challenge is over, the full winning conversation transcript and guardrail logs get documented publicly.
Building the general-purpose agent itself was probably the most fun part. Getting it to reliably use tools, stay in character, and follow instructions while still being useful is harder than it sounds. That alone reminded us how early we all are in understanding and deploying these systems at scale.
First challenge was to get an agent to call a tool it's been told to never call.
Someone got through in around 60 seconds without ever asking for the secret directly (which taught us a lot).
Next challenge is focused on data exfiltration with harder defences: https://playground.fabraix.com
hellocr7•1h ago
zachdotai•28m ago
You're right that LLM-as-a-judge is fragile though. We saw that as well in the first challenge. The attacker fabricated some research context that made the guardrail want to approve the call. The judge's own reasoning at the end was basically "yes this normally violates the security directive, but given the authorised experiment context it's fine." It talked itself into it.
Full transcript and guardrail logs are published here btw: https://github.com/fabraix/playground/blob/master/challenges...
The leaderboard should start populating once we have more submissions!