frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Open-source playground to red-team AI agents with exploits published

https://github.com/fabraix/playground
17•zachdotai•3h ago
We build runtime security for AI agents. The playground started as an internal tool that we used to test our own guardrails. But we kept finding the same types of vulnerabilities because we think about attacks a certain way. At some point you need people who don't think like you.

So we open-sourced it. Each challenge is a live agent with real tools and a published system prompt. Whenever a challenge is over, the full winning conversation transcript and guardrail logs get documented publicly.

Building the general-purpose agent itself was probably the most fun part. Getting it to reliably use tools, stay in character, and follow instructions while still being useful is harder than it sounds. That alone reminded us how early we all are in understanding and deploying these systems at scale.

First challenge was to get an agent to call a tool it's been told to never call.

Someone got through in around 60 seconds without ever asking for the secret directly (which taught us a lot).

Next challenge is focused on data exfiltration with harder defences: https://playground.fabraix.com

Comments

hellocr7•1h ago
I have tried to manipulate it using base64 encoding and translaion into other languages which didnt work so far but seems to be that llm as a judge is a very fragile defence for this. Would be cool to add a leaderboard though
zachdotai•32m ago
Thanks for trying it out! Base64 and language switching are solid approaches but they don't tend to work anymore with the latest models in my experience.

You're right that LLM-as-a-judge is fragile though. We saw that as well in the first challenge. The attacker fabricated some research context that made the guardrail want to approve the call. The judge's own reasoning at the end was basically "yes this normally violates the security directive, but given the authorised experiment context it's fine." It talked itself into it.

Full transcript and guardrail logs are published here btw: https://github.com/fabraix/playground/blob/master/challenges...

The leaderboard should start populating once we have more submissions!

Canada's bill C-22 mandates mass metadata surveillance of Canadians

https://www.michaelgeist.ca/2026/03/a-tale-of-two-bills-lawful-access-returns-with-changes-to-war...
294•opengrass•4h ago•77 comments

Chrome DevTools MCP

https://developer.chrome.com/blog/chrome-devtools-mcp-debug-your-browser-session
319•xnx•6h ago•141 comments

The 49MB web page

https://thatshubham.com/blog/news-audit
290•kermatt•6h ago•159 comments

Cannabinoids remove plaque-forming Alzheimer's proteins from brain cells

https://www.salk.edu/news-release/cannabinoids-remove-plaque-forming-alzheimers-proteins-from-bra...
43•anjel•1h ago•14 comments

LLM Architecture Gallery

https://sebastianraschka.com/llm-architecture-gallery/
233•tzury•9h ago•19 comments

A new Bigfoot documentary helps explain our conspiracy-minded era

https://www.msn.com/en-us/news/us/a-new-bigfoot-documentary-helps-explain-our-conspiracy-minded-e...
37•zdw•3h ago•6 comments

//go:fix inline and the source-level inliner

https://go.dev/blog/inliner
113•commotionfever•4d ago•41 comments

The Linux Programming Interface as a university course text

https://man7.org/tlpi/academic/index.html
14•teleforce•1h ago•1 comments

Separating the Wayland compositor and window manager

https://isaacfreund.com/blog/river-window-management/
218•dpassens•10h ago•94 comments

What makes Intel Optane stand out (2023)

https://blog.zuthof.nl/2023/06/02/what-makes-intel-optane-stand-out/
174•walterbell•10h ago•115 comments

Bandit: A 32bit baremetal computer that runs Color Forth [video]

https://www.youtube.com/watch?v=HK0uAKkt0AE
17•surprisetalk•3d ago•1 comments

Glassworm Is Back: A New Wave of Invisible Unicode Attacks Hits Repositories

https://www.aikido.dev/blog/glassworm-returns-unicode-attack-github-npm-vscode
220•robinhouston•12h ago•136 comments

C++26: The Oxford Variadic Comma

https://www.sandordargo.com/blog/2026/03/11/cpp26-oxford-variadic-comma
115•ingve•4d ago•63 comments

Nasdaq's Shame

https://keubiko.substack.com/p/nasdaqs-shame
132•imichael•3h ago•28 comments

Stop Sloppypasta

https://stopsloppypasta.ai/
109•namnnumbr•8h ago•58 comments

Learning athletic humanoid tennis skills from imperfect human motion data

https://zzk273.github.io/LATENT/
120•danielmorozoff•10h ago•24 comments

Excel incorrectly assumes that the year 1900 is a leap year

https://learn.microsoft.com/en-us/troubleshoot/microsoft-365-apps/excel/wrongly-assumes-1900-is-l...
42•susam•1h ago•10 comments

In Memoriam: John W. Addison, my PhD advisor

https://billwadge.com/2026/03/15/in-memoriam-john-w-addison-jr-my-phd-advisor/
96•herodotus•9h ago•4 comments

A Visual Introduction to Machine Learning (2015)

https://r2d3.us/visual-intro-to-machine-learning-part-1/
314•vismit2000•14h ago•29 comments

Type systems are leaky abstractions: the case of Map.take!/2

https://dashbit.co/blog/type-systems-are-leaky-abstractions-map-take
20•tosh•4d ago•8 comments

Bus travel from Lima to Rio de Janeiro

https://kenschutte.com/lima-to-rio-by-bus/
118•ks2048•4d ago•44 comments

Show HN: Free OpenAI API Access with ChatGPT Account

https://github.com/EvanZhouDev/openai-oauth
25•EvanZhouDev•4h ago•11 comments

I'm Too Lazy to Check Datadog Every Morning, So I Made AI Do It

https://quickchat.ai/post/automate-bug-triage-with-claude-code-and-datadog
16•piotrgrudzien•4h ago•4 comments

Kangina

https://en.wikipedia.org/wiki/Kangina
71•thunderbong•3h ago•5 comments

LLMs can be exhausting

https://tomjohnell.com/llms-can-be-absolutely-exhausting/
70•tjohnell•4h ago•58 comments

Show HN: GDSL – 800 line kernel: Lisp subset in 500, C subset in 1300

https://firthemouse.github.io/
58•FirTheMouse•9h ago•13 comments

Ask HN: How is AI-assisted coding going for you professionally?

238•svara•9h ago•402 comments

Hollywood Enters Oscars Weekend in Existential Crisis

https://www.theculturenewspaper.com/hollywood-enters-oscars-weekend-in-existential-crisis/
123•RickJWagner•12h ago•396 comments

Show HN: Open-source playground to red-team AI agents with exploits published

https://github.com/fabraix/playground
17•zachdotai•3h ago•2 comments

Show HN: Signet – Autonomous wildfire tracking from satellite and weather data

https://signet.watch
105•mapldx•13h ago•31 comments