It’s an open-source sandbox that runs an automated barrage of standard exploits against your target LLM to see if it leaks data or ignores core instructions.
How it works under the hood: - The UI is built with Streamlit, backend is FastAPI, and dependency management is handled by `uv`. - You paste your system prompt and hit run. It fires 12 baseline attack vectors (Direct leaks, XSS payloads, Context overflows, etc.) concurrently. - The core mechanic is "LLM-as-a-Judge". It uses a hardcoded `gpt-4.1-mini` with strict alignment rules to systematically evaluate the target's responses. - It supports OpenAI, Anthropic, and a solid list of open-weight models via OpenRouter (including DeepSeek V3/R1, Qwen 2.5, and Llama 3.3).
There is a hosted free version if you want to play with it immediately (I capped it at 15 requests/IP to survive the launch), but the entire tool is open-source and takes 30 seconds to spin up locally with Docker or `uv`.
Repo: https://github.com/BreakMyAgent/breakmyagent-os Live demo: https://breakmyagent.dev
Next on the roadmap: I'm building a dedicated CLI/GitHub Action so teams can drop this into their own CI/CD pipelines to block prompt regressions. I'm also developing a PoC for multi-turn agentic fuzzing and expanding the payload database for complex tool-spoofing.
I’d love to hear your feedback! What other test configurations (besides temperature and response format) do you think are essential for a tool like this? Also open to any feedback on the architecture, the judge prompt, or specific zero-day vectors you'd like to see included in the public database.