frontpage.

Hi everyone,

I’ve been working on an open-source tool called Flakestorm to test the reliability of AI agents before they hit production.

Most agent testing today focuses on eval scores or happy-path prompts. In practice, agents tend to fail in more mundane ways: typos, tone shifts, long context, malformed input, or simple prompt injections — especially when running on smaller or local models. Flakestorm applies chaos-engineering ideas to agents. Instead of testing one prompt, it takes a “golden prompt”, generates adversarial mutations (semantic variations, noise, injections, encoding edge cases), runs them against your agent, and produces a robustness score plus a detailed HTML report showing what broke.

Key points: Local-first (uses Ollama for mutation generation)

Tested with Qwen / Gemma / other small models Works against HTTP agents, LangChain chains, or Python callables No cloud or API keys required This started as a way to debug my own agents after seeing them behave unpredictably under real user input. I’m still early and trying to understand how useful this is outside my own workflow.

I’d really appreciate feedback on: Whether this overlaps with how you test agents today Failure modes you’ve seen that aren’t covered Whether “chaos testing for agents” is a useful framing, or if this should be thought of differently Repo: https://github.com/flakestorm/flakestorm Docs are admittedly long.

Thanks for taking a look.

A Night Without the Nerds – Claude Opus 4.6, Field-Tested

Could ionospheric disturbances influence earthquakes?

SpaceX's next astronaut launch for NASA is officially on for Feb. 11 as FAA clea

Show HN: One-click AI employee with its own cloud desktop

Show HN: Poddley – Search podcasts by who's speaking

Same Surface, Different Weight

The Rise of Spec Driven Development

The first good Raspberry Pi Laptop

Seas to Rise Around the World – But Not in Greenland

Will Future Generations Think We're Gross?

State Department will delete Xitter posts from before Trump returned to office

Show HN: Verifiable server roundtrip demo for a decision interruption system

Impl Rust – Avro IDL Tool in Rust via Antlr

Stories from 25 Years of Software Development

minikeyvalue

Neomacs: GPU-accelerated Emacs with inline video, WebKit, and terminal via wgpu

Show HN: Moli P2P – An ephemeral, serverless image gallery (Rust and WebRTC)

How I grow my X presence?

What's the cost of the most expensive Super Bowl ad slot?

What if you just did a startup instead?

Hacking up your own shell completion (2020)

Show HN: Gorse 0.5 – Open-source recommender system with visual workflow editor

GLM-OCR: Accurate × Fast × Comprehensive

Local Agent Bench: Test 11 small LLMs on tool-calling judgment, on CPU, no GPU

Show HN: AboutMyProject – A public log for developer proof-of-work

Expertise, AI and Work of Future [video]

So Long to Cheap Books You Could Fit in Your Pocket

PID Controller

SpaceX Rocket Generates 100GW of Power, or 20% of US Electricity

Kubernetes MCP Server

Show HN: Flakestorm – Chaos engineering for AI agents (local-first, open source)