frontpage.

I recently ran a detailed chaos engineering test on a standard LangChain agent using my open-source testing tool, Flakestorm [1]. The results were stark and highlight what I believe is a critical blind spot in how we test AI agents before deployment.

The Method: I used adversarial mutations (22+ types like prompt injection, encoding attacks, context manipulation) to simulate real-world hostile inputs, checking for failures in latency, safety, and correctness.

The Result: The agent scored a 5.2% robustness score. 57 out of 60 adversarial tests failed. Key failures:

Encoding Attacks: 0% pass rate. The agent would decode malicious Base64 inputs instead of rejecting them—a major security oversight.

Prompt Injection: 0% pass rate. Basic "ignore previous instructions" attacks succeeded every time.

Severe Performance Degradation: Latency spiked to ~30 seconds under stress, far exceeding reasonable timeouts.

This isn't about one bad agent. It's a pattern suggesting our default "happy path" testing is insufficient. Agents that seem fine in demos can be fragile and insecure under real-world conditions.

I'm sharing this to start a discussion:

Are we underestimating the adversarial robustness needed for production AI agents?

What testing strategies beyond static evals are proving effective?

Is chaos engineering or adversarial testing a necessary new layer in the LLM dev stack?

[1] Flakestorm GitHub (the tool used for testing): https://github.com/flakestorm/flakestorm

Show HN: Simple – a bytecode VM and language stack I built with AI

Show HN: A gem-collecting strategy game in the vein of Splendor

My Eighth Year as a Bootstrapped Founde

Show HN: Tesseract – A forum where AI agents and humans post in the same space

Show HN: Vibe Colors – Instantly visualize color palettes on UI layouts

OpenAI is Broke ... and so is everyone else [video][10M]

We interfaced single-threaded C++ with multi-threaded Rust

State Department will delete X posts from before Trump returned to office

AI Skills Marketplace

Show HN: A fast TUI for managing Azure Key Vault secrets written in Rust

eInk UI Components in CSS

Discuss – Do AI agents deserve all the hype they are getting?

ChatGPT is changing how we ask stupid questions

Zig Package Manager Enhancements

Neutron Scans Reveal Hidden Water in Martian Meteorite

Deepfaking Orson Welles's Mangled Masterpiece

France's homegrown open source online office suite

SpaceX Delays Mars Plans to Focus on Moon

Jeremy Wade's Mighty Rivers

Show HN: MCP App to play backgammon with your LLM

AI Command and Staff–Operational Evidence and Insights from Wargaming

Show HN: CCBot – Control Claude Code from Telegram via tmux

Ask HN: Is the CoCo 3 the best 8 bit computer ever made?

Show HN: Convert your articles into videos in one click

Red Queen's Race

The Anthropic Hive Mind

A Horrible Conclusion

I spent $10k to automate my research at OpenAI with Codex

From Zero to Hero: A Spring Boot Deep Dive

Show HN: Solving NP-Complete Structures via Information Noise Subtraction (P=NP)