frontpage.

Show HN: Statewright – Visual state machines that make AI agents reliable

https://github.com/statewright/statewright

3•azurewraith•1h ago

Agentic problem solving in its current state is very brittle. I fell in love with it, but it creates as many problems as it solves.

I'm Ben Cochran, I spent 20+ years in the trenches with full-stack Engineering, DevOps, high performance computing & ML with stints at NVIDIA, AMD and various other organizations most recently as a Distinguished Engineer.

For agents to work reliably you either need massive parameter counts or massive context windows to keep the solution spaces workable. Most people are brute forcing reliability with bigger models and longer prompts.

What if I made the problem smaller instead of making the model bigger?

I took a different approach by using smaller models: models in the 13-20B parameter range and set them to task solving real SWE-bench problems. I constrained the tool and solution spaces using formal state machines. Each state in the machine defines which tools the model can access, how many iterations it gets and what transitions are valid. A planning state gets read-only tools. An implementation state gets edit tools (scoped to prevent mega edits) and write friendly bash tools. The testing state gets bash but only for testing commands. The model cannot physically skip steps or use the wrong tool at the wrong time. It is enforced via protocol, not via prompts.

The results were more promising than I would have expected. Across multiple model families irrespective of age (qwen-coder, gpt-oss, gemma4) and the improvements were consistent above the 13B parameter inflection point. Below that, models can navigate the state machine but can't retain enough context to produce accurate edits. More on the research bit: https://statewright.ai/research

Surprisingly this yielded improvements in frontier models as well. Haiku and Sonnet start to punch above their weight and Opus solves more reliably with fewer tokens and death spirals. Fine tuning did not yield these kinds of functional improvements for me. The takeaway it seems is that context window utilization matters more than raw context size - a tightly scoped working context at each step outperforms a model given carte blanche over everything. Constraining LLMs which are non-idempotent by using deterministic code is a pattern that nobody is currently talking about.

So, I built Statewright. Its core is a Rust engine that evaluates state machine definitions: states, transitions, guards and tool restrictions. Its orchestration doesn't use an LLM, just enforces the state machine. On top of that is a plugin layer that integrates with Claude Code (and soon Codex, Cursor and others) via MCP. When you activate a workflow, hooks enforce the guardrails per state automatically. The model sees 5 tools available instead of dozens, gets clear instructions for the current phase and transitions when conditions are met. Importantly it tells the model when it's attempting to do something that isn't in scope, incorrect or when it needs to try something else after getting stuck.

You can use your agent via MCP to build a state machine for you to solve a problem in your current context. The visual editor at statewright.ai lets you tweak these workflows in a graph view... You can clearly see the failure paths, the retry loops and the approval gates. State machines aren't DAGs; they loop and retry, which is what agentic work actually needs.

Statewright is currently live with a free tier, try it out in Claude Code by running the following:

/plugin marketplace add statewright/statewright

/plugin install statewright

/reload-plugins

Then "start the bugfix workflow" or /statewright start bugfix. You'll need to paste your API key when prompted. The latest versions of Claude may complain -- paste the API key again and say you really mean it, Claude is just being cautious here.

Feedback is welcome on the workflow editor, the plugin experience, and tell me what workflows you'd want to build first. Agents are suggestions, states are laws.

Show HN: Crane Control

Evento – Events Made Social

Graphmind – local code intelligence for Claude Code(graph and mem and MCP)

AI Floss

Static Analysis for GitHub Actions

Exposing a $300M Private Equity Scam [video]

I built a privacy-focused tool to help people understand complex documents

Ranking a What Is My IP Tool

Wrap Go binaries in Python wheels

Show HN: X509-certificate-exporter – Prometheus exporter for TLS cert expiration

Setting the record straight on Cloud Access and Community

All the ways to mock your Rust code

Show HN: Reducing LLM input tokens by 70%

Europe could soon get new platform to book train tickets

The NY Times Published an A.I.-Fabricated Quote Attributed to Pierre Poilievre

Multilingual Ambiguity

Why Not Objective-C

Chemistry in the AI Era

There is a problem with users abusing flagging on HN (2025)

Want to AI proof your degree? Study History

Roadside Picnic and the AI Race

'systematic' rape and sexual violence during Hamas' Oct 7 attack on Israel

Operation: Epic Furious

Ask HN: Any materials on building distributed rate limiter?

"Cannot be explained" – New ultra stainless steel stuns researchers

South Korea's housing crisis explained (2025)

Stochastic Parrots: Frequently Unasked Questions

Bioplastics Toxicity Upon Ingestion: Biotransformation and GI Effects

Why senior developers fail to communicate their expertise

Apple Sales Coach Will Use AI-Generated Video Presenters