Show HN: Statewright – Visual state machines that make AI agents reliable

https://github.com/statewright/statewright

13•azurewraith•4h ago

Agentic problem solving in its current state is very brittle. I fell in love with it, but it creates as many problems as it solves.

I'm Ben Cochran, I spent 20+ years in the trenches with full-stack Engineering, DevOps, high performance computing & ML with stints at NVIDIA, AMD and various other organizations most recently as a Distinguished Engineer.

For agents to work reliably you either need massive parameter counts or massive context windows to keep the solution spaces workable. Most people are brute forcing reliability with bigger models and longer prompts.

What if I made the problem smaller instead of making the model bigger?

I took a different approach by using smaller models: models in the 13-20B parameter range and set them to task solving real SWE-bench problems. I constrained the tool and solution spaces using formal state machines. Each state in the machine defines which tools the model can access, how many iterations it gets and what transitions are valid. A planning state gets read-only tools. An implementation state gets edit tools (scoped to prevent mega edits) and write friendly bash tools. The testing state gets bash but only for testing commands. The model cannot physically skip steps or use the wrong tool at the wrong time. It is enforced via protocol, not via prompts.

The results were more promising than I would have expected. Across multiple model families irrespective of age (qwen-coder, gpt-oss, gemma4) and the improvements were consistent above the 13B parameter inflection point. Below that, models can navigate the state machine but can't retain enough context to produce accurate edits. More on the research bit: https://statewright.ai/research

Surprisingly this yielded improvements in frontier models as well. Haiku and Sonnet start to punch above their weight and Opus solves more reliably with fewer tokens and death spirals. Fine tuning did not yield these kinds of functional improvements for me. The takeaway it seems is that context window utilization matters more than raw context size - a tightly scoped working context at each step outperforms a model given carte blanche over everything. Constraining LLMs which are non-idempotent by using deterministic code is a pattern that nobody is currently talking about.

So, I built Statewright. Its core is a Rust engine that evaluates state machine definitions: states, transitions, guards and tool restrictions. Its orchestration doesn't use an LLM, just enforces the state machine. On top of that is a plugin layer that integrates with Claude Code (and soon Codex, Cursor and others) via MCP. When you activate a workflow, hooks enforce the guardrails per state automatically. The model sees 5 tools available instead of dozens, gets clear instructions for the current phase and transitions when conditions are met. Importantly it tells the model when it's attempting to do something that isn't in scope, incorrect or when it needs to try something else after getting stuck.

You can use your agent via MCP to build a state machine for you to solve a problem in your current context. The visual editor at statewright.ai lets you tweak these workflows in a graph view... You can clearly see the failure paths, the retry loops and the approval gates. State machines aren't DAGs; they loop and retry, which is what agentic work actually needs.

Statewright is currently live with a free tier, try it out in Claude Code by running the following:

/plugin marketplace add statewright/statewright

/plugin install statewright

/reload-plugins

Then "start the bugfix workflow" or /statewright start bugfix. You'll need to paste your API key when prompted. The latest versions of Claude may complain -- paste the API key again and say you really mean it, Claude is just being cautious here.

Feedback is welcome on the workflow editor, the plugin experience, and tell me what workflows you'd want to build first. Agents are suggestions, states are laws.

Comments

giancarlostoro•30m ago

Interesting, I built a ticketing system similar to Beads which has yielded more predictable results with Claude and other models, and I'm currently building a custom harness, I'm able to use offline models though my GPU ram bandwidth is much lower, but I'm also planning on doing something similar to what you've built, namely the editing tools and what not, I hate how long it takes for Claude to look for files, it feels wasteful. I'm still astounded that everyone else has figured out ways to speed up harnesses, but Claude Code is still slow like a slug. I don't even care if I am waiting on the LLM in terms of slowness, but running local tools slowly bothers the living crap out of me, stop using grep, RIPGREP IS FASTER!

In any case, I'll have to check out Statewright after work ;)

password4321•17m ago

Does it make sense to ship an MCP code mode API? I'm surprised you're recommending MCP as-is when concerned about context usage optimization. I don't have a lot of hands-on experience either way yet so I'm curious what's best and/or most popular... I understand MCP is less effort and still affordable at VC-subsidised prices.

azurewraith•11m ago

for the integration piece that ties into Claude Code and other places where AI is used most frequently? yes I think it does... we're not fighting context in Opus/Haiku as much as we are in smaller models and we're only adding about 6 tools here which is a smaller footprint than other MCP exposures. Smaller models have a more direct/tight interface that doesn't bloat the tool space in my experimentation (using the core directly)

davidkpiano•16m ago

Pretty cool. Looks like stately.ai but catered towards agentic state machine workflows. Really interesting!

Googlebook

CERT is releasing six CVEs for serious security vulnerabilities in dnsmasq

Why senior developers fail to communicate their expertise

Rendering the Sky, Sunsets, and Planets

The Future of Obsidian Plugins

Dead.Letter (CVE-2026-45185) – How XBOW found an unauthenticated RCE on Exim

Reimagining the mouse pointer for the AI era

Instructure pays ransom to Canvas hackers

Bambu Lab is abusing the open source social contract

When life gives you lemons, write better error messages

Learning Software Architecture

Show HN: Agentic interface for mainframes and COBOL

Screenshots of Old Desktop OSes

Launch HN: Voker (YC S24) – Analytics for AI Agents

The Moth Story Map

Postmortem: TanStack NPM supply-chain compromise

Canada’s Bill C-22 Is a Repackaged Version of Last Year’s Surveillance Nightmare

Show HN: Statewright – Visual state machines that make AI agents reliable

Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

Text Blaze (YC W21) Is Hiring for a No-AI Summer Internship

The Real Story of Troy

Profiling.sampling – Statistical Profiler

The Surprisingly Long Life of the Vacuum Tube

eBay Rejects GameStop's $56B Takeover as Not Credible

They Live (1988) inspired Adblocker

If AI writes your code, why use Python?

Testing UPS Output Waveforms

Amazon employees are "tokenmaxxing" due to pressure to use AI tools

Show HN: Gigacatalyst – Extend your SaaS with an embedded AI builder

EU to crack down on TikTok, Instagram's 'addictive design' targeting kids

Show HN: Statewright – Visual state machines that make AI agents reliable

Comments

Googlebook

CERT is releasing six CVEs for serious security vulnerabilities in dnsmasq

Why senior developers fail to communicate their expertise

Rendering the Sky, Sunsets, and Planets

The Future of Obsidian Plugins

Dead.Letter (CVE-2026-45185) – How XBOW found an unauthenticated RCE on Exim

Reimagining the mouse pointer for the AI era

Instructure pays ransom to Canvas hackers

Bambu Lab is abusing the open source social contract

When life gives you lemons, write better error messages

Learning Software Architecture

Show HN: Agentic interface for mainframes and COBOL

Screenshots of Old Desktop OSes

Launch HN: Voker (YC S24) – Analytics for AI Agents

The Moth Story Map

Postmortem: TanStack NPM supply-chain compromise

Canada’s Bill C-22 Is a Repackaged Version of Last Year’s Surveillance Nightmare

Show HN: Statewright – Visual state machines that make AI agents reliable

Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

Text Blaze (YC W21) Is Hiring for a No-AI Summer Internship

The Real Story of Troy

Profiling.sampling – Statistical Profiler

The Surprisingly Long Life of the Vacuum Tube

eBay Rejects GameStop's $56B Takeover as Not Credible

They Live (1988) inspired Adblocker

If AI writes your code, why use Python?

Testing UPS Output Waveforms

Amazon employees are "tokenmaxxing" due to pressure to use AI tools

Show HN: Gigacatalyst – Extend your SaaS with an embedded AI builder

EU to crack down on TikTok, Instagram's 'addictive design' targeting kids