Show HN: Statewright – Visual state machines that make AI agents reliable

https://github.com/statewright/statewright

22•azurewraith•5h ago

Agentic problem solving in its current state is very brittle. I fell in love with it, but it creates as many problems as it solves.

I'm Ben Cochran, I spent 20+ years in the trenches with full-stack Engineering, DevOps, high performance computing & ML with stints at NVIDIA, AMD and various other organizations most recently as a Distinguished Engineer.

For agents to work reliably you either need massive parameter counts or massive context windows to keep the solution spaces workable. Most people are brute forcing reliability with bigger models and longer prompts.

What if I made the problem smaller instead of making the model bigger?

I took a different approach by using smaller models: models in the 13-20B parameter range and set them to task solving real SWE-bench problems. I constrained the tool and solution spaces using formal state machines. Each state in the machine defines which tools the model can access, how many iterations it gets and what transitions are valid. A planning state gets read-only tools. An implementation state gets edit tools (scoped to prevent mega edits) and write friendly bash tools. The testing state gets bash but only for testing commands. The model cannot physically skip steps or use the wrong tool at the wrong time. It is enforced via protocol, not via prompts.

The results were more promising than I would have expected. Across multiple model families irrespective of age (qwen-coder, gpt-oss, gemma4) and the improvements were consistent above the 13B parameter inflection point. Below that, models can navigate the state machine but can't retain enough context to produce accurate edits. More on the research bit: https://statewright.ai/research

Surprisingly this yielded improvements in frontier models as well. Haiku and Sonnet start to punch above their weight and Opus solves more reliably with fewer tokens and death spirals. Fine tuning did not yield these kinds of functional improvements for me. The takeaway it seems is that context window utilization matters more than raw context size - a tightly scoped working context at each step outperforms a model given carte blanche over everything. Constraining LLMs which are non-idempotent by using deterministic code is a pattern that nobody is currently talking about.

So, I built Statewright. Its core is a Rust engine that evaluates state machine definitions: states, transitions, guards and tool restrictions. Its orchestration doesn't use an LLM, just enforces the state machine. On top of that is a plugin layer that integrates with Claude Code (and soon Codex, Cursor and others) via MCP. When you activate a workflow, hooks enforce the guardrails per state automatically. The model sees 5 tools available instead of dozens, gets clear instructions for the current phase and transitions when conditions are met. Importantly it tells the model when it's attempting to do something that isn't in scope, incorrect or when it needs to try something else after getting stuck.

You can use your agent via MCP to build a state machine for you to solve a problem in your current context. The visual editor at statewright.ai lets you tweak these workflows in a graph view... You can clearly see the failure paths, the retry loops and the approval gates. State machines aren't DAGs; they loop and retry, which is what agentic work actually needs.

Statewright is currently live with a free tier, try it out in Claude Code by running the following:

/plugin marketplace add statewright/statewright

/plugin install statewright

/reload-plugins

Then "start the bugfix workflow" or /statewright start bugfix. You'll need to paste your API key when prompted. The latest versions of Claude may complain -- paste the API key again and say you really mean it, Claude is just being cautious here.

Feedback is welcome on the workflow editor, the plugin experience, and tell me what workflows you'd want to build first. Agents are suggestions, states are laws.

Comments

giancarlostoro•1h ago

Interesting, I built a ticketing system similar to Beads which has yielded more predictable results with Claude and other models, and I'm currently building a custom harness, I'm able to use offline models though my GPU ram bandwidth is much lower, but I'm also planning on doing something similar to what you've built, namely the editing tools and what not, I hate how long it takes for Claude to look for files, it feels wasteful. I'm still astounded that everyone else has figured out ways to speed up harnesses, but Claude Code is still slow like a slug. I don't even care if I am waiting on the LLM in terms of slowness, but running local tools slowly bothers the living crap out of me, stop using grep, RIPGREP IS FASTER!

In any case, I'll have to check out Statewright after work ;)

azurewraith•6m ago

I feel you on how sluggish Claude Code can be, you just never know what those pulsing prompts are doing in the background...

Given Statewright plugs into Claude Code, there is a little added overhead while managing the state machine logic, but for complicated workflows if it saves you a few debug loops, mass edit reversions or death spirals I think the case can be pretty solid for including it

password4321•1h ago

Does it make sense to ship an MCP code mode API? I'm surprised you're recommending MCP as-is when concerned about context usage optimization. I don't have a lot of hands-on experience either way yet so I'm curious what's best and/or most popular... I understand MCP is less effort and still affordable at VC-subsidised prices.

azurewraith•56m ago

for the integration piece that ties into Claude Code and other places where AI is used most frequently? yes I think it does... we're not fighting context in Opus/Sonnet as much as we are in smaller models and we're only adding about 6 tools here which is a smaller footprint than other MCP exposures. Smaller models have a more direct/tight interface that doesn't bloat the tool space in my experimentation (using the core directly)

davidkpiano•1h ago

Pretty cool. Looks like stately.ai but catered towards agentic state machine workflows. Really interesting!

azurewraith•44m ago

Stately is pretty neat, I hadn't come across it yet... kind of like a state machine langflow or Node-RED.

I see constant posts on Reddit/HN about the ways that AI is amazing and at the same time is fudging it (literally). Nobody can make reliability guarantees on something that's non-deterministic and non-idempotent. Nobody's AI workflow suite of tools can claim this. Prompting gets you closer to the mark but still non-deterministic. Breaking down the problem into chunks with valid transition criterion so that even tiny models can step through them I believe gets us closer to where we want to be semantically

Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

Show HN: Agentic interface for mainframes and COBOL

Show HN: Gigacatalyst – Extend your SaaS with an embedded AI builder

Show HN: Statewright – Visual state machines that make AI agents reliable

Show HN: Cook a Django project well, the agent-skill take on cookiecutter

Show HN: OpenClaw OS – OSS Claude Cowork Built on Top of OpenClaw

Show HN: TikTok but for scientific papers

Show HN: A modern Music Player Daemon based on Rockbox firmware

Show HN: One-shot NAT traversal library

Show HN: Profine - Profile and rewrite your PyTorch training loop on real GPUs

Show HN: OpenGravity – A zero-install, BYOK vanilla JS clone of Antigravity

Show HN: Grunden – Frontier AI inference hosted in Sweden, OpenAI-compatible

Show HN: E2a – Open-source email gateway for AI agents

Show HN: Doomscroll the Goverment's UFO Files as One Gigantic Microfilm

Show HN: Formal Verification with Lean

Show HN: Is Github Online?

Show HN: How Scaleway brought the first RISC-V servers to the cloud

Show HN: Music visualizers that react to audio in real time

Show HN: I mage GhosttyFX, a JavaFX terminal view that uses libghostty

Show HN: An index of indie web/blog indexes

Show HN: I made a Clojure-like language in Go, boots in 7ms

Show HN: Countries where you can leave your MacBook at a random coffee shop

Show HN: Rust but Lisp

Show HN: adamsreview – better multi-agent PR reviews for Claude Code

Show HN: Safe-install – safer NPM installs with trusted build dependencies

Show HN: Java/Spring Boot Idempotency Library

Show HN: Building a web server in assembly to give my life (a lack of) meaning

Show HN: I built a fair-price checker App for home repairs using BLS wage data

Show HN: TRUST – Coding Rust like it's 1989

Show HN: Tessera – Turn coding agent sessions into structured work

Show HN: Statewright – Visual state machines that make AI agents reliable

Comments

Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

Show HN: Agentic interface for mainframes and COBOL

Show HN: Gigacatalyst – Extend your SaaS with an embedded AI builder

Show HN: Statewright – Visual state machines that make AI agents reliable

Show HN: Cook a Django project well, the agent-skill take on cookiecutter

Show HN: OpenClaw OS – OSS Claude Cowork Built on Top of OpenClaw

Show HN: TikTok but for scientific papers

Show HN: A modern Music Player Daemon based on Rockbox firmware

Show HN: One-shot NAT traversal library

Show HN: Profine - Profile and rewrite your PyTorch training loop on real GPUs

Show HN: OpenGravity – A zero-install, BYOK vanilla JS clone of Antigravity

Show HN: Grunden – Frontier AI inference hosted in Sweden, OpenAI-compatible

Show HN: E2a – Open-source email gateway for AI agents

Show HN: Doomscroll the Goverment's UFO Files as One Gigantic Microfilm

Show HN: Formal Verification with Lean

Show HN: Is Github Online?

Show HN: How Scaleway brought the first RISC-V servers to the cloud

Show HN: Music visualizers that react to audio in real time

Show HN: I mage GhosttyFX, a JavaFX terminal view that uses libghostty

Show HN: An index of indie web/blog indexes

Show HN: I made a Clojure-like language in Go, boots in 7ms

Show HN: Countries where you can leave your MacBook at a random coffee shop

Show HN: Rust but Lisp

Show HN: adamsreview – better multi-agent PR reviews for Claude Code

Show HN: Safe-install – safer NPM installs with trusted build dependencies

Show HN: Java/Spring Boot Idempotency Library

Show HN: Building a web server in assembly to give my life (a lack of) meaning

Show HN: I built a fair-price checker App for home repairs using BLS wage data

Show HN: TRUST – Coding Rust like it's 1989

Show HN: Tessera – Turn coding agent sessions into structured work