frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Statewright – Visual state machines that make AI agents reliable

https://github.com/statewright/statewright
13•azurewraith•4h ago
Agentic problem solving in its current state is very brittle. I fell in love with it, but it creates as many problems as it solves.

I'm Ben Cochran, I spent 20+ years in the trenches with full-stack Engineering, DevOps, high performance computing & ML with stints at NVIDIA, AMD and various other organizations most recently as a Distinguished Engineer.

For agents to work reliably you either need massive parameter counts or massive context windows to keep the solution spaces workable. Most people are brute forcing reliability with bigger models and longer prompts.

What if I made the problem smaller instead of making the model bigger?

I took a different approach by using smaller models: models in the 13-20B parameter range and set them to task solving real SWE-bench problems. I constrained the tool and solution spaces using formal state machines. Each state in the machine defines which tools the model can access, how many iterations it gets and what transitions are valid. A planning state gets read-only tools. An implementation state gets edit tools (scoped to prevent mega edits) and write friendly bash tools. The testing state gets bash but only for testing commands. The model cannot physically skip steps or use the wrong tool at the wrong time. It is enforced via protocol, not via prompts.

The results were more promising than I would have expected. Across multiple model families irrespective of age (qwen-coder, gpt-oss, gemma4) and the improvements were consistent above the 13B parameter inflection point. Below that, models can navigate the state machine but can't retain enough context to produce accurate edits. More on the research bit: https://statewright.ai/research

Surprisingly this yielded improvements in frontier models as well. Haiku and Sonnet start to punch above their weight and Opus solves more reliably with fewer tokens and death spirals. Fine tuning did not yield these kinds of functional improvements for me. The takeaway it seems is that context window utilization matters more than raw context size - a tightly scoped working context at each step outperforms a model given carte blanche over everything. Constraining LLMs which are non-idempotent by using deterministic code is a pattern that nobody is currently talking about.

So, I built Statewright. Its core is a Rust engine that evaluates state machine definitions: states, transitions, guards and tool restrictions. Its orchestration doesn't use an LLM, just enforces the state machine. On top of that is a plugin layer that integrates with Claude Code (and soon Codex, Cursor and others) via MCP. When you activate a workflow, hooks enforce the guardrails per state automatically. The model sees 5 tools available instead of dozens, gets clear instructions for the current phase and transitions when conditions are met. Importantly it tells the model when it's attempting to do something that isn't in scope, incorrect or when it needs to try something else after getting stuck.

You can use your agent via MCP to build a state machine for you to solve a problem in your current context. The visual editor at statewright.ai lets you tweak these workflows in a graph view... You can clearly see the failure paths, the retry loops and the approval gates. State machines aren't DAGs; they loop and retry, which is what agentic work actually needs.

Statewright is currently live with a free tier, try it out in Claude Code by running the following:

/plugin marketplace add statewright/statewright

/plugin install statewright

/reload-plugins

Then "start the bugfix workflow" or /statewright start bugfix. You'll need to paste your API key when prompted. The latest versions of Claude may complain -- paste the API key again and say you really mean it, Claude is just being cautious here.

Feedback is welcome on the workflow editor, the plugin experience, and tell me what workflows you'd want to build first. Agents are suggestions, states are laws.

Comments

giancarlostoro•30m ago
Interesting, I built a ticketing system similar to Beads which has yielded more predictable results with Claude and other models, and I'm currently building a custom harness, I'm able to use offline models though my GPU ram bandwidth is much lower, but I'm also planning on doing something similar to what you've built, namely the editing tools and what not, I hate how long it takes for Claude to look for files, it feels wasteful. I'm still astounded that everyone else has figured out ways to speed up harnesses, but Claude Code is still slow like a slug. I don't even care if I am waiting on the LLM in terms of slowness, but running local tools slowly bothers the living crap out of me, stop using grep, RIPGREP IS FASTER!

In any case, I'll have to check out Statewright after work ;)

password4321•17m ago
Does it make sense to ship an MCP code mode API? I'm surprised you're recommending MCP as-is when concerned about context usage optimization. I don't have a lot of hands-on experience either way yet so I'm curious what's best and/or most popular... I understand MCP is less effort and still affordable at VC-subsidised prices.
azurewraith•11m ago
for the integration piece that ties into Claude Code and other places where AI is used most frequently? yes I think it does... we're not fighting context in Opus/Haiku as much as we are in smaller models and we're only adding about 6 tools here which is a smaller footprint than other MCP exposures. Smaller models have a more direct/tight interface that doesn't bloat the tool space in my experimentation (using the core directly)
davidkpiano•16m ago
Pretty cool. Looks like stately.ai but catered towards agentic state machine workflows. Really interesting!

Googlebook

https://googlebook.google/
173•tambourine_man•1h ago•207 comments

CERT is releasing six CVEs for serious security vulnerabilities in dnsmasq

https://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2026q2/018471.html
42•chizhik-pyzhik•51m ago•4 comments

Why senior developers fail to communicate their expertise

https://www.nair.sh/guides-and-opinions/communicating-your-expertise/why-senior-developers-fail-t...
119•nilirl•3h ago•49 comments

Rendering the Sky, Sunsets, and Planets

https://blog.maximeheckel.com/posts/on-rendering-the-sky-sunsets-and-planets/
313•ibobev•5h ago•26 comments

The Future of Obsidian Plugins

https://obsidian.md/blog/future-of-plugins/
139•xz18r•3h ago•62 comments

Dead.Letter (CVE-2026-45185) – How XBOW found an unauthenticated RCE on Exim

https://xbow.com/blog/dead-letter-cve-2026-45185-xbow-found-rce-exim
21•fedek_•1h ago•8 comments

Reimagining the mouse pointer for the AI era

https://deepmind.google/blog/ai-pointer/
34•devhouse•1h ago•24 comments

Instructure pays ransom to Canvas hackers

https://www.insidehighered.com/news/tech-innovation/administrative-tech/2026/05/11/instructure-pa...
139•Cider9986•16h ago•110 comments

Bambu Lab is abusing the open source social contract

https://www.jeffgeerling.com/blog/2026/bambu-lab-abusing-open-source-social-contract/
763•rubenbe•4h ago•267 comments

When life gives you lemons, write better error messages

https://wix-ux.com/when-life-gives-you-lemons-write-better-error-messages-46c5223e1a2f
46•luispa•3d ago•11 comments

Learning Software Architecture

https://matklad.github.io/2026/05/12/software-architecture.html
442•surprisetalk•9h ago•82 comments

Show HN: Agentic interface for mainframes and COBOL

https://www.hypercubic.ai/hopper
25•sai18•1h ago•6 comments

Screenshots of Old Desktop OSes

http://www.typewritten.org/Media/
575•adunk•13h ago•294 comments

Launch HN: Voker (YC S24) – Analytics for AI Agents

https://voker.ai
28•ttpost•3h ago•13 comments

The Moth Story Map

https://themoth.org/dispatches/story-map
7•jxmorris12•3d ago•0 comments

Postmortem: TanStack NPM supply-chain compromise

https://tanstack.com/blog/npm-supply-chain-compromise-postmortem
1028•varunsharma07•21h ago•433 comments

Canada’s Bill C-22 Is a Repackaged Version of Last Year’s Surveillance Nightmare

https://www.eff.org/deeplinks/2026/05/canadas-bill-c-22-repackaged-version-last-years-surveillanc...
59•Brajeshwar•1h ago•20 comments

Show HN: Statewright – Visual state machines that make AI agents reliable

https://github.com/statewright/statewright
13•azurewraith•4h ago•4 comments

Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

https://github.com/cactus-compute/needle
4•HenryNdubuaku•1h ago•0 comments

Text Blaze (YC W21) Is Hiring for a No-AI Summer Internship

https://www.ycombinator.com/companies/text-blaze/jobs/P4CCN62-the-blaze-no-ai-summer-internship
1•scottfr•7h ago

The Real Story of Troy

https://storica.club/blog/troy-was-real/
25•cemsakarya•2d ago•13 comments

Profiling.sampling – Statistical Profiler

https://docs.python.org/3.15/library/profiling.sampling.html#module-profiling.sampling
74•djoldman•2d ago•21 comments

The Surprisingly Long Life of the Vacuum Tube

https://www.construction-physics.com/p/the-surprisingly-long-life-of-the
45•surprisetalk•1d ago•26 comments

eBay Rejects GameStop's $56B Takeover as Not Credible

https://www.bloomberg.com/news/articles/2026-05-12/ebay-rejects-gamestop-s-56-billion-takeover-as...
178•voisin•3h ago•160 comments

They Live (1988) inspired Adblocker

https://github.com/davmlaw/they_live_adblocker
500•tokenburner•18h ago•159 comments

If AI writes your code, why use Python?

https://medium.com/@NMitchem/if-ai-writes-your-code-why-use-python-bf8c4ba1a055
809•indigodaddy•22h ago•846 comments

Testing UPS Output Waveforms

https://www.lttlabs.com/articles/2026/05/12/ups-exploration
18•LabsLucas•2h ago•7 comments

Amazon employees are "tokenmaxxing" due to pressure to use AI tools

https://arstechnica.com/ai/2026/05/amazon-employees-are-tokenmaxxing-due-to-pressure-to-use-ai-to...
169•Bender•2h ago•149 comments

Show HN: Gigacatalyst – Extend your SaaS with an embedded AI builder

20•namanyayg•2h ago•7 comments

EU to crack down on TikTok, Instagram's 'addictive design' targeting kids

https://www.cnbc.com/2026/05/12/tiktok-instagram-social-media-addictive-eu-crack-down.html
427•thm•8h ago•379 comments