Runtime validation is still fucked in AI coding agents

1•sebringj•1h ago

AI agents (Cursor, Claude computer-use, Copilot agent mode, etc.) have gotten stupidly good at spitting out code. Prompt → boom, clean code. The marketing says "it just works."

It fucking doesn't.

You run it in a real app and immediately hit the same bullshit wall every time: - Hallucinated logic only reveals itself under real data or edge cases - UI updates magically forget to sync across devices (mobile → web = sad trombone) - API calls quietly return 401s or other crap that gets swallowed in some lazy try-catch - Vision-based agents crawl like molasses (2–10s per action) and torch tokens like it's free - Background pings and unrelated fetches make it impossible to tell what actually caused what

I tried pretty much everything out there and none of it quite scratched the itch I had: fast, structured, cross-platform runtime visibility without vision bloat or having to wire up a ton of hooks.

Quick rundown of the usual suspects:

- Pure vision/computer-use (Claude 3.5/4, ADEPT-style): zero setup, works on anything — but latency from hell and token burn is brutal for anything longer than a demo - Playwright / browser MCP servers: fast and structured for web — but web-only, selectors shatter like glass, no native mobile - Appium + vision hybrids: cross-platform on paper — but still vision-dependent and setup is a pain - Sandboxed agents (OpenHands, SWE-agent): decent for repo tasks and shell stuff — not so much for live app UI/network state - Explicit hooks/bridges: precise when you bother adding them — but requires code changes, which sucks

Couldn't find anything that gave me low-latency structured JSON state (UI elements, network, errors, logs) across platforms, local-first, without the usual trade-offs. So yeah, I got fed up and built a small local MCP server to solve it for myself.

Full disclosure: it's called Autonomo MCP https://github.com/sebringj/autonomo — very early, just launched.

I don't usually do this "I built a thing" thing — my open-source contributions are mostly small fixes and PRs — but I honestly couldn't see a better way in the current landscape.

It is my hope that Anthropic (or someone) will eventually ship a clean native solution for this. They already fixed BM25 tool calling to shrink context like crazy; I'd love to see them (or the industry) make runtime validation "just work" out of the box too.

Sometimes when you code in a vacuum you think your shit smells good. lmk if I'm off base here, I grew up with a mean grandpa so I'm cool with it.

Comments

GahLak•1h ago

You've nailed the real friction point that demos gloss over: agents are great at generation but terrible at verification in production systems. The vision latency tax is brutal once you hit real workflows.

sebringj•1h ago

ya, for real, my boss was like let's do e2e testing with AI, look for solutions out there... then like 2 days later he's like wtf is this bill, and i was like you wanted that right? Was using vision calls in azure foundry and was like over 100 bucks or something just in 2 days of me setting it up and trying it out with all the test cases it had.

Jeffrey Epstein's digital cleanup crew

Real-time Reddit sentiment tracker for stock trading

Trump's War on History

Quitting .NET after 22 years

Is human collaboration the answer to the skill formation risks by AI?

Microsoft Should Watch the Expanse

Show HN: Cosmic CLI – Build, deploy, and manage apps from your terminal with AI

AgentLogs: Open-source observability for AI coding agents

WordCatcher

Breakthrough pancreatic cancer therapy blocks tumor resistance in mice

Show HN: Multimodal perception system for real-time conversation

Heuristics for lab robotics, and where its future may go

Show HN: Traction – Security readiness framework for scaling SaaS teams

Crossview v3.5.0 – New auth modes (header / none), no DB required for proxy auth

Show HN: Tasty A.F. – Turn Any Online Recipe into a 3x5 Notecard

Photoswitching for chromocontrol of TRPC4/5 channel functions in live tissues

This feels so reminiscent of the whimsical times in tech

Hello, Dada

Expectation and Copysets

LLMCode Lab – Compare up to 5 LLMs side-by-side, then fuse the best answers

BurgerDisk Tests

In praise of the dad joke (2023)

Looking for feedback from someone who hired technical freelancers earlier

Update on Update [video]

USDA's reputation suffers after revisions in US corn acres

Updating the Expiring Secure Boot Certificates Is Sure to Go Without a Hitch

'We feel it in our bones': Can a machine ever love you?

Google hit by European publishers' complaint to EU over AI Overviews

Writing RSS reader in 80 lines of bash

Simulated phishing test f#%k off