frontpage.

Show HN: Coding agent where a second agent QAs every PR in a real browser

https://www.notesasm.com/

1•kavin_key•35m ago

Hi HN. I've been building this for the last few months and it's at a state where outside eyes would help more than another week of solo iteration.

It's a kanban board where each ticket runs two agents back to back:

Build agent: runs in a sandboxed temp dir against a shallow clone of the user's repo, makes the change, pushes a branch, opens a PR. Uses the Claude Agent SDK.

QA agent: waits for the preview deploy to come up, then drives a real browser via Browserbase against the preview and verifies the change works against the ticket's acceptance criteria. Screenshots and an mp4 of the QA session get attached to the PR.

If QA fails, the build agent reruns with the QA report as context, up to 3 iterations. Before each retry, a classifier reads the failure and decides whether it was a real code bug or environmental (Clerk didn't load, preview never deployed, Browserbase session got 403'd, etc). Environmental failures break the loop instead of iterating on infra noise. This was the single biggest reliability win.

The other side is input. The platform exposes an MCP server, so from Claude Code or any MCP client you can say "make a ticket for X" and it lands in the backlog. The original reason I built any of this was that writing tickets was the bottleneck for me, not writing code.

A few implementation notes that might be interesting:

The build agent's system prompt forbids the Task / Agent (subagent) tool. Spawning subagents inside the SDK was hanging for 4+ minutes consistently. Staying in the main session with Read/Edit/Bash/Glob/Grep is dramatically more reliable.

Postgres schema is applied on startup from a single schema.sql, idempotent with IF NOT EXISTS everywhere. No migrations directory. Adding a column is "edit schema.sql, push, restart." This is the highest-leverage decision I've made on the backend.

QA has a fast mode (local Chromium for anonymous routes) and a deep mode (Browserbase + residential proxies + stealth, for anything behind auth). The mode is per-ticket because cheap-and-fast loses signal once you go past the login wall.

A background sweeper force-fails any job running over 60 min. The SDK can hang in ways asyncio.wait_for doesn't always clean up through the subprocess boundary, so the kill switch is a belt-and-suspenders guard.

Stack: FastAPI on Railway, Postgres, Claude Agent SDK, Browserbase, Vercel for previews, Clerk for auth, Resend for transactional email, MCP over HTTP. Frontend is one HTML file on Vercel, no build step, no framework, just vanilla JS and Clerk loaded from CDN.

What's not working well yet: deep-mode QA still occasionally gets stuck on CAPTCHAs in unfamiliar OAuth flows. The classifier's environmental-failure list is hand-curated keywords, which is fragile. Spend tracking is per-job but I haven't built per-workspace budget caps yet. PR previews on Vercel sometimes take 2-3 min to come up which the QA agent has to wait through.

It's in alpha with a waitlist. Free during alpha, paid plans later. The whole platform was built using Claude Code, so this has been dogfooding itself for the entire build.

Site: https://notesasm.com

Would love feedback, especially on: the dual-agent loop design, the classifier approach, what kinds of tickets would actually break this on your repo, and prior art I should be aware of (I know about Devin, OpenHands, SWE-agent; what else?).

Sony's 1000X the ColleXion Headphones Make the AirPods Max 2 Look Affordable

Edmund Phelps, Who Upended the Way We View Inflation, Dies at 92

AI-written story published in Granta, wins major literary prize

SaaSpocalypse now? These founders don't think so

WTF happened to Claude Code ext in antigravity?

Is It Worth Investing in Index Funds? What 90% of Investors Get Wrong About Fees

Ask HN: Is there a good code intelligence MCP server yet?

Dumb Ways for an Open Source Project to Die

The Relevance of BPMN in the Age of AI

We Got Lost in AI

ZPL – a deterministic engine for binary-matrix bias scoring

Spring Cleaning

Mistral AI Acquires Emmi AI to Create the Leading AI Stack

Minnesota becomes first state to ban prediction markets

The State of Statefulness in AI Agents

Tesla (TSLA) is building its giant solar panel factory in Houston

QUIC has a lot going for it, but it is a large library (six figure LoC)

Unlocking Asynchronicity in Continuous Batching

Tools to understand how content was created and edited

Big Tech's AI Trap

Depression linked to bacterium-chemical interaction in personal care products

The Sunk Cost Fallacy and How It Influences Our Decisions

Andrej Karpathy Joins Anthropic

Google Antigravity CLI

Google introduces Gemini Spark, a 24/7 agentic assistant with Gmail integration

Show HN: Logbox – let Claude monitor your dev logs

Likely AI-generated short story won a major prize

Show HN: Melogen – Generate MIDI melodies for free

Show HN: FastBack end – schema-first back end runtime with OpenAPI output

The Gemini app becomes more agentic, delivering proactive, 24/7 help