It's a kanban board where each ticket runs two agents back to back:
Build agent: runs in a sandboxed temp dir against a shallow clone of the user's repo, makes the change, pushes a branch, opens a PR. Uses the Claude Agent SDK.
QA agent: waits for the preview deploy to come up, then drives a real browser via Browserbase against the preview and verifies the change works against the ticket's acceptance criteria. Screenshots and an mp4 of the QA session get attached to the PR.
If QA fails, the build agent reruns with the QA report as context, up to 3 iterations. Before each retry, a classifier reads the failure and decides whether it was a real code bug or environmental (Clerk didn't load, preview never deployed, Browserbase session got 403'd, etc). Environmental failures break the loop instead of iterating on infra noise. This was the single biggest reliability win.
The other side is input. The platform exposes an MCP server, so from Claude Code or any MCP client you can say "make a ticket for X" and it lands in the backlog. The original reason I built any of this was that writing tickets was the bottleneck for me, not writing code.
A few implementation notes that might be interesting:
The build agent's system prompt forbids the Task / Agent (subagent) tool. Spawning subagents inside the SDK was hanging for 4+ minutes consistently. Staying in the main session with Read/Edit/Bash/Glob/Grep is dramatically more reliable.
Postgres schema is applied on startup from a single schema.sql, idempotent with IF NOT EXISTS everywhere. No migrations directory. Adding a column is "edit schema.sql, push, restart." This is the highest-leverage decision I've made on the backend.
QA has a fast mode (local Chromium for anonymous routes) and a deep mode (Browserbase + residential proxies + stealth, for anything behind auth). The mode is per-ticket because cheap-and-fast loses signal once you go past the login wall.
A background sweeper force-fails any job running over 60 min. The SDK can hang in ways asyncio.wait_for doesn't always clean up through the subprocess boundary, so the kill switch is a belt-and-suspenders guard.
Stack: FastAPI on Railway, Postgres, Claude Agent SDK, Browserbase, Vercel for previews, Clerk for auth, Resend for transactional email, MCP over HTTP. Frontend is one HTML file on Vercel, no build step, no framework, just vanilla JS and Clerk loaded from CDN.
What's not working well yet: deep-mode QA still occasionally gets stuck on CAPTCHAs in unfamiliar OAuth flows. The classifier's environmental-failure list is hand-curated keywords, which is fragile. Spend tracking is per-job but I haven't built per-workspace budget caps yet. PR previews on Vercel sometimes take 2-3 min to come up which the QA agent has to wait through.
It's in alpha with a waitlist. Free during alpha, paid plans later. The whole platform was built using Claude Code, so this has been dogfooding itself for the entire build.
Site: https://notesasm.com
Would love feedback, especially on: the dual-agent loop design, the classifier approach, what kinds of tickets would actually break this on your repo, and prior art I should be aware of (I know about Devin, OpenHands, SWE-agent; what else?).