Show HN: OmoiOS–190K lines of Python to stop babysitting AI agents (Apache 2.0)

2•kanddle•1h ago

AI coding agents generate decent code. The problem is everything around the code - checking progress, catching drift, deciding if it's actually done. I spent months trying to make autonomous agents work. The bottleneck was always me.

Attempt 1 - Claude/GPT directly: works for small stuff, but you re-explain context endlessly.

Attempt 2 - Copilot/Cursor: great autocomplete, still doing 95% of the thinking.

Attempt 3 - continuous agents: keeps working without prompting, but "no errors" doesn't mean "feature works."

Attempt 4 - parallel agents: faster wall-clock, but now you're manually reviewing even more output.

The common failure: nobody verifies whether the output satisfies the goal. That somebody was always me. So I automated that job.

OmoiOS is a spec-driven orchestration system. You describe a feature, and it:

1. Runs a multi-phase spec pipeline (Explore > Requirements > Design > Tasks) with LLM evaluators scoring each phase. Retry on failure, advance on pass. By the time agents code, requirements have machine-checkable acceptance criteria.

2. Spawns isolated cloud sandboxes per task. Your local env is untouched. Agents get ephemeral containers with full git access.

3. Validates continuously - a separate validator agent checks each task against acceptance criteria. Failures feed back for retry. No human in the loop between steps.

4. Discovers new work - validation can spawn new tasks when agents find missing edge cases. The task graph grows as agents learn.

What's hard (honest):

- Spec quality is the bottleneck. Vague spec = agents spinning. - Validation is domain-specific. API correctness is easy. UI quality is not. - Discovery branching can grow the task graph unexpectedly. - Sandbox overhead adds latency per task. Worth it, but a tradeoff. - Merging parallel branches with real conflicts is the hardest problem. - Guardian monitoring (per-agent trajectory analysis) has rough edges still.

Stack: Python/FastAPI, PostgreSQL+pgvector, Redis (~190K lines). Next.js 15 + React Flow (~83K lines TS). Claude Agent SDK + Daytona Cloud. 686 commits since Nov 2025, built solo. Apache 2.0.

I keep coming back to the same problem: structured spec generation that produces genuinely machine-checkable acceptance criteria. Has anyone found an approach that works for non-trivial features, or is this just fundamentally hard?

GitHub: https://github.com/kivo360/OmoiOS Live: https://omoios.dev

Comments

kanddle•1h ago

Creator here. TL;DR: OmoiOS takes a feature description, generates structured specs with acceptance criteria, dispatches agents to isolated cloud sandboxes, validates each task autonomously, and produces a PR. You review the PR, not every intermediate step.

The core insight: AI coding tools are great at generating code, but someone still has to verify the output matches the goal. Usually that someone is you. OmoiOS automates that oversight loop.

How this compares to what you're probably using:

- vs Claude Code / Cursor: great interactive tools where you're in the loop. OmoiOS is for when you want to write the spec, approve the plan, and walk away.

- vs Codex: both produce PRs, but Codex is prompt-driven (individual tasks). OmoiOS is spec-driven (full feature lifecycle). Also open-source and not locked to one provider.

- vs Kiro: both spec-driven, but Kiro is a VS Code fork for interactive work. OmoiOS runs autonomously in the cloud. Also open-source, self-hostable, multi-model.

- vs CrewAI / LangGraph: agent frameworks (primitives). OmoiOS is an opinionated system — full lifecycle from spec to PR.

- vs Devin: OmoiOS is open-source, self-hostable, shows you the plan before executing. Devin is a black box.

Built with Claude Agent SDK + FastAPI + PostgreSQL + Next.js 15. Apache 2.0 — fork it, self-host it, build on it.

Happy to go deep on the spec pipeline, the validation loop, or the multi-agent coordination.

genxy•23m ago

The pervasive use of AI to write posts makes them exhausting to read.

BBC Journalist SEO-Hacks ChatGPT and Google's AI

Show HN: SeaRoutes, find the shortest navigable sea routes on the globe

The Rise of the Financial Engineer

Show HN: Next job comes from someone you barely know

The Predatory Hegemon

US Draft Rules for Power over Nvidia's Global Sales

A Guide to Wine Certification Programs

Iranian strikes on Amazon data centers highlight industry's vulnerability

The Download: The startup that says it can stop lightning, and inside OpenAI's

Building a Database on S3

The largest open-source humanized voice library

Congress Is Considering Abolishing Your Right to Be Anonymous Online

Olmo Hybrid

Show HN: RedDragon, LLM-assisted IR analysis of code across languages

Exfiltrating passwords with no interaction using autofill

Show HN: Plought – Reduce noise in decision making

The Brand Age

We Only Accept Pre-Revenue Projects

My application programmer instincts failed when debugging assembler

Launch HN: Vela (YC W26) – AI for complex scheduling

Which H100 Instance to Train Nanochat – Benchmarking PCIe, SXM, and NVL

Düren's Hydrogen Bet: The Math Behind a Looming Liability

Using Structured Light Scanning and Photogrammetry in Cultural Heritage

Financial AGI announced – outperforms human experts on 12 professional exams

Most AI agent demos won't survive enterprise security review

Show HN: Experiment- enforcing accessibility guardrails during AI UI generation

Ask HN: Have you noticed how the number of 'Show HN' posts has skyrocketed?

CSUN Assistive Technology Conference 2026 files

Show HN: Chatddit.com Fresh off the vibe press

I'm a Coin Boy, Too (2023)