Attempt 1 - Claude/GPT directly: works for small stuff, but you re-explain context endlessly.
Attempt 2 - Copilot/Cursor: great autocomplete, still doing 95% of the thinking.
Attempt 3 - continuous agents: keeps working without prompting, but "no errors" doesn't mean "feature works."
Attempt 4 - parallel agents: faster wall-clock, but now you're manually reviewing even more output.
The common failure: nobody verifies whether the output satisfies the goal. That somebody was always me. So I automated that job.
OmoiOS is a spec-driven orchestration system. You describe a feature, and it:
1. Runs a multi-phase spec pipeline (Explore > Requirements > Design > Tasks) with LLM evaluators scoring each phase. Retry on failure, advance on pass. By the time agents code, requirements have machine-checkable acceptance criteria.
2. Spawns isolated cloud sandboxes per task. Your local env is untouched. Agents get ephemeral containers with full git access.
3. Validates continuously - a separate validator agent checks each task against acceptance criteria. Failures feed back for retry. No human in the loop between steps.
4. Discovers new work - validation can spawn new tasks when agents find missing edge cases. The task graph grows as agents learn.
What's hard (honest):
- Spec quality is the bottleneck. Vague spec = agents spinning. - Validation is domain-specific. API correctness is easy. UI quality is not. - Discovery branching can grow the task graph unexpectedly. - Sandbox overhead adds latency per task. Worth it, but a tradeoff. - Merging parallel branches with real conflicts is the hardest problem. - Guardian monitoring (per-agent trajectory analysis) has rough edges still.
Stack: Python/FastAPI, PostgreSQL+pgvector, Redis (~190K lines). Next.js 15 + React Flow (~83K lines TS). Claude Agent SDK + Daytona Cloud. 686 commits since Nov 2025, built solo. Apache 2.0.
I keep coming back to the same problem: structured spec generation that produces genuinely machine-checkable acceptance criteria. Has anyone found an approach that works for non-trivial features, or is this just fundamentally hard?
GitHub: https://github.com/kivo360/OmoiOS Live: https://omoios.dev
kanddle•2h ago
The core insight: AI coding tools are great at generating code, but someone still has to verify the output matches the goal. Usually that someone is you. OmoiOS automates that oversight loop.
How this compares to what you're probably using:
- vs Claude Code / Cursor: great interactive tools where you're in the loop. OmoiOS is for when you want to write the spec, approve the plan, and walk away.
- vs Codex: both produce PRs, but Codex is prompt-driven (individual tasks). OmoiOS is spec-driven (full feature lifecycle). Also open-source and not locked to one provider.
- vs Kiro: both spec-driven, but Kiro is a VS Code fork for interactive work. OmoiOS runs autonomously in the cloud. Also open-source, self-hostable, multi-model.
- vs CrewAI / LangGraph: agent frameworks (primitives). OmoiOS is an opinionated system — full lifecycle from spec to PR.
- vs Devin: OmoiOS is open-source, self-hostable, shows you the plan before executing. Devin is a black box.
Built with Claude Agent SDK + FastAPI + PostgreSQL + Next.js 15. Apache 2.0 — fork it, self-host it, build on it.
Happy to go deep on the spec pipeline, the validation loop, or the multi-agent coordination.
genxy•42m ago