Show HN: OctopusGarden – An autonomous software factory (specs in, code out)

https://github.com/foundatron/octopusgarden

6•foundatron•6h ago

I built this over the weekend after reading about StrongDM's software factory (their writeup: https://factory.strongdm.ai/, Simon Willison's deep dive: https://simonwillison.net/2026/Feb/7/software-factory/, Dan Shapiro's Five Levels: https://www.danshapiro.com/blog/2026/01/the-five-levels-from...). OctopusGarden is an open-source implementation of the pattern StrongDM described: holdout scenarios, probabilistic satisfaction scoring via LLM-as-judge, and a convergence loop that iterates until the code works; no human code review in the loop.

What stood out to me was that this architecture largely rhymes with the coding workflows I and others already do with coding agents. It's basically automating the connective tissue between the workflows I was already doing in Claude Code, and then brute-forcing a result. In the dark factory model, a spec goes in, code gets generated, built in Docker, validated against scenarios the agent never saw, scored, and failures feed back until it converges.

I've tried it with mostly standard CRUD/REST API apps and it works. I haven't tried anything with HTML/JS yet. You can try the sample specs in the repo.

Some raw notes from the experience:

1. I don't want to maintain the code these factories generate. It works. The phenotype is (largely) correct, but the genotype is pretty wild and messy. I did not use OctopusGarden to build OctopusGarden (you can tell because it uses strict linting and tests). I know the point of these systems is zero human in the loop, but I think there's a real opportunity to get factories to generate code that humans actually want to maintain. I'm going to work on getting OctopusGarden there.

2. Compliance might be a nightmare. In my day job I think a lot about ISO 27001 and SOC 2 compliance. The idea of deploying dark-factory-generated projects into my environments and checking compliance boxes sounds painful. That might just be the current state of OctopusGarden and the code it generates, but I think we can get to a point where generated code is completely linted, statically checked, and tested inside the factory. That's not OctopusGarden today, but maybe it will be there next week? I can see this moving fast.

3. These dark factory apps will be hard to debug. There was a Claude outage today and I couldn't run my smoke tests or generate new apps. I don't want to maintain services that can't be debugged and fixed by a human in a pinch. We're already partially there with AI-assisted code, but this factory-generated code is even more convoluted. Requiring AI to create a new app version is probably worth it...but it's still yet another thing between you and quickly patching an urgent bug.

4. Security needs a better story. These things need real security hardening. Maybe that's just better spec files and scenarios, maybe it's something more. I'm going to drink a strong cola and think about this one.

5. The unit of responsibility keeps growing. Last year we said code must come in PR-sized bites — that's how we manage risk. Now we're talking about deploying meshes of services created and deployed with no humans in the loop (except at creation). AI-generated services could really push the scale of what people are willing to accept responsibility for. Most SRE teams manage 1-5 services at big companies. Will that number increase per team? How much GDP is one person willing to manage via agents? Just a shower thought.

6. I was surprised this works. I'm surprised at how easy it was to make. I'm surprised more of these aren't out there already. I only did a couple of GitHub searches and didn't find many. I'm bad at searching. Sorry if I didn't find your project.

Comments

deltaops•6h ago

This is exactly the problem we're tackling! We built DeltaOps (delta-ops-mvp.vercel.app) - human-in-the-loop governance for autonomous agents. You hit the nail on the head with "no human in the loop" - that's the gap. DeltaOps adds a layer where agents can work autonomously, but critical actions (deploys, code merges, spending) require human approval. Also addresses your compliance concerns - every action is logged and approved. Would love to chat about integrating governance into dark factories!

foundatron•6h ago

Cool site/ good idea. Maybe I'm underestimating it (I probably am), but I don't think it's a huge leap from what I published today and that compliant vision you're tackling.

guerython•6h ago

Curious how you are handling those guard logs and approvals in OctopusGarden?

foundatron•6h ago

Right now OctopusGarden logs every LLM call with token counts and cost, and the SQLite store records each run and iteration (spec hash, scores per scenario, generated code). So you get a full trace of what was generated, what it was tested against, and how it scored.

For approvals, the current model is that the spec is the approval. If the spec is right and scenarios pass at 95%+ satisfaction, the code ships. There's no PR review step by design (the "code is opaque weights" philosophy).

That said, you could totally layer approvals on top. Gate on spec changes, require sign-off before a run kicks off, or add a human checkpoint between "converged" and "deployed." The tool doesn't enforce a deployment pipeline, so that's up to your org's workflow.

Worth noting: this is purely a hobby project at this point. It hasn't been used in any commercial setting. The guard rails and approval workflow stuff is where it would need the most work before anyone used it for real.

jlongo78•3h ago

The hardest part of autonomous pipelines like this is observability when things go sideways. Specs rarely survive first contact with real codebases intact. Worth investing heavily in session persistence and replay so you can audit exactly what the agent reasoned at each step. Being able to resume a failed run mid-conversation rather than starting cold saves enormous time. Multi-agent parallelism also compounds fast, so a grid view across simultaneous runs becomes essential pretty quickly.

Show HN: I built a sub-500ms latency voice agent from scratch

Show HN: Govbase – Follow a bill from source text to news bias to social posts

Show HN: Trade Stocks and Crypto On-Chain with Full Transparency

Show HN: Pianoterm – Run shell commands from your Piano. A Linux CLI tool

Show HN: Visual Lambda Calculus – a thesis project (2008) revived for the web

Show HN: uBlock filter list to blur all Instagram Reels

Show HN: Giggles – A batteries-included React framework for TUIs

Show HN: PantheonOS–An Evolvable, Distributed Multi-Agent System for Science

Show HN: Cortexa – Bloomberg terminal for agentic memory

Show HN: Omni – Open-source workplace search and chat, built on Postgres

Show HN: Starcraft2 replay rendering engine and AI coach

Show HN: PHP 8 disable_functions bypass PoC

Show HN: Timber – Ollama for classical ML models, 336x faster than Python

Show HN: Web Audio Studio – A Visual Debugger for Web Audio API Graphs

Show HN: A Puzzle Game Based on Non-Commutative Operations

Show HN: GitHub Commits Leaderboard

Show HN: An Auditable Decision Engine for AI Systems

Show HN: Gapless.js – gapless web audio playback

Show HN: ApplyPilot – AI Agent that applies to jobs for you

Show HN: Try Archetype 360 – AI‑powered personality test, 3× deeper than MBTI

Show HN: OctopusGarden – An autonomous software factory (specs in, code out)

Show HN: Punch card simulator and Fortran IV interpreter

Show HN: OmniGlass – Executable AI screen snips with kernel-level sandboxing

Show HN: Open-Source Postman for MCP

Show HN: Writing App for Novelist

Show HN: We filed 99 patents for deterministic AI governance(Prior Art vs. RLHF)

Show HN: Vanilla JavaScript refinery simulator built to explain job to my kids

Show HN: Agd – a content-addressed DAG for tracking what AI agents do

Show HN: BoardMint – upload a PCB, get a standards-backed issue report in ~30s

Show HN: Ledge - Policy layer for AI agent payments (prevents unauthorized txns)

Show HN: OctopusGarden – An autonomous software factory (specs in, code out)

Comments

Show HN: I built a sub-500ms latency voice agent from scratch

Show HN: Govbase – Follow a bill from source text to news bias to social posts

Show HN: Trade Stocks and Crypto On-Chain with Full Transparency

Show HN: Pianoterm – Run shell commands from your Piano. A Linux CLI tool

Show HN: Visual Lambda Calculus – a thesis project (2008) revived for the web

Show HN: uBlock filter list to blur all Instagram Reels

Show HN: Giggles – A batteries-included React framework for TUIs

Show HN: PantheonOS–An Evolvable, Distributed Multi-Agent System for Science

Show HN: Cortexa – Bloomberg terminal for agentic memory

Show HN: Omni – Open-source workplace search and chat, built on Postgres

Show HN: Starcraft2 replay rendering engine and AI coach

Show HN: PHP 8 disable_functions bypass PoC

Show HN: Timber – Ollama for classical ML models, 336x faster than Python

Show HN: Web Audio Studio – A Visual Debugger for Web Audio API Graphs

Show HN: A Puzzle Game Based on Non-Commutative Operations

Show HN: GitHub Commits Leaderboard

Show HN: An Auditable Decision Engine for AI Systems

Show HN: Gapless.js – gapless web audio playback

Show HN: ApplyPilot – AI Agent that applies to jobs for you

Show HN: Try Archetype 360 – AI‑powered personality test, 3× deeper than MBTI

Show HN: OctopusGarden – An autonomous software factory (specs in, code out)

Show HN: Punch card simulator and Fortran IV interpreter

Show HN: OmniGlass – Executable AI screen snips with kernel-level sandboxing

Show HN: Open-Source Postman for MCP

Show HN: Writing App for Novelist

Show HN: We filed 99 patents for deterministic AI governance(Prior Art vs. RLHF)

Show HN: Vanilla JavaScript refinery simulator built to explain job to my kids

Show HN: Agd – a content-addressed DAG for tracking what AI agents do

Show HN: BoardMint – upload a PCB, get a standards-backed issue report in ~30s

Show HN: Ledge - Policy layer for AI agent payments (prevents unauthorized txns)