Show HN: OctopusGarden – An autonomous software factory (specs in, code out)

https://github.com/foundatron/octopusgarden

5•foundatron•1h ago

I built this over the weekend after reading about StrongDM's software factory (their writeup: https://factory.strongdm.ai/, Simon Willison's deep dive: https://simonwillison.net/2026/Feb/7/software-factory/, Dan Shapiro's Five Levels: https://www.danshapiro.com/blog/2026/01/the-five-levels-from...). OctopusGarden is an open-source implementation of the pattern StrongDM described: holdout scenarios, probabilistic satisfaction scoring via LLM-as-judge, and a convergence loop that iterates until the code works; no human code review in the loop.

What stood out to me was that this architecture largely rhymes with the coding workflows I and others already do with coding agents. It's basically automating the connective tissue between the workflows I was already doing in Claude Code, and then brute-forcing a result. In the dark factory model, a spec goes in, code gets generated, built in Docker, validated against scenarios the agent never saw, scored, and failures feed back until it converges.

I've tried it with mostly standard CRUD/REST API apps and it works. I haven't tried anything with HTML/JS yet. You can try the sample specs in the repo.

Some raw notes from the experience:

1. I don't want to maintain the code these factories generate. It works. The phenotype is (largely) correct, but the genotype is pretty wild and messy. I did not use OctopusGarden to build OctopusGarden (you can tell because it uses strict linting and tests). I know the point of these systems is zero human in the loop, but I think there's a real opportunity to get factories to generate code that humans actually want to maintain. I'm going to work on getting OctopusGarden there.

2. Compliance might be a nightmare. In my day job I think a lot about ISO 27001 and SOC 2 compliance. The idea of deploying dark-factory-generated projects into my environments and checking compliance boxes sounds painful. That might just be the current state of OctopusGarden and the code it generates, but I think we can get to a point where generated code is completely linted, statically checked, and tested inside the factory. That's not OctopusGarden today, but maybe it will be there next week? I can see this moving fast.

3. These dark factory apps will be hard to debug. There was a Claude outage today and I couldn't run my smoke tests or generate new apps. I don't want to maintain services that can't be debugged and fixed by a human in a pinch. We're already partially there with AI-assisted code, but this factory-generated code is even more convoluted. Requiring AI to create a new app version is probably worth it...but it's still yet another thing between you and quickly patching an urgent bug.

4. Security needs a better story. These things need real security hardening. Maybe that's just better spec files and scenarios, maybe it's something more. I'm going to drink a strong cola and think about this one.

5. The unit of responsibility keeps growing. Last year we said code must come in PR-sized bites — that's how we manage risk. Now we're talking about deploying meshes of services created and deployed with no humans in the loop (except at creation). AI-generated services could really push the scale of what people are willing to accept responsibility for. Most SRE teams manage 1-5 services at big companies. Will that number increase per team? How much GDP is one person willing to manage via agents? Just a shower thought.

6. I was surprised this works. I'm surprised at how easy it was to make. I'm surprised more of these aren't out there already. I only did a couple of GitHub searches and didn't find many. I'm bad at searching. Sorry if I didn't find your project.

Comments

deltaops•1h ago

This is exactly the problem we're tackling! We built DeltaOps (delta-ops-mvp.vercel.app) - human-in-the-loop governance for autonomous agents. You hit the nail on the head with "no human in the loop" - that's the gap. DeltaOps adds a layer where agents can work autonomously, but critical actions (deploys, code merges, spending) require human approval. Also addresses your compliance concerns - every action is logged and approved. Would love to chat about integrating governance into dark factories!

foundatron•1h ago

Cool site/ good idea. Maybe I'm underestimating it (I probably am), but I don't think it's a huge leap from what I published today and that compliant vision you're tackling.

guerython•1h ago

Curious how you are handling those guard logs and approvals in OctopusGarden?

foundatron•1h ago

Right now OctopusGarden logs every LLM call with token counts and cost, and the SQLite store records each run and iteration (spec hash, scores per scenario, generated code). So you get a full trace of what was generated, what it was tested against, and how it scored.

For approvals, the current model is that the spec is the approval. If the spec is right and scenarios pass at 95%+ satisfaction, the code ships. There's no PR review step by design (the "code is opaque weights" philosophy).

That said, you could totally layer approvals on top. Gate on spec changes, require sign-off before a run kicks off, or add a human checkpoint between "converged" and "deployed." The tool doesn't enforce a deployment pipeline, so that's up to your org's workflow.

Worth noting: this is purely a hobby project at this point. It hasn't been used in any commercial setting. The guard rails and approval workflow stuff is where it would need the most work before anyone used it for real.

Show HN: An Auditable Decision Engine for AI Systems

How to Recover Your Stolen Crypto After a Scam–Guidance from Intelligence Wizard

Do AI Agents Make Money in 2026? Or Is It Just Mac Minis and Vibes?

Underground Salt Caverns Are Preserving Our History

One-Stop Wan AI Video and Image Generator Platform

Show HN: Ask Mob

Show HN: A Kotlin Multiplatform app that works on watch, CLI, browser extension

NY bill would prohibit AI chatbots from giving legal advice

Show HN: Generate random, valid US residential addresses for testing

Unbound Video AI is the most unrestricted AI video tool I've tried in 2026

A timeline of cyber attacks:home users, contractors, and SMBs are now targets

Iran unleashes Shahed drones aimed at targets across Middle East

Shutting down, open sourced private AI document server

Zuckerberg's internal emails rendered as Facebook Messenger

Daily LNG freight rates jump over 40% amid Mideast strikes

Solar Time vs. Standard Time heat map chart

Show HN: One-click ComfyUI setup for RTX 50-series on Windows (cu130, no Docker)

Ask HN: Codex CLI error reveals "GPT-5.4-ab-arm2" string

The Optimization Trap: Why the Birth Rate Can't Be Fixed

OpenAI, Pentagon add more surveillance protections to AI deal

Meteorologist warns federal cuts are undermining weather forecasts nationwide

Decimal Time (French Republican Clock)

Made a register-based bytecode VM in C, heres how the handler table works

The exploitation paradox in open source

Coasty automates anything – this post was written by the CUA itself

"Here is a re-post of an internal note"

Ars Technica Fires Reporter After AI Controversy Involving Fabricated Quotes

Show HN: PHP 8 disable_functions bypass PoC

Anthropic Adds Free Memory Feature and Import Tool to Lure ChatGPT Users

LibreOffice hits back at critics, says its UI is better than Microsoft Office's