Ask HN: What does your agentic software dark factory look like?

4•ElFitz•3h ago

In some of the comment threads around here a few of you shared interesting ideas and patterns, enough that I believe everyone interesting in harness engineering is working on some sort of software dark factory or another.

We have OpenAI’s Symphony[1], StrongDM’s Factory[2], Yegge’s GasTown[3], and probably a few others I’ve missed.

So I’m curious. What have you been working on? What have learned? What has worked and what has failed? And what do you think comes after?

I’ll go first. The first thing I tried that yielded interesting results was, when possible, providing a ground truth or reference for the model to iterate against: screenshots or mockups for UI work, API contracts and unit / integration tests for logic. That’s the Ralph Loop we all know and love. A feedback loop.

The second (obvious, I know) was splitting planning and implementation.

Reviews by other models and iterative loops came next, with appreciable results. However the implementing agent would often wiggle out by deferring things into oblivion or saying things that were actually important feedback were out of scope. Another feedback loop. I’ve found turning those reviews into "hard gates" has its own set of issue, as reviewing agents will always find something to nitpick about, turning this iterative implementation approaches into near infinite loops.

Combining these reviews and committing plans alongside the code led to an interesting accident: reviewing agents spontaneously and unexpectedly picked up on those and drastically improved their feedbacks by comparing plan and implementation (should have been obvious, and you’ll imagine my surprise the first time GitHub Copilot actually provided useful feedbacks instead of the usual typo nitpicks).

Then a comment here led me to an adversarial green team / red team process.

A first agent creates a spec (based on StrongDM’s NLSpec) from my initial plan and gets it reviewed, including a detailed API.

A red team agent writes unit and integration test based on these specs, and gets them reviewed.

Then a green team agent is given those same specs and API, and implements the actual feature or fix, and iterates against the tests, without any access to the tests themselves, only which tests failed and what they were testing. This prevents it from gaming the tests.

Finally, once tests pass, a reviewing agent reviews the implementation against the specs.

This was nice. And it allows mixing and matching models, thinking levels, and providers. But both green and red team would sometimes diverge from the initial specs and API, sometimes with good reasons.

So another agent was brought in to evaluate those divergences when they occur and, if they are valid improvements, restart the process from the spec generation point, with the new insights. Yet another feedback loop.

And finally, integrating logs, OTel traces, and stack traces into the process. These agents seem remarkably capable at sifting through these, and end-to-end observability drastically improved results. Again, a feedback loop.

That’s all for me so far. Curious to see what other insights, findings, lessons or learnings everyone else has to share on this!

It’s a fun ride.

Comments

aosaigh•34m ago

What is a dark factory?

The "just build it with Claude" paradox

Tell HN: An app is silently installing itself on my iPhone every day

Ask HN: What does your agentic software dark factory look like?

Ask HN: Is there a good CV review service for tech roles in Switzerland?

Ask HN: Can you tell the difference between Claude Sonnet and Opus?

Ask HN: RedHat for Personal Use

Tell HN: Claude 4.7 is ignoring stop hooks

Ask HN: Are you concerned by TLS-terminating proxies like Cloudflare Tunnels?

Ask HN: Is Ubuntu 26.04 LTS Consider GNU/Linux?

Ask HN: Anyone want to collaborate on a local-first AI-based research assistant

Ask HN: Do you read differently now that anything could be AI generated?

Ask HN: How I find a job where what is needed is solid code, not firefighting?

Tell HN: Medvi (telehealth) hardcodes 999 patient emails in public JavaScript

Ask HN: Is anyone working on Gov Digital IDs or have implementation docs / FOSS

Ask HN: How did the industry settle on weekly limits?

Batteries Included CLI Framework

Ask HN: Which is Better–Android or iOS?

Ask HN: How do solo devs protect their work in the age of vibe coding?

Ask HN: Anyone managed to get Google trends API?

Ask HN: What file sharing apps do you guys use?

Ask HN: Is Zuckerberg just a „one-hit-wonder"?

Ask HN: Oh, What Places to Go (Seriously Tho)

Ask HN: MicroVM setup for VS Code Dev Container-like experience?

Tell HN: Anthropic won't reset usage limits for those who downgraded

Tell HN: YouTube RSS feeds no longer work

Ask HN: Scaling a targeted web crawler beyond 500M pages/day

Ask HN: Do you waste AI assisted time looking for answers?

GPT-5.5 – No ARC-AGI-3 scores

Ask HN: Cyberdecks are cool but do they serve a purpose?

Anthropic bans orgs without warning

Ask HN: What does your agentic software dark factory look like?

Comments

The "just build it with Claude" paradox

Tell HN: An app is silently installing itself on my iPhone every day

Ask HN: What does your agentic software dark factory look like?

Ask HN: Is there a good CV review service for tech roles in Switzerland?

Ask HN: Can you tell the difference between Claude Sonnet and Opus?

Ask HN: RedHat for Personal Use

Tell HN: Claude 4.7 is ignoring stop hooks

Ask HN: Are you concerned by TLS-terminating proxies like Cloudflare Tunnels?

Ask HN: Is Ubuntu 26.04 LTS Consider GNU/Linux?

Ask HN: Anyone want to collaborate on a local-first AI-based research assistant

Ask HN: Do you read differently now that anything could be AI generated?

Ask HN: How I find a job where what is needed is solid code, not firefighting?

Tell HN: Medvi (telehealth) hardcodes 999 patient emails in public JavaScript

Ask HN: Is anyone working on Gov Digital IDs or have implementation docs / FOSS

Ask HN: How did the industry settle on weekly limits?

Batteries Included CLI Framework

Ask HN: Which is Better–Android or iOS?

Ask HN: How do solo devs protect their work in the age of vibe coding?

Ask HN: Anyone managed to get Google trends API?

Ask HN: What file sharing apps do you guys use?

Ask HN: Is Zuckerberg just a „one-hit-wonder"?

Ask HN: Oh, What Places to Go (Seriously Tho)

Ask HN: MicroVM setup for VS Code Dev Container-like experience?

Tell HN: Anthropic won't reset usage limits for those who downgraded

Tell HN: YouTube RSS feeds no longer work

Ask HN: Scaling a targeted web crawler beyond 500M pages/day

Ask HN: Do you waste AI assisted time looking for answers?

GPT-5.5 – No ARC-AGI-3 scores

Ask HN: Cyberdecks are cool but do they serve a purpose?

Anthropic bans orgs without warning