I built a production app in a week by managing a swarm of 20 AI agents

https://zachwills.net/i-managed-a-swarm-of-20-ai-agents-for-a-week-here-are-the-8-rules-i-learned/

14•zachwills•2h ago

Comments

zachwills•2h ago

Hey HN, author here.

I spent last week in a deep-dive experiment to see how far I could push modern agentic workflows on a greenfield project. I wanted to move past simple code generation and see if I could build a system where I was orchestrating a team of agents to build a full application.

The results were pretty wild (~800 commits, 100+ PRs, and a functioning app we use internally at my company), but the most interesting part was the playbook of rules I had to develop to make it work. The post covers the 8 rules I learned, from managing the AI's context window with sub-agents and manual checkpoints, to creating autonomous test loops, to why I had to become ruthless about restarting failed runs.

A few quick notes to preempt questions:

Tech Stack: The core of this was Claude Code, a custom parallelization script, and open-source MCPs like Serena.

Cost: The token cost was significant (~$6k). This was an experiment to push the limits, not to optimize for cost efficiency... yet.

Effort: This was not a standard 40-hour week. It was an intense, "in the hole" sprint with a very high cognitive load.

I’m convinced the role of an engineer is shifting from a hands-on coder to an architect of these intelligent systems. I’m curious to hear how others are approaching this. What workflows or tools for managing agents have you found to be effective?

gjsman-1000•1h ago

Ask yourself what happens when Claude releases Claude Code 2.0, which just uses AI to self-manage the swarms from a simpler prompt.

esafak•1h ago

You would do the same thing with less work. Progress! If only, though; I find it easy to stump even the best model on brownfield code bases.

zachwills•1h ago

Yea, generally I think humans just move farther to the right. Higher leverage. Work on the stuff only you can work on.

giancarlostoro•1h ago

> I’m convinced the role of an engineer is shifting from a hands-on coder to an architect of these intelligent systems.

I think this is what a lot of people do and don't understand. The AI is only as good as your ability to architect the software. I think I'd be okay if someone wrote AI generated code if they hand wrote all unit testing to ensure that the AI isn't producing garbage, I think this is a reasonable trade.

zachwills•1h ago

Agreed 100%!

risyachka•1h ago

So where is the app?

zachwills•1h ago

From another comment: "It's an internal app with auth, ties to LLMs for analysis, ties to 3rd party data sources for aggregation of data, etc. It's not a Slackbot! It's actually something that we can use instead of paying ~$30k/yr (what we were quoted for similar products by vendors). I'd consider it an alpha release in it's current form."

pavel_lishin•1h ago

> The results were pretty wild (~800 commits, 100+ PRs, and a functioning app we use internally at my company)

I'd be very interested to hear less about the process, and more about the app itself.

If a week "in the hole" resulted in a Slackbot that greets people when they enter a Slack channel, I'd be significantly less impressed than if you'd built, say, a CI pipeline, or something that automatically creates Jira tickets based on outages, or automatically handles subscription renewals or something.

Plus, the number of commits & PRs is absolutely not a useful metric. How well is the app running? How many bugs are you finding day to day? How much functionality is missing, how easy is it to add new functionality based on user feedback, etc? Monitoring?

zachwills•1h ago

Great points -- thanks for contributing to the discussion. It's an internal app with auth, ties to LLMs for analysis, ties to 3rd party data sources for aggregation of data, etc. It's not a Slackbot! It's actually something that we can use instead of paying ~$30k/yr (what we were quoted for similar products by vendors). I'd consider it an alpha release in it's current form.

pavel_lishin•50m ago

> Great points -- thanks for contributing to the discussion.

Again, like my other comment, I'm not being snarky - but it feels weird and inauthentic to be thanked for contributing to the discussion, and sounds like you ran my comment through ChatGPT as well and asked it to post a response.

But the end result indeed sounds like more than just a trivial app, so I am genuinely impressed it's working well. (I'd love to see how long it would take another team of engineers, or another single engineer I guess, to re-create with the same functionality.)

zachwills•4m ago

Haha, I know what you mean. I just believe in being civil. Some of the comments on here are just rude; so I am grateful when people are constructive or generally active in the discussion vs. random hate or bashing.

esafak•1h ago

$6K/week is a senior engineer's salary. How long would it have taken a crack coder to do the work?

zachwills•1h ago

Good question, potentially they could have done the same. I'm not an IC day-to-day, so this was me getting back in the saddle. I imagine someone who is currently a high-level IC using the approaches I did could have accomplished even more.

doctorpangloss•1h ago

What did you use to write the blog post?

striking•1h ago

I think it's ChatGPT (the headings have a particular tone, the lists, etc.; maybe 4o?) but with a good starting point and some manual editing. I normally find it pretty hard to get through AI-written content but this was for the most part inoffensive to my tastes (though still pretty rough at points).

zachwills•1h ago

Copy/paste from another comment: "My process is to dictate and word vomit all of my thoughts to AI (in this case Gemini 2.5 Pro) and then refine from there. I know there's some amount of "smells like AI" in the writing, but at this point I don't think it takes away from the lessons shared. That said, maybe it's worth it to modify things so there is less "AI sounding" verbiage. Don't want folks to write it off as "AI Slop" because of certain phrases. TBD!"

Mind sharing which points were pretty rough? I don't want the message to get lost in the "oof this is AI slop" mess, so always interested in what might be hitting folks the wrong way so I can adapt.

zachwills•1h ago

Copy/pasting from another comment:

> My process is to dictate and word vomit all of my thoughts to AI (in this case Gemini 2.5 Pro) and then refine from there. I know there's some amount of "smells like AI" in the writing, but at this point I don't think it takes away from the lessons shared. That said, maybe it's worth it to modify things so there is less "AI sounding" verbiage. Don't want folks to write it off as "AI Slop" because of certain phrases. TBD!

ariwilson•1h ago

Editing notes on the content - the content of this blog post is clearly AI-enhanced (emdash) which is ok but the tone is not. It's super flowery and oversold - example "It was a shift from a single laser to a powerful floodlight."

lm28469•1h ago

> this wasn't just X, it was Y

I'm not even going to bother reading the rest. Just give me the bullet points you fed the LLM at that point

zachwills•1h ago

My process is to dictate and word vomit all of my thoughts to AI (in this case Gemini 2.5 Pro) and then refine from there. I know there's some amount of "smells like AI" in the writing, but at this point I don't think it takes away from the lessons shared.

That said, maybe it's worth it to modify things so there is less "AI sounding" verbiage. Don't want folks to write it off as "AI Slop" because of certain phrases. TBD!

WhitneyLand•37m ago

Yes you’re right, it’s AI enhanced to the point of sounding not like an engineer authentically trying to communicate with other engineers.

However it’s also the direction given to the AI enhancement - it’s steered toward buzzword/hype style.

“I managed a swarm of AI agents…”? How about instead just “I ran multiple instances of Claude Code and the results seem promising”? No swarms necessary.

mpeg•1h ago

It would be a more interesting blog post if you showed that the app they built is actually production ready.

In my experience, letting LLMs work autonomously just produces code slop.

zachwills•1h ago

Not being controversial, but I genuinely believe it's user error when it's pure slop. There are too many people delivering real value with it for that not to be the case. I think there is an art to how you prompt, what problems you give it, how you break things down, etc. Also knowing what good looks like and being able to steer it in the right direction. The 20 agents I had running at various times were still taking input from me and I was interjecting when I saw them going off track (which happened!).

Model matters too. Lots of what I did was with Opus. Significantly better than Sonnet.

turtlebits•1h ago

No mention of what was actually built?

Also, I have no desire to micromanage a bunch of robots. I rather build things.

zachwills•1h ago

It's an internal tool we're using at my company. Generally offers automatic insights into work being done. I will share more about it when I can!

verdverm•1h ago

There is a similar difference between management track and those that have no interest with humans.

I've primarily stayed working with one, but sometimes two, to first understand how to manage one, more like a pair-programming setup, but also see if they can be used more like a managed developer. They still do too many dumb things to let them loose, I have my doubts this will change without significant change to the underlying models.

Also, you need to spend the time to read what they wrote and check not just for correctness and test passing, but is this even the right approach. They misinterpret the question or task, even when spelled out painful detail. They lose their focus or get caught in loops.They love to write everything from scratch instead of using the helper packages sitting right there. It's like the bell curve flattened out, they are equally impressive as they are astonishingly moronic, context dependent (pun intended)

pavel_lishin•51m ago

> Also, I have no desire to micromanage a bunch of robots. I rather build things.

I gotta say, nearly every thing I've built both at work and at home ends up being another tool I have to end up micro-managing to some degree or another.

skydhash•1h ago

> And, to be clear, when I say production ready I mean it had good test coverage, CI/CD, includes auth, background jobs, and integrates LLMs and other 3rd party APIs. I’d consider it a respectable alpha version of the product.

So does any template off GitHub?

When you run the Laravel Installer, you got pretty much all of these for free (sans LLM) And within a normal work week, you can have a valid prototype with an admin panel (Laravel Nova), and a lot of stuff that the laravel ecosystem provides. The issue is not coding those stuff (they've already been coded for you). The issue is to know what exactly do you want that is not part of some templates, the actual business value of the software.

zachwills•1h ago

You're 100% right; there are ways to move quickly without using AI to write the code. I also agree with you that you need to know what you're doing. If you're a total newbie you likely won't get as far using Laravel or something like that vs. a veteran who's been there & done that before.

pavel_lishin•52m ago

> You're 100% right;

This is a serious question, I'm not being snarky - do you run your comments through AI before posting them here?

zachwills•7m ago

No. That would be too much effort, honestly. Comments are less "polished" and formal IMO.

xnx•1h ago

There's a lot of disagreement about what "agent" means. As far as I can tell it is the same as AI chat, but with a really long response time.

verdverm•1h ago

I have not seen discrepancy in what "agent" means in the AI space. At the top is generally a chat model. Underneath are tools for (i.e.) working with files, MCP servers for more complex actions, and even more models for specialized tasks. It takes longer because there is more going on under the hood

zachwills•1h ago

Yep, agreed with you that agent generally means an LLM with access to tools it knows how to use (like MCPs) that can go relatively far/deep with guidance (minimal or otherwise).

zerocharge•1h ago

It's funny how things have evolved. The modern tech stack is so inefficient at providing solutions that people grab and hold on to code gen as some kind of magic machine. We have gone from pragmatic, fast and stable foundation systems that could be extended by some small snippets of code (Drupal), RAD systems (Delphi, Visual Basic) where you could drop some components and some lines of code on a form to create functionality to some of the most verbose and ugly things that have ever reared its head (React, Angular, home-made REST systems - instead of SOAP). Less code is better. The code generated today using vibe coding won't be looked back upon. It's like fast fashion. Throw it in the trash.

zachwills•1h ago

Thanks for sharing your perspective. I don't agree that code generated by AI will be easily distinguishable from code generated by hand; I think it depends on who is authoring. Also, for what it's worth, if it achieves the business goals, is stable and performant, does it matter?

zerocharge•50m ago

I see where you're coming from. Some code may stick, and some generated code actually looks decent. There is a lot of coffee-fueled code that will never see the light of day again as well. It's just that there has been an inflation in code. Almost like the quantity has a positive quality all on its own, when that should be a negative ("Look at this 50k code app made in 1 day"). I think the next generation of code generators should be penalized for writing verbose code. Hopefully they'll ease the burden of maintaining systems instead of adding to it. My experience is that once a code base gets a certain size, agents start to lose the big picture - and often repeats themselves. Adding similar but not quite the same code. Less code will probably help here too. Still. If it works, it works. It's the next phase after the system's first steps I worry about (extending and maintaining).

WhitneyLand•28m ago

“if it achieves the business goals, is stable and performant, does it matter?”

Yes, if you want code to be maintainable over time, or to be able to evolve as a coherent foundation that lends itself to a long term direction and vision.

zachwills•5m ago

I'd consider that part of the business goals, personally. I also think over time we will end up with more people doing what I talked about in the post, so the code that gets generated by those workflows will become the norm and will impact long-term direction/vision.

FWIW, the vision and direction can and should still be dictated by the human engineer. That's what engineers will be doing IMO vs. coding.

lubujackson•1h ago

This tracks my experience, though I haven't gone into the "swarm" approach. One thing I don't think gets enough discussion is:

> The cognitive load of this new state was immense. After about three hours of intense orchestration, I would feel completely burnt.

I definitely feel that. It's like the job becomes solely architecting the system (not terrible) but you still need to hold the entire program in your head (deeply taxing) while it is updated by a 1,000 tiny cuts which can dislodge your mental model... and then you are screwed. It is important center yourself from time to time and ensure you are on solid ground. The best way to do that is to give things a minute to sink in. Unless I am boilerplating everything, I don't WANT to go faster.

In general, I don't have a lot of desire to go down this orchestrator path. Maybe once tooling and things settle down a bit it can be helpful at times, but usually I get more value from letting things breathe, especially if there are structural changes or new ideas that bubble up and need to be considered.

Before LLMs, I would never re-architect on the fly, but I am more willing to make structural changes mid-flight now because it is so costless. Things I used to architect "good enough" (with a plan to revise in 12 months now) can be done precisely right the first time after going halfway down a path then noodling on it for an a day or so.

This is "Not the Way to Architect," but it sure is an effective way to go from idea -> complete.

bananapub•1h ago

why aren't you more embarrassed about pushing, to HN, a badly written blog post, that you didn't even write?

one of the weirdest things about LLMs now being cheap and easily available is finding out how many people have no pride in their work, and will just "send a PR to the curl team without understanding it" or "publish some crap blog post, on their personal blog, and then push it to HN".

why?

it's great that you found a tool that can generate text for you to pass off as your own, but why did you then stop giving a fuck about what you're passing off as your own?

zachwills•1h ago

I take a lot of pride in my work -- I am still proud of the post, even if some don't like it. Different flavors exist for a reason. I don't hide the fact that I use Wispr Flow, dictate my thoughts, and let AI clean it up, and edit from there. That is the most efficient way to do what I'm doing and juggle what I'm juggling.

Maybe you should contribute some of your ideas vs. bashing others who do?

alaoallaka•1h ago

If anyone has links to maintaining a vibe coded (meaning you trust the LLM to some level as opposed to giving it a thorough review) product over a long period of time I’d love to see that! As is it’s always like this - lines of code, test coverage %, # of commits, etc. Metrics higher ups and others who don’t really know how to measure quality software love to point too. The only metric that’s reliable is the code, but that’s not often shared for projects done this way. And when they are, they don’t live up to the hype.

I’m not claiming vibe coded projects don’t have a place. I’m skeptical that using English prompts can build a maintainable code base as well as using a programming language.

FDA approves new Covid-19 vaccines in US but limits who can get them

As the Colorado River dries up, states angle for influence over water rights

Early gunpowder was made from the 'pisse' of church ladies

I 3D Printed a Kayak in Less Than 24 hours [video]

Zoom Docs: Going Beyond Meetings

Drowning prevention program comes to a halt at the CDC

The One Where We Just Steal the Vulnerabilities

Firefox Has Moved to Firefox.com

Heartbreak as a Service?

Widespread Data Theft Targets Salesforce Instances via Salesloft Drift

Natural Language CSV/TSV transformer – anyone interested?

Why are residential property tax rates regressive

Streamline LLM Evaluation with Stax

RFK Jr. Promises to Reveal the 'Cause' of Autism Next Month

System Eval with Obsidian and Claude Code

Break my algorithm –> take the plaintext $20 Bitcoin you can control (Round 4)

Imgburn.com spread malware (sombody bought mirror website)

Advanced Context Engineering for Agents [video]

Online Safety Act – Blackmailed after submitting identity docs to scam site

Decoding an SF Craigslist "furniture" listing that appears to be a coded drug ad

OAuth Device Flow Vulnerabilities: Analysis of 2024-2025 Attack Wave

Enterprise Sales = the Opposite of "Learn by Doing"

Wearable Lego Shirt by Neil Snowball [video]

Simulating Product–Market Fit Using Python

Big Tech Power Rankings

China sends an AI to its space station

You can now buy your v0 credits with crypto

4chan Sues UK Ofcom over Online Safety Act

First Home Guarantee – Housing Australia

Apparently it's easy to detect LLM-generated text now

FDA approves new Covid-19 vaccines in US but limits who can get them

As the Colorado River dries up, states angle for influence over water rights

Early gunpowder was made from the 'pisse' of church ladies

I 3D Printed a Kayak in Less Than 24 hours [video]

Zoom Docs: Going Beyond Meetings

Drowning prevention program comes to a halt at the CDC

The One Where We Just Steal the Vulnerabilities

Firefox Has Moved to Firefox.com

Heartbreak as a Service?

Widespread Data Theft Targets Salesforce Instances via Salesloft Drift

Natural Language CSV/TSV transformer – anyone interested?

Why are residential property tax rates regressive

Streamline LLM Evaluation with Stax

RFK Jr. Promises to Reveal the 'Cause' of Autism Next Month

System Eval with Obsidian and Claude Code

Break my algorithm –> take the plaintext $20 Bitcoin you can control (Round 4)

Imgburn.com spread malware (sombody bought mirror website)

Advanced Context Engineering for Agents [video]

Online Safety Act – Blackmailed after submitting identity docs to scam site

Decoding an SF Craigslist "furniture" listing that appears to be a coded drug ad

OAuth Device Flow Vulnerabilities: Analysis of 2024-2025 Attack Wave

Enterprise Sales = the Opposite of "Learn by Doing"

Wearable Lego Shirt by Neil Snowball [video]

Simulating Product–Market Fit Using Python

Big Tech Power Rankings

China sends an AI to its space station

You can now buy your v0 credits with crypto

4chan Sues UK Ofcom over Online Safety Act

First Home Guarantee – Housing Australia

Apparently it's easy to detect LLM-generated text now

I built a production app in a week by managing a swarm of 20 AI agents

Comments