Building a Personal AI Factory

https://www.john-rush.com/posts/ai-20250701.html

125•derek•7h ago

Comments

IncreasePosts•6h ago

Okay, what is he actually building with this?

I have a problem where half the times I see people talking about their AI workflow, I can't tell if they are talking about some kind of dream workflow that they have, or something they're actually using productively

ClawsOnPaws•6h ago

I keep coming to the same conclusion, which basically is: if I had an LLM write it for me, I just don't care about it. There are 2 projects out of the maybe 50 or so that are LLM generated, and even for those two I cared enough to make changes myself without an LLM. The rest just sit there because one day I thought huh wouldn't it be neat if, and then realized actually I cared more about having that thought than having the result of that thought. Then you end up fighting with different models and implementation details and then it messes up something and you go back and forth about how you actually want it to work, and somehow this is so much more draining and exhausting than just getting the work done manually with some slight completion help perhaps, maybe a little bit of boilerplate fill-in. And yes, this is after writing extensive design docs, then having some reasoning LLM figure out the tasks that need to be completed, then having some models talk back and forth about what needs to happen and while it's happening, and then I spent a whole lot of money on what exactly? Questionably working software that kinda sorta does what I wanted it to do? If I have a clear idea, or an existing codebase, if I end up guiding it along, agents and stuff are pretty cool I guess. But vibe coding? Maybe I'm in the minority here but as soon as it's a non trivial app, not just a random small script or bespoke app kind of deal, it's not fun, I often don't get the results I actually wanted out of it even if I tried to be as specific as I wanted with my prompting and design docs and example data and all that, it's expensive, code is still messy as heck, and at the end I feel like I just spent a whole lot of time actually literally arguing with my computer. Why would I want to do that?

jwpapi•5h ago

I’ve written a full stack monorepo with over 1,000 files alone now. I’ve started with AI doing a lot of the work, but the percentage goes down and down. For me a good codebase is not about how much you’ve written, but about how it’s architectured. I want to have an app that has the best possible user and dev experience meaning its easy to maintain and easy to extend. This is achieved by making code easy to understand, for yourself, for others.

In my case it’s more like developing a mindset building a framework than to push feature after feature. I would think it’s like that for most companies. You can get an unpolished version of most apps easily, but polishing takes 3-5x the time.

Lets not talk about development robustness, backend security etc etc. Like AI has just way too many slippages for me in these cases.

However I would still consider myself a heavy AI user, but I mainly use it to discuss plans,(what google used to be) or to check it if I’ve forgotten anything.

For most features in my app I’m faster typing it out exactly the way I want it. (with a bit of auto-complete) The whole brain-coordination works better.

I guess long talk, but you’re not alone trust your instinct. You don’t seem narrow minded.

ozten•5h ago

What does the full stack monorepo do?

tptacek•4h ago

We just had a story last night about a Python cryptography maintainer using Claude to add formally-verified optimizations to LLVM. I think the ship has sailed on skepticism about whether LLMs are going to produce valuable code; you can follow Simon Willison's blog for more examples.

steveklabnik•6h ago

I'd love to see more specifics here, that is, how Claude and o3 talk to each other, an example session, etc.

breckenedge•6h ago

I presume via Goose via MCP in Claude Code:

> I also have a local mcp which runs Goose and o3.

steveklabnik•6h ago

Ah, I skimmed the docs for Goose but I couldn't figure out exactly what it is that it does, which is a common issue for docs.

For example: https://block.github.io/goose/docs/category/tutorials/ I just want to see an example workflow before I set this up in CI or build a custom extension to it!

breckenedge•6h ago

Classic Steve Klabnik comment.

IncreasePosts•5h ago

An uncommon Aaron Breckenridge comment

steveklabnik•4h ago

It's true that I deeply care about docs! Turns out they're good for both humans and LLMs :)

schmookeeg•5h ago

I use Zen MCP and OpenRouter. Every once in awhile, my instance of claude code will "phone a friend" and use Gemini for a code review. Often unprompted, sometimes me asking for "analysis" or "ultrathink" about a thorny feature when I doubt the proposed implementation will work out or cause footguns.

It's wild to see in action when it's unprompted.

For planning, I usually do a trip out to Gemini to check our work, offer ideas, research, and ratings of completeness. The iterations seem to be helpful, at least to me.

Everyone in these sorta threads asks for "proofs" and I don't really know what to offer. It's like 4 cents for a second opinion on what claude's planning has cooked up, and the detailed response has been interesting.

I loaded 10 bucks onto OpenRouter last month and I think I've pulled it down by like 50 cents. Meanwhile I'm on Claude Max @ $200/mo and GPT Plus for another $20. The OpenRouter stuff seems like less than couch change.

$0.02 :D

conradev•5h ago

proof -> show the code if you can!

Then engineers can judge for themselves

schmookeeg•5h ago

Yeahhhhhh I've been to enough code reviews / PR reviews to know this will result in 100 opinions about what color the drapes should be and what a catastrophe we've vibe coded for ourselves. If I shoot something to GH I'll highlight it for others, but nothing yet. I can appreciate this makes me look like I'm shilling.

It makes usable code for my projects. It often gets into the weeds and makes weird tesseracts of nonsense that I need to discover, tear down, and re-prompt it to not do that again.

It's cheap or free to try. It saves me time, particularly in languages I am not used to daily driving. Funnily enough, I get madder when I have it write ts/py/sql code since I'm most conversant in those, but for fringe stuff that I find tedious like AWS config and tests -- it mostly just works.

Will it rot my brain? Maybe? If this thing turns me from an engineer to a PM, well, I'll have nobody to blame but myself as I irritate other engineers and demand they fibonacci-size underdefined jira tix. :D

I think there's going to be a lot of momentum in this direction in the coming year. I'm fortunate that my clients embrace this stuff and we all look for the same hallucinations in the codebase and shut them down and laugh together, but I worry that I'm not exactly justifying my rate by being an LLM babysitter.

steveklabnik•5h ago

It’s not about proof: it’s that at this point I’m a fairly heavy Claude Code user and I’d like to up my game, but I’m also not so up on many of these details that I can just figure out how to give this a try just from the description of it. I’m already doing plan-up-front workflows with just Claude, but haven’t figured out some of this more advanced stuff.

I have two MCPs installed (playwright and context7) but it never seems like Claude decides to reach for them on its own.

I definitely appreciate why you’re not posting code, as you said in another comment.

Aeolun•53m ago

> I have two MCPs installed (playwright and context7) but it never seems like Claude decides to reach for them on its own.

Not even when you add ‘memories’ that tell it to always use those tools in certain situations?

My admonitions to always run repomix at the start of coding, and always run the build command before crying victory seem to be followed pretty well anyway.

Uehreka•2h ago

> Everyone in these sorta threads asks for "proofs" and I don't really know what to offer

I’ve tried building these kinds of multi agent systems a couple times, and I’ve found that there’s a razor thin edge between a nice “humming along” system I feel good about and a “car won’t start” system where the first LLM refuses to properly output JSON and then the rest of them start reading each others <think> thoughts.

The difference seems to often come down to:

- Which LLM wrappers are you using? Are they using/exposing features like MCP, tools and chain-of-thought correctly for the particular models you’re using?

- What are your prompts? What are the 5 bullet points with capital letters that need to be in there to keep things in line? Is there a trick to getting certain LLMs to actually use the available MCP tools?

- Which particular LLM versions are you using? I’ve heard people say that Claude Sonnet 4 is actually better than Claude Opus 4 sometimes, so it’s not always an intuitive “pick the best model” kind of thing.

- Is your system capable of “humming along” for hours or is this a thing where you’re doing a ton of copy-paste between interfaces? If it’s the latter then hey, whatever works for you works for you. But a lot of people see the former as a difficult-to-attain Holy Grail, so if you’ve figured out the exact mixture of prompts/tools that makes that happen people are gonna want to know the details.

The overall wisdom in the post about inputs mattering more than outputs etc is totally spot on, and anyone who hasn’t figured that out yet should master that before getting into these weeds. But for those of us who are on that level, we’d love to know more about exactly what you’re getting out of this and how you’re doing it.

(And thanks for the details you’ve provided so far! I’ll have to check out Zen MCP)

skybrian•6h ago

> It’s essentially free to fire off a dozen attempts at a task - so I do.

What sort of subscription plan is that?

steveklabnik•6h ago

Claude Code's $200 Max subscription can take a lot of usage. I haven't done a dozen things at once, but I have worked on two side projects simultaneously with it before.

ccusage shows me getting over 10x the value of paying via API tokens this month so far...

simonw•4h ago

I had to look that up: https://github.com/ryoppippi/ccusage

  npx ccusage@latest

Outputs a table of your token usage over the last few days, which it reads from the jsonl files that Claude Code leaves tucked away in the ~/.claude/ directory.

steveklabnik•4h ago

Don’t sleep on the other options either, the live updates are cool, see where you’re at in the five hour session.

Aeolun•50m ago

Given you can nearly run two full code instances with Opus, and Opus is claimed to be 5x more expensive than Sonnet, you can maybe do 10 sonnet instances at the same time?

photon_garden•6h ago

It’s hard to evaluate setups like this without knowing how the resulting code is being used.

Standalone vibe coded apps for personal use? Pretty easy to believe.

Writing high quality code in a complex production system? Much harder to believe.

kasey_junk•5h ago

I don’t really understand this article or the workflow it’s describing as it’s kind of vague.

But I use multiple agents talking to each other, async agents, git work trees etc on complex production systems as my day to day workflow. I wouldn’t say I go so far as to never change the outputs but I certainly view it as signal when I don’t get the outputs I want that I need to work on my workflow.

9cb14c1ec0•4h ago

Exactly. I use claude code as a major speedup in coding, but I stay in the loop on every code change to make sure it is creating an optimal system. The few times that I've just let it run have resulted in bugs that customers had to deal with.

Aeolun•58m ago

I think you can probably get a pretty decent thing going if you have models review output they haven’t written themselves (not still in context anyway)

solomonb•6h ago

> When something goes wrong, I don’t hand-patch the generated code. I don’t argue with claude. Instead, I adjust the plan, the prompts, or the agent mix so the next run is correct by construction.

I don't think "correct by construction" means what OP thinks it means.

btbuildem•5h ago

Also, aren't they just rolling the dice here? Can you turn down the temperature via Claude Code?

vFunct•5h ago

The issue I'm facing with multiple agents working on separate work trees is that each independent agent tends to have completely different ideas on absolutely every detail, leading to inconsistent user experience.

For example, an agent working on the dashboard for the Documents portion of my project has a completely different idea from the agent working on the dashboard for the Design portion of my project. The design consistency is not there, not just visually, but architecturally. Database schema and API ideas are inconsistent, for example. Even on the same input things are wildly different. It seems that if it can be different, it will be different.

You start to update instruction files to get things consistent, but then these end up being thousands of lines on a large project just to get the foundations right, eating into the context window.

I think ultimately we might need smaller language models trained on certain rules & schemas only, instead of on the universe of ideas that a prompt could result in. Small language models are likely the correct path.

Swizec•5h ago

> each independent agent tends to have completely different ideas on absolutely every detail, leading to inconsistent user experience

> The design consistency is not there, not just visually, but architecturally.

Seniors always gonna have to senior. Doesn't matter if the coders are AI or humans. You have to make sure you provide enough structures for the agents to move in roughly the same direction while allowing enough flexibility that you're not better off just writing the code.

pjm331•5h ago

I’ve had success with building the first version of a thing mostly by hand and then telling Claude code to look at it as an example of how to do things when building the next N of them

swader999•2h ago

The things that work on a regular dev team translate well to the agentic mode.

marviel•5h ago

Thanks for the writeup!

I talked about a similar, but slightly simpler workflow in my post on "Vibe Specs".

https://lukebechtel.com/blog/vibe-speccing

I use these rules in all my codebases now. They essentially cause the AI to do two things differently:

(1) ask me questions first (2) Create a `spec.md` doc, before writing any code.

Seems not too dissimilar from yours, but I limit it to a single LLM

rolha-capoeira•2h ago

I guess a lot of us are trying this (naturally) as solo devs, where we can take an engineering-first mindset and build a machine or factory that spits out gizmos. I haven't gotten to the finish line, mostly because for me, the holy grail is code confidence via e2e tests that the agent generated (separately, not alongside the implementation).

marviel•2h ago

Totally. Yeah I think your approach is a solid take!

simonw•4h ago

My hunch is that this article is going to be almost completely impenetrable to people who haven't yet had the "aha" moment with Claude Code.

That's the moment when you let "claude --dangerously-skip-permissions" go to work on a difficult problem and watch it crunch away by itself for a couple of minutes running a bewildering array of tools until the problem is fixed.

I had it compile, run and debug a Mandelbrot fractal generator in 486 assembly today, executing in Docker on my Mac, just to see how well it could do. It did great! https://gist.github.com/simonw/ba1e9fa26fc8af08934d7bc0805b9...

gerdesj•4h ago

Crack on - this is YC!

Why are you not already a unicorn?

lucubratory•4h ago

An LLM wrapper does not have serious revenue potential. Being able to do very impressive things with Claude Code has a pretty strict ceiling on valuation because at any point Anthropic could destroy your business by removing access, incorporating whatever you're doing into their core feature set, etc.

petesergeant•3h ago

Having worked with some serious pieces of enterprise software, I don't think this is right. Anthropic is not going to perfect multi-vendor integrations, spin up a support team, and solution architect your problems for you. Enterprise software gets into the walls, and can be very hard to displace once deployed. If you build an LLM-wrapper resume parser, once you've got it into your client's workflows, they're going to find it hard to unembed it to replace it with raw Anthropic.

ffsm8•25m ago

But if you did become a unicorn, It would suddenly become very easy to replace for anthropic, because they're the ones actually providing the sauce and can just replicate your efforts. So your window of opportunity is to be too small for anthropic to notice and get interested. That can't be called unicorn

That was the point he was making, at least that's how I understood it

zackify•4h ago

If it helps anyone else. I downgraded from Claude max to pro for $20 and the usage limits are really good.

I think they’re trying to compete with Gemini cli and now I’m glad I’m paying less

ffsm8•27m ago

you will run through the pro rate limiting within <1h if you do it the way the article lays out.

But yeah, if you're babysitting a single agent, only applying after reading what it wants to do ... You'll be fine for 3-4 hours before the token limit refreshed after the 5th

csomar•25m ago

I am on max and burning daily (ccusage) roughly my monthly subscription. It is not clear whether the API is very overpriced or we are getting aggressively subsidized. I can afford $100-200/month but not $3.000. Let's hope this last for a good while as GitHub copilot turned off the tap on unlimited usage very recently.

low_common•3h ago

That's a pretty trivial example for one of these IDEs to knock out. Assembly is certainly in their training sets, and obviously docker is too. I've watched cursor absolutely run amok when I let it play around in some of my codebase.

I'm bullish it'll get there sooner rather than later, but we're not there yet.

simonw•1h ago

I think the hardest problem in computer science right now may be coming up with an LLM demo that doesn't get called "pretty trivial".

fragmede•1h ago

I think Cloudflare's oauth library qualifies https://news.ycombinator.com/item?id=44159166

skydhash•1h ago

Because they are trivial in a way that you can go on GitHub and copy one of those while not pretending LLM isn't a mashup of the internet.

What people agree on being non-trivial is working on a real project. There's a lot of opensource projects that could benefit from a useful code contribution. But they only got slop thrown at them.

simonw•1h ago

How about landing a compiler optimization in LLVM? https://simonwillison.net/2025/Jun/30/llvm/

(Someone on here already called that a "tinkertoy greenfield project" yesterday.)

skydhash•25m ago

I took the time to investigate the work being done there (all those years learning assembly and computer architecture come in handy), and it confirms (to me) that the key aspect of using LLM is pattern matching. Meaning you know that there's a solution out there (in this case, anything involving multiplying/dividing by a power of 2 can use such trick) and framing your problem (intentionally or not) and you'll get a derived text that will contain a possible solution.

But there's nothing truly novel in the result. The key aspect is being similar enough to something that's already in the training data so that the LLM can extrapolate the rest. The hint can be quite useful and sometimes you have something that shorten the implementation time, but you have to at least have some basic understanding of the domain in order to recognize the signs.

The issue is that the result is always tainted by your prompt. The signs may be there because of your prompt and not because there's some kind of data that need s to be explored further. And sometimes it's a bad fit, similar but different (what you want and what you get). So for the few domain that's valuable to me, I prefer to construct my own mental database that can lead me to concrete artifacts (books, articles, blog posts,...) that exists outside the influence of my query.

ADDENDUM

I can use LLMs with great results and I've done so. But it's more rewarding (and more useful to me) to actually think through the problem and learning from references. Instead of getting a perfect (or wobbly or the wrong category) circle that fits my query, I go to find a strange polygon formed (by me) from other strange polygon. Then because I know I need a circle, I only need to find its center and its radius.

It's slower, but the next time I need another circle (or a square) from the same polygon, it's going to be faster and faster.

csomar•20m ago

That's a very simple example/context that I suspect most LLMs will be able to knock out with minimal frustration. I had much more complex Rust dependency upgrade done on a 30+ iterations on very custom code (wasm stuff where training data is probably scarce). Claude would ping context7 and mcp-lsp to get details. You do find its limits after a while though and as you push it harder.

dkdcio•4h ago

I went down this (and even built a bit of internal web tooling) —- it’s like playing multiple games of online poker for me (instead of the factoria analogy here)

it’s really promising, but I found focusing on a single task and doing it well is still more efficient for now. excited for where this goes

geekymartian•4h ago

ADHD coding, brute forcing product generation until you get it right? Just freaking write the code that you can expand and modify in the future instead of increasing your carbon footprint.

cube00•4h ago

The end goal is to remove the developer from this equation.

Business owner asks for a new CRUD app and there it is in production.

Of course it's full of full of bugs, slow as syrup, saves to a public unauthed database but that's none of my business *gulps scalding hot tea*

gerdesj•4h ago

"Here’s the secret sauce: iterate the inputs":

No it isn't. There are no short cuts to ... anything. You expend a lot of input for a lot of output and I'm not too sure you understand why.

"Example: an agent once wrote code ..." - not exactly world beating.

If you believe this will take over the world, then go full on startup. YC is your oyster.

I've run my own firm for 25 years. Nothing exciting and certainly not YC excitable.

You wont with this.

tranchebald•2h ago

You come across as a massive hater. Maybe it’s a cultural thing. Do you actually have employees?

apwell23•4h ago

ppl are getting slowly disillusioned with vibe coding.

yes AI assisted workflow might be here to stay but it won't be the magical put programmers out of job thing.

And this the best product market fit for LLMs. I imagine it will be even worse in other domains.

azan_•4h ago

Are they though? I’m seeing more and more people that used gpt4 and got substandard results get blown away with Claude code and opus once they gave it a chance. Also remember that progress has not stopped (whether it has slowed down is also controversial), so I wouldn’t make strong assumptions that ai won’t replace many devs. I hope it won’t, I really like intellectual work associated with it.

petesergeant•3h ago

> ppl are getting slowly disillusioned with vibe coding.

This is the absolute polar opposite from my experience. I'm in a large non-tech community with a coders channel, and every day we get a few more Claude Code converts. I would say that vibe-coding is moving into the main-stream with experienced, professional developers who were deeply skeptical a few months ago. It's no longer fancy auto-complete: I have myself seen the magic of wishing a (low importance) front-end app into existence from scratch in an hour or so that would have taken me an order of magnitude more time beforehand.

apwell23•2h ago

oh yea thats true. I was talking more about ppl who have been vibe coding for a while.

https://www.reddit.com/r/ClaudeAI/comments/1loj3a0/this_pret...

c4pt0r•4h ago

Maybe a bit off-topic, but the minimalist style of the blog looks really cool.

petesergeant•3h ago

> I keep several claude code windows open, each on its own git-worktree.

Can someone convince me they're doing their due-diligence on this code if they're using this approach? I am smart and I am experienced, and I have trouble keeping on top of the changes and subtle bugs being created by one Claude Code.

webprofusion•1h ago

The basic idea is that you can continuously document what your system should do (high level and detailed features), how it should prove it has done that, optionally how you want it to do it (architecture and code style etc).

The multi-model AI part is just the (current) tool to help avoid bias and make fine tuned selections for certain parts of the task.

Eventually large complex systems will be built and re-built from a set of requirements and software will finally match the stated requirements. The only "legacy code" will be legacy requirements specifications. Fix your requirements, not the generated code.

guicen•1h ago

This "AI factory for everyone" model may be able to break resource inequality and allow people from more places to participate in truly valuable entrepreneurship.

namuol•1h ago

No real mention of results that aren’t self-referential.

I guess vibe-coding is on its way to becoming the next 3D printing: Expensive hobby best suited for endless tinkering. What’s today’s vibe coding equivalent of a “benchy”? Todo apps?

SchemaLoad•1h ago

3D printing actually is useful though. Basically everyone designing products or any kind of engineering is using it. The only reason it never took off for the average consumer is that every pre designed piece of plastic junk you could ever want to download and print is already available from Amazon.

In a pre online shopping world 3D printing would be far more useful for the average person. Going forward it looks like it's only really useful for people who can design their own files for actually custom stuff you can't buy.

namuol•37m ago

Yeah I’m not saying either aren’t useful, just that they can both be a trap for tinkerers.

am17an•1h ago

I actually don't understand how you can offload the instruction pointer of the program to another program, permanently. How are you accountable for anything then? You can't debug, you can't program, just a tourist in your own home. Own your code, even if AI wrote it.

The Fed says this is a cube of $1M. They're off by half a million

Hilbert's sixth problem: derivation of fluid equations via Boltzmann's theory

Fakespot shuts down today after 9 years of detecting fake product reviews

Figma Files Registration Statement for Proposed Initial Public Offering

Why Do Swallows Fly to the Korean DMZ?

Code⇄GUI bidirectional editing via LSP

Feasibility study of a mission to Sedna - Nuclear propulsion and solar sailing

Show HN: I made a 2D game engine in Dart

Ask HN: Who is hiring? (July 2025)

The Roman Roads Research Association

Show HN: Spegel, a Terminal Browser That Uses LLMs to Rewrite Webpages

Soldier's wrist purse discovered at Roman legionary camp

I built something that changed my friend group's social fabric

Building a Personal AI Factory

Ask HN: Who wants to be hired? (July 2025)

Effectiveness of trees in reducing temperature, outdoor heat exposure in Vegas

Victory Shoot: Hanemono in Toy Form

OpenFLOW – Quickly make beautiful infrastructure diagrams local to your machine

Australians to face age checks from search engines

Show HN: Core – open source memory graph for LLMs – shareable, user owned

Converting a large mathematical software package written in C++ to C++20 modules

Show HN: Jobs by Referral: Find jobs in your LinkedIn network

The Hoyle State (2021)

Graph Theory Applications in Video Games

Swearing as a Response to Pain: Assessing Effects of Novel Swear Words

Voyage of Magellan – Epilogue: Sailor of Eternal Fame

The wanton destruction of a creative-tech era

Cua (YC X25) is hiring an engineer

Ask HN: Freelancer? Seeking freelancer? (July 2025)

All Good Editors Are Pirates: In Memory of Lewis H. Lapham

Building a Personal AI Factory

Comments

The Fed says this is a cube of $1M. They're off by half a million

Hilbert's sixth problem: derivation of fluid equations via Boltzmann's theory

Fakespot shuts down today after 9 years of detecting fake product reviews

Figma Files Registration Statement for Proposed Initial Public Offering

Why Do Swallows Fly to the Korean DMZ?

Code⇄GUI bidirectional editing via LSP

Feasibility study of a mission to Sedna - Nuclear propulsion and solar sailing

Show HN: I made a 2D game engine in Dart

Ask HN: Who is hiring? (July 2025)

The Roman Roads Research Association

Show HN: Spegel, a Terminal Browser That Uses LLMs to Rewrite Webpages

Soldier's wrist purse discovered at Roman legionary camp

I built something that changed my friend group's social fabric

Building a Personal AI Factory

Ask HN: Who wants to be hired? (July 2025)

Effectiveness of trees in reducing temperature, outdoor heat exposure in Vegas

Victory Shoot: Hanemono in Toy Form

OpenFLOW – Quickly make beautiful infrastructure diagrams local to your machine

Australians to face age checks from search engines

Show HN: Core – open source memory graph for LLMs – shareable, user owned

Converting a large mathematical software package written in C++ to C++20 modules

Show HN: Jobs by Referral: Find jobs in your LinkedIn network

The Hoyle State (2021)

Graph Theory Applications in Video Games

Swearing as a Response to Pain: Assessing Effects of Novel Swear Words

Voyage of Magellan – Epilogue: Sailor of Eternal Fame

The wanton destruction of a creative-tech era

Cua (YC X25) is hiring an engineer

Ask HN: Freelancer? Seeking freelancer? (July 2025)

All Good Editors Are Pirates: In Memory of Lewis H. Lapham