Getting AI to work in complex codebases

https://github.com/humanlayer/advanced-context-engineering-for-coding-agents/blob/main/ace-fca.md

103•dhorthy•5h ago

Comments

malfist•2h ago

This article bases its argument on the predicate that AI _at worst_ will increase developer productivity be 0-10%. But several studies have found that not to be true at all. AI can, and does, make some people less effective

dhorthy•2h ago

definitely - the standford video has a slide about how many cases caused people to be even slower than without AI

mcny•1h ago

Question for discussion - what steps can I take as a human to set myself up for success where success is defined by AI made me faster, more efficient etc?

0xblacklight•1h ago

In many cases (though not all) it's the same thing that makes for great engineering managers:

smart generalists with a lot of depth in maybe a couple of things (so they have an appreciation for depth and complexity) but a lot of breadth so they can effectively manage other specialists,

and having great technical communication skills - be able to communicate what you want done and how without over-specifying every detail, or under-specifying tasks in important ways.

Peritract•23m ago

>where success is defined by AI made me faster, more efficient etc?

I think this attitude is part of the problem to me; you're not aiming to be faster or more efficient (and using AI to get there), you're aiming to use AI (to be faster and more efficient).

A sincere approach to improvement wouldn't insist on a tool first.

CharlesW•2h ago

Both (1) "AI can, and does, make some people less effective" and (2) "the average productivity boost (~20%) is significant" (per Stanford's analysis) can be true.

The article at the link is about how to use AI effectively in complex codebases. It emphasizes that the techniques described are "not magic", and makes very reasonable claims.

dingnuts•1h ago

the techniques described sound like just as much work, if not more, than just writing the code. the claimed output isn't even that great, it's comparable to the speed you would expect a skilled engineer to move at in a startup environment

CharlesW•1h ago

> the techniques described sound like just as much work, if not more, than just writing the code.

That's very fair, and I believe that's true for you and for many experienced software developers who are more productive than the average developer. For me, AI-assisted coding is a significant net win.

dhorthy•1h ago

I tend to think about it like vim - you will feel slow and annoyed for the first few weeks, but investing in these skills are massive +EV long term

criemen•50m ago

Yet a lot of people never bother to learn vim, and are still outstanding and productive engineers. We're surely not seeing any memos "Reflexive vim usage is now a baseline expectation at [our company]" (context: https://x.com/tobi/status/1909251946235437514)

The as-of-yet unanswered question is: Is this the same? Or will non-LLM-using engineers be left behind?

withinboredom•1h ago

How many skilled engineers can you afford to hire? Vs. Far more mediocre engineers who know how to leverage these tools?

f59b3743•1h ago

You should bet your career on it

dingnuts•1h ago

I have

dhorthy•1h ago

me too :)

simonw•48m ago

"AI can, and does, make some people less effective"

So those people should either stop using it or learn to use it productively. We're not doomed to live in a world where programmers start using AI, lose productivity because of it and then stay in that less productive state.

bgwalter•23m ago

If managers are convinced by stakeholders who relentlessly put out pro-"AI" blog posts, then a subset of programmers can be forced to at least pretend to use "AI".

They can be forced to write in their performance evaluation how much (not if, because they would be fired) "AI" has improved their productivity.

telliott1984•35m ago

There's also the more insidious gap between perceived productivity and actual productivity. Doesn't help that nobody can agree on how to measure productivity even without AI.

tschellenbach•2h ago

I wrote this blogpost on the same topic: https://getstream.io/blog/cursor-ai-large-projects/

It's super effective with the right guardrails and docs. It also works better on languages like Go instead of Python.

dhorthy•2h ago

why do you think go is better than python (i have some thoughts but curious your take)

polishdude20•2h ago

probably because it's typed?

0xblacklight•1h ago

Among other things; coding agents that can get feedback by running a compile step on top of the linter will tend to produce better output.

Also, strongly-typed languages tend to catch more issues through the language server which the agent can touch through LSP.

mholm•1h ago

imo:

1. Go's spec and standard practices are more stable, in my experience. This means the training data is tighter and more likely to work.

2. Go's types give the llm more information on how to use something, versus the python model.

3. Python has been an entry-level accessible language for a long time. This means a lot of the code in the training set is by amateurs. Go, ime, is never someone's first language. So you effectively only get code from someone who has already has other programming experience.

4. Go doesn't do much 'weird' stuff. It's not hard to wrap your head around.

dhorthy•1h ago

yeah i love that there is a lot of source data for "what is good idiomatic go" - the model doesn't have it all in the training set but you can easily collect coding standards for go with deep research or something

And then I find models try to write scripts/manual workflows for testing, but Go is REALLY good for doing what you might do in a bash script, and so you can steer the model to build its own feedback loop as a harness in go integration tests (we do a lot of this in github.com/humanlayer/humanlayer/tree/main/hld)

ath3nd•2h ago

Why though. Why should we do that?

If AI is so groundbreaking, why do we have to have guides and jump through 3000 hoops just so we can make it work?

ej88•1h ago

why do we have guides and lessons on how to use a chainsaw when we can hack the tree with an axe?

leptons•1h ago

The chainsaw doesn't sometimes chop off your arm when you are using it correctly.

crent•1h ago

If you swing an axe with a lack of hand eye coordination you don't think it's possible to seriously injure yourself?

logicchains•1h ago

Even if we had perfectly human-level AI it'd still need management, just like human workers do, and turns out effective management is actually nontrivial.

bluefirebrand•1h ago

I don't want to effectively manage the idiot box

I want to do the work

spaniard89277•1h ago

Because now your manager will measure on LOCs against other engineers again and it's only software engineers worrying about complexity, maintainability, and, in summary, the health of the very creature it's going to pay your salary.

This is the new world we live in. Anyone who actually likes coding should seriously look for other venues because this industry is for other type of people now.

I use AI in my job. I went from tolerable (not doing anything fancy) to unbearable.

I'm actually looking to become a council employee with a boring job and code my own stuff, because if this is what I have to do moving forward, I rather go back to non-coding jobs.

dhorthy•1h ago

i strongly disagree with this - if anything, using AI to code real production code in real complex codebase is MORE technical than just writing software.

Staff/Principal engineers already spend a lot more time designing systems than writing code. They care a lot about complexity, maintainability, and good architecture.

The best people I know who have been using these techniques are former CTOs, former core Kubernetes contributors, have built platforms for CRDTs at scale, and many other HIGHLY technical pursuits.

0xblacklight•1h ago

if nuclear power is so much better than coal, why do we need to learn how to safely operate a reactor just to make it work? Coal is so much easier

shafyy•2h ago

> Heck even Amjad was on a lenny's podcast 9 months ago talking about how PMs use Replit agent to prototype new stuff and then they hand it off to engineers to implement for production.

Please kill me now

jb2403•1h ago

It’s refreshing to read a full article this was written by a human. Content +++

cheschire•1h ago

As an aside, this single markdown file as an entire GitHub repo is a unique approach to blog posts.

dhorthy•1h ago

s/unique/lazy

merlincorey•1h ago

It seems we're still collectively trying to figure out the boundaries of "delegation" versus "abstraction" which I personally don't think are the same thing, though they are certainly related and if you squint a bit you can easily argue for one or the other in many situations.

> We've gotten claude code to handle 300k LOC Rust codebases, ship a week's worth of work in a day, and maintain code quality that passes expert review.

This seems more like delegation just like if one delegated a coding task to another engineer and reviewed it.

> That in two years, you'll be opening python files in your IDE with about the same frequency that, today, you might open up a hex editor to read assembly (which, for most of us, is never).

This seems more like abstraction just like if one considers Python a sort of higher level layer above C and C a higher level layer above Assembly, except now the language is English.

Can it really be both?

dhorthy•1h ago

I would say its much more about abstraction and the leverage abstractions give you.

You'll also note that while I talk about "spec driven development", most of the tactical stuff we've proven out is downstream of having a good spec.

But in the end a good spec is probably "the right abstraction" and most of these techniques fall out as implementation details. But to paraphrase sandy metz - better to stay in the details than to accidentally build against the wrong abstraction (https://sandimetz.com/blog/2016/1/20/the-wrong-abstraction)

I don't think delegation is right - when me and vaibhav shipped a week's worth of work in a day, we were DEEPLY engaged with the work, we didn't step away from the desk, we were constantly resteering and probably sent 50+ user messages that day, in addition to some point-edits to markdown files along the way.

fusslo•1h ago

Maybe I am just misunderstanding. I probably am; seems like it happens more and more often these days

But.. I hate this. I hate the idea of learning to manage the machine's context to do work. This reads like a lecture in an MBA class about managing certain types of engineers, not like an engineering doc.

Never have I wanted to manage people. And never have I even considered my job would be to find the optimum path to the machine writing my code.

Maybe firmware is special (I write firmware)... I doubt it. We have a cursor subscription and are expected to use it on production codebases. Business leaders are pushing it HARD. To be a leader in my job, I don't need to know algorithms, design patterns, C, make, how to debug, how to work with memory mapped io, what wear leveling is, etc.. I need to know 'compaction' and 'context engineering'

I feel like a ship corker inspecting a riveted hull

jnwatson•1h ago

I've started to use agents on some very low-level code, and have middling results. For pure algorithmic stuff, it works great. But I asked it to write me some arm64 assembly and it failed miserably. It couldn't keep track of which registers were which.

jmkni•1h ago

I imagine the LLM's have been trained on a lot less firmware code than say, HTML

dolebirchwood•12m ago

Guess it boils down to personality, but I personally love it. I got into coding later in life, and coming from a career that involved reading and writing voluminous amounts of text in English. I got into programming because I wanted to build web applications, not out of any love for the process of programming in and of itself. The less I have to think and write in code, the better. Much happier to be reading it and reviewing it than writing it myself.

daxfohl•1h ago

> And yeah sure, let's try to spend as many tokens as possible

It'd be nice if the article included the cost for each project. A 35k LOC change in a 350k codebase with a bunch of back and forth and context rewriting over 7 hours, would that be a regular subscription, max subscription, or would that not even cover it?

CharlesW•1h ago

From a cost perspective, you would definitely want a Claude Max subscription for this.

dhorthy•1h ago

yes - correct. For the record, if spending raw tokens, the 2 prs to baml cost about $650.

but yes we switched off per-token this week because we ran out of anthropic credits, we're on max plan now

daxfohl•42m ago

Haha, when I asked claude the question, it estimated $20-45. https://claude.ai/share/5c3b0592-7bc9-4c40-9049-459058b16920

Horrible, right? When I asked gemini, it guessed 37 cents! https://g.co/gemini/share/ff3ed97634ba

daxfohl•1h ago

Oh, oops it says further down

> oh, and yeah, our team of three is averaging about $12k on opus per month

I'll have to admit, I was intrigued with the workflow at first. But emm, okay, yeah, I'll keep handwriting my open source contributions for a while.

vanillax•1h ago

Doesnt githubs new speckit solve this? https://github.com/github/spec-kit

0xblacklight•1h ago

how does this solve it?

ooopakaj•1h ago

I’m not an expert in either language, but seeing a 20k LoC PR go up (linked in the article) would be an instant “lgtm, asshole” kind of review.

> I had to learn to let go of reading every line of PR code

Ah. And I’m over here struggling to get my teammates to read lines that aren’t in the PR.

Ah well, if this stuff works out it’ll be commoditized like the author said and I’ll catch up later. Hard to evaluate the article given the authors financial interest in this succeeding and my lack of domain expertise.

ActionHank•44m ago

I dunno man, I usually close the PR when someone does that and tell them to make more atomic changes.

Would you trust an colleague who is over confident, lies all the time, and then pushes a huge PR? I wouldn't.

Our_Benefactors•37m ago

Closing someone else’s PR is an actively hostile move. Opening a 20k LOC isn’t great either, but going ahead and closing it is rude as hell.

ActionHank•34m ago

Dumping a huge PR across a shared codebase wherein everyone else also has to deal with the risk of you monumental changes is pretty rude as well, I would even go so far as to say that it is likely selfishly risky.

bak3y•33m ago

Opening a 20k LOC PR is an actively hostile move worthy of an appropriate response.

Closed > will not review > make more atomic changes.

wwweston•13m ago

A 20k LOC PR isn’t reviewable in any normal workflow/process.

The only moves are refusing to review it, taking it up the chain of authority, or rubber stamping it with a note to the effect that it’s effectively unreviewable so rubber stamping must be the desired outcome.

GoatInGrey•7m ago

If somebody did this, it means they ignored their team's conventions and offloaded work onto colleagues for their own convenience. Being considered rude by the offender is not a concern of mine when dealing with a report who pulls this kind of antisocial crap.

koakuma-chan•1h ago

Context has never been the bottleneck for me. AI just stops working when I reach certain things that AI doesn't know how to do.

jmkni•1h ago

My problem is it keeps working, even when it reaches certain things it doesn't know how to do.

I've been experimenting with Github agents recently, they use GPT-5 to write loads of code, and even make sure it compiles and "runs" before ending the task.

Then you go and run it and it's just garbage, yeah it's technically building and running "something", but often it's not anything like what you asked for, and it's splurged out so much code you can't even fix it.

Then I go and write it myself like the old days.

koakuma-chan•54m ago

I have same experience with CC. It loves to comment out code, add a "fallback" implementation that returns mock data, and act like the thing works.

0xblacklight•1h ago

> Context has never been the bottleneck for me. AI just stops working when I reach certain things that AI doesn't know how to do.

It's context all the way down. That just means you need to find and give it the context to enable it to figure out how to do the thing. Docs, manuals, whatever. Same stuff that you would use to enable a human that doesn't know how to do it to figure out how.

koakuma-chan•1h ago

At that point it's easier to implement the thing yourself, and then let AI work with that.

bluefirebrand•1h ago

Or just forget the AI entirely, if you can build it yourself then do it yourself

I treat "uses AI tools" as a signal that a person doesn't know what they are doing

lacy_tinpot•1h ago

Specifically what did you have difficulty implementing where it "just stops working"?

koakuma-chan•1h ago

Anything it has not been trained on. Try getting AI to use OpenAI's responses API. You will have to try very hard to convince it not to use the chat completions API.

anthonypasq•1h ago

in cursor you can index docs by just adding a url and then reference it like file context in the editor

dhorthy•1h ago

yeah once again you need the right context to override what's in the weights. It may not know how to use the responses api, so you need to provide examples in context (or tools to fetch them)

philipp-gayret•1h ago

Can't agree with the formula for performance, on the "/ size" part. You can have a huge codebase, but if the complexity goes up with size then you are screwed. Wouldn't a huge but simple codebase be practical and fine for AI to deal with?

The hierarchy of leverage concept is great! Love it. (Can't say I like the 1 bad line of CLAUDE.md is 100K lines of bad code; I've had some bad lines in my CLAUDE.md from time to time - I almost always let Claude write it's own CLAUDE.md.).

dhorthy•1h ago

i mean there's also the fact that claude code injects this system message into your claude.md which means that even if your claude.md sucks you will probably be okay:

<system-reminder> IMPORTANT: this context may or may not be relevant to your tasks. You should not respond to this context or otherwise consider it in your response unless it is highly relevant to your task. Most of the time, it is not relevant. </system-reminder>

lots of others have written about this so i won't go deep but its a clear product decision, but if you don't know what's in your context window, you can't respond/architect your balance between claude.md and /commands well.

jascha_eng•1h ago

Except for ofc pushing their own product (humanlayer) and some very complex prompt template+agent setups that are probably overkill for most, the basics in this post about compaction and doing human review at the correct level are pretty good pointers. And giving a bit of a framework to think within is also neat

hellovai•1h ago

if you haven't tried the research -> plan -> implementation approach here, you are missing out on how good LLMs are. it completely changed my perspective.

the key part was really just explicitly thinking about different levels of abstraction at different levels of vibecoding. I was doing it before, but not explicitly in discrete steps and that was where i got into messes. The prior approach made check pointing / reverting very difficult.

When i think of everything in phases, i do similar stuff w/ my git commits at "phase" levels, which makes design decision easier to make.

I also do spend ~4-5 hours cleaning up the code at the very very end once everything works. But its still way faster than writing hard features myself.

0xblacklight•1h ago

tbh I think the thing that's making this new approach so hard to adopt for many people is the word "vibecoding"

Like yes vibecoding in the lovable-esque "give me an app that does XYZ" manner is obviously ridiculous and wrong, and will result in slop. Building any serious app based on "vibes" is stupid.

But if you're doing this right, you are not "coding" in any traditional sense of the word, and you are *definitely* not relying on vibes

Maybe we need a new word

dhorthy•1h ago

alex reibman proposed hyperengineering

i've also heard "aura coding", "spec-driven development" and a bunch of others I don't love.

but we def need a new word cause vibe coding aint it

simonw•49m ago

I'm sticking to the original definition of "vibe coding", which is AI-generated code that you don't review.

If you're properly reviewing the code, you're programming.

The challenge is finding a good term for code that's responsibly written with AI assistance. I've been calling it "AI-assisted programming" but that's WAY too long.

giancarlostoro•44m ago

> but not explicitly in discrete steps and that was where i got into messes.

I've said this repeatedly, I mostly use it for boilerplate code, or when I'm having a brain fart of sorts, I still love to solve things for myself, but AI can take me from "I know I want x, y, z" to "oh look I got to x, y, z in under 30 minutes, which could have taken hours. For side projects this is fine.

I think if you do it piecemeal it should almost always be fine. When you try to tell it to do two much, you and the model both don't consider edge cases (Ask it for those too!) and are more prone for a rude awakening eventually.

r2ob•1h ago

I refactored CPython using GPT-5, turning the compiler bilingual for english and portuguese keywords.

https://github.com/ricardoborges/cpython

what web programming task GPT-5 can't handle?

iambateman•1h ago

I built a package which I use for large codebase work[0].

It starts with /feature, and takes a description. Then it analyzes the codebase and asks questions.

Once I’ve answered questions, it writes a plan in markdown. There will be 8-10 markdowns files with descriptions of what it wants to do and full code samples.

Then it does a “code critic” step where it looks for errors. Importantly, this code critic is wrong about 60% of the time. I review its critique and erase a bunch of dumb issues it’s invented.

By that point, I have a concise folder of changes along with my original description, and it’s been checked over. Then all I do is say “go” to Claude Code and it’s off to the races doing each specific task.

This helps it keep from going off the rails, and I’m usually confident that the changes it made were the changes I wanted.

I use this workflow a few times per day for all the bigger tasks and then use regular Claude code when I can be pretty specific about what I want done. It’s proven to be a pretty efficient workflow.

[0] GitHub.com/iambateman/speedrun

iagooar•1h ago

I am working on a project with ~200k LoC, entirely written with AI codegen.

These days I use Codex, with GPT-5-Codex + $200 Pro subscription. I code all day every day and haven't yet seen a single rate limiting issue.

We've come a long way. Just 3-4 months ago, LLMs would start doing a huge mess when faced with a large codebase. They would have massive problems with files with +1k LoC (I know, files should never grow this big).

Until recently, I had to religiously provide the right context to the model to get good results. Codex does not need it anymore.

Heck, even UI seems to be a solved problem now with shadcn/ui + MCP.

My personal workflow when building bigger new features:

1. Describe problem with lots of details (often recording 20-60 mins of voice, transcribe)

2. Prompt the model to create a PRD

3. CHECK the PRD, improve and enrich it - this can take hours

4. Actually have the AI agent generate the code and lots of tests

5. Use AI code review tools like CodeRabbit, or recently the /review function of Codex, iterate a few times

6. Check and verify manually - often times, there are a few minor bugs still in the implementation, but can be fixed quickly - sometimes I just create a list of what I found and pass it for improving

With this workflow, I am getting extraordinary results.

AMA.

mentos•1h ago

What platform are you developing for, web?

Did you start with Cursor and move to Codex or only ever Codex?

drewnick•1h ago

Not OP, but I use Codex for back-end, scripting, and SQL. Claude Code for most front-end. I have found that when one faces a challenge, the other often can punch through and solve the problem. I even have them work together (moving thoughts and markdown plans back and fourth) and that works wonders.

My progression: Cursor in '24, Roo code mid '25, Claude Code in Q2 '25, Codex CLI in Q3 `25.

iagooar•44m ago

Cursor for me until 3-4 weeks ago, now Codex CLI most of the time.

These tools change all the time, very quickly. Important to stay open to change though.

iagooar•52m ago

Yes, it is a web project with next.js + Typescript + Tailwind + Postgres (Prisma).

I started with Cursor, since it offers a well-rounded IDE with everything you need. It also used to be the best tool for the job. These days Codex + GPT-5-Codex is king. But I sometimes go back to Cursor, especially when reading / editing the PRDs or if I need the ocasional 2nd opinion by Claude.

Mockapapella•1h ago

This sounds very similar to my workflow. Do you have pre-commits or CI beyond testing? I’ve started thinking about my codebase as an RL environment with the pre-commits as hyperparameters. It’s fascinating seeing what coding patterns emerge as a result.

iagooar•45m ago

Yes, I do have automated linting (a bit of a PITA at this scale). On the CI side I am using Github Actions - it does the job, but haven't put much work into it yet.

Generally I have observed that using a statically typed language like Typescript helps catching issues early on. Had much worse results with Ruby.

joshvm•14m ago

I think pre-commit is essential. I enforce conventional commits (+ a hook which limits commit length to 50 chars) and for Python, ruff with many options enabled. Perhaps the most important one is to enforce complexity limits. That will catch a lot of basic mistakes. Any sanity checks that you can make deterministic are a good idea. You could even add unit tests to pre-commit, but I think it's fine to have the model run pytest separately.

The models tend to be very good about syntax, but this sort of linting will often catch dead code like unused variables or arguments.

You do need to rule-prompt that the agent may need to run pre-commit multiple times to verify the changes worked, or to re-add to the commit. Also, frustratingly, you also need to be explicit that pre-commit might fail and it should fix the errors (otherwise sometimes it'll run and say "I ran pre-commit!") For commits there are some other guardrails, like blanket denying git add <wildcard>.

Claude will sometimes complain via its internal monologue when it fails a ton of linter checks and is forced to write complete docstrings for everything. Sometimes you need to nudge it to not give up, and then it will act excited when the number of errors goes down.

criemen•1h ago

What does PRD mean? I never heard that acronym before.

dhorthy•57m ago

https://en.wikipedia.org/wiki/Product_requirements_document

iagooar•57m ago

Product Requirements Document

It is a fairly standardized way of capturing the essens of a new feature. It covers most important aspects of what the feature is about, the goals, the success criteria, even implementation details where it makes sense.

If there is interest, I can share the outline/template of my PRDs.

canadiantim•23m ago

I'd be very interested

jrmiii•58m ago

> Heck, even UI seems to be a solved problem now with shadcn/ui + MCP.

I'm interested in hearing more about this - any resource you can point me at or do you mind elaborating a bit? TIA!

giancarlostoro•52m ago

> 1. Describe problem with lots of details (often recording 20-60 mins of voice, transcribe)

I just ask it to give me instructions for a coding agent and give it a small description of what I want to do, it looks at my code, and details what I describes as best as it can, and usually I have enough to let Junie (JetBrains AI) run on.

I can't personally justify $200 a month, I would need to see seriously strong results for that much. I use AI piecemeal because it has always been the best way to use it. I still want to understand the codebase. When things break its mostly on you to figure out what broke.

ActionHank•46m ago

lol, my guy invented programming with more steps

nzach•31m ago

What is you opinion on what is the "right level of detail" that we should use when creating technical documents the LLM will use to implement features ?

When I started leaning heavily into LLMs I was using really detailed documentations. Not '20 minutes of voice recordings', but my specification documents would easily hit hundreds of lines even for simple features.

The result was decent, but extremely frustrating. Because it would often deliver 80% to 90% but the final 10% to 20% it could never get right.

So, what I naturally started doing was to care less about the details of the implementation and focus on the behavior I want. And this led me to simpler prompts, to the point that I don't feel the need to create a specification document anymore. I just use the plan mode in Claude Code and it is good enough for me.

One way that I started to think about this was that really specific documentations were almost as if I was 'over-fitting' my solution over other technically viable solutions the model could come up with. One example would be if I want to sort an array, I could either ask for "sort the array" or "merge sort the array". And by forcing a merge sort I may end up with a worse solution. Admittedly sort is a pretty simple and unlikely example, but this could happen with any topic. You may ask the model to use a hash-set but a better solution would be to use a bloom filter.

Given all that, do you think investing so much time into your prompts provides a good ROI compared with the alternative of not really min-maxing every single prompt?

danielbln•30m ago

I can recommend one more thing: tell the LLM frequently to "ask me clarifying questions". It's simple, but the effect is quite dramatic, it really cuts down on ambiguity and wrong directions without having to think about every little thing ahead of time.

daxfohl•19m ago

Have you considered or tried adding steps to create / review an engineering design doc? Jumping straight from PRD to a huge code change seems scary. Granted, given that it's fast and cheap to throw code away and start over, maybe engineering design is a thing of the past. But still, it seems like it would be useful to have it delineate the high-level decisions and tradeoffs before jumping straight into code; once the code is generated it's harder to think about alternative approaches.

GoatInGrey•18m ago

And I assume there's no actual product that customers are using that we could also demo? Because only 1 out of every 20 or so claims of awesomeness actually has a demoable product to back up those claims. The 1 who does usually has immediate problems. Like an invisible text box rendered over the submit button on their Contact Us page preventing an onClick event for that button.

In case it wasn't obvious, I have gone from rabidly bullish on AI to very bearish over the last 18 months. Because I haven't found one instance where AI is running the show and things aren't falling apart in not-always-obvious ways.

prisenco•17m ago

Are you vibe coding or have the 200k LoC been human reviewed?

daxfohl•12m ago

Which of these steps do you think/wish could be automated further? Most of the latter ones seem like throwing independent AI reviewers could almost fully automate it, maybe with a "notify me" option if there's something they aren't confident about? Could PRD review be made more efficient if it was able to color code by level of uncertainty? For 1, could you point it to a feed of customer feedback or something and just have the day's draft PRD up and waiting for you when you wake up each morning?

potamic•37m ago

There are a lot of people declaring this, proclaiming that about working with AI, but nobody presents the details. Talk is cheap, show me the prompts. What will be useful is to check in all the prompts along with code. Every commit generated by AI should include a prompt log recording all the prompts that led to the change. One should be able to walkthrough the prompt log just as they may go through the commit log and observe firsthand how the code was developed.

spariev•36m ago

Thanks for sharing, I wonder how do you keep the stylistic and mental alignment of the codebase - is this happens during the code review or there are specific instructions during at the plan/implement stages?

jgilias•29m ago

I used to do these things manually in Cursor. Then I had to take a few months off programming, and when I came back and updated Cursor I found out that it now automatically does ToDos, as well as keeps track of the context size and compresses it automatically by summarising the history when it reaches some threshold.

With this I find that most of the shenanigans of manual context window managing with putting things in markdown files is kind of unnecessary.

You still need to make it plan things, as well as guide the research it does to make sure it gets enough useful info into the context window, but in general it now seems to me like it does a really good job with preserving the information. This is with Sonnet 4

YMMV

gloosx•23m ago

It's strange that author is bragging that this 35K LOC was researched and implemented in 7 hours, but there are 40 commits spanning across 7 days. Was it 1 hour per day or what?

Also quite funny that one of the latest commits is "ignore some tests" :D

dhorthy•21m ago

if you read further down, I acknowledge this

> While the cancelation PR required a little more love to take things over the line, we got incredible progress in just a day.

daxfohl•4m ago

FWIW I think your style is better and more honest than most advocates. But I'd really love to see some examples of things that completely failed. Because there have to be some, right? But you hardly ever see an article from an AI advocate about something that failed, nor from an AI skeptic about something that succeeded. Yet I think these would be the types of things that people would truly learn from. But maybe it's not in anyone's financial interest to cross borders like that, for those who are heavily vested in the ecosystem.

Libghostty is coming

Markov chains are the original language models

Android users can now use conversational editing in Google Photos

Find SF parking cops

How to draw construction equipment for kids

Launch HN: Strata (YC X25) – One MCP server for AI to handle thousands of tools

Go has added Valgrind support

From MCP to shell: MCP auth flaws enable RCE in Claude Code, Gemini CLI and more

Always Invite Anna

Mesh: I tried Htmx, then ditched it

Nine things I learned in ninety years

x402 — An open protocol for internet-native payments

Getting More Strategic

Restrictions on house sharing by unrelated roommates

Getting AI to work in complex codebases

Structured Outputs in LLMs

Thundering herd problem: Preventing the stampede

OpenDataLoader-PDF: An open source tool for structured PDF parsing

Zoxide: A Better CD Command

Zinc (YC W14) Is Hiring a Senior Back End Engineer (NYC)

Agents turn simple keyword search into compelling search experiences

Shopify, pulling strings at Ruby Central, forces Bundler and RubyGems takeover

Denmark wants to push through Chat Control

YAML document from hell (2023)

Show HN: Run Qwen3-Next-80B on 8GB GPU at 1tok/2s throughput

Processing Strings 109x Faster Than Nvidia on H100

The Great American Travel Book: The book that helped revive a genre

Smooth weighted round-robin balancing

Show HN: Kekkai – a simple, fast file integrity monitoring tool in Go

Permeable materials in homes act as sponges for harmful chemicals: study

Libghostty is coming

Markov chains are the original language models

Android users can now use conversational editing in Google Photos

Find SF parking cops

How to draw construction equipment for kids

Launch HN: Strata (YC X25) – One MCP server for AI to handle thousands of tools

Go has added Valgrind support

From MCP to shell: MCP auth flaws enable RCE in Claude Code, Gemini CLI and more

Always Invite Anna

Mesh: I tried Htmx, then ditched it

Nine things I learned in ninety years

x402 — An open protocol for internet-native payments

Getting More Strategic

Restrictions on house sharing by unrelated roommates

Getting AI to work in complex codebases

Structured Outputs in LLMs

Thundering herd problem: Preventing the stampede

OpenDataLoader-PDF: An open source tool for structured PDF parsing

Zoxide: A Better CD Command

Zinc (YC W14) Is Hiring a Senior Back End Engineer (NYC)

Agents turn simple keyword search into compelling search experiences

Shopify, pulling strings at Ruby Central, forces Bundler and RubyGems takeover

Denmark wants to push through Chat Control

YAML document from hell (2023)

Show HN: Run Qwen3-Next-80B on 8GB GPU at 1tok/2s throughput

Processing Strings 109x Faster Than Nvidia on H100

The Great American Travel Book: The book that helped revive a genre

Smooth weighted round-robin balancing

Show HN: Kekkai – a simple, fast file integrity monitoring tool in Go

Permeable materials in homes act as sponges for harmful chemicals: study

Getting AI to work in complex codebases

Comments