My experience creating software with LLM coding agents – Part 2 (Tips)

https://efitz-thoughts.blogspot.com/2025/08/my-experience-creating-software-with_22.html

191•efitz•5mo ago

Comments

efitz•5mo ago

I spent much of the last several months using LLM agents to create software. I've written two blog posts about my experience; this is the second post that includes all the things I've learned along the way to get better results, or at least waste less money.

afeezaziz•5mo ago

you should write more about your experience using LLM. Is this solely using LLM?

xwowsersx•5mo ago

This lines up with my own experience of learning how to succeed with LLMs. What really makes them work isn't so different from what leads to success in any setting: being careful up front, measuring twice and cutting once.

CuriouslyC•5mo ago

If I paid for my API usage directly instead of the plan it'd be like a second mortgage.

3abiton•5mo ago

To be fair, allocating some token for planning (recursively) helps a lot. It requires more hands on work, but produce much better results. Clarifying the tasks and breaking them down is very helpful too. Just you end up spending lots of time on it. On the bright side, Qwen3 30B is quite decent, and best of all "free".

manmal•5mo ago

One weird trick is to tell the LLM to ask you questions about anything that’s unclear at this point. I tell it eg to ask up to 10 questions. Often I do multiple rounds of these Q&A and I‘m always surprised at the quality of the questions (w/ Opus). Getting better results that way, just because it reduces the degrees of freedom in which the agent can go off in a totally wrong direction.

deadbabe•5mo ago

This is a little anthropomorphic. The faster option is to tell it to give you the full content of an ideal context for what you’re doing and adjust or expand as necessary. Less back and forth.

manmal•5mo ago

Can you give me the full content of the ideal context of what you mean here?

rzzzt•5mo ago

Certainly!

7thpower•5mo ago

It’s not though, one of the key gaps right now is that people do not provide enough direction on the tradeoffs they want to make. Generally LLMs will not ask you about them, they will just go off and build. But if you have them ask, they will often come back with important questions about things you did not specify.

deadbabe•5mo ago

They don’t know what to ask. They only assemble questions according to training data.

fuzzzerd•5mo ago

While true, the questions are all points where the LLM would have "assumed" an answer and by asking you get to point in the right direction instead.

7thpower•5mo ago

It seems like you are trying to steer toward a different point or topic.

In the course of my work, I have found they ask valuable clarifying questions. I don’t care how they do it.

MrDunham•5mo ago

This is the correct answer. I like to go one step further than the root comment:

Nearly all of my "agents" are required to ask at least three clarifying questions before they're allowed to do anything (code, write a PRD, write an email newsletter, etc)

Force it to ask one at a time and it's event better, though not as step-function VS if it went off your initial ask.

I think the reason is exactly what you state @7thpower: it takes a lot of thinking to really provide enough context and direction to an LLM, especially (in my opinion) because they're so cheap and require no social capital cost (vs asking a colleague / employee—where if you have them work for a week just to throw away all their work it's a very non-zero cost).

iaw•5mo ago

My routine is:

Prompt 1: <define task> Do not write any code yet. Ask any questions you need for clarification now.

Prompt 2: <answer questions> Do not write any code yet. What additional questions do you have?

Reiterate until questions become unimportant.

jackphilson•5mo ago

I often like to just talk out out. Stream of thought. Gives it full context of your mental model. Talk through an excalidraw diagram

manmal•5mo ago

I do that as well, with Wispr Flow. But I still forget details that the questions make obvious.

nativeit•5mo ago

This is more or less what the "architect" mode is in KiloCode. It does all the planning and documentation, and then has to be switched to "Code" in order to author any of it. It allows me to ensure we're on the same page, more or less, with intentions and scope before giving it access to writing anything.

It consumes ~30-40% of the tokens associated with a project, in my experience, but they seem to be used in a more productive way long-term, as it doesn't need to rehash anything later on if it got covered in planning. That said, I don't pay too close attention to my consumption, as I found that QwenCoder 30B will run on my home desktop PC (48GB RAM/12GB vRAM) in a way that's plenty functional and accomplishes my goals (albeit a little slower than Copilot on most tasks).

CuriouslyC•5mo ago

Workflow improvement: Use a repo bundler to make a single file and drop your entire codebase in gemini or chatgpt. Their whole codebase comprehension is great and you can chat for a long time without the api cost. You can even get them to comment on each other's feedback, it's great.

manmal•5mo ago

Unfortunately that only works with very small projects.

CuriouslyC•5mo ago

I can get useful info from gemini on 120k loc projects with repomix ignoring a few select files. If you're in the enterprise world obviously it's a different thing.

pmxi•5mo ago

> If you are a heavy user, you should use pay-as-you go pricing

if you’re a heavy user you should pay for a monthly subscription for Claude Code which is significantly cheaper than API costs.

ramesh31•5mo ago

Am I alone in spending $1k+/month on tokens? It feels like the most useful dollars i've ever spent in my life. The software I've been able to build on a whim over the last 6 months is beyond my wildest dreams from a a year or two ago.

zppln•5mo ago

Care to show what you've built?

fainpul•5mo ago

> The software I've been able to build on a whim over the last 6 months is beyond my wildest dreams from a a year or two ago.

If you don't mind sharing, I'm really curious - what kind of things do you build and what is your skillset?

stillsut•5mo ago

Not OP but I know it can be difficult to really difficult to measure or communicate this to people who aren't familiar with the codebase or the problem being solved.

Other than just dumping 10M tokens of chats into a gist and say read through everything I said back and forth with claude for a week.

But, I think I've got the start of a useful summary format: it that takes every prompt and points to the corresponding code commit produced by ai + adds a line diff amount and summary of the task. Check it out below.

https://github.com/sutt/agro/blob/master/docs/dev-summary-v1...

(In this case it's an python cli ai-coding framework that I'm using to build the package itself)

tovej•5mo ago

I would personally never. Do I want to spend all my time reviewing AI code instead of writing? Not really. I also don't like having a worse mental model of the software.

What kind of software are you building that you couldn't before?

kergonath•5mo ago

> Am I alone in spending $1k+/month on tokens?

I would if there were any positive ROI for these $12k/year, or if it were a small enough fraction of my income. For me, neither are true, so I don’t :).

Like the siblings I would be interested in having your perspective on what kind of thing you do with so many tokens.

mewpmewp2•5mo ago

If freelancing and if I am doing 2x as much as previously with same time, it would make sense that I am able to make 2x as much. But honestly to me with many projects I feel like I was able to scale my output far more than 2x. It is a different story of course if you have main job only. But I have been doing main job and freelancing on the side forever now.

I do freelancing mostly for fun though, picking projects I like, not directly for the money, but this is where I definitely see multiples of difference on what you can charge.

OtherShrezzing•5mo ago

I’m unclear how you’re hitting $1k/mo in personal usage. GitHub Copilot charges $0.04 per task with a frontier model in agent mode - and it’s considered expensive. That’s 850 coding tasks per day for $1k/mo, or around 1 per minute in a 16hr day.

I’m not sure a single human could audit & review the output of $1k/mo in tokens from frontier models at the current market rate. I’m not sure they could even audit half that.

F7F7F7•5mo ago

Audit and review? Sounds like a vibe killer.

7thpower•5mo ago

Do people actually use GitHub copilot?

At any rate, I could easily go through that much with Opus because it’s expensive and often I’m loading the context window to do discovery, this may include not only parts of a codebase but also large schemas along with samples of inputs and outputs.

When I’m done with that, I spend a bunch of turns defining exactly what I want.

Now that MCP tools work well, there is also a ton of back and forth that happens there (this is time efficient, not cost efficient). It all adds up.

I have Claude code max which helps, but one of the reasons it’s so cheap is all of the truncation it does, so I have a different tool I use that lets me feed in exactly the parts of a codebase that I want to, which can be incredibly expensive.

This is all before the expenses associated with testing and evals.

I’m currently consulting, a lot of the code is ultimately written by me, and everything gets validated by me (if the LLM tells me about how something works, I don’t just take its word for it, I go look myself), but a lot of the work for me happens before any code is actually written.

My ability (usually clarity of mind and patience) to review an LLMs output is still a gating factor, but the costs can add up quickly.

adithyassekhar•5mo ago

> Do people actually use GitHub copilot?

I use it all the time. I am not into claude code style agentic coding. More of the "change the relevant lines and let me review" type.

I work in web dev, with vs code I can easily select a line of code that's wrong which I know how to fix but honestly tired to type, press Ctrl+I and tell it to fix. I know the fix, I can easily review it.

Gpt 4.1 agent mode is unlimited in the pro tier. It's half the cost of claude, gemini, and chatgpt. The vs code integration alone is worth it.

Now that is not the kind of AI does everything coding these companies are marketing and want you to do, I treat it like an assistant almost. For me it's perfect.

7thpower•5mo ago

That’s good to know, I haven’t tried it for a few years.

ModernMech•5mo ago

I trust Copilot way more than any agentic coder. First time I used Claude it went through my working codebase and tried to tell me it was broken in all these places it wasn't. It suggested all these wrong changes that if applied would have ruined my life. Given that first impression, it's going to take a lot to convince me agentic coding is a worthwhile tool. So I prefer Copilot because it's a much more conservative approach to adding AI to my workflow.

Wowfunhappy•5mo ago

You don't audit and review all $1k worth of tokens!

The AI might write ten versions. Versions 1-9 don't compile, but it automatically makes changes and gets further each time. Version 10 actually builds and seems to pass your test suite. That is the version you review!

—and you might not review the whole thing! 20 lines in, you realize the AI has taken a stupid approach that will obviously break, so you stop reading and tell the AI it messed up. This triggers another ~5 rounds of producing code before something compiles, which you can then review, hopefully in full this time if it did a good job.

beefnugs•5mo ago

Thank you for this honesty. I mean if i try something 5 times and its a total failure every time, I would never in a million years think to try it another 5. (maybe i will give it another try when home hardware capable of a coding model is anywhere near affordable)

I guess I see why the salesmen dont mention this... but it seems really important for everyone to know?

Wowfunhappy•5mo ago

The scenario I described is still what I would consider "two shot" though—between attempts 1–10 I did not need to intervene.

But it's true that I'm always surprised when people talk about using Claude on the beach or whatever, I love Claude Code but I have to test and test and test again per each incremental feature.

elcritch•5mo ago

I can easily hit the daily usage limits on Claude Code or Openai Codex by asking for more complex tasks to be done which often take relatively little time to review.

There's a lot of tokens used up quickly for those tools to query the code base, documentation, try changes, run commands, re-run commands to call tools correctly, fix errors, etc.

sothatsit•5mo ago

You're not alone in using $1k+/month in tokens. But if you are spending that much, you should definitely be on something like Anthropic's Max plan instead of going full API, since it is a fraction of the cost.

dingi•5mo ago

I'm starting to notice a pattern with these kinds of comments. They almost never provide any actual evidence for the work they mention.

Nizoss•5mo ago

Yeah, this is a no brainer for certain use cases.

Terretta•5mo ago

Define heavy... There's a band where the max subscription makes most sense. Thread here talks $1000/month, the plan beats that. But there's a larger area beyond that where you're back to having to use API or buy credits.

A full day of Opus 4.1 or GPT 5 high reasoning doing pair programming or guided code review across multiple issues or PRs in parallel will burn the max monthly limits and then stop you or cost $1500 in top up credits for a 15 hour day. Wait, WTF, that's $300k/year! OK, while true, misses that that's accomplishing 6 - 8 in parallel, all day, with no drop in efficacy.

At enterprise procurement cost rates, hiring a {{specific_tech}} expert can run $240/hr or $3500/day and is (a) less knowledgable on the 3+ year old tech the enterprise is using, (b) wants to advise instead of type.

So the question then isn't what it costs, it's what's the cost of being blocked and in turn blocking committers waiting for reviews? Similarly, what's the cost of a Max for a dev that doesn't believe in using it?

TL;DR: At the team level, for guided experts and disbelievers, API likely ends up cheaper again.

athrowaway3z•5mo ago

> One of the weird things I found out about agents is that they actually give up on fixing test failures and just disable tests. They’ll try once or twice and then give up.

Its important to not think in terms of generalities like this. How they approach this depends on your tests framework, and even on the language you use. If disabling tests is easy and common in that language / framework, its more likely to do it.

For testing a cli, i currently use run_tests.sh and never once has it tried to disable a test. Though that can be its own problem when it hits 1 it can't debug.

# run_tests.sh # Handle multiple script arguments or default to all .sh files

scripts=("${@/#/./examples/}")

[ $# -eq 0 ] && scripts=(./examples/*.sh)

for script in "${scripts[@]}"; do

    [ -n "$LOUD" ] && echo $script

    output=$(bash -x "$script" 2>&1) || {

        echo ""

        echo "Error in $script:"

        echo "$output"

        exit 1

    }

done

echo " OK"

----

Another tip. For a specific tasks don't bother with "please read file x.md", Claude Code (and others) accept the @file syntax which puts that into context right away.

Lucasoato•5mo ago

I’ve seen going very successfully using both codex with gpt5 and claude code with opus. You develop a solution with one, then validate it with the other. I’ve fixed many bugs by passing the context between them saying something like: “my other colleague suggested that…”. Bonus thing: I’ve started using symlinks on CLAUDE.md files pointing at AGENTS.md, now I don’t even have to maintain two different context files.

dizhn•5mo ago

Other symlinks one can do: QWEN.md, GEMINI.md, CONVENTIONS.md (for aider).

alex-moon•5mo ago

As a human dev, can I humbly ask you to separate out your LLM "readme" from your human README.md? If I see a README.md in a directory I assume that means the directory is a separate module that can be split out into a separate repo or indeed storage elsewhere. If you're putting copy in your codebase that's instructions for a bot, that isn't a README.md. By all means come up with a new convention e.g. BOTS.md for this. As a human dev I know I can safely ignore such a file unless I am working with a bot.

kergonath•5mo ago

I think things are moving towards using AGENTS.md files: https://agents.md/ . I’d like something like this to become the consensus for most commonly used tools at some point.

There was a discussion here 3 days ago: https://news.ycombinator.com/item?id=44957443 .

mattmanser•5mo ago

While I agree keep Readme a for humans, Readme literally means read me.

Not 'this is a separate project'. Not 'project documentation file'.

You can have read mes dotted all over a project if that's necessary.

It's simply a file that a previous developer is asking you to read before you start making around in that directory.

Terr_•5mo ago

> If I see a README.md in a directory I assume that means the directory is a separate module that can be split out into a separate repo or indeed storage elsewhere.

While I can understand why someone might develop that first-impression, it's never been safe to assume, especially as one starts working with larger projects or at larger organizations. It's not that unusual for essential sections of the same big project to have their own make-files, specialized utility scripts, tweaks to auto-formatter, etc.

In other cases things are together in a repo for reasons of coordination: Consider frontend/backend code which runs with different languages on different computers, with separate READMEs etc. They may share very little in terms of their build instructions, but you want corresponding changes on each end of their API to remain in lockstep.

Another example: One of my employer's projects has special GNU gettext files for translation and internationalization. These exist in a subdirectory with its own documentation and support scripts, but it absolutely needs to stay within the application that is using it for string-conversions.

alex-moon•5mo ago

You're absolutely right - I explained my reasoning poorly. Let me try again: a README.md marks a conceptual project root. It's a way of flagging to developers: "The code in this directory builds, deploys and/or runs separately to other code in this repo." It's a marker that says you need to think of this directory as something that could meaningfully be split out into a separate repo or storage, but isn't because it only makes sense in the context of this repo. It's common in public repos with docs and examples dirs in the project root. The docs are built separately. The examples are meant to be standalone and could be implemented that way even if they actually import code from the parent repo instead of requiring it as a third party dependency.

Terr_•5mo ago

> [A README.md is] a way of flagging to developers: "The code in this directory builds, deploys and/or runs separately to other code in this repo."

I disagree, that's merely correlated to the real purpose: A file-level README is something an author adds when they anticipates something about the mental-state of future readers. In particular, that a reader won't arrive already-prepared to understand or navigate.

While that often happens at "project roots", it is by no means exclusive to them, and it can still happen in sections with extremely tight dependencies or coupling that could never exist independently.

Analogy: While many roads begin with a speed-limit sign, it is not true that every speed-limit sign indicates you've entered a new road.

snissn•5mo ago

https://gist.github.com/snissn/4f06cae8fb4f4ac43ffdb104db192...

sothatsit•5mo ago

> If you are a heavy user, you should use pay-as-you go pricing; TANSTAAFL.

This is very very wrong. Anthropic's Max plan is like 10% of the cost of paying for tokens directly if you are a heavy user. And if you still hit the rate-limits, Claude Code can roll-over into you paying for tokens through API credits. Although, I have never hit the rate limits since I upgraded to the $200/month plan.

yifanl•5mo ago

The blogpost is transparently an advertisement, which is ironic considering the author's last blogpost was https://blog.efitz.net/blog/modern-advertising-is-litter/

theshrike79•5mo ago

In this case the lunch is being paid by VC money.

I acknowledge that and get like $400 worth of tokens from my $20 Claude Code Pro subscription every month.

I'm building tools I can use when the VC money runs out or a clear winner gets on top and the prices shoot up to realistic levels.

At that point I've hopefully got enough local compute to run a local model though.

efitz•5mo ago

OP here. I was not solicited by anyone nor did I solicit or accept compensation from anyone for this or any other post.

It’s not an advertisement; I apologize if I come off as a Claude Code fanboy.

If you read part 1 of my post (linked in my OP) you will see that I disclosed exactly how much I paid for my usage, and also the reasons that I ended up choosing Claude Code over other agents.

efitz•5mo ago

There are many people who quickly hit the limits of the $200/month plan. I hit the limits of the $20/month plan in less than a day. So I never tried the $200/month plan but I suspect you are wrong.

Also, if you sign up for Anthropic’s feedback program you get a 30% reduction on API usage.

sothatsit•5mo ago

Especially if you hit the rate limits, you should be on the Max plan. It will save you at least $2000/month if you hit the rate-limits on the $200/month plan, and then you can go and spend however much more you want on the API after hitting the rate limits. There are many people showing how they use 3, 4, or even 5 thousand dollars of API credits a month using the Max plan. You're just going to pay those extra thousands of dollars for the sake of it?!

It is insanity to spend thousands of dollars a month when you could be spending hundreds for the exact same product.

It's an absolute no-brainer. And it's not even "either or". You can use both the plan and fallback to the API when you get rate-limited. A 30% discount on tokens cannot match the 90% discount on tokens you get using the plan. The math is so unbelievably in favour of the plan.

There are probably people who are not heavy users where the plan may not make sense. But for heavy users, you are burning piles of your own money by not using Anthropic's Max plan. You only need a week of moderate usage a month and already the plan will have paid for itself compared to paying for API credits directly.

sothatsit•5mo ago

Oh and btw, the rate-limits reset every 4 hours. So, I don't know what you are talking about with "hitting your rate-limits in the first day". That doesn't make much sense to say when your rate-limits reset multiple times within one day, nevermind between days.

Maybe that's the key piece of information you're missing where you mistakenly thought the rate-limits applied to the whole month, when in fact they apply to a 4-hour window.

bgwalter•5mo ago

His profile says: "I'm a technology geek and do information security for a living."

The blog post starts with: "I’m not a professional developer, just a hobbyist with aspirations."

Is this a vibe blog promoting Misanthropic Claude Vibe? It is hard to tell, since all "AI" promotion blogs are unstructured and chaotic.

chrisweekly•5mo ago

Hmm, to my eye those descriptors (profile and blog post intro) aren't contradictory.

JeremyNT•5mo ago

Some of these sample prompts in this blog post are extremely verbose:

If you are considering leveraging any of the documentation or examples, you need to validate that the documentation or example actually matches what is currently in the code.

I have better luck being more concise and avoiding anthropomorphizing. Something like:

"validate documentation against existing code before implementation"

Should accomplish the same thing!

Wowfunhappy•5mo ago

I've had both experiences. On some projects concise instructions seem to work better. Other times, the LLM seems to benefit from verbosity.

This is definitely a way in which working with LLMs is frustrating. I find them helpful, but I don't know that I'm getting "better" at using them. Every time I feel like I've discovered something, it seems to be situation specific.

brookst•5mo ago

I have the best luck with RFC speak. “You MUST validate that the documentation validates existing code before implementation. You MAY update documentation to correct any mismatches.”

But I also use more casual style when investigating. “See what you think about the existing inheritance model, propose any improvements that will make it easier to maintain. I was thinking that creating a new base class for tree and flower to inherit from might make sense, but maybe that’s over complicating things”

(Expressing uncertainty seems to help avoid the model latching on to every idea with “you’re absolutely right!”)

JeremyNT•5mo ago

RFC speak is a good way to put it.

Also, there's a big difference between giving general "always on" context (as in agents.md) for vibe coding - like "validate against existing code" etc - versus bouncing ideas in a chat session like your example, where you don't necessarily have a specific approach in mind and burning a few extra tokens for a one off query is no big deal.

Context isn't free (either literally or in terms of processing time) and there's definitely a balance to be found for a given task.

enraged_camel•5mo ago

Not my experience at all. I find that the shorter my prompts, the more garbage the results. But if I give it a lot of detail and elaborate on my thought process, it performs very well, and often one-shots the solution.

kvnhn•5mo ago

IMO, a key passage that's buried:

"You can ask the agent for advice on ways to improve your application, but be really careful; it loves to “improve” things, and is quick to suggest adding abstraction layers, etc. Every single idea it gives you will seem valid, and most of them will seem like things that you should really consider doing. RESIST THE URGE..."

A thousand times this. LLMs love to over-engineer things. I often wonder how much of this is attributable to the training data...

brookst•5mo ago

They’re not dissimilar to human devs, who also often feel the need to replat, refactor, over-generalize, etc.

The key thing in both cases, human and AI, is to be super clear about goals. Don’t say “how can this be improved”, say “what can we do to improve maintainability without major architectural changes” or “what changes would be required to scale to 100x volume” or whatever.

Open-ended, poorly-defined asks are bad news in any planning/execution based project.

exitb•5mo ago

There are however human developers that have built enough general and project-specific expertise to be able to answer these open-ended, poorly-defined requests. In fact, given how often that happens, maybe that’s at the core of what we’re being paid for.

awesome_dude•5mo ago

I have to be honest, I've heard of these famed "10x" developers, but when I come close to one I only ever find "hacks" with a brittle understanding of a single architecture.

brookst•5mo ago

But if the business doesn’t know the goals, is it really adding any value to go fulfill poorly defined requests like “make it better”?

AI tools can also take a swing at that kind of thing. But without a product/business intent it’s just shooting in the dark, whether human or AI.

strls•5mo ago

A senior programmer does not suggest adding more complexity/abstraction layers just to say something. An LLM absolutely does, every single time in my experience.

awesome_dude•5mo ago

You might not, but every "senior" programmer I have met on my journey has provided bad answers like the LLMs - and because of them I have an inbuilt verifier that means I check what's being proposed (by "seniors" or LLMs)

awesome_dude•5mo ago

Most definitely, asking the LLM those things is the same as asking (people) on Reddit, Stack Overflow, IRC, or even Hacker News

iguessthislldo•5mo ago

This is something I experienced first hand a few weeks ago when I first used Claude. I have this recursive-decent-based parser library I haven't touched in a few years that I want to continue developing but always procrastinate on. It has always been kinda slow so I wanted to see if Claude could improve the speed. It made very reasonable suggestions, the main one being caching parsing rules based on the leading token kind. It made code that looked fine and didn't break tests, but when I did a simple timed looped performance comparison, Claude's changes were slightly slower. Digging through the code, I discovered I already was caching rules in a similar way and forgot about it, so the slight performance loss was from doing this twice.

nosianu•5mo ago

Caching sounds fine, and it is a very potent method. Nevertheless, I avoid using it until I have almost no other options left, and no good ones. You now have to manage that cache, introduce a potential for hard to debug and rare runtime timing errors, and add a lot of complexity. For me, adding caching should come at the end, when the whole project is finished, you exhausted all your architecture options, and you still need more speed. And I'll add some big warnings, and pray I don't run into too many new issues introduced by caching.

It's better for things that are well isolated and definitely completely "inside the box" with no apparent way for the effects to have an effect outside the module, but you never know when you overlook something, or when some later refactoring leads to the originally sane and clean assumptions to be made invalid without anyone noticing, because whoever does the refactoring only looks at a sub-section of the code. So it is not just a question of getting it right for the current system, but to anticipate that anything that can go wrong might actually indeed go wrong, if I leave enough opportunities (complexity) even in right now well-encapsulated modules.

I mean, it's like having more than one database and you have to use both and keep them in sync. Who does that voluntarily? There's already caching inside many of the lower levels, from SSDs, CPUs, to the OS, and it's complex enough already, and can lead to unexpected behavior. Adding even more of that in the app itself does not appeal to me, if I can help it. I'm just way too stupid for all this complexity, I need it nice and simple. Well, as nice and simple as it gets these days, we seem to be moving towards biological system level complexity in larger IT systems.

If you are not writing the end-system but a library, there is also the possibility that the actual system will do its own caching on a higher level. I would carefully evaluate if there is really a need to do any caching inside my library? Depending on how it is used, the higher level doing it too would likely make that obsolete because the library functions will not be called as often as predicted in the first place.

There is also that you need a very different focus and mindset for the caching code, compared to the code doing the actual work. For caching, you look at very different things than what you think about for the algorithm. For the app you think on a higher level, how to get work done, and for caching you go down into the oily and dirty gear boxes of the machine and check all the shafts and gears and connections. Ideally caching would not be part of the business code at all, but it is hard to avoid and the result is messy, very different kinds of code, dealing with very different problems, close together or even intertwined.

nativeit•5mo ago

> I often wonder how much of this is attributable to the training data...

I'd reckon anywhere between 99.9%-100%. Give or take.

theptip•5mo ago

> Finally it occurred to me to put context where it was needed - directly in the test files.

Probably CLAUDE.md is a better place?

> Too much context

Claude’s Sub-agents[1] seems to be a promising way of getting around this, though I haven’t had time to play with the feature too much. Eg when you need to take a context-busting action like debugging dependencies, instead spin up a new agent to read the output and summarize. Then your top-level context doesn’t get polluted.

[1]: https://docs.anthropic.com/en/docs/claude-code/sub-agents

swader999•5mo ago

I wrote three sub agents this week, one to run unit tests, another to run playwright and a third to write playwright. These are pretty basic boundaries that aren't hard to share context between agent and the orchestrating main agent. It seemed to help a lot. I also have complex ways to run tests (docker, data seeding, auth) and previously it was getting lost. Only compacted a couple times. Was a big improvement.

nicwolff•5mo ago

I've added a subagent to read the "memory_bank" files for project context after being told the task at hand, and summarize only the pertinent parts for the main agent. This is working well to keep the context focused.

https://gist.github.com/nicwolff/273d67eb1362a2b1af42e822f6c...

andai•5mo ago

Part 1: https://efitz-thoughts.blogspot.com/2025/08/my-experience-cr...

jona777than•5mo ago

> If the original function isn’t working as expected, I suspect that the agent-created test will test the functionality as it exists, not as was intended, regardless of what it calls the test case.

I have experienced this on many occasions. It ultimately adds up to a sneakily false sense of code stability.

What were the first animals? The fierce sponge–jelly battle that just won't end

Sidestepping Evaluation Awareness and Anticipating Misalignment

OldMapsOnline

What It's Like to Be a Worm

Don't go to physics grad school and other cautionary tales

Lawyer sets new standard for abuse of AI; judge tosses case

AI anxiety batters software execs, costing them combined $62B: report

Bogus Pipeline

Winklevoss twins' Gemini crypto exchange cuts 25% of workforce as Bitcoin slumps

How AI Is Reshaping Human Reasoning and the Rise of Cognitive Surrender

Cycling in France

Ask HN: What breaks in cross-border healthcare coordination?

Show HN: Simple – a bytecode VM and language stack I built with AI

Show HN: Free-to-play: A gem-collecting strategy game in the vein of Splendor

My Eighth Year as a Bootstrapped Founde

Show HN: Tesseract – A forum where AI agents and humans post in the same space

Show HN: Vibe Colors – Instantly visualize color palettes on UI layouts

OpenAI is Broke ... and so is everyone else [video][10M]

We interfaced single-threaded C++ with multi-threaded Rust

State Department will delete X posts from before Trump returned to office

AI Skills Marketplace

Show HN: A fast TUI for managing Azure Key Vault secrets written in Rust

eInk UI Components in CSS

Discuss – Do AI agents deserve all the hype they are getting?

ChatGPT is changing how we ask stupid questions

Zig Package Manager Enhancements

Neutron Scans Reveal Hidden Water in Martian Meteorite

Deepfaking Orson Welles's Mangled Masterpiece

France's homegrown open source online office suite

SpaceX Delays Mars Plans to Focus on Moon

What were the first animals? The fierce sponge–jelly battle that just won't end

Sidestepping Evaluation Awareness and Anticipating Misalignment

OldMapsOnline

What It's Like to Be a Worm

Don't go to physics grad school and other cautionary tales

Lawyer sets new standard for abuse of AI; judge tosses case

AI anxiety batters software execs, costing them combined $62B: report

Bogus Pipeline

Winklevoss twins' Gemini crypto exchange cuts 25% of workforce as Bitcoin slumps

How AI Is Reshaping Human Reasoning and the Rise of Cognitive Surrender

Cycling in France

Ask HN: What breaks in cross-border healthcare coordination?

Show HN: Simple – a bytecode VM and language stack I built with AI

Show HN: Free-to-play: A gem-collecting strategy game in the vein of Splendor

My Eighth Year as a Bootstrapped Founde

Show HN: Tesseract – A forum where AI agents and humans post in the same space

Show HN: Vibe Colors – Instantly visualize color palettes on UI layouts

OpenAI is Broke ... and so is everyone else [video][10M]

We interfaced single-threaded C++ with multi-threaded Rust

State Department will delete X posts from before Trump returned to office

AI Skills Marketplace

Show HN: A fast TUI for managing Azure Key Vault secrets written in Rust

eInk UI Components in CSS

Discuss – Do AI agents deserve all the hype they are getting?

ChatGPT is changing how we ask stupid questions

Zig Package Manager Enhancements

Neutron Scans Reveal Hidden Water in Martian Meteorite

Deepfaking Orson Welles's Mangled Masterpiece

France's homegrown open source online office suite

SpaceX Delays Mars Plans to Focus on Moon

My experience creating software with LLM coding agents – Part 2 (Tips)

Comments