In the course of my work, I have found they ask valuable clarifying questions. I don’t care how they do it.
Nearly all of my "agents" are required to ask at least three clarifying questions before they're allowed to do anything (code, write a PRD, write an email newsletter, etc)
Force it to ask one at a time and it's event better, though not as step-function VS if it went off your initial ask.
I think the reason is exactly what you state @7thpower: it takes a lot of thinking to really provide enough context and direction to an LLM, especially (in my opinion) because they're so cheap and require no social capital cost (vs asking a colleague / employee—where if you have them work for a week just to throw away all their work it's a very non-zero cost).
Prompt 1: <define task> Do not write any code yet. Ask any questions you need for clarification now.
Prompt 2: <answer questions> Do not write any code yet. What additional questions do you have?
Reiterate until questions become unimportant.
It consumes ~30-40% of the tokens associated with a project, in my experience, but they seem to be used in a more productive way long-term, as it doesn't need to rehash anything later on if it got covered in planning. That said, I don't pay too close attention to my consumption, as I found that QwenCoder 30B will run on my home desktop PC (48GB RAM/12GB vRAM) in a way that's plenty functional and accomplishes my goals (albeit a little slower than Copilot on most tasks).
if you’re a heavy user you should pay for a monthly subscription for Claude Code which is significantly cheaper than API costs.
If you don't mind sharing, I'm really curious - what kind of things do you build and what is your skillset?
Other than just dumping 10M tokens of chats into a gist and say read through everything I said back and forth with claude for a week.
But, I think I've got the start of a useful summary format: it that takes every prompt and points to the corresponding code commit produced by ai + adds a line diff amount and summary of the task. Check it out below.
https://github.com/sutt/agro/blob/master/docs/dev-summary-v1...
(In this case it's an python cli ai-coding framework that I'm using to build the package itself)
What kind of software are you building that you couldn't before?
I would if there were any positive ROI for these $12k/year, or if it were a small enough fraction of my income. For me, neither are true, so I don’t :).
Like the siblings I would be interested in having your perspective on what kind of thing you do with so many tokens.
I do freelancing mostly for fun though, picking projects I like, not directly for the money, but this is where I definitely see multiples of difference on what you can charge.
I’m not sure a single human could audit & review the output of $1k/mo in tokens from frontier models at the current market rate. I’m not sure they could even audit half that.
At any rate, I could easily go through that much with Opus because it’s expensive and often I’m loading the context window to do discovery, this may include not only parts of a codebase but also large schemas along with samples of inputs and outputs.
When I’m done with that, I spend a bunch of turns defining exactly what I want.
Now that MCP tools work well, there is also a ton of back and forth that happens there (this is time efficient, not cost efficient). It all adds up.
I have Claude code max which helps, but one of the reasons it’s so cheap is all of the truncation it does, so I have a different tool I use that lets me feed in exactly the parts of a codebase that I want to, which can be incredibly expensive.
This is all before the expenses associated with testing and evals.
I’m currently consulting, a lot of the code is ultimately written by me, and everything gets validated by me (if the LLM tells me about how something works, I don’t just take its word for it, I go look myself), but a lot of the work for me happens before any code is actually written.
My ability (usually clarity of mind and patience) to review an LLMs output is still a gating factor, but the costs can add up quickly.
I use it all the time. I am not into claude code style agentic coding. More of the "change the relevant lines and let me review" type.
I work in web dev, with vs code I can easily select a line of code that's wrong which I know how to fix but honestly tired to type, press Ctrl+I and tell it to fix. I know the fix, I can easily review it.
Gpt 4.1 agent mode is unlimited in the pro tier. It's half the cost of claude, gemini, and chatgpt. The vs code integration alone is worth it.
Now that is not the kind of AI does everything coding these companies are marketing and want you to do, I treat it like an assistant almost. For me it's perfect.
The AI might write ten versions. Versions 1-9 don't compile, but it automatically makes changes and gets further each time. Version 10 actually builds and seems to pass your test suite. That is the version you review!
—and you might not review the whole thing! 20 lines in, you realize the AI has taken a stupid approach that will obviously break, so you stop reading and tell the AI it messed up. This triggers another ~5 rounds of producing code before something compiles, which you can then review, hopefully in full this time if it did a good job.
I guess I see why the salesmen dont mention this... but it seems really important for everyone to know?
But it's true that I'm always surprised when people talk about using Claude on the beach or whatever, I love Claude Code but I have to test and test and test again per each incremental feature.
There's a lot of tokens used up quickly for those tools to query the code base, documentation, try changes, run commands, re-run commands to call tools correctly, fix errors, etc.
A full day of Opus 4.1 or GPT 5 high reasoning doing pair programming or guided code review across multiple issues or PRs in parallel will burn the max monthly limits and then stop you or cost $1500 in top up credits for a 15 hour day. Wait, WTF, that's $300k/year! OK, while true, misses that that's accomplishing 6 - 8 in parallel, all day, with no drop in efficacy.
At enterprise procurement cost rates, hiring a {{specific_tech}} expert can run $240/hr or $3500/day and is (a) less knowledgable on the 3+ year old tech the enterprise is using, (b) wants to advise instead of type.
So the question then isn't what it costs, it's what's the cost of being blocked and in turn blocking committers waiting for reviews? Similarly, what's the cost of a Max for a dev that doesn't believe in using it?
TL;DR: At the team level, for guided experts and disbelievers, API likely ends up cheaper again.
Its important to not think in terms of generalities like this. How they approach this depends on your tests framework, and even on the language you use. If disabling tests is easy and common in that language / framework, its more likely to do it.
For testing a cli, i currently use run_tests.sh and never once has it tried to disable a test. Though that can be its own problem when it hits 1 it can't debug.
# run_tests.sh # Handle multiple script arguments or default to all .sh files
scripts=("${@/#/./examples/}")
[ $# -eq 0 ] && scripts=(./examples/*.sh)
for script in "${scripts[@]}"; do
[ -n "$LOUD" ] && echo $script
output=$(bash -x "$script" 2>&1) || {
echo ""
echo "Error in $script:"
echo "$output"
exit 1
}
doneecho " OK"
----
Another tip. For a specific tasks don't bother with "please read file x.md", Claude Code (and others) accept the @file syntax which puts that into context right away.
There was a discussion here 3 days ago: https://news.ycombinator.com/item?id=44957443 .
Not 'this is a separate project'. Not 'project documentation file'.
You can have read mes dotted all over a project if that's necessary.
It's simply a file that a previous developer is asking you to read before you start making around in that directory.
While I can understand why someone might develop that first-impression, it's never been safe to assume, especially as one starts working with larger projects or at larger organizations. It's not that unusual for essential sections of the same big project to have their own make-files, specialized utility scripts, tweaks to auto-formatter, etc.
In other cases things are together in a repo for reasons of coordination: Consider frontend/backend code which runs with different languages on different computers, with separate READMEs etc. They may share very little in terms of their build instructions, but you want corresponding changes on each end of their API to remain in lockstep.
Another example: One of my employer's projects has special GNU gettext files for translation and internationalization. These exist in a subdirectory with its own documentation and support scripts, but it absolutely needs to stay within the application that is using it for string-conversions.
I disagree, that's merely correlated to the real purpose: A file-level README is something an author adds when they anticipates something about the mental-state of future readers. In particular, that a reader won't arrive already-prepared to understand or navigate.
While that often happens at "project roots", it is by no means exclusive to them, and it can still happen in sections with extremely tight dependencies or coupling that could never exist independently.
Analogy: While many roads begin with a speed-limit sign, it is not true that every speed-limit sign indicates you've entered a new road.
This is very very wrong. Anthropic's Max plan is like 10% of the cost of paying for tokens directly if you are a heavy user. And if you still hit the rate-limits, Claude Code can roll-over into you paying for tokens through API credits. Although, I have never hit the rate limits since I upgraded to the $200/month plan.
I acknowledge that and get like $400 worth of tokens from my $20 Claude Code Pro subscription every month.
I'm building tools I can use when the VC money runs out or a clear winner gets on top and the prices shoot up to realistic levels.
At that point I've hopefully got enough local compute to run a local model though.
It’s not an advertisement; I apologize if I come off as a Claude Code fanboy.
If you read part 1 of my post (linked in my OP) you will see that I disclosed exactly how much I paid for my usage, and also the reasons that I ended up choosing Claude Code over other agents.
Also, if you sign up for Anthropic’s feedback program you get a 30% reduction on API usage.
It is insanity to spend thousands of dollars a month when you could be spending hundreds for the exact same product.
It's an absolute no-brainer. And it's not even "either or". You can use both the plan and fallback to the API when you get rate-limited. A 30% discount on tokens cannot match the 90% discount on tokens you get using the plan. The math is so unbelievably in favour of the plan.
There are probably people who are not heavy users where the plan may not make sense. But for heavy users, you are burning piles of your own money by not using Anthropic's Max plan. You only need a week of moderate usage a month and already the plan will have paid for itself compared to paying for API credits directly.
Maybe that's the key piece of information you're missing where you mistakenly thought the rate-limits applied to the whole month, when in fact they apply to a 4-hour window.
The blog post starts with: "I’m not a professional developer, just a hobbyist with aspirations."
Is this a vibe blog promoting Misanthropic Claude Vibe? It is hard to tell, since all "AI" promotion blogs are unstructured and chaotic.
If you are considering leveraging any of the documentation or examples, you need to validate that the documentation or example actually matches what is currently in the code.
I have better luck being more concise and avoiding anthropomorphizing. Something like:
"validate documentation against existing code before implementation"
Should accomplish the same thing!
This is definitely a way in which working with LLMs is frustrating. I find them helpful, but I don't know that I'm getting "better" at using them. Every time I feel like I've discovered something, it seems to be situation specific.
But I also use more casual style when investigating. “See what you think about the existing inheritance model, propose any improvements that will make it easier to maintain. I was thinking that creating a new base class for tree and flower to inherit from might make sense, but maybe that’s over complicating things”
(Expressing uncertainty seems to help avoid the model latching on to every idea with “you’re absolutely right!”)
Also, there's a big difference between giving general "always on" context (as in agents.md) for vibe coding - like "validate against existing code" etc - versus bouncing ideas in a chat session like your example, where you don't necessarily have a specific approach in mind and burning a few extra tokens for a one off query is no big deal.
Context isn't free (either literally or in terms of processing time) and there's definitely a balance to be found for a given task.
"You can ask the agent for advice on ways to improve your application, but be really careful; it loves to “improve” things, and is quick to suggest adding abstraction layers, etc. Every single idea it gives you will seem valid, and most of them will seem like things that you should really consider doing. RESIST THE URGE..."
A thousand times this. LLMs love to over-engineer things. I often wonder how much of this is attributable to the training data...
The key thing in both cases, human and AI, is to be super clear about goals. Don’t say “how can this be improved”, say “what can we do to improve maintainability without major architectural changes” or “what changes would be required to scale to 100x volume” or whatever.
Open-ended, poorly-defined asks are bad news in any planning/execution based project.
AI tools can also take a swing at that kind of thing. But without a product/business intent it’s just shooting in the dark, whether human or AI.
It's better for things that are well isolated and definitely completely "inside the box" with no apparent way for the effects to have an effect outside the module, but you never know when you overlook something, or when some later refactoring leads to the originally sane and clean assumptions to be made invalid without anyone noticing, because whoever does the refactoring only looks at a sub-section of the code. So it is not just a question of getting it right for the current system, but to anticipate that anything that can go wrong might actually indeed go wrong, if I leave enough opportunities (complexity) even in right now well-encapsulated modules.
I mean, it's like having more than one database and you have to use both and keep them in sync. Who does that voluntarily? There's already caching inside many of the lower levels, from SSDs, CPUs, to the OS, and it's complex enough already, and can lead to unexpected behavior. Adding even more of that in the app itself does not appeal to me, if I can help it. I'm just way too stupid for all this complexity, I need it nice and simple. Well, as nice and simple as it gets these days, we seem to be moving towards biological system level complexity in larger IT systems.
If you are not writing the end-system but a library, there is also the possibility that the actual system will do its own caching on a higher level. I would carefully evaluate if there is really a need to do any caching inside my library? Depending on how it is used, the higher level doing it too would likely make that obsolete because the library functions will not be called as often as predicted in the first place.
There is also that you need a very different focus and mindset for the caching code, compared to the code doing the actual work. For caching, you look at very different things than what you think about for the algorithm. For the app you think on a higher level, how to get work done, and for caching you go down into the oily and dirty gear boxes of the machine and check all the shafts and gears and connections. Ideally caching would not be part of the business code at all, but it is hard to avoid and the result is messy, very different kinds of code, dealing with very different problems, close together or even intertwined.
I'd reckon anywhere between 99.9%-100%. Give or take.
Probably CLAUDE.md is a better place?
> Too much context
Claude’s Sub-agents[1] seems to be a promising way of getting around this, though I haven’t had time to play with the feature too much. Eg when you need to take a context-busting action like debugging dependencies, instead spin up a new agent to read the output and summarize. Then your top-level context doesn’t get polluted.
[1]: https://docs.anthropic.com/en/docs/claude-code/sub-agents
https://gist.github.com/nicwolff/273d67eb1362a2b1af42e822f6c...
I have experienced this on many occasions. It ultimately adds up to a sneakily false sense of code stability.
efitz•5mo ago
afeezaziz•5mo ago