There's something a little better the tool use loop, which is nice.
But Claude seems a little dumber and is aggressive about "getting things done", often ignoring common sense or explicit instructions or design information.
If I tell it to make a test pass, it will sometimes change my database structure to avoid having to debug the test. At least twice it deleted protobufs from my project and replaced it with JSON because it struggled to immediately debug a proto issue.
It is sometimes acceptable for humans to use judgment and defer work; the machine doesn’t have judgment so it is not acceptable for it to do so.
Like just now it says "great the tests are consistently passing!" So I ran the same test command and 4 of the 7 tests are so broken they don't even build.
My immediate and obvious response is "you broke them!" (at least to myself), but I do appreciate that it's trying to keep focused in some strange way. A simple "commit, fix failing tests" prompt will generally take care of it.
I've been working on my "/implement" command to do a better job of checking that the full test suite is all green before asking if I want to clear the task and merge the feature branch
Then clear the context and move on to the next task. Context pollution is real and can hurt you.
We captured debug logs, described the detailed issue to Gemini 2.5 Flash giving it the nginx logs for the one second before and after an example incident, about 10k log entries.
It came back with a clear verdict, saying
"The smoking gun is here: 2025/07/24 21:39:51 [debug] 32#32: *5902095 rport:443 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.233.100.128, server: grpc-ai-test.not-relevant.org, request: POST /org.not-relevant.cloud.api.grpc.CloudEventsService/startStreaming HTTP/2.0, upstream: grpc://10.233.75.54:50051, host: grpc-ai-test.not-relevant.org"
and gave me a detailed action plan.
I was thinking this is cool, don't need to use my head on this, until I realized that the log entry simply did not exist. It was entirely made up.
(And yes I admit, I should know better than to do lousy prompting on a cheap foundation model)
In that case, you have put a stop to it and point out that it would already be done if it hadn’t decided to blow it all up in an effort to write a one time use codemod. Of course it agrees with that point as it agrees with everything. It’s the epitome of strong opinions loosely held.
Interestingly, it’s the only LLM I’ve seen behave that way. Others simply acknowledge the failure and, after a few hints, eventually get everything working.
Claude just hopes I won’t notice its tricks. It makes me wonder what else it might try to hide when misalignment has more serious consequences.
So I guess the blog team also uses Claude
> "Instead of remembering complex Kubernetes commands, they ask Claude for the correct syntax, like "how to get all pods or deployment status," and receive the exact commands needed for their infrastructure work."
Duh, you can ask LLM tech questions and stuff. What is the point of putting something like that on the tech blog of the company which supposed to be working on beading edge tech.
Even with people who do use it, they might thinking about it narrowly. They use it for code generation, but might not think to use it for simplified man pages.
Of course there are people who are the exact opposite and use it for every last thing they do. And maybe from this they learn how to better approach their prompts.
This bullet point is funny:
> Treat it like a slot machine
> Save your state before letting Claude work, let it run for 30 minutes, then either accept the result or start fresh rather than trying to wrestle with corrections. Starting over often has a higher success rate than trying to fix Claude's mistakes.
That's easy to say when the employee is not personally paying the massive amount of compute running Claude Code for a half-hour.
Sorry boss, it looks like we need to hire more software engineers since the AI route still isn't mathing.
Well, Anthropic sure thinks that you should. Number go up!
Except the power and cooling demands of the current crop of GPU's means you are not fitting full density in a rack. There is a real material increase in fiber use because of your now more distributed equipment. (because 800gbps interconnects are NOT cheap).
You can't capitalize power costs: this is now a non-trivial cost to account for. And the more power you use for compute the more power you have to use for cooling... (Power density is now so high that cooling with something other than air is looking not just attractive but like it is going to be a requirement.)
Meanwhile the cost of lending right now is high compared to recent decades...
The accounting side of things isnt as pretty as one would like it to be.
https://archive.nytimes.com/www.nytimes.com/books/97/05/18/r...?
Pros:
- Saved Time!
- Scalable!
- Big Bill?
Cons:
- Big Bill
- AI written code
Even better if you use an LLM with Hook support, just have the hook run formatters on the file after each edit.
Same here. It's either capable of working unsupervised or not. And if not, you have to start wondering what you're even doing if you're at your keyboard, running tools, editing code that you don't like, etc.
We're still working out the edge cases with these "Full" self driving editors. It vastly diminishes the usefulness if it's going to spend 20 minutes (and $) on stupid simple things.
> We're still working out the edge cases
The difficult part is that like with FSD, it's mostly edge casesSure the air is 3 dimensions, but driving is too dynamic and volatile. Every single road is different, and you have to rely on heuristics meant for humans.
It's stupid easy for humans to tell what is a yellow line and what a stop sign looks like, but it's not so easy for computers. These are human tools - physical things we look at with our eyes. Not easy to measure. Whereas measurements in the air are quite easy to measure.
On top of the visual heuristics, everthing changes all the time and very fast. You look away from the road and look back and you don't know what you're gonna see. It's why texting and driving is so dangerous.
First, I want to summon my car. Then, when leaving, if I’m in a dense area with lots of shopping, the roads can be a pain. You have to exit right, immediately get into the left lane, three lanes over, the second of the right turn only lanes, etc
My guess is that context = main thing + somewhat unrelated thing is too big a space for the models to perform well at this point in time.
The practical solution is to remove the need for the model to figure it out each time and instead explicitly tell it about as much as possible before hand in CLAUDE.md.
They take less than a second to run, can run on every save, and are free
You could say I've seen A LOT of poorly written human generated code.
Yet, I still trust it more. Why? Well one of the big reasons is exactly what we're joking about. I can trust a human to iterate. Lack of iteration would be fine if everything was containerized and code operates in an unchanging environment[0]. But in the real world, code needs to be iterated on, constantly. Good code doesn't exist. If it does exist, it doesn't stay good for long.
Another major problem is that AI generates code that optimizes for human preference, not correctness. Even the terrible students who were just doing enough to scrape by weren't trying to mask mistakes[1], but were still optimizing for correctness, even if it was the bare minimum. I can still walk through that code with the human and we can figure out what went wrong. I can ask the human about the code and I can tell a lot by their explanation, even if they make mistakes[2]. I can't trust the AI to tell an accurate account of even its own code because it doesn't actually understand. Even the dumb human has a much larger context window. They can see all the code. They can actually talk to me and try to figure out the intent. They will challenge me if I'm wrong! And for the love of god, I'm going to throw them out if they are just constantly showering me with praise and telling me how much of a genius I am. I don't want to work with someone where I feel like at any moment they're going to start trying to sell me a used car.
There's a lot of reasons, more than I list here. Do I still prompt LLMs and use them while I write code? Of course. Do I trust it to write code? Fuck no. I know it isn't trivial to see that middle ground if all you do is vibe code or hate writing code so much you just want to outsource it, but there's a lot of room here between having some assistant and having AI write code. Like the OP suggests, someone has got to write that 10-20%. That doesn't mean I've saved 80% of my time, I maybe saved 20%. Pareto is a bitch.
[0] Ever hear of "code rot?"
[1] Well... I'd rightfully dock points if they wrote obfuscated code...
[2] A critical skill of an expert in any subject is the ability to identify other experts. https://xkcd.com/451/
What makes you think that agents can't iterate?
> I'm going to throw them out if they are just constantly showering me with praise and telling me how much of a genius I am
You can tell the agent to have the persona of an arrogant ass if you prefer it.
Plus, the entire session/task history goes into every LLM prompt, not just the last message. So for every turn of the loop the LLM has the entire context with everything that previously happened in it, along with added "memories" and instructions.
> What makes you think that agents can't iterate?
Please RTFA or RTF top most comment in the thread.Can they? Yes. Will they reliably? If so, why would it be better to restart...
But the real answer to your question: personal experience
TFA says:
Engineers use Claude Code for rapid prototyping by enabling "auto-accept mode" (shift+tab) and setting up autonomous loops in which Claude writes code, runs tests, and iterates continuously.
The tool rapidly prototypes features and iterates on ideas without getting bogged down in implementation details
Not only can you, some providers recommend it and their tools provide it, like ChatGPT Codex (the web tool). Can’t find where I read it but I’m pretty sure Anthropic devs said early on that they kick off the same prompt to Claude Code in multiple simultaneous runs.
Personally, I’ve had decent success from this way of working.
you can do the same for $200/month
I have been pretty successful at using llms for code generation.
I have a simple rule that something is either 90%>ai or none at all (exluding inline completions, and very obvious text editing).
The model has an inherent understanding of some problems due to it's training data (e.g. setting up a web server with little to no deps in golang), that it can do with almost 100% certainty, where it's really easy to blaze through in a few minutes, and then I can setup the architecture for some very flat code flows. This can genuinely improve my output by 30%-50%
10% is the time it works 100% of the time.
I have been using Cline in VSCode, and I've been enjoying it a lot.
Vibe coding in Python is seductive but ultimately you end up in a bad place with a big bill to show for it.
Vibe coding in Haskell is a "how much money am I willing to pour in per unit clean, correct, maintainable code" exercise. With GHC cranked up to `-Wall -Werror` and some nasty property tests? Watching Claude Code try to weasel out with a mock goes from infuriating to amusing: bam, unused parameter! Now why would the test suite be demanding that a property holds on an unused parameter...
And Haskell is just an example, TypeScript is in some ways even more powerful in it's type system, so lots of projects have scope to dabble with what I'm calling "hyper modern vibe coding": just start putting a bunch of really nasty fastcheck and generic bounds on stuff and watch Claude Code try to cheat. Your move, Claude Code, I know you want to check off that line on the TODO list like I want to breathe, so what's it gonna be?
I find it usually gives up and does the work you paid for.
A bigger issue here is that the random process is not a good engineering pattern. It's not repeatable, does not drive coherent architecture, and struggles with complex problems. In my experience, problem size correlates inversely with generated code quality. Engineering is a process of divide-and-conquer and there is a good reason people don't use bogo (random) sort in production.
More specifically, if you only look at the final code, you are either spending a lot of time reviewing the code or accepting the code with less review scrutiny. Carefully reviewing semi random diffs seems like a poor use of time... so I suspect the default is less review scrutiny and higher tech debt. Interestingly enough, higher tech debt might be an acceptable tradeoff if you believe that soon Code Assistants will be good enough to burn the tech debt down autonomously or with minimal oversight.
On the other hand, if the code you are writing is not allowed to fail, the stakes change and you can't pick the less review option. I never thought to codify it as a process, but here is what I do to guide the development process:
- Start by stating the problem and asking Claude Code to: analyze the existing code, restate the problem in a structured fashion, scan the codebase for existing patterns solving the problem, brainstorm alternative solutions. An enhancement here could be to have a map / list of the codebase to improve the search.
- Evaluate presented solutions and iterate on the list. Add problem details, provide insight, eliminate the solutions that would not work. A lot of times I have enough context to pick a winner here, but if not, I ask for more details about each solution and their relative pros and cons.
- Ask Claude to provide a detailed plan for the down-selected solution. Carefully review the plan (a significantly faster endeavor compared to reviewing the whole diff). Iterate on the plan as needed; after that, tell Claude to save the plan for comparison after the implementation and then to get cracking.
- Review Claude's report of what was implemented vs. what was initially planned. This step is crucial because Claude will try dumb things to get things working, and I've already done the legwork on making sure we're not doing anything dumb in the previous step. Make changes as needed.
- After implementation, I generally do a pass on the unit tests because Claude is extremely prolific with them. You generally need to let it write unit tests to make sure it is on the right track. Here, I ask it to scan all of the unit tests and identify similar or identical code. After that, I ask for refactor options that most importantly maximize clarity, secondly minimize lines of code, and thirdly minimize diffs. Pick the best ones.
Yes, I accept that the above process takes significantly longer for any single change; however, in my experience, it produces far superior results in a bounded amount of time.
P.S. if you got this far please leave some feedback on how I can improve the flow.
Recently, I realized that this applies not only to the first 70–80% of a project but sometimes also to the final 70-80%.
I couldn’t make progress with Claude on a major refactoring from scratch, so I started implementing it myself. Once I had shaped the idea clearly enough but in a very early state, I handed it back to Claude to finish and it worked flawlessly, down to the last CHANGELOG entry, without any further input from me.
I saw this as a form of extensive guardrails or prompting-by-example.
It'll create a massive bespoke class to do something that is already in the stdlib.
But if there's a pattern of already using stdlib functions, it can copy that easily.
As I’ve worked with a number of people like what I’ve described above, the way I’ve worked with them has helped me get better results from LLMs for coding. The difference is that you can help a junior grow over time. LLMs forget after that context (Claude.md helps, but not perfect).
Should be the same party as is getting the rewards of the productivity gains.
> /undo
> /clear
> ↑ ↑ ↑ ⏎
- Custom accessibility solution for family members
- The team created prototype "phone tree" systems to help team members connect with the right lawyer at Anthropic
- Team coordination tools
- Rapid prototyping for solution validation
So, not legal
I use it at home via the $20/m subscription and am piloting it at work via AWS Bedrock. When used with Bedrock APIs, at the end of every session it shows you the dollar amount spent which is a bit disconcerting. I hope the fine-grained metering of inference is a temporary situation otherwise I think it will have a chilling/discouraging effect on software developers, leading to less experimentation and fewer rewrites, overall lower quality.
I imagine Anthropic gets to consume it unmetered internally so I they probably completely avoid this problem.
That's why you don't pay the yearly license for anything at this point in time. Pay monthly and evaluate before each bill if there's something better out already.
Nope.
More open models ship everyday and are 80% cheaper for similar and sometimes better performance, depending on the task.
You can use Qwen-3 Coder (a 480 billion parameter model with 35 billion active per forward pass (8 out of 160 experts)) for $0.302/M input tokens $0.302/M output tokens via openrouter.
Claude 4 Sonnet is $3/M input tokens and $15/M output tokens.
Several utilities will let you use Claude Code to use these models at will.
You're like in the top 0.05% of earners in the software field.
Of course, if you save 10 hours per month, the math starts making more sense for others.
And this is assuming LLM prices are stable, which I very much doubt they are, since everyone is price dumping to get market share.
Nobody is investing half a trillion in a tech without expecting a 10x return.
And fairly sure soon those $20/month subscriptions will sell your data, shove ads everywhere AND basically only allow you to get that junior dev for 30 minutes per day or 2 days a month.
And the $200/month will probably be $500-1000 with more limitations.
Still cheap, but AI can't run an entire project, can't deliver. So the human will be in the loop, as you said, so at least a partial cost on top.
You can use these models through Claude Code; I do it everyday.
Some developers are running smaller versions of these LLMs on their own hardware, paying no one.
So I don’t think Anthropic and the other companies can dramatically increase their prices without losing the customers that helped them go from $0 to $4 billion in revenue in 3 years.
Users can easily move between different AI platforms with no lock-in, which makes it harder to increase prices and proceed to enshitify their platforms.
100k/mo on cloud watch corresponds to a moderately large software business assuming basic best practices are followed. Optimization projects can often run into major cost overruns where the people time exceeds the discounted future free cash flow savings from the optimization.
That being said, a team of 5 on a small revenue/infra spend racking up 100k/mo is excessive. Pedantically, cloud watch/datadog are SaaS vendors - 100k/mo on Prometheus would correspond to a 20 node SSD cluster in the cloud which could easily handle several 10s of millions of metrics per second from 10s of thousands of metric producers. If you went to raw colocation facility costs - you’d have over a hundred dual Xeon machines with multi-TB direct attached SSD. Supporting hundreds of thousands of servers producing hundreds of millions of data points per second.
Human time is really the main trade-off.
I’m legitimately surprised at your feeling on this. I might not want the granular cost put in my face constantly but I do like the ability to see how much my queries cost when I am experimenting with prompt setup for agents. Occasionally I find wording things one way or the other has a significantly cheaper cost.
Why do you think it will lead to a chilling effect instead of the normal effect of engineers ruthlessly innovating costs down now that there is a measurable target?
i know some swift so i checked on what it was doing. for a quick hack project it did all the work and easily updated things i saw issues with.
for a one-off like that, not bad at all. not too dissimilar from your example.
I can assure you that I don’t at all care about the MAYBE $10 charge my monster Claude Code session billed the company. They also clearly said “don’t worry about cost, just go figure out how to work with it”
By a more honest standard we are still a very long way away from AI suggesting new ANN architectures, new approaches to managing RLHF, better training data, new benchmarks, etc etc. LLMs are nowhere close to being able to improve themselves.
The documentation is good, but is kept relatively general and I have a feeling that the quality of Claude Code's output really depends on the specific setup and prompts you use.
[0] https://fortune.com/2025/02/04/anthropic-tells-job-candidate...
I’ve got much better at using Claude.md and plan files, etc., but it still goes off the rails so quickly when I try to get it to follow a normal TDD test/edit/build/commit workflow. It will report success, and then the unit tests will fail or the working copy is dirty or there are build errors, etc. An LLM may be great for figuring out what code to write and what weird tool incantation to run to debug something, but I am fed up enough I want to write my own agent, because all I want is a switch-statement state machine to manage workflow.
It definitely has a "we use AI enough that we've lost the ability to communicate coherently" vibe to it.
But, if they had an expert in networking build it in the first place, would they have not avoided the error entirely up front?
I can just talk to it like a person and explain the full context / history of things. Way faster than typing it all out.
https://apps.apple.com/us/app/voice-type-local-dictation/id6...
The developer is pretty cool too. I found a few bugs here and there and reported them. He responds pretty much immediately.
I highly recommend getting a good microphone, I use a Rode smartlav. It makes a huge difference.
I type a lot faster than I speak :D
Even if it gets slightly garbled, I often will add a note in my context that I'm using speech recognition. Then Claude will handle the potentially garbled or unclear sections perfectly or ask follow-up questions if it's unclear.
I often work on large, complicated projects that span the whole codebase and multiple micro services. So it's often a blend of engineering, architectural, and product priorities. I can end up talking for paragraphs or multiple pages to fully explain the context. Then Claude typically has follow-up questions, things that aren't clear, or issues that I didn't catch.
Honestly, I just get sick of typing out "dissertations" every time. It's easier just to have a conversation, save it to a file, and then use that as context to start a new thread and do the work.
The copy aspect was the main value prop for the app I chose: Voice Type. You can do ctrl-v to start recording, again to stop, and it pastes it in the active text box anywhere on your computer.
So I just avoid it and generally think the whole thing isn’t serious, because nobody seems to care enough about the safety implications of building AGI with legal terms which are logically impossible to satisfy to demonstrate appropriate attention to detail (aka, yall are noobs)
Maybe before boasting about how your internal teams use your product, add an option for external companies to pay for it!
Industry leading AI models but basic things like subscription management are unsolved…
The only reason why "team" plans exist is to have centralised billing and licensing.
This seems rather inefficient, and also surprising that Claude Code was even needed for this.
Is it really value add to my life that I know some detail on page A or have some API memorized?
I’d rather we be putting smart people in charge of using AI to build out great products.
It should make things 10000x more competitive. I’m for one excited AF for what the future holds.
If people want to be purists and pat themselves on the back sure. I mean people have hobbies like arts.
AI mostly provides negative efficiency gains.
This will not change in the future.
yes, actually. Maybe not intimate details but knowing what's available in the API heavily helps with problem solving
Since I don't like it to automatically do changes to files, I copy&paste the code from terminal to the IDE. That seems slow at first, but it allows me to correct the bigger and smaller issues on the fly faster than prompting Claude to my preffered solution. In my opinion, this makes more sense since I have more control and it is easier to spot problematic code. When fixing such issues, I point Claude to changes afterwards to add to its context.
For me Claude is like a very (over) confident junior developer. You have to keep an eye on them and if it is faster do it yourself, then just do it and explain them why you did it. (That might be a bad approach for Juniors, but for Claude it works for me)
Btw, can we talk about that this blog post is written by the company that tries to sell the tool? So we should take it with a huge grain of salt... . Like all what these AI companies are telling us, should probably be ignored for 90 % of time. They either want to raise money or or getting bought by some other company in the end ...
Unlike Gemini CLI which will just rush into implementation without hesitation :D
It is very interesting to me how the differences in our intelligences is physically manifested in text. It is one argument against hard-takeoff: the bioneuron can encode information in a sweet spot that cannot be targeted by the perceptron, by the so-called neuralese, by any amount of distillation.
The most effective way I’ve found to use CC so far is this workflow:
Have a detailed and also compressed spec in an md file. It can be called anything, because you’re going to reference it explicitly in every prompt. (CC usually forgets about CLAUDE.md ime)
Start with the user story, and ask it to write a high-level staged implementation plan with atomic steps. Review this plan and have CC rewrite as necessary. (Another md file results.)
Then, based on this file, ask it to write a detailed implementation plan, also with atomic stages. Then review it together and ask if it’s ready to implement.
Then tell Claude to go ahead and implement it on a branch.
Remember the automated tests and functional testing.
Then merge.
I've written a little about some my findings and workflow in detail here: https://github.com/sutt/agro/blob/master/docs/case-studies/a...
Richard Stallman is rolling in his grave with this.
But in all seriousness, nice work, I think this _is_ where the industry is going, hopefully we don't have to rely on using proprietary models forever though.
You can set up a FOSS toolchain to do similar work, it’s just something I haven’t spent the time on. I probably should.
- there's a devlog showing all the prompts and accepted outputs: https://github.com/sutt/agro/blob/master/docs/dev-summary-v1...
- and you can look at the ai-generated tests (as is being discussed above) and see they aren't very well thought out for the behavior, but are syntactically impressive: https://github.com/sutt/agro/tree/master/tests
- check out the case-studies in the docs if you're interested in more ideas.
The downside is I don’t have as much of a grasp on what’s actually happening in my project, while with hand-written projects I’d know every detail.
Not a gotcha I'm just extremely skeptical that AI is at a point to have the level of responsibility you're describing and have it turn into good code long term
buggy6257•23h ago
kevmo314•23h ago
laborcontract•22h ago
OJFord•22h ago
I agree it has times like this it doesn't work and makes the title worse though. The submitter can edit it, and this reformatting (and others) are not applied again.
NotMichaelBay•11h ago