Claude Sonnet 4 now supports 1M tokens of context

https://www.anthropic.com/news/1m-context

890•adocomplete•9h ago

Comments

throwaway888abc•9h ago

holy moly! awesome

tankenmate•9h ago

This is definitely good to have this as an option but at the same time having more context reduces the quality of the output because it's easier for the LLM to get "distracted". So, I wonder what will happen to the quality of code produced by tools like Claude Code if users don't properly understand the trade off being made (if they leave it in auto mode of coding right up to the auto compact).

jasonthorsness•9h ago

What do you recommend doing instead? I've been using Claude Code a lot but am still pretty novice at the best practices around this.

TheDong•9h ago

Have the AI produce a plan that spans multiple files (like "01 create frontend.md", "02 create backend.md", "03 test frontend and backend running together.md"), and then create a fresh context for each step if it looks like re-using the same context is leading it to confusion.

Also, commit frequently, and if the AI constantly goes down the wrong path ("I can't create X so I'll stub it out with Y, we'll fix it later"), you can update the original plan with wording to tell it not to take that path ("Do not ever stub out X, we must make X work"), and then start a fresh session with an older and simpler version of the code and see if that fresh context ends up down a better path.

You can also run multiple attempts in parallel if you use tooling that supports that (containers + git worktrees is one way)

F7F7F7•9h ago

Inventivatbly the files become a mess of their own. Changes and learnings from one part of the plan often dont result in adaptation to impacted plans down chain.

In the end you have a mish mash of half implemented plans and now you’ve lost context too. Which leads to blowing tokens on trying to figure out what’s been implemented, what’s half baked, and what was completely ignored.

Any links to anyone who’s built something at scale using this method? It always sounds good on paper.

I’d love to find a system that works.

brandall10•8h ago

My system is to create detailed feature files up to a few hundred lines in size that are immutable, and then have a status.md file (preferably kept to about 50 lines) that links to a current feature that is used as a way to keep track of the progress on that feature.

Additionally I have a Claude Code command with instructions referencing the status.md, how to select the next task, how to compact status.md, etc.

Every time I'm done with a unit of work from that feature - always triggered w/ ultrathink - I'll put up a PR and go through the motions of extra refactors/testing. For more complex PRs that require many extra commits to get prod ready I just let the sessions auto-compact.

After merging I'll clear the context and call the CC command to progress to the next unit of work.

This allows me to put up to around 4-5 meaningful PRs per feature if it's reasonably complex while keeping the context relatively tight. The current project I'm focused on is just over 16k LOC in swift (25k total w/ tests) and it seems to work pretty well - it rarely gets off track, does unnecessary refactors, destroys working features, etc.

nzach•7h ago

Care to elaborate on how you use the status.md file? What exactly you put in there, and what value does it bring?

brandall10•3h ago

When I initially have it built from a feature file, it pulls in the most pertinent high level details from that and creates a supercharged task list that is updated w/ implementation details as the feature progresses.

As it links to the feature file as well, that is pulled into the context, but status.md is there to essentially act as a 'cursor' to where it is in the implementation and provide extended working memory - that Claude itself manages - specific to that feature. With that you can work on bite sized chunks of the feature each with a clean context. When the feature is complete it is trashed.

I've seen others try to achieve similar things by making CLAUDE.md or the feature file mutable but that IME is a bad time. CLAUDE.md should be lean with the details to work on the project, and the feature file can easily be corrupted in an unintended way allowing things to go wayward in scope.

nzach•8h ago

In my experience it works better if you create one plan at a time. Create a prompt, make claude implement it and then you make sure it is working as expected. Only then you ask for something new.

I've created an agent to help me create the prompts, it goes something like this: "You are an Expert Software Architect specializing in creating comprehensive, well-researched feature implementation prompts. Your sole purpose is to analyze existing codebases and documentation to craft detailed prompts for new features. You always think deeply before giving an answer...."

My workflow is: 1) use this agent to create a prompt for my feature; 2) ask claude to create a plan for the just created prompt; 3) ask claude to implement said plan if it looks good.

cube00•7h ago

>You always think deeply before giving an answer...

Nice try but they're not giving you the "think deeper" level just because you asked.

nzach•7h ago

https://docs.anthropic.com/en/docs/build-with-claude/prompt-...

dpe82•7h ago

Actually that's exactly how you do it.

theshrike79•6h ago

I use Gemini-cli (free 2.5 pro for an undetermined time before it self-lobotomises and switches to lite) to keep the specs up to date.

The actual tasks are stored in Github issues, which Claude (and sometimes Gemini when it feels like it) can access using the `gh` CLI tool.

But it's all just project management, if what the code says drifts from what's in the specs (for any reason), one of them has to change.

Claude does exactly what the documentation says, it doesn't evaluate the fact that the code is completely different and adapt - like a human would.

bredren•6h ago

Don’t rely entirely on CC. Once a milestone has been reached, copy the full patch to clipboard and the technical spec covering this. Provide the original files, the patch and the spec to Gemini and say ~a colleague did the work and does it fulfill the aims to best practices and spec?

Pick among the best feedback to polish the work done by CC—-it will miss things that Gemini will catch.

Then do it again. Sometimes CC just won’t follow feedback well and you gotta make the changes yourself.

If you do this you’ll be more gradual but by nature of the pattern look at the changes more closely.

You’ll be able to realign CC with the spec afterward with a fresh context and the existing commits showing the way.

Fwiw, this kind of technique can be done entirely without CC and can lead to excellent results faster, as Gemini can look at the full picture all at once, vs having to force cc to consume vs hen and peck slices of files.

wongarsu•8h ago

Changing the prompt and rerunning is something where Cursor still has a clear edge over Claude Code. It's such a powerful technique for keeping the context small because it keeps the context clear of back-and-forths and dead ends. I wish it was more universally supported

abound•7h ago

I do this all the time in Claude Code, you hit Escape twice and select the conversation point to 'branch' from.

agotterer•8h ago

I use the main Claude code thread (I don’t know what to call it) for planning and then explicitly tell Claude to delegate certain standalone tasks out to subagents. The subagents don’t consume the main threads context window. Even just delegating testing, debugging, and building will save a ton context.

sixothree•6h ago

/clear often is really the first tool for management. Do this when you finish a task.

tehlike•9h ago

Some reference:

https://simonwillison.net/2025/Jun/29/how-to-fix-your-contex...

https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-ho...

bachittle•8h ago

As of now it's not integrated into Claude Code. "We’re also exploring how to bring long context to other Claude products". I'm sure they already know about this issue and are trying to think of solutions before letting users incur more costs on their monthly plans.

PickledJesus•8h ago

Seems to be for me, I came to look at HN because I saw it was the default in CC

novaleaf•7h ago

where do you see it in CC?

PickledJesus•7h ago

I got a notification when I opened it, indicating that the default had changed, and I can see it on /model.

Only on a max (20x) account, not there on a Pro one.

novaleaf•7h ago

thanks, FYI I'm on a max 20x also and I don't see it!

tankenmate•3h ago

maybe a staggered release?

Wowfunhappy•3h ago

I'm curious, what does it say on /model?

For reference, my options are:

    ╭─────────────────────────────────────────────────────────────────────────────────────────────╮
    │                                                                                             │
    │  Select Model                                                                               │
    │  Switch between Claude models. Applies to this session and future Claude Code sessions.     │
    │  For custom model names, specify with --model.                                              │
    │                                                                                             │
    │     1. Default (recommended)  Opus 4.1 for up to 50% of usage limits, then use Sonnet 4     │
    │     2. Opus                   Opus 4.1 for complex tasks · Reaches usage limits faster      │
    │     3. Sonnet                 Sonnet 4 for daily use                                        │
    │     4. Opus Plan Mode         Use Opus 4.1 in plan mode, Sonnet 4 otherwise                 │
    │                                                                                             │
    ╰─────────────────────────────────────────────────────────────────────────────────────────────╯

novaleaf•2h ago

me also

dbreunig•4h ago

The team at Chroma is currently looking into this and should have some figures.

falcor84•9h ago

Strange that they don't mention whether that's enabled or configurable in Claude Code.

farslan•9h ago

Yeah same, I'm curious about this. I would guess it's by default enabled with Claude Code.

csunoser•9h ago

They don't say it outright. But I think it is not in Claude Code yet.

> We’re also exploring how to bring long context to other Claude products. - Anthropic

That is, any other product that is not Anthropic API tier 4 or Amazon bedrock.

CharlesW•9h ago

From a co-marketing POV, it's considered best practice to not discuss home-grown offerings in the same or similar category as products from the partners you're featuring.

It's likely they'll announce this week, albeit possibly just within the "what's new" notes that you see when Claude Code is updated.

reasonableklout•31m ago

They just sent an email that the feature is in beta in CC.

faangguyindia•9h ago

In my testing the gap between claude and gemini pro 2.5 is close. My company is in asia pacific and we can't get access to claude via vertex for some stupid reason.

but i tested it via other providers, the gap used to be huge but now not.

Tostino•9h ago

For me the gap is pretty large (in Gemini Pro 2.5's favor).

For reference, the code I am working on is a Spring Boot / (Vaadin) Hilla multi-module project with helm charts for deployment and a separate Python based module for ancillary tasks that were appropriate for it.

I've not been able to get any good use out of Sonnet in months now, whereas Gemini Pro 2.5 has (still) been able to grok the project well enough to help out.

jona777than•9h ago

I initially found Gemini Pro 2.5 to work well for coding. Over time, I found Claude to be more consistently productive. Gemini Pro 2.5 became my go-to for use cases benefitting from larger context windows. Claude seemed to be the safer daily driver (if I needed to get something done.)

All that being said, Gemini has been consistently dependable when I had asks that involved large amounts of code and data. Claude and the OpenAI models struggled with some tasks that Gemini responsively satisfied seemingly without "breaking a sweat."

Lately, it's been GPT-5 for brainstorming/planning, Claude for hammering out some code, Gemini when there is huge data/code requirements. I'm curious if the widened Sonnet 4 context window will change things.

llm_nerd•9h ago

Opus 4.1 is a much better model for coding than Sonnet. The latter is good for general queries / investigations or to draw up some heuristics.

I have paid subscriptions to both Gemini Pro and Claude. Hugely worthwhile expense professionally.

faangguyindia•8h ago

when gemini 2.5 pro gets stuck, i often use deep seek r1 in architect mode and qwen3 in coder mode in aider and it solves all the problem

last month i ran into some wicked dependency bug and only chatgpt could solve it which i am guessing is the case because it has hot data from github?

On the other hand, i really need a tool like aider where i can use various models in "architect" and "coder" mode.

what i've found is better reasoning models tend to be bad at writing actual code, and models like qwen3 coder seems better.

deep seek r1 will not write reliable code but it will reason well and map out the path forward.

i wouldn't be surprised if sonnets success was doing EXACTLY this behind the scenes.

but now i am looking for pure models who do not use this black magic hack behind API.

I want more control at tool end where i can alter the prompts and achieve results i want

this is one reason i do not use claude code etc....

aider is 80% of what i want wish it had more of what i want though.

i just don't know why no one has build a perfect solution to this yet.

Here are things i am missing in aider

1. Automatic model switching, use different models for asking questions about the code, different one for planning a feature, different one for writing actual code.

2. Self determine, if a feature needs a "reasoning" model or coding model will suffice.

3. be able to do more, selectively send context and drop the files we don't need. Intelligently add files to context which will be touched by the feature, not after having done all code planning asking to add files, then doing it all over again with more context available.

penguin202•9h ago

Claude doesn't have a mid-life crisis and try to `rm -rf /` or delete your project.

film42•8h ago

Agree but pricing wise, Gemini 2.5 pro wins. Gemini input tokens are half the cost of Claude 4. Output is $5/million cheaper than Claude. But, document processing is significantly cheaper. A 5MB PDF (customer invoice) with Gemini is like 5k tokens vs 56k with Claude.

The only downside with Gemini (and it's a big one) is availability. We get rate limited by their dynamic QoS all the time even if we haven't reached our quota. Our GCP sales rep keeps recommending "provisioned throughput," but it's both expensive, and doesn't fit our workload type. Plus, the VertexAI SDK is kind of a PITA compared to Anthropic.

Alex-Programs•6h ago

Google products are such a pain to work with from an API perspective that I actively avoid them where possible.

artursapek•9h ago

Eagerly waiting for them to do this with Opus

irthomasthomas•9h ago

Imagine paying $20 a prompt?

artursapek•8h ago

If I can give it a detailed spec, walk away and do something else for 20 minutes, and come back to work that would have taken me 2 hours, then that's a steal.

dbbk•2h ago

You can just do this now though. In fact you could go a step further and set up the GitHub Action, then you can kick off Claude from the iOS GitHub app from the beach and review the PR when it's done.

datadrivenangel•8h ago

Depending on how many prompts per hour you're looking at, that's probably same order of magnitude as expensive SAAS. A fancy CRM seat can be ~$2000 per month (or more), which assuming 50 hours per week x 4 weeks per month is $10 per hour ($2000/200 hours). A lot of money, but if it makes your sales people more productive, it's a good investment. Assuming that you're paying your sales people say 240K per year, ($20,000 per month), then the SAAS cost is 10% of their salary.

This explains DataDog pricing. Maybe it will give a future look at AI pricing.

mettamage•9h ago

Shame it's only the API. Would've loved to see it via the web interface on claude.ai itself.

minimaxir•9h ago

Can you even fit 200+k tokens worth of context in the web interface? IMO Claude's API workbench is the worst of the three major providers.

mettamage•9h ago

Via text files right? Just drag and drop.

data-ottawa•8h ago

When working on artifacts after a few change requests it definitely can.

77pt77•8h ago

Even if you can't, a conversation can easily get larger than that.

fblp•8h ago

I assume this will mean that long chats continue to get the "prompt is too long" error?

penguin202•9h ago

But will it remember any of it, and stop creating new redundant files when it can't find or understand what its looking for?

1xer•9h ago

moaaaaarrrr

aliljet•9h ago

This is definitely one of my CORE problem as I use these tools for "professional software engineering." I really desperately need LLMs to maintain extremely effective context and it's not actually that interesting to see a new model that's marginally better than the next one (for my day-to-day).

However. Price is king. Allowing me to flood the context window with my code base is great, but given that the price has substantially increased, it makes sense to better manage the context window into the current situation. The value I'm getting here flooding their context window is great for them, but short of evals that look into how effective Sonnet stays on track, it's not clear if the value actually exists here.

rootnod3•9h ago

Flooding the context also means increasing the likelihood of the LLM confusing itself. Mainly because of the longer context. It derails along the way without a reset.

aliljet•9h ago

How do you know that?

EForEndeavour•9h ago

https://onnyunhui.medium.com/evaluating-long-context-lengths...

bigmadshoe•9h ago

https://research.trychroma.com/context-rot

joenot443•7h ago

This is a good piece. Clearly it's a pretty complex problem and the intuitive result a layman engineer like myself might expect doesn't reflect the reality of LLMs. Regex works as reliably on 20 characters as it does 2m characters; the only difference is speed. I've learned this will probably _never_ be the case with LLMs, there will forever exist some level of epistemic doubt in its result.

When they announced Big Contexts in 2023, they referenced being able to find a single changed sentence in the context's copy of Great Gatsby[1]. This example seemed _incredible_ to me at the time but now two years later I'm feeling like it was pretty cherry-picked. What does everyone else think? Could you feed a novel into an LLM and expect it to find the single change?

[1] https://news.ycombinator.com/item?id=35941920

dang•1h ago

Discussed here:

Context Rot: How increasing input tokens impacts LLM performance - https://news.ycombinator.com/item?id=44564248 - July 2025 (59 comments)

F7F7F7•9h ago

What do you think happens when things start falling outside of its context window? It loses access to parts of your conversation.

And that’s why it will gladly rebuild the same feature over and over again.

anonz4FWNqnX•9h ago

I've had similar experiences. I've gone back and forth between running models locally and using the commercial models. The local models can be incredibly useful (gemma, qwen), but they need more patience and work to get them to work.

One advantage to running locally[1] is that you can set the context length manually and see how well the llm uses it. I don't have an exact experience to relay, but it's not unusual for models to be allow longer contexts, but ignore that context.

Just making the context big doesn't mean the LLM is going to use it well.

[1] I've using lm studio on both a macbook air and a macbook pro. Even a macbook air with 16G can run pretty decent models.

nomel•7h ago

A good example of this was the first Gemini model that allowed 1 million tokens, but would lose track of the conversation after a couple paragraphs.

rootnod3•8h ago

The longer the context and the discussion goes on, the more it can get confused, especially if you have to refine the conversation or code you are building on.

Remember, in its core it's basically a text prediction engine. So the more varying context there is, the more likely it is to make a mess of it.

Short context: conversion leaves the context window and it loses context. Long context: it can mess with the model. So the trick is to strike a balance. But if it's an online models, you have fuck all to control. If it's a local model, you have some say in the parameters.

fkyoureadthedoc•8h ago

https://github.com/adobe-research/NoLiMa

giancarlostoro•8h ago

Here's a paper from MIT that covers how this could be resolved in an interesting fashion:

https://hanlab.mit.edu/blog/streamingllm

The AI field is reusing existing CS concepts for AI that we never had hardware for, and now these people are learning how applied Software Engineering can make their theoretical models more efficient. It's kind of funny, I've seen this in tech over and over. People discover new thing, then optimize using known thing.

mamp•7h ago

Unfortunately, I think the context rot paper [1] found that the performance degradation when context increased still occurred in models using attention sinks.

1. https://research.trychroma.com/context-rot

kridsdale3•3h ago

The fact that this is happening is where the tremendous opportunity to make money as an experienced Software Engineer currently lies.

For instance, a year or two ago, the AI people discovered "cache". Imagine how many millions the people who implemented it earned for that one.

giancarlostoro•21m ago

I've been thinking the same, and its things that you don't need some crazy ML degree to know how to do... A lot of the algorithms are known... for a while now... Milk it while you can.

Wowfunhappy•5h ago

I keep reading this, but with Claude Code in particular, I consistently find it gets smarter the longer my conversations go on, peaking right at the point where it auto-compacts and everything goes to crap.

This isn't always true--some conversations go poorly and it's better to reset and start over--but it usually is.

benterix•9h ago

> it's not clear if the value actually exists here.

Having spent a couple of weeks on Claude Code recently, I arrived to the conclusion that the net value for me from agentic AI is actually negative.

I will give it another run in 6-8 months though.

wahnfrieden•8h ago

Did you try with using Opus exclusively?

freedomben•8h ago

Do you know if there's a way to force Claude code to do that exclusively? I've found a few env vars online but they don't seem to actually work

wahnfrieden•8h ago

Peter Steinberger has been documenting his workflows and he relies exclusively on Opus at least until recently. (He also pays for a few Max 20x subscriptions at once to avoid rate limits.)

atonse•8h ago

You can type /config and then go to the setting to pick a model.

gdudeman•8h ago

Yes: type /model and then pick Opus 4.1.

artursapek•8h ago

You can "force" it by just paying them $200 (which is nothing compared to the value)

parineum•8h ago

Value is irrelevant. What's the return on investment you get from spending $200?

Collecting value doesn't really get you anywhere if nobody is compensating you for it. Unless someone is going to either pay for it for you or give you $200/mo post-tax dollars, it's costing you money.

wahnfrieden•8h ago

The return for me is faster output of features, fixes, and polish for my products which increases revenue above the cost of the tool. Did you need to ask this?

parineum•7h ago

Yes, I did. Not everybody has their own product that might benefit from a $200 subscription. Most of us work for someone else and, unless that person is paying for the subscription, the _value_ it adds is irrelevant unless it results in better compensation.

Furthermore, the advice was given to upgrade to a $200 subscription from the $20 subscription. The difference in value that might translate into income between the $20 option and the $200 option is very unclear.

wahnfrieden•7h ago

If you are employed you should petition your employer for tools you want. Maybe you can use it to take the day off earlier or spend more time socializing. Or to get a promotion or performance bonus. Hopefully not just to meet rising productivity expectations without being handed the tools needed to achieve that. Having full-time access to these tools can also improve your own skills in using them, to profit from in a later career move or from contributing toward your own ends.

parineum•6h ago

I'm not disputing that. I'm just pushing back against the casual suggestion (not by you) to just go spend $200.

No doubt that you should ask you employer for the tools you want/need to do your job but plenty of us are using this kind of thing casually and the response to "Any way I can force it to use [Opus] exclusively?" is "Spend $200, it's worth it." isn't really helpful, especially in the context where the poster was clearly looking to try it out to see if it was worth it.

Aeolun•2h ago

If you have the money, and like coding your own stuff, the $200 is worth it. If you just code for the enterprise? Not so much.

epiccoleman•7h ago

is Opus that much better than Sonnet? My sub is $20 a month, so I guess I'd have to buy that I'm going to get a 10x boost, which seems dubious

theshrike79•6h ago

With the $20 plan you get Opus on the web and in the native app. Just not in Claude Code.

IMO it's pretty good for design, but with code it gets in its head a bit too much and overthinks and overcomplicates solutions.

artursapek•2h ago

Yes, Opus is much better at complicated architecture

mark_l_watson•8h ago

I am sort of with you. I am down to asking Gemini Pro a couple of questions a day, use ChatGPT just a few times a week, and about once a week use gemini-cli (either a short free session, or a longer session where I provide my API key.)

That said I spend (waste?) an absurdly large amount of time each week experimenting with local models (sometimes practical applications, sometimes ‘research’).

mikepurvis•8h ago

For a bit more nuance, I think I would my overall net is about break even. But I don't take that as "it's not worth it at all, abandon ship" but rather that I need to hone my instinct of what is and is not a good task for AI involvement, and what that involvement should look like.

Throwing together a GHA workflow? Sure, make a ticket, assign it to copilot, check in later to give a little feedback and we're golden. Half a day of labour turned into fifteen minutes.

But there are a lot of tasks that are far too nuanced where trying to take that approach just results in frustration and wasted time. There it's better to rely on editor completion or maybe the chat interface, like "hey I want to do X and Y, what approach makes sense for this?" and treat it like a rubber duck session with a junior colleague.

cambaceres•8h ago

For me it’s meant a huge increase in productivity, at least 3X.

Since so many claim the opposite, I’m curious to what you do more specifically? I guess different roles/technologies benefit more from agents than others.

I build full stack web applications in node/.net/react, more importantly (I think) is that I work on a small startup and manage 3 applications myself.

datadrivenangel•8h ago

How do you structure your applications for maintainability?

dingnuts•8h ago

You have small applications following extremely common patterns and using common libraries. Models are good at regurgitating patterns they've seen many times, with fuzzy find/replace translations applied.

Try to build something like Kubernetes from the ground up and let us know how it goes. Or try writing a custom firmware for a device you just designed. Something like that.

elevatortrim•8h ago

I think there are two broad cases where ai coding is beneficial:

1. You are a good coder but working on a new (to you) or building a new project, or working with a technology you are not familiar with. This is where AI is hugely beneficial. It does not only accelerate you, it lets you do things you could not otherwise.

2. You have spent a lot of time on engineering your context and learning what AI is good at, and using it very strategically where you know it will save time and not bother otherwise.

If you are a really good coder, really familiar with the project, and mostly changing its bits and pieces rather than building new functionality, AI won’t accelerate you much. Especially if you did not invest the time to make it work well.

nicce•8h ago

> I build full stack web applications in node/.net/react, more importantly (I think) is that I work on a small startup and manage 3 applications myself.

I think this is your answer. For example, React and JavaScript are extremely popular and aged. Are you using TypeScript and want to get most of the types or are you accepting everything that LLM gives as JavaScript? How interested you are about the code whether it is using "soon to be deprecated" functions or the most optimized loop/implementation? How about the project structure?

In other cases, the more precision you need, less effective LLM is.

rs186•8h ago

3X if not 10X if you are starting a new project with Next.js, React, Tailwind CSS for a fullstack website development, that solves an everyday problem. Yeah I just witnessed that yesterday when creating a toy project.

For my company's codebase, where we use internal tools and proprietary technology, solving a problem that does not exist outside the specific domain, on a codebase of over 1000 files? No way. Even locating the correct file to edit is non trivial for a new (human) developer.

GenerocUsername•8h ago

Your first week of AI usage should be crawling your codebase and generating context.md docs that can then be fed back into future prompts so that AI understands your project space, packages, apis, and code philosophy.

I guarantee your internal tools are not revolutionary, they are just unrepresented in the ML model out of the box

nicce•8h ago

Even then, are you even allowed to use AI in such codebase. Is some part of the code "bought", e.g. commercial compiler generated with specific license? Is pinky promise from LLM provider enough?

orra•8h ago

That sounds incredibly boring.

Is it effective? If so I'm sure we'll see models to generate those context.md files.

cpursley•7h ago

Yes. And way less boring than manually reading a section of a codebase to understand what is going on after being away from it for 8 months. Claude's docs and git commit writing skills are worth it for that alone.

blitztime•8h ago

How do you keep the context.md updated as the code changes?

shmoogy•7h ago

I tell Claude to update it generally but you can probably use a hook

tombot•7h ago

This, while it has context of the current problem, just ask Claude to re-read it's own documentation and think of things to add that will help it in the future

MattGaiser•8h ago

Yeah, anecdotally it is heavily dependent on:

1. Using a common tech. It is not as good at Vue as it is at React.

2. Using it in a standard way. To get AI to really work well, I have had to change my typical naming conventions (or specify them in detail in the instructions).

nicce•8h ago

React also seems to be actually alias for Next.js. Models have hard time to make the difference.

mike_hearn•8h ago

My codebase has about 1500 files and is highly domain specific: it's a tool for shipping desktop apps[1] that handles all the building, packaging, signing, uploading etc for every platform on every OS simultaneously. It's written mostly in Kotlin, and to some extent uses a custom in-house build system. The rest of the build is Gradle, which is a notoriously confusing tool. The source tree also contains servers, command line tools and a custom scripting language which is used for all the scripting needs of the project [2].

The code itself is quite complex and there's lots of unusual code for munging undocumented formats, speaking undocumented protocols, doing cryptography, Mac/Windows specific APIs, and it's all built on a foundation of a custom parallel incremental build system.

In other words: nightmare codebase for an LLM. Nothing like other codebases. Yet, Claude Code demolishes problems in it without a sweat.

I don't know why people have different experiences but speculating a bit:

1. I wrote most of it myself and this codebase is unusually well documented and structured compared to most. All the internal APIs have full JavaDocs/KDocs, there are extensive design notes in Markdown in the source tree, the user guide is also part of the source tree. Files, classes and modules are logically named. Files are relatively small. All this means Claude can often find the right parts of the source within just a few tool uses.

2. I invested in making a good CLAUDE.md and also wrote a script to generate "map.md" files that are at the top of every module. These map files contain one-liners of what every source file contains. I used Gemini to make these due to its cheap 1M context window. If Claude does struggle to find the right code by just reading the context files or guessing, it can consult the maps to locate the right place quickly.

3. I've developed a good intuition for what it can and cannot do well.

4. I don't ask it to do big refactorings that would stress the context window. IntelliJ is for refactorings. AI is for writing code.

[1] https://hydraulic.dev

[2] https://hshell.hydraulic.dev/

tptacek•7h ago

That's an interesting comment, because "locating the correct file to edit" was the very first thing LLMs did that was valuable to me as a developer.

evantbyrne•8h ago

The problem with these discussions is that almost nobody outside of the agency/contracting world seems to track their time. Self-reported data is already sketchy enough without layering on the issue of relying on distant memory of fine details.

andrepd•8h ago

Self-reports are notoriously overexcited, real results are, let's say, not so stellar.

https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

logicprog•5h ago

Here's an in depth analysis and critique of that study by someone whose job is literally to study programmers psychologically and has experience in sociology studies: https://www.fightforthehuman.com/are-developers-slowed-down-...

Basically, the study has a fuckton of methodological problems that seriously undercut the quality of its findings, and even assuming its findings are correct, if you look closer at the data, it doesn't show what it claims to show regarding developer estimations, and the story of whether it speeds up or slows down developers is actually much more nuanced and precisely mirrors what the developers themselves say in the qualitative quote questionaire, and relatively closely mirrors what the more nuanced people will say here — that it helps with things you're less familiar with, that have scope creep, etc a lot more, but is less or even negatively useful for the opposite scenarios — even in the worst case setting.

Not to mention this is studying a highly specific and rare subset of developers, and they even admit it's a subset that isn't applicable to the whole.

acedTrex•8h ago

I have yet to get it to generate code past 10ish lines that I am willing to accept. I read stuff like this and wonder how low yall's standards are, or if you are working on projects that just do not matter in any real world sense.

spicyusername•8h ago

4/5 times I can easily get 100s of lines output, that only needs a quick once over.

1/5 times, I spend an extra hour tangled in code it outputs that I eventually just rewrite from scratch.

Definitely a massive net positive, but that 20% is extremely frustrating.

acedTrex•8h ago

That is fascinating to me, i've never seen it generate that much code that is actually something i would consider correct. It's always wrong in some way.

LinXitoW•3h ago

In my experience, if I have to issue more than 2 corrections, I'm better off restarting and beefing up the prompt or just doing it myself

dillydogg•8h ago

Whenever I read comments from the people singing their praises of the technology, it's hard not to think of the study that found AI tools made developers slower in early 2025.

>When developers are allowed to use AI tools, they take 19% longer to complete issues—a significant slowdown that goes against developer beliefs and expert forecasts. This gap between perception and reality is striking: developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%.

https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

mstkllah•7h ago

Ah, the very extensive study with 16 developers. Bulletproof results.

izacus•6h ago

Yeah, we should listen to the one "trust me bro" dude instead.

troupo•5h ago

Compared to "it's just a skill issue you're not prompting it correctly" crowd with literally zero actionable data?

logicprog•5h ago

Not to mention this is studying a highly specific and rare subset of developers, and they even admit it's a subset that isn't applicable to the whole.

dillydogg•5h ago

This is very helpful, thank you for the resource

djeastm•7h ago

Standards are going to be as low as the market allows I think. Some industries code quality is paramount, other times its negligible and perhaps speed of development is higher priority and the code is mostly disposable.

wiremine•8h ago

> Having spent a couple of weeks on Claude Code recently, I arrived to the conclusion that the net value for me from agentic AI is actually negative.

> For me it’s meant a huge increase in productivity, at least 3X.

How do we reconcile these two comments? I think that's a core question of the industry right now.

My take, as a CTO, is this: we're giving people new tools, and very little training on the techniques that make those tools effective.

It's sort of like we're dropping trucks and airplanes on a generation that only knows walking and bicycles.

If you've never driven a truck before, you're going to crash a few times. Then it's easy to say "See, I told you, this new fangled truck is rubbish."

Those who practice with the truck are going to get the hang of it, and figure out two things:

1. How to drive the truck effectively, and

2. When NOT to use the truck... when talking or the bike is actually the better way to go.

We need to shift the conversation to techniques, and away from the tools. Until we do that, we're going to be forever comparing apples to oranges and talking around each other.

jdgoesmarching•8h ago

Agreed, and it drives me bonkers when people talk about AI coding as if it represents some a single technique, process, or tool.

Makes me wonder if people spoke this way about “using computers” or “using the internet” in the olden days.

We don’t even fully agree on the best practices for writing code without AI.

moregrist•7h ago

> Makes me wonder if people spoke this way about “using computers” or “using the internet” in the olden days.

There were gobs of terrible road metaphors that spun out of calling the Internet the “Information Superhighway.”

Gobs and gobs of them. All self-parody to anyone who knew anything.

I hesitate to relate this to anything in the current AI era, but maybe the closest (and in a gallows humor/doomer kind of way) is the amount of exec speak on how many jobs will be replaced.

porksoda•3h ago

Remember the ones who loudly proclaimed the internet to be a passing fad, not useful for normal people. All anti LLM rants taste like that to me.

I get why they thought that - it was kind of crappy unless you're one who is excited about the future and prepared to bleed a bit on the edge.

mh-•7h ago

> Makes me wonder if people spoke this way about “using computers” or “using the internet” in the olden days.

Older person here: they absolutely did, all over the place in the early 90s. I remember people decrying projects that moved them to computers everywhere I went. Doctors offices, auto mechanics, etc.

Then later, people did the same thing about the Internet (was written with a single word capital I by 2000, having been previously written as two separate words.)

https://i.imgur.com/vApWP6l.png

jacquesm•7h ago

And not all of those people were wrong.

jeremy_k•8h ago

Well put. It really does come down to nuance. I find Claude is amazing at writing React / Typescript. I mostly let it do it's own thing and skim the results after. I have it write Storybook components so I can visually confirm things look how I want. If something isn't quite right I'll take a look and if I can spot the problem and fix it myself, I'll do that. If I can't quickly spot it, I'll write up a prompt describing what is going on and work through it with AI assistance.

Overall, React / Typescript I heavily let Claude write the code.

The flip side of this is my server code is Ruby on Rails. Claude helps me a lot less here because this is my primary coding background. I also have a certain way I like to write Ruby. In these scenarios I'm usually asking Claude to generate tests for code I've already written and supplying lots of examples in context so the coding style matches. If I ask Claude to write something novel in Ruby I tend to use it as more of a jumping off point. It generates, I read, I refactor to my liking. Claude is still very helpful, but I tend to do more of the code writing for Ruby.

Overall, helpful for Ruby, I still write most of the code.

These are the nuances I've come to find and what works best for my coding patterns. But to your point, if you tell someone "go use Claude" and they have have a preference in how to write Ruby and they see Claude generate a bunch of Ruby they don't like, they'll likely dismiss it as "This isn't useful. It took me longer to rewrite everything than just doing it myself". Which all goes to say, time using the tools whether its Cursor, Claude Code, etc (I use OpenCode) is the biggest key but figuring out how to get over the initial hump is probably the biggest hurdle.

croes•7h ago

Do you only skim the results or do you audit them at some point to prevent security issues?

jeremy_k•7h ago

What kind of security issues are you thinking about? I'm generating UI components like Selects for certain data types or Charts of data.

dghlsakjg•6h ago

User input is a notoriously thorny area.

If you aren't sanitizing and checking the inputs appropriately somewhere between the user and trusted code, you WILL get pwned.

Rails provides default ways to avoid this, but it makes it very easy to do whatever you want with user input. Rails will not necessarily throw a warning if your AI decides that it wants to directly interpolate user input into a sql query.

jeremy_k•4h ago

Well in this case, I am reading through everything that is generated for Rails because I want things to be done my way. For user input, I tend to validate everything with Zod before sending it off the backend which then flows through ActiveRecord.

I get what you're saying that AI could write something that executes user input but with the way I'm using the tools that shouldn't happen.

k9294•7h ago

For this very reason I switched for TS for backend as well. I'm not a big fun of JS but the productivity gain of having shared types between frontend and backend and the Claude code proficiency with TS is immense.

jeremy_k•7h ago

I considered this, but I'm just too comfortable writing my server logic in Ruby on Rails (as I do that for my day job and side project). I'm super comfortable writing client side React / Typescript but whenever I look at server side Typescript code I'm like "I should understand what this is doing but I don't" haha.

jorvi•6h ago

It is not really a nuanced take when it compares 'unassisted' coding to using a bicycle and AI-assisted coding with a truck.

I put myself somewhere in the middle in terms of how great I think LLMs are for coding, but anyone that has worked with a colleague that loves LLM coding knows how horrid it is that the team has to comb through and doublecheck their commits.

In that sense it would be equally nuanced to call AI-assisted development something like "pipe bomb coding". You toss out your code into the branch, and your non-AI'd colleagues have to quickly check if your code is a harmless tube of code or yet another contraption that quickly needs defusing before it blows up in everyone's face.

Of course that is not nuanced either, but you get the point :)

LinXitoW•3h ago

Oh nuanced the comparison seems also depends on whether you live in Arkansas or in Amsterdam.

But I disagree that your counterexample has anything at all to do with AI coding. That very same developer was perfectly capable of committing untested crap without AI. Perfectly capable of copy pasting the first answer they found on Stack Overflow. Perfectly capable of recreating utility functions over and over because they were to lazy to check if they already exist.

nabla9•7h ago

I agree.

I experience a productivity boost, and I believe it’s because I prevent LLMs from making design choices or handling creative tasks. They’re best used as a "code monkey", fill in function bodies once I’ve defined them. I design the data structures, functions, and classes myself. LLMs also help with learning new libraries by providing examples, and they can even write unit tests that I manually check. Importantly, no code I haven’t read and accepted ever gets committed.

Then I see people doing things like "write an app for ....", run, hey it works! WTF?

quikoa•7h ago

It's not just about the programmer and his experience with AI tools. The problem domain and programming language(s) used for a particular project may have a large impact on how effective the AI can be.

wiremine•6h ago

> The problem domain and programming language(s) used for a particular project may have a large impact on how effective the AI can be.

100%. Again, if we only focus on things like context windows, we're missing the important details.

vitaflo•4h ago

But even on the same project with the same tools the general way a dev derives satisfaction from their work can play a big role. Some devs derive satisfaction from getting work done and care less about the code as long as it works. Others derive satisfaction from writing well architected and maintainable code. One can guess the reactions to how LLM's fit into their day to day lives for each.

weego•7h ago

In a similar role and place with this.

My biggest take so far: If you're a disciplined coder that can handle 20% of an entire project's (project being a bug through to an entire app) time being used on research, planning and breaking those plans into phases and tasks, then augmenting your workflow with AI appears to be to have large gains in productivity.

Even then you need to learn a new version of explaining it 'out loud' to get proper results.

If you're more inclined to dive in and plan as you go, and store the scope of the plan in your head because "it's easier that way" then AI 'help' will just fundamentally end up in a mess of frustration.

cmdli•6h ago

My experience has been entirely the opposite as an IC. If I spend the time to delve into the code base to the point that I understand how it works, AI just serves as a mild improvement in writing code as opposed to implementing it normally, saving me maybe 5 minutes on a 2 hour task.

On the other hand, I’ve found success when I have no idea how to do something and tell the AI to do it. In that case, the AI usually does the wrong thing but it can oftentimes reveal to me the methods used in the rest of the codebase.

zarzavat•6h ago

Both modes of operation are useful.

If you know how to do something, then you can give Claude the broad strokes of how you want it done and -- if you give enough detail -- hopefully it will come back with work similar to what you would have written. In this case it's saving you on the order of minutes, but those minutes add up. There is a possibility for negative time saving if it returns garbage.

If you don't know how to do something then you can see if an AI has any ideas. This is where the big productivity gains are, hours or even days can become minutes if you are sufficiently clueless about something.

jacobr1•5h ago

An importantly the cycle time on this stuff can be much faster. Trying out different variants, and iterating through larger changes can be huge.

hirako2000•5h ago

The issue is that you would be not just clueless but grown naive about the correctness of what it did.

Knowing what to do at least you can review. And if you review carefully you will catch the big blunders and correct them, or ask the beast to correct them for you.

> Claude, please generate a safe random number. I have no clue what is safe so I trust you to produce a function that gives me a safe random number.

Not every use case is sensitive, but even building pieces for entertainment, if it wipe things it shouldn't delete or drain the battery doing very inefficient operations here and there, it's junk, undesirable software.

bcrosby95•4h ago

Claude will point you in the right neighborhood but to the wrong house. So if you're completely ignorant that's cool. But recognize that its probably wrong and only a starting point.

Hell, I spent 3 hours "arguing" with Claude the other day in a new domain because my intuition told me something was true. I brought out all the technical reason why it was fine but Claude kept skirting around it saying the code change was wrong.

After spending extra time researching it I found out there was a technical term for it and when I brought that up Claude finally admitted defeat. It was being a persistent little fucker before then.

My current hobby is writing concurrent/parallel systems. Oh god AI agents are terrible. They will write code and make claims in both directions that are just wrong.

hebocon•2h ago

> After spending extra time researching it I found out there was a technical term for it and when I brought that up Claude finally admitted defeat. It was being a persistent little fucker before then.

Whenever I feel like I need to write "Why aren't you listening to me?!" I know it's time for a walk and a change in strategy. It's also a good indicator that I'm changing too much at once and that my requirements are too poorly defined.

teaearlgraycold•6h ago

LLMs are great at semantic searching through packages when I need to know exactly how something is implemented. If that’s a major part of your job then you’re saving a ton of time with what’s available today.

t0mas88•5h ago

For me it has a big positive impact on two sides of the spectrum and not so much in the middle.

One end is larger complex new features where I spend a few days thinking about how to approach it. Usually most thought goes into how to do something complex with good performance that spans a few apps/services. I write a half page high level plan description, a set of bullets for gotchas and how to deal with them and list normal requirements. Then let Claude Code run with that. If the input is good you'll get a 90% version and then you can refactor some things or give it feedback on how to do some things more cleanly.

The other end of the spectrum is "build this simple screen using this API, like these 5 other examples". It does those well because it's almost advanced autocomplete mimicking your other code.

Where it doesn't do well for me is in the middle between those two. Some complexity, not a big plan and not simple enough to just repeat something existing. For those things it makes a mess or you end up writing a lot of instructions/prompt abs could have just done it yourself.

ath3nd•7h ago

> How do we reconcile these two comments? I think that's a core question of the industry right now.

The current freshest study focusing on experienced developers showed a net negative in the productivity when using an LLM solution in their flow:

https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

My conclusion on this, as an ex VP of Engineering, is that good senior developers find little utility with LLMs and even them to be a nuisance/detriment, while for juniors, they can be godsend, as they help them with syntax and coax the solution out of them.

It's like training wheels to a bike. A toddler might find 3x utility, while a person who actually can ride a bike well will find themselves restricted by training wheels.

pesfandiar•6h ago

Your analogy would be much better with giving workers a work horse with a mind of its own. Trucks come with clear instructions and predictable behaviour.

chasd00•6h ago

> Your analogy would be much better with giving workers a work horse with a mind of its own.

i think this is a very insightful comment with respect to working with LLMs. If you've ever ridden a horse you don't really tell it to walk, run, turn left, turn right, etc you have to convince it to do those things and not be too aggravating while you're at it. With a truck simple cause and effect applies but with horse it's a negotiation. I feel like working with LLMs is like a negotiation, you have to coax out of it what you're after.

pletnes•6h ago

Being a consultant / programmer with feet on the ground, eh, hands on the keyboard: some orgs let us use some AI tools, others do not. Some projects are predominantly new code based on recent tech (React); others include maintaining legacy stuff on windows server and proprietary frameworks. AI is great on some tasks, but unavailable or ignorant about others. Some projects have sharp requirements (or at least, have requirements) whereas some require 39 out of 40 hours a week guessing at what the other meat-based intelligences actually want from us.

What «programming» actually entails, differs enormously; so does AI’s relevance.

abc_lisper•6h ago

I doubt there is much art to getting LLM work for you, despite all the hoopla. Any competent engineer can figure that much out.

The real dichotomy is this. If you are aware of the tools/APIs and the Domain, you are better off writing the code on your own, except may be shallow changes like refactorings. OTOH, if you are not familiar with the domain/tools, using a LLM gives you a huge legup by preventing you from getting stuck and providing intial momentum.

jama211•6h ago

I dunno, first time I tried an LLM I was getting so annoyed because I just wanted it to go through a css file and replace all colours with variables defined in root, and it kept missing stuff and spinning and I was getting so frustrated. Then a friend told me I should instead just ask it to write a script which accomplishes that goal, and it did it perfectly in one prompt, then ran it for me, and also wrote another script to check it hadn’t missed any and ran that.

At no point when it was getting f stuck initially did it suggest another approach, or complain that it was outside its context window even though it was.

This is a perfect example of “knowing how to use an LLM” taking it from useless to useful.

abc_lisper•4h ago

Which one did you use and when was this? I mean, no body gets anything working right the first time. You got to spend a few days atleast trying to understand the tool

badlucklottery•5h ago

This is my experience as well.

LLM currently produce pretty mediocre code. A lot of that is a "garbage in, garbage out" issue but it's just the current state of things.

If the alternative is noob code or just not doing a task at all, then mediocre is great.

But 90% of the time I'm working in a familiar language/domain so I can grind out better code relatively quickly and do so in a way that's cohesive with nearby code in the codebase. The main use-case I have for AI in that case is writing the trivial unit tests for me.

So it's another "No Silver Bullet" technology where the problem it's fixing isn't the essential problem software engineers are facing.

brulard•3h ago

I believe there IS much art in LLMs and Agents especially. Maybe you can get like 20% boost quite quickly, but there is so much room to grow it to maybe 500% long term.

worldsayshi•6h ago

I think it's very much down to which kind of problem you're trying to solve.

If a solution can subtly fail and it is critical that it doesn't, LLM is net negative.

If a solution is easy to verify or if it is enough that it walks like a duck and quacks like one, LLM can be very useful.

I've had examples of both lately. I'm very much both bullish and bearish atm.

oceanplexian•6h ago

It's pretty simple, AI is now political for a lot of people. Some folks have a vested interest in downplaying it or over hyping it rather than impartially approaching it as a tool.

Gigachad•2h ago

It’s also just not consistent. A manager who can’t code using it to generate a react todo list thinks it’s 100x efficiency while a senior software dev working on established apps finds it a net productivity negative.

AI coding tools seem to excel at demos and flop on the field so the expectation disconnect between managers and actual workers is massive.

chasd00•6h ago

One thing to think about is many software devs have a very hard time with code they didn't write. I've seen many devs do a lot of work to change code to something equivalent (even with respect to performance and readability) only because it's not the way they would have done it. I could see people having a hard time using what the LLM produced without having to "fix it up" and basically re-write everything.

jama211•6h ago

Yeah sometimes I feel like a unicorn because I don’t really care about code at all, so long as it conforms to decent standards and does what it needs to do. I honestly believe engineers often overestimate the importance of elegance in code too, to the point of not realising the slow down of a project due to overly perfect code is genuinely not worth it.

parpfish•5h ago

i dont care if the code is elegant, i care that the code is consistent.

do the same thing in the same way each time and it lets you chunk it up and skim it much easier. if there are little differences each time, you have to keep asking yourself "is it done differently here for a particular reason?"

vanviegen•5h ago

Exactly! And besides that, new code being consistent with its surrounding code used to be a sign of careful craftsmanship (as opposed to spaghetti-against-the-wall style coding), giving me some confidence that the programmer may have considered at least the most important nasty edge cases. LLMs have rendered that signal mostly useless, of course.

dennisy•6h ago

Also another view is that developers below a certain level get a positive benefit and those above get a negative effect.

This makes sense, as the models are an average of the code out there and some of us are above and below that average.

Sorry btw I do not want to offend anyone who feels they do garner a benefit from LLMs, just wanted to drop in this idea!

ath3nd•6h ago

That's my anecdotal experience as well! Junior devs struggle with a lot of things:

- syntax

- iteration over an idea

- breaking down the task and verifying each step

Working with a tool like Claude that gets them started quick and iterate the solution together with them helps them tremendously and educate them on best practices in the field.

Contrast that with a seasoned developer with a domain experience, good command of the programming language and knowledge of the best practices and a clear vision of how the things can be implemented. They hardly need any help on those steps where the junior struggled and where the LLMs shine, maybe some quick check on the API, but that's mostly it. That's consistent with the finding of the study https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o... that experienced developers' performance suffered when using an LLM.

What I used as a metaphor before to describe this phenomena is training wheels: kids learning how to ride a bike can get the basics with the help and safety of the wheels, but adults that already can ride a bike don't have any use for the training wheels, and can often find restricted by them.

epolanski•51m ago

> that experienced developers' performance suffered when using an LLM

That experiment is really non significant. A bunch of OSS devs without much training in the tools used them for very little time and found it to be a net negative.

parpfish•5h ago

i don't know if anybody else has experienced this, but one of my biggest time-sucks with cursor is that it doesn't have a way for me to steer it mid-process that i'm aware of.

it'll build something that fails a test, but i know how to fix the problem. i can't jump in a manually fix it or tell it what to do. i just have to watch it churn through the problem and eventually give up and throw away a 90% good solution that i knew how to fix.

smokel•5h ago

My experience was exactly the opposite.

Experienced developers know when the LLM goes off the rails, and are typically better at finding useful applications. Junior developers on the other hand, can let horrible solutions pass through unchecked.

Then again, LLMs are improving so quickly, that the most recent ones help juniors to learn and understand things better.

rzz3•3h ago

It’s also really good for me as a very senior engineer with serious ADHD. Sometimes I get very mentally blocked, and telling Claude Code to plan and implement a feature gives me a really valuable starting point and has a way of unblocking me. For me it’s easier to elaborate off of an existing idea or starting point and refactor than start a whole big thing from zero on my own.

unoti•6h ago

> Having spent a couple of weeks on Claude Code recently, I arrived to the conclusion that the net value for me from agentic AI is actually negative. > For me it’s meant a huge increase in productivity, at least 3X. > How do we reconcile these two comments? I think that's a core question of the industry right now.

Every success story with AI coding involves giving the agent enough context to succeed on a task that it can see a path to success on. And every story where it fails is a situation where it had not enough context to see a path to success on. Think about what happens with a junior software engineer: you give them a task and they either succeed or fail. If they succeed wildly, you give them a more challenging task. If they fail, you give them more guidance, more coaching, and less challenging tasks with more personal intervention from you to break it down into achievable steps.

As models and tooling becomes more advanced, the place where that balance lies shifts. The trick is to ride that sweet spot of task breakdown and guidance and supervision.

troupo•5h ago

> And every story where it fails is a situation where it had not enough context to see a path to success on.

And you know that because people are actively sharing the projects, code bases, programming languages and approaches they used? Or because your gut feeling is telling you that?

For me, agents failed with enough context, and with not enough context, and succeeded with context, or not enough, and succeeded and failed with and without "guidance and coaching"

hirako2000•5h ago

Bold claims.

From my experience, even the top models continue to fail delivering correctness on many tasks even with all the details and no ambiguity in the input.

In particular when details are provided, in fact.

I find that with solutions likely to be well oiled in the training data, a well formulated set of *basic* requirements often leads to a zero shot, "a" perfectly valid solution. I say "a" solution because there is still this probability (seed factor) that it will not honour part of the demands.

E.g, build a to-do list app for the browser, persist entries into a hashmap, no duplicate, can edit and delete, responsive design.

I never recall seeing an LLM kick off C++ code out of that. But I also don't recall any LLM succeeding in all these requirements, even though there aren't that many.

It may use a hash set, or even a set for persistence because it avoids duplicates out of the box. And it would even use a hash map to show it used a hashmap but as an intermediary data structure. It would be responsive, but the edit/delete buttons may not show, or may not be functional. Saving the edits may look like it worked, but did not.

The comparison with junior developers is pale. Even a mediocre developer can test its and won't pretend that it works if it doesn't even execute. If a develop lies too many times it would lose trust. We forgive these machines because they are just automatons with a label on it "can make mistakes". We have no resorts to make them speak the truth, they lie by design.

brulard•2h ago

> From my experience, even the top models continue to fail delivering correctness on many tasks even with all the details and no ambiguity in the input.

You may feel like there are all the details and no ambiguity in the prompt. But there may still be missing parts, like examples, structure, plan, or division to smaller parts (it can do that quite well if explicitly asked for). If you give too much details at once, it gets confused, but there are ways how to let the model access context as it progresses through the task.

And models are just one part of the equation. Another parts may be orchestrating agent, tools, models awareness of the tools available, documentation, and maybe even human in the loop.

epolanski•50m ago

> From my experience, even the top models continue to fail delivering correctness on many tasks even with all the details and no ambiguity in the input.

Please provide the examples, both of the problem and your input so we can double check.

sixothree•6h ago

It might just be me but I feel like it excels with certain languages where other situations it falls flat. Throw a well architected and documented code base in a popular language and you can definitely feel it get I to its groove.

Also giving IT tools to ensure success is just as important. MCPs can sometimes make a world of difference, especially when it needs to search you code base.

delegate•6h ago

Easy. You're 3x more productive for a while and then you burn yourself out.

Or lose control of the codebase, which you no longer understand after weeks of vibing (since we can only think and accumulate knowledge at 1x).

Sometimes the easy way out is throwing a week of generated code away and starting over.

So that 3x doesn't come for free at all, besides API costs, there's the cost of quickly accumulating tech debt which you have to pay if this is a long term project.

For prototypes, it's still amazing.

brulard•3h ago

You conflate efficient usage of AI with "vibing". Code can be written by AI and still follow the agreed-upon structures and rules and still can and should be thoroughly reviewed. The 3x absolutely does not come for free. But the price may have been paid in advance by learning how to use those tools best.

I agree the vibe-coding mentality is going to be a major problem. But aren't all tools used well and used badly?

Aeolun•2h ago

> Or lose control of the codebase, which you no longer understand after weeks of vibing (since we can only think and accumulate knowledge at 1x).

I recognize this, but at the same time, I’m still better at rmembering the scope of the codebase than Claude is.

If Claude gets a 1M context window, we can start sticking a general overview of the codebase in every single prompt without.

bloomca•5h ago

> 2. When NOT to use the truck... when talking or the bike is actually the better way to go.

Some people write racing car code, where a truck just doesn't bring much value. Some people go into more uncharted territories, where there are no roads (so the truck will not only slow you down, it will bring a bunch of dead weight).

If the road is straight, AI is wildly good. In fact, it is probably _too_ good; but it can easily miss a turn and it will take a minute to get it on track.

I am curious if we'll able to fine tune LLMs to assist with less known paths.

troupo•5h ago

> How do we reconcile these two comments? I think that's a core question of the industry right now.

We don't. Because there's no hard data: https://dmitriid.com/everything-around-llms-is-still-magical...

And when hard data of any kind does start appearing, it may actually point in a different direction: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

> We need to shift the conversation to techniques, and away from the tools.

No, you're asking to shift the conversation to magical incantation which experts claim work.

What we need to do is shift the conversation to measurements

jf22•5h ago

A couple of weeks isn't enough.

I'm six months in using LLMs to generate 90 of my code and finally understanding the techniques and limitations.

gwd•4h ago

> How do we reconcile these two comments? I think that's a core question of the industry right now.

The question is, for those people who feel like things are going faster, what's the actual velocity?

A month ago I showed it a basic query of one resource I'd rewritten to use a "query builder" API. Then I showed it the "legacy" query of another resource, and asked it to do something similar. It managed to get very close on the first try, and with only a few more hours of tweaking and testing managed to get a reasonably thorough test suite to pass. I'm sure that took half the time it would have taken me to do it by hand.

Fast forward to this week, when I ran across some strange bugs, and had to spend a day or two digging into the code again, and do some major revision. Pretty sure those bugs wouldn't have happened if I'd written the code myself; but even though I reviewed the code, they went under the radar, because I hadn't really understood the code as well as I thought I had.

So was I faster overall? Or did I just offload some of the work to myself at an unpredictable point in the future? I don't "vibe code": I keep tight reign on the tool and review everything it's doing.

Gigachad•2h ago

Pretty much. We are in an era of vibe efficiency.

If programmers really did get 3x faster. Why has software not improved any faster than it always has been.

epolanski•53m ago

This is a very sensible point.

thanhhaimai•8h ago

I work across the stack (frontend, backend, ML)

- For FrontEnd or easy code, it's a speed up. I think it's more like 2x instead of 3x.

- For my backend (hard trading algo), it has like 90% failure rate so far. There is just so much for it to reason through (balance sheet, lots, wash, etc). All agents I have tried, even on Max mode, couldn't reason through all the cases correctly. They end up thrashing back and forth. Gemini most of the time will go into the "depressed" mode on the code base.

One thing I notice is that the Max mode on Cursor is not worth it for my particular use case. The problem is either easy (frontend), which means any agent can solve it, or it's hard, and Max mode can't solve it. I tend to pick the fast model over strong model.

squeaky-clean•7h ago

I just want to point out that they only said agentic models were a negative, not AI in general. I don't know if this is what they meant, but I personally prefer to use a web or IDE AI tool and don't really like the agentic stuff compared to those. For me agentic AI would be a net positive against no-AI, but it's a net negative compared to other AI interfaces

dmitrygr•7h ago

> For me it’s meant a huge increase in productivity, at least 3X.

Quote possibly you are doing very common things that are often done and thus are in the training set a lot, the parent post is doing something more novel that forces the model to extrapolate, which they suck at.

cambaceres•6h ago

Sure, I won’t argue against that. The more complex (and fun) parts of the applications I tend to write myself. The productivity gains are still real though.

bcrosby95•7h ago

My current guess is it's how the programmer solves problems in their head. This isn't something we talk about much.

People seem to find LLMs do well with well-spec'd features. But for me, creating a good spec doesn't take any less time than creating the code. The problem for me is the translation layer that turns the model in my head into something more concrete. As such, creating a spec for the LLM doesn't save me any time over writing the code myself.

So if it's a one shot with a vague spec and that works that's cool. But if it's well spec'd to the point the LLM won't fuck it up then I may as well write it myself.

byryan•7h ago

That makes sense, especially if your building web applications that are primarily "just" CRUD operations. If a lot of the API calls follow the same pattern and the application is just a series of API calls + React UI then that seems like something an LLM would excel at. LLM's are also more proficient in TypeScript/JS/Python compared to other languages, so that helps as well.

carlhjerpe•6h ago

I'm currently unemployed in the DevOps field (resigned and got a long vacation). I've been using various models to write various Kubernetes plug-ins abd simple automation scripts. It's been a godsend implementing things which would require too much research otherwise, my ADHD context window is smaller than Claude's.

Models are VERY good at Kubernetes since they have very anal (good) documentation requirements before merging.

I would say my productivity gain is unmeasurable since I can produce things I'd ADHD out of unless I've got a whip up my rear.

qingcharles•4h ago

On the right projects, definitely an enormous upgrade for me. Have to be judicious with it and know when it is right and when it's wrong. I think people have to figure out what those times are. For now. In the future I think a lot of the problems people are having with it will diminish.

epolanski•54m ago

> Since so many claim the opposite

The overwhelming majority of those claiming the opposite are a mixture of:

- users with wrong expectations, such as AI's ability to do the job on its own with minimal effort from the user. They have marketers to blame.

- users that have AI skill issues: they simply don't understand/know how to use the tools appropriately. I could provide countless examples from the importance of quality prompting, good guidelines, context management, and many others. They have only their laziness or lack of interest to blame.

- users that are very defensive about their job/skills. Many feel threatened by AI taking their jobs or diminishing it, so their default stance is negative. They have their ego to blame.

revskill•8h ago

Truth. To some extend, the agent doesn't know what it's doing at all, it lacks real brain, maybe we should just treat them as the hard worker.

flowerthoughts•8h ago

What type of work do you do? And how do you measure value?

Last week I was using Claude Code for web development. This week, I used it to write ESP32 firmware and a Linux kernel driver. Sure, it made mistakes, but the net was still very positive in terms of efficiency.

verall•7h ago

> This week, I used it to write ESP32 firmware and a Linux kernel driver.

I'm not meaning to be negative at all, but was this for a toy/hobby or for a commercial project?

I find that LLMs do very well on small greenfield toy/hobby projects but basically fall over when brought into commercial projects that often have bespoke requirements and standards (i.e. has to cross compile on qcc, comply with autosar, in-house build system, tons of legacy code laying around maybe maybe not used).

So no shade - I'm just really curious what kind of project you were able get such good results writing ESP32 FW and kernel drivers for :)

lukebechtel•7h ago

Maintaining project documentation is:

(1) Easier with AI

(2) Critical for letting AI work effectively in your codebase.

Try creating well structured rules for working in your codebase, put in .cursorrules or Claude equivalent... let AI help you... see if that helps.

theshrike79•6h ago

The magic to using agentic LLMs efficiently is...

proper project management.

You need to have good documentation, split into logical bits. Tasks need to be clearly defined and not have extensive dependencies.

And you need to have a simple feedback loop where you can easily run the program and confirm the output matches what you want.

troupo•5h ago

And the chance of that working depends on the weather, the phase of the moon and the arrangement of bird bones in a druidic augury.

It's a non-deterministic system producing statistically relevant results with no failure modes.

I had Cursor one-shot issues in internal libraries with zero rules.

And then suggest I use StringBuilder (Java) in a 100% Elixir project with carefully curated cursor rules as suggested by the latest shamanic ritual trends.

oceanplexian•7h ago

I work in FAANG, have been for over a decade. These tools are creating a huge amount of value, starting with Copilot but now with tools like Claude Code and Cursor. The people doing so don’t have a lot of time to comment about it on HN since we’re busy building things.

nomel•7h ago

What are the AI usage policies like at your org? Where I am, we’re severely limited.

jpc0•7h ago

> These tools are creating a huge amount of value...

> The people doing so don’t have a lot of time to comment about it on HN since we’re busy building…

“We’re so much more productive that we don’t have time to tell you how much more productive we are”

Do you see how that sounds?

wijwp•7h ago

To be fair, AI isn't going to give us more time outside work. It'll just increase expectations from leadership.

drusepth•3h ago

I feel this, honestly. I get so much more work done (currently: building & shipping games, maintaining websites, managing APIs, releasing several mobile apps, and developing native desktop applications) managing 5x claude instances that the majority of my time is sucked up by just prompting whichever agent is done on their next task(s), and there's a real feeling of lost productivity if any agent is left idle for too long.

The only time to browse HN left is when all the agents are comfortably spinning away.

GodelNumbering•7h ago

I don't see how FAANG is relevant here. But the 'FAANG' I used to work at had an emergent problem of people throwing a lot of half baked 'AI-powered' code over the wall and let reviewers deal with it (due to incentives, not that they were malicious). In orgs like infra where everything needs to be reviewed carefully, this is purely a burden

nme01•7h ago

I also work for a FAANG company and so far most employees agree that while LLMs are good for writing docs, presentations or emails, they still lack a lot when it comes to writing a maintainable code (especially in Java, they supposedly do better in Go, don’t know why, not my opinion). Even simple refactorings need to be carefully checked. I really like them for doing stuff that I know nothing about though (eg write a script using a certain tool, tell me how to rewrite my code to use certain library etc) or for reviewing changes

verall•6h ago

I work in a FAANG equivalent for a decade, mostly in C++/embedded systems. I work on commercial products used by millions of people. I use the AI also.

When others are finding gold in rivers similar to mine, and I'm mostly finding dirt, I'm curious to ask and see how similar the rivers really are, or if the river they are panning in is actually somewhere I do find gold, but not a river I get to pan in often.

If the rivers really are similar, maybe I need to work on my panning game :)

boppo1•2h ago

>creating a huge amount of value Do you write software, or work in accounting/finance/marketing?

GodelNumbering•7h ago

This is my experience too. Also, their propensity to jump into code without necessarily understanding the requirement is annoying to say the least. As the project complexity grows, you find yourself writing longer and longer instructions just to guardrail.

Another rather interesting thing is that they tend to gravitate towards sweep the errors under the rug kind of coding which is disastrous. e.g. "return X if we don't find the value so downstream doesn't crash". These are the kind of errors no human, even a beginner on their first day learning to code, wouldn't make and are extremely annoying to debug.

Tl;dr: LLMs' tendency to treat every single thing you give it as a demo homework project

tombot•7h ago

> their propensity to jump into code without necessarily understanding the requirement is annoying to say the least.

Then don't let it, collaborate on the spec, ask Claude to make a plan. You'll get far better results

https://www.anthropic.com/engineering/claude-code-best-pract...

verall•6h ago

> Another rather interesting thing is that they tend to gravitate towards sweep the errors under the rug kind of coding which is disastrous. e.g. "return X if we don't find the value so downstream doesn't crash".

Yes, these are painful and basically the main reason I moved from Claude to Gemini - it felt insane to be begging the AI - "No, you actually have to fix the bug, in the code you wrote, you cannot just return some random value when it fails, it actually has to work".

GodelNumbering•6h ago

Claude in particular abuses the word 'Comprehensive' a lot. You express that you're unhappy with its approach, it will likely comeback with "Comprehensive plan to ..." and then write like 3 bullet points under it, that is of course after profusely apologizing. On a sidenote, I wish LLMs never apologized and instead just said I don't know how to do this.

jorvi•6h ago

Running LLM code with kernel privileges seems like courting disaster. I wouldn't dare do that unless I had a rock-solid grasp of the subsystem, and at that point, why not just write the code myself? LLM coding is on-average 20% slower.

LinXitoW•3h ago

In my experience in a Java code base, it didn't do any of this, and did a good job with exceptions.

And I have to disagree that these aren't errors that beginners or even intermediates make. Who hasn't swallowed an error because "that case totally, most definitely won't ever happen, and I need to get this done"?

flowerthoughts•5h ago

Totally agree.

This was a debugging tool for Zigbee/Thread.

The web project is Nuxt v4, which was just released, so Claude keeps wanting to use v3 semantics, and you have to keep repeating the known differences, even if you use CLAUDE.md. (They moved client files under a app/ subdirectory.)

All of these are greenfield prototypes. I haven't used it in large systems, and I can totally see how that would be context overload for it. This is why I was asking GP about the circumstances.

LinXitoW•3h ago

Ironically, AI mirrors human developers in that it's far more effective when working in a well written, well documented code base. It will infer function functionality from function names. If those are shitty, short, or full of weird abbreviations, it'll have a hard time.

Maybe it's a skill issue, in the sense of having a decent code base.

greenie_beans•7h ago

same. agents are good with easy stuff and debugging but extremely bad with complexity. has no clue about chesterson's fence, and it's hard to parse the results especially when it creates massive diffs. creates a ton of abandoned/cargo code. lots of misdirection with OOP.

chatting witch claude and copy/pasting code between my IDE and claude is still the most effective for more complex stuff, at least for me.

jmartrican•6h ago

Maybe that is a skills issue.

rootusrootus•6h ago

If you are suggesting that LLMs are proving quite good at taking over the low skilled work that probably 90% of devs spend the majority of their time doing, I totally agree. It is the simplest explanation for why many people think they are magic, while some people find very little value.

On the occasion that I find myself having to write web code for whatever reason, I'm very happy to have Claude. I don't enjoy coding for the web, like at all.

logicprog•5h ago

I think that's definitely true — these tools are only really taking care of the relatively low skill stuff; synthesizing algorithms and architectures and approaches that have been seen before, automating building out for scaffolding things, or interpolating skeletons, and running relatively typical bash commands for you after making code changes, or implementing fairly specific specifications of how to approach novel architectures algorithms or code logic, automating exploring code bases and building understanding of what things do and where they are and how they relate and the control flow (which would otherwise take hours of laboriously grepping around and reading code), all in small bite sized pieces with a human in the loop. They're even able to make complete and fully working code for things that are a small variation or synthesization of things they've seen a lot before in technologies they're familiar with.

But I think that that can still be a pretty good boost — I'd say maybe 20 to 30%, plus MUCH less headache, when used right — even for people that are doing really interesting and novel things, because even if your work has a lot of novelty and domain knowledge to it, there's always mundane horseshit that eats up way too much of your time and brain cycles. So you can use these agents to take care of all the peripheral stuff for you and just focus on what's interesting to you. Imagine you want to write some really novel unique complex algorithm or something but you do want it to have a GUI debugging interface. You can just use Imgui or TKinter if you can make Python bindings or something and then offload that whole thing onto the LLM instead of having to have that extra cognitive load and have to page just to warp the meat of what you're working on out whenever you need to make a modification to your GUI that's more than trivial.

I also think this opens up the possibility for a lot more people to write ad hoc personal programs for various things they need, which is even more powerful when combined with something like Python that has a ton of pre-made libraries that do all the difficult stuff for you, or something like emacs that's highly malleable and rewards being able to write programs with it by making them able to very powerfully integrate with your workflow and environment. Even for people who already know how to program and like programming even, there's still an opportunity cost and an amount of time and effort and cognitive load investment in making programs. So by significantly lowering that you open up the opportunities even for us and for people who don't know how to program at all, their productivity basically goes from zero to one, an improvement of 100% (or infinity lol)

phist_mcgee•3h ago

What a supremely arrogant comment.

rootusrootus•3h ago

I often have such thoughts about things I read on HN but I usually follow the site guidelines and keep it to myself.

ericmcer•6h ago

Agreed, daily Cursor user.

Just got out of a 15m huddle with someone trying to understand what they were doing in a PR before they admitted Claude generated everything and it worked but they weren't sure why... Ended up ripping about 200 LoC out because what Claude "fixed" wasn't even broken.

So never let it generate code, but the autocomplete is absolutely killer. If you understand how to code in 2+ languages you can make assumptions about how to do things in many others and let the AI autofill the syntax in. I have been able to swap to languages I have almost no experience in and work fairly well because memorizing syntax is irrelevant.

daymanstep•6h ago

> I have been able to swap to languages I have almost no experience in and work fairly well because memorizing syntax is irrelevant.

I do wonder whether your code does what you think it does. Similar-sounding keywords in different languages can have completely different meanings. E.g. the volatile keyword in Java vs C++. You don't know what you don't know, right? How do you know that the AI generated code does what you think it does?

jacobr1•5h ago

Beyond code-gen I think some techniques are very underutilized. One can generate tests, generate docs, explain things line by line. Explicitly explaining alternative approaches and tradeoffs is helpful too. While, as with everything in this space, there are imperfection, I find a ton of value in looking beyond the code into thinking through the use cases, alternative approaches and different ways to structure the same thing.

pornel•5h ago

I've wasted time debugging phantom issues due to LLM-generated tests that were misusing an API.

Brainstorming/explanations can be helpful, but also watch out for Gell-Mann amnesia. It's annoying that LLMs always sound smart whether they are saying something smart or not.

Miraste•4h ago

Yes, you can't use any of the heuristics you develop for human writing to decide if the LLM is saying something stupid, because its best insights and its worst hallucinations all have the same formatting, diction, and style. Instead, you need to engage your frontal cortex and rationally evaluate every single piece of information it presents, and that's tiring.

spanishgum•5h ago

The same way I would with any of my own code - I would test it!

The key here is to spend less time searching, and more time understanding the search result.

I do think the vibe factor is going to bite companies in the long run. I see a lot of vibe code pushed by both junior and senior devs alike, where it's clear not enough time was spent reviewing the product. This behavior is being actively rewarded now, but I do think the attitude around building code as fast as possible will change if impact to production systems becomes realized as a net negative. Time will tell.

senko•6h ago

> Just got out of a 15m huddle with someone trying to understand what they were doing in a PR before they admitted Claude generated everything and it worked but they weren't sure why...

But .. that's not the AI's fault. If people submit any PRs (including AI-generated or AI-assisted) without completely understanding them, I'd treat is as serious breach of professional conduct and (gently, for first-timers) stress that this is not acceptable.

As someone hitting the "Create PR" (or equivalent) button, you accept responsibility for the code in question. If you submit slop, it's 100% on you, not on any tool used.

draxil•5h ago

But it's pretty much a given at this point that if you use agents to code for any length of time it starts to atrophy your ability to understand what's going on. So, yeah. it's a bit of a devils chalice.

whatever1•5h ago

If you have to review what the LLM wrote then there is no productivity gain.

Leadership asks for vibe coding

senko•5h ago

> If you have to review what the LLM wrote then there is no productivity gain.

I do not agree with that statement.

> Leadership asks for vibe coding

Leadership always asks for more, better, faster.

mangamadaiyan•35m ago

> Leadership always asks for more, better, faster.

More and faster, yes. Almost never better.

swat535•3h ago

> If you have to review what the LLM wrote then there is no productivity gain.

You always have to review the code, whether it's written by another person, yourself or an AI.

I'm not sure how this translates into the loss of productivity?

Did you mean to say that the code AI generates is difficult to review? In those cases, it's the fault of the code author and not the AI.

Using AI like any other tool requires experience and skill.

WolfeReader•1h ago

I've seen AI create incorrect solutions and deceptive variable names. Reviewing the code is absolutely necessary.

epolanski•1h ago

> If you have to review what the LLM wrote then there is no productivity gain

Stating something with confidence does not make it automatically true.

fooster•47m ago

I suggest you upgrade your code review skill. I find it vastly quicker in most cases to review code than write it in the first place.

whatever1•28m ago

Anyone can skim code and type “looks good to me”.

qingcharles•4h ago

The other day I caught it changing the grammar and spelling in a bunch of static strings in a totally different part of a project, for no sane reason.

bdamm•3h ago

I've seen it do this as well. Odd things like swapping the severity level on log statements that had nothing to do with the task.

Very careful review of my commits is the only way forward, for a long time.

epolanski•1h ago

You're blaming the tool and not the tool user.

meowtimemania•6h ago

For me it depends on the task. For some tasks (maybe things that don't have good existing examples in my codebase?)

I'll spend 3x the time repeatedly asking claude to do something for me

9cb14c1ec0•3h ago

The more I use Claude Code, the more aware I become of its limitations. On the whole, it's a useful tool, but the bigger the codebase the less useful. I've noticed a big difference on its performance on projects with 20k lines of code versus 100k. (Yes, I know. A 100k line project is still very small in the big picture)

Aeolun•2h ago

I think one of thr big issues with CC is that it’ll read the first occurence of something, and then think it’s found it. Never mind that there are 17 instances spread throughout the codebase.

I have to be really vigilant and tell it to search the codebase for any duplication, then resolve it, if I want it to keep going good at what it does.

alexchamberlain•9h ago

I'm not sure how, and maybe some of the coding agents are doing this, but we need to teach the AI to use abstractions, rather than the whole code base for context. We as humans don't hold the whole codebase in our hear, and we shouldn't expect the AI to either.

F7F7F7•9h ago

There are a billion and one repos that claim to help do this. Let us know when you find one.

siwatanejo•9h ago

I do think AIs are already using abstractions, otherwise you would be submitting all the source code of your dependencies into the context.

TheOtherHobbes•4h ago

I think they're recognising patterns, which is not the same thing.

Abstractions are stable, they're explicit in their domains, good abstractions cross multiple domains, and they typically come with a symbolic algebra of available operations.

Math is made of abstractions.

Patterns are a weaker form of cognition. They're implicit, heavily context-dependent, and there's no algebra. You have to poke at them crudely in the hope you can make them do something useful.

Using LLMs feels more like the latter than the former.

If LLMs were generating true abstractions they'd be finding meta-descriptions for code and language and making them accessible directly.

AGI - or ASI - may be be able to do that some day, but it's not doing that now.

anthonypasq•8h ago

the fact we cant keep the repo in our working memory is a flaw of our brains. i cant see how you could possibly make the argument that if you were somehow able to keep the entire codebase in your head that it would be a disadvantage.

SkyBelow•8h ago

Information tradeoff. Even if you could keep the entire code base in memory, if something else has to be left out of memory, then you have to consider the value of an abstraction verses whatever other information is lost. Abstractions also apply to the business domain and works the same.

You also have time tradeoffs. Like time to access memory and time to process that memory to achieve some outcome.

There is also quality. If you can keep the entire code base in memory but with some chance of confusion, while abstractions will allow less chance of confusion, then the tradeoff of abstractions might be worth it still.

Even if we assume a memory that has no limits, can access and process all information at constant speed, and no quality loss, there is still communication limitations to worry about. Energy consumption is yet another.

sdesol•8h ago

LLMs (current implementation) are probabilistic so it really needs the actual code to predict the most likely next tokens. Now loading the whole code base can be a problem in itself, since other files may negatively affect the next token.

photon_lines•8h ago

Sorry -- I keep seeing this being used but I'm not entirely sure how it differs from most of human thinking. Most human 'reasoning' is probabilistic as well and we rely on 'associative' networks to ingest information. In a similar manner - LLMs use association as well -- and not only that, but they are capable of figuring out patterns based on examples (just like humans are) -- read this paper for context: https://arxiv.org/pdf/2005.14165. In other words, they are capable of grokking patterns from simple data (just like humans are). I've given various LLMs my requirements and they produced working solutions for me by simply 1) including all of the requirements in my prompt and 2) asking them to think through and 'reason' through their suggestions and the products have always been superior to what most humans have produced. The 'LLMs are probabilistic predictors' comments though keep appearing on threads and I'm not quite sure I understand them -- yes, LLMs don't have 'human context' i.e. data needed to understand human beings since they have not directly been fed in human experiences, but for the most part -- LLMs are not simple 'statistical predictors' as everyone brands them to be. You can see a thorough write-up I did of what GPT is / was here if you're interested: https://photonlines.substack.com/p/intuitive-and-visual-guid...

didibus•8h ago

You seem possibly more knowledgeable then me on the matter.

My impression is that LLMs predict the next token based on the prior context. They do that by having learned a probability distribution from tokens -> next-token.

Then as I understand, the models are never reasoning about the problem, but always about what the next token should be given the context.

The chain of thought is just rewarding them so that the next token isn't predicting the token of the final answer directly, but instead predicting the token of the reasoning to the solution.

Since human language in the dataset contains text that describes many concepts and offers many solutions to problems. It turns out that predicting the text that describes the solution to a problem often ends up being the correct solution to the problem. That this was true was kind of a lucky accident and is where all the "intelligence" comes from.

photon_lines•7h ago

So - in the pre-training step you are right -- they are simple 'statistical' predictors but there are more steps involved in their training which turn them from simple predictors to being able to capture patterns and reason -- I tried to come up with an intuitive overview of how they do this in the write-up and I'm not sure I can give you a simple explanation here, but I would recommend you play around with Deep-Seek and other more advanced 'reasoning' or 'chain-of-reason' models and ask them to perform tasks for you: they are not simply statistically combining information together. Many times they are able to reason through and come up with extremely advanced working solutions. To me this indicates that they are not 'accidently' stumbling upon solutions based on statistics -- they actually are able to 'understand' what you are asking them to do and to produce valid results.

sdesol•8h ago

I'm not sure if I would say human reasoning is 'probabilistic' unless you are taking a very far step back and saying based on how the person lived, they have ingrained biases (weights) that dictates how they reason. I don't know if LLMs have a built in scepticism like humans do, that plays a significant role in reasoning.

Regardless if you believe LLMs are probabilistic or not, I think what we are both saying is context is king and what it (LLM) says is dictated by the context (either through training) or introduced by the user.

Workaccount2•7h ago

Humans have a neuro-chemical system that performs operations with electrical signals.

That's the level to look at, unless you have a dualist view of the brain (we are channeling a super-natural forces).

lll-o-lll•3h ago

Yep, just like like looking at a birds feather through a microscope explains the principles of flight…

Complexity theory doesn’t have a mathematics (yet), but that doesn’t mean we can’t see that it exists. Studying the brain at the lowest levels haven’t lead to any major insights in how cognition functions.

photon_lines•7h ago

'I don't know if LLMs have a built in scepticism like humans do' - humans don't have an 'in built skepticism' -- we learn in through experience and through being taught how to 'reason' within school (and it takes a very long time to do this). You believe that this is in-grained but you may have forgotten having to slog through most of how the world works and being tested when you went to school and when your parents taught you these things. On the context component: yes, context is vitally important (just as it is with humans) -- you can't produce a great solution unless you understand the 'why' behind it and how the current solution works so I 100% agree with that.

ijidak•6h ago

For me, the way humans finish each other's sentences and often think of quotes from the same movies at the same time in conversation (when there is no clear reason for that quote to be a part of the conversation), indicates that there is a probabilistic element to human thinking.

Is it entirely probabilistic? I don't think so. But, it does seem that a chunk of our speech generation and processing is similar to LLMs. (e.g. given the words I've heard so far, my brain is guessing words x y z should come next.)

I feel like the conscious, executive mind humans have exercises some active control over our underlying probabilistic element. And LLMs lack the conscious executive.

e.g. They have our probabilistic capabilities, without some additional governing layer that humans have.

nomel•7h ago

No, it doesn’t, nor do we. It’s why abstractions and documentations exist.

If you know what a function achieves, and you trust it to do that, you don’t need to see/hold its exact implementation in your head.

sdesol•7h ago

But documentation doesn't include styling or preferred pattern, which is why I think a lot people complain that the LLM will just produce garbage. Also documentation is not guaranteed to be correct or up to date. To be able to produce the best code based on what you are hoping for, I do think having the actual code is necessary unless styling/design patterns are not important, then yes documentation will be suffice, provided they are accurate and up to date.

throwaway314155•8h ago

/compact in Claude Code is effectively this.

brulard•42m ago

Compact is a reasonable default way to do that, but quite often it discards important details. It's better to have CC to store important details, decisions and reasons in a document where it can be reviewed and modified if needed.

LinXitoW•3h ago

They already do, or at least Claude Code does. It will search for a method name, then only load a chunk of that file to get the method signature, for example.

It will use the general information you give it to make educated guesses of where things are. If it knows the code is Vue based and it has to do something with "users", it might seach for "src/*/User.vue.

This is also the reason why the quality of your code makes such a large difference. The more consistent the naming of files and classes, the better the AI is at finding them.

sdesol•9h ago

> I really desperately need LLMs to maintain extremely effective context

I actually built this. I'm still not ready to say "use the tool yet" but you can learn more about it at https://github.com/gitsense/chat.

The demo link is not up yet as I need to finalize an admin tool but you should be able to follow the npm instructions to play around with.

The basic idea is, you should be able to load your entire repo or repos and use the context builder to help you refine it. Or you can can create custom analyzers that you can do 'AI Assisted' searches with like execute `!ask find all frontend code that does [this]` and the because the analyzer knows how to extract the correct metadata to support that query, you'll be able to easily build the context using it.

kvirani•9h ago

Wait that's not how Cursor etc work? (I made assumptions)

trenchpilgrim•8h ago

Dunno about Cursor but this is exactly how I use Zed to navigate groups of projects

sdesol•8h ago

I don't use Cursor so I can't say, but based on what I've read, they optimize for smaller context to reduce cost and probably for performance. The issue is, I think this is severely flawed as LLMs are insanely context sensitive and forgetting to include a reference file can lead to undesirable code.

I am obviously biased, but I still think to get the best results, the context needs to be human curated to ensure everything the LLM needs will be present. LLMs are probabilistic, so the more relevant context, the greater the chances the final output is the most desired.

hirako2000•8h ago

Not clear how it gets around what is, ultimately, a context limit.

I've been fiddling with some process too, would be good if you shared the how. The readme looks like yet another full fledged app.

sdesol•8h ago

Yes there is a context window limit, but I've found for most frontier models, you can generate very effective code if the context window is under 75,000 tokens provided the context is consistent. You have to think of everything from a probability point of view and the more logical the context, the greater the chances of better code.

For example, if the frontend doesn't need to know the backend code (other than the interface) not including the backend code to solve a frontend one to solve a specific problem can reduce context size and improve the chances of expected output. You just need to ensure you include the necessary interface documenation.

As for the full fledged app, I think you raised a good point and I should add a 'No lock in' section for why to use it. The app has a message tool that lets you pick and choose what messages to copy. Once you've copied the context (including any conversation messages that can help the LLM), you can use the context where ever you want.

My strategy with the app is to be the first place you goto to start a conversation before you even generate code, so my focus is helping you construct contexts (the smaller the better) to feed into LLMs.

handfuloflight•8h ago

Doesn't Claude Code do all of this automatically?

sdesol•8h ago

I haven't looked at Claud Code, so I don't know if they have analyzers or not that understands how to extract any type of data other than specific coding data that it is trained on. Based on the runtime for some tasks, I would not be surprised if it is going through all the files and asking "is this relevant"

My tool is mainly targeted at massive code bases and enterprise as I still believe the most efficient way to build accurate context is by domain experts.

Right now, I would say 95% of my code is AI generated (98% human architectured) and I am spending about $2 a day on LLM costs and the code generation part usually never runs more than 30 seconds for most tasks.

handfuloflight•8h ago

Well you should look at it, because it's not going through all files. I looked at your product and the workflow is essentially asking me to do manually what Claude Code does auto. Granted, manually selecting the context will probably lead to lower costs in any case because Claude Code invokes tool calls like grep to do its search, so I do see merit in your product in that respect.

sdesol•7h ago

Looking at the code, it does have some sort of automatic discovery. I also don't know how scalable Claude Code is. I've spent over a decade thinking about code search, so I know what the limitations are for enterprise code.

One of the neat tricks that I've developed is, I would load all my backend code for my search component and then I would ask the LLM to trace a query and create a context bundle for only the files that are affected. Once the LLM has finished, I just need to do a few clicks to refine a 80,000 token size window down to about 20,000 tokens.

I would not be surprised if this is one of the tricks that it does as it is highly effective. Also, yes my tool is manual, but I treat conversations as durable asset so in the future, you should be able to say, last week I did this, load the same files and LLM will know what files to bring into context.

handfuloflight•7h ago

Excellent, I look forward to trying it out, at minimum to wean off dependency to Claude Code and it's likely current state of overspending on context. I agree with looking at conversations as durable assets.

sdesol•7h ago

> current state of overspending on context

The thing that is killing me when I hear about Claude Code and other agent tools is the amount of energy they must be using. People say they let the task run for an hour and I can't help but to think how much energy is being used and if Claude Code is being upfront with how much things will actually cost in the future.

pacoWebConsult•7h ago

FWIW Claude code conversations are also durable. You can resume any past conversation in your project. They're stored as jsonl files within your `$HOME/.claude` directory. This retains the actual context (including your prompts, assistant responses, tool usages, etc) from that conversation, not just the files you're affecting as context.

sdesol•7h ago

Thanks for the info. I actually want to make it easy for people to review aider, plandex, claude code, etc. conversations so I will probably look at importing them.

My goal isn't to replace the other tools, but to make them work smarter and more efficiently. I also think we will in a year or two, start measuring performance based on how developers interact with LLMs (so management will want to see the conversations). Instead of looking at code generated, the question is going to be, if this person is let go, what is the impact based on how they are contributing via their conversations.

ec109685•1h ago

It greps around the code like an intern would. You have to have patience and be willing to document workflows and correct when it gets things wrong via CLAUDE.md files.

sdesol•31m ago

Honestly, grepping isn't a bad strategy if there is enough context to generate focused keywords/patterns to search. The "let Claude Code think for 10 minutes or more", makes a lot more sense now, as this brute force method can take some time.

msikora•2h ago

Why not build this as an MCP so that people can plug it into their favorite platform?

sdesol•48m ago

An MCP is definitely on the roadmap. My objective is to become the context engine for LLMs so having a MCP is required. However, there will be things from a UX perspective that you'll lose out on if you just use the MCP.

seanmmward•8h ago

The primary use case isn't just about shoving more code in context, although depending on the task, there is an irredicible minimum context needed for it to capture all the needed understanding. The 1M context model is a unique beast in terms of how you need to feed it, and its real power is being able to tackle long horizon tasks which require iterative exploration, in context learning, and resynthesis. Ie, some problems are breadth (go fix an api change in 100 files), other however require depth (go learn from trying 15 different ways to solve this problem). 1M Sonnet is unique in its capabilities for the latter in particular.

hinkley•7h ago

Sounds to me like your problem has shifted from how much the AI tool costs per hour to how much it costs per token because resetting a model happens often enough that the price doesn't amortize out per hour. That giant spike every ?? months overshadows the average cost per day.

I wonder if this will become more universal, and if we won't see a 'tick-tock' pattern like Intel used, where they tweak the existing architecture one or more times between major design work. The 'tick' is about keeping you competitive and the 'tock' is about keeping you relevant.

TZubiri•5h ago

"However. Price is king. Allowing me to flood the context window with my code base is great"

I don't vibe code, but in general having to know all of the codebase to be able to do something is a smell, it's spagghetti, it's lack of encapsulation.

When I program I cannot think about the whole database, I have a couple of files open tops and I think about the code in those files.

This issue of having to understand the whole codebase, complaining about abstractions, microservices, and OOP, and wanting everything to be in a "simple" monorepo, or a monolith; is something that I see juniors do, almost exclusively.

ants_everywhere•5h ago

> I really desperately need LLMs to maintain extremely effective context

The context is in the repo. An LLM will never have the context you need to solve all problems. Large enough repos don't fit on a single machine.

There's a tradeoff just like in humans where getting a specific task done requires removing distractions. A context window that contains everything makes focus harder.

For a long time context windows were too small, and they probably still are. But they have to get better at understanding the repo by asking the right questions.

stuartjohnson12•5h ago

> An LLM will never have the context you need to solve all problems.

How often do you need more than 10 million tokens to answer your query?

ants_everywhere•4h ago

I exhaust the 1 million context windows on multiple models multiple times per day.

I haven't used the Llama 4 10 million context window so I don't know how it performs in practice compared to the major non-open-source offerings that have smaller context windows.

But there is an induced demand effect where as the context window increases it opens up more possibilities, and those possibilities can get bottlenecked on requiring an even bigger context window size.

For example, consider the idea of storing all Hollywood films on your computer. In the 1980s this was impossible. If you store them in DVD or Bluray quality you could probably do it in a few terabytes. If you store them in full quality you may be talking about petabytes.

We recently struggled to get a full file into a context window. Now a lot of people feel a bit like "just take the whole repo, it's only a few MB".

brulard•1h ago

I think you misunderstand how context in current LLMs works. To get the best results you have to be very careful to provide what is needed for immediate task progression, and postpone context thats needed later in the process. If you give all the context at once, you will likely get quite degraded output quality. Thats like if you want to give a junior developer his first task, you likely won't teach him every corner of your app. You would give him context he needs. It is similar with these models. Those that provided 1M or 2M of context (Gemini etc.) were getting less and less useful after cca 200k tokens in the context.

Maybe models would get better in picking up relevant information from large context, but AFAIK it is not the case today.

onion2k•4h ago

Large enough repos don't fit on a single machine.

I don't believe any human can understand a problem if they need to fit the entire problem blem domain in their head, and the scope of a domain that doesn't fit on a computer. You have to break it down into a manageable amount of information to tackle it in chunks.

If a person can do that, so can an LLM prompted to do that by a person.

sdesol•4h ago

> But they have to get better at understanding the repo by asking the right questions.

How I am tackling this problem is making it dead simple for users to create analyzers that are designed to enriched text data. You can read more about how it would be used in a search at https://github.com/gitsense/chat/blob/main/packages/chat/wid...

The basic idea is, users would construct analyzers with the help of LLMs to extract the proper metadata that can be semantically searched. So when the user does an AI Assisted search with my tool, I would load all the analyzers (description and schema) into the system prompt and the LLM can determine which analyzers can be used to answer the question.

A very simplistic analyzer would be to make it easy to identify backend and frontend code so you can just use the command `!ask find all frontend files` and the LLM will construct a deterministic search that knows to match for frontend files.

rootnod3•9h ago

So, more tokens means better but at the same time more tokens means it distracts itself too much along the way. So at the same time it is an improvement but also potentially detrimental. How are those things beneficial in any capacity? What was said last week? Embrace AI or leave?

All I see so far is: don't embrace and stay.

rootnod3•8h ago

So, I see this got downvoted. Instead of just downvoting, I would prefer to have a counter-argument. Honestly. I am on the skeptic side of LLM, but would not mind being turned to the other side with some solid arguments.

pupppet•9h ago

How does anyone send these models that much context without it tripping over itself? I can't get anywhere near that much before it starts losing track of instruction.

9wzYQbTYsAIc•9h ago

I’ve been having decent luck telling it to keep track of itself in a .plan file, not foolproof, of course, but it has some ability to “preserve context” between contexts.

Right now I’m experimenting with using separate .plan files for tracking key instructions across domains like architecture and feature decisions.

CharlesW•8h ago

> I’ve been having decent luck telling it to keep track of itself in a .plan file, not foolproof, of course, but it has some ability to “preserve context” between contexts.

This is the way. Not only have I had good luck with both a TASKS.md and TASKS-COMPLETE.md (for history), but I have an .llm/arch full of AI-assisted, for-LLM .md files (auth.md, data-access.md, etc.) that document architecture decisions made along the way. They're invaluable for effectively and efficiently crossing context chasms.

collinvandyck76•1h ago

Yeah, this. Each project I work on has it's own markdown file named for the ticket or the project. Committed on the branch, and I have claude rewrite it with the "current understanding" periodically. After compacting, I have it re-read the MD file and we get started again. Quite nice.

olddustytrail•8h ago

I think it's key to not give it contradictory instructions, which is an easy mistake to make if you forget where you started.

As an example, I know of an instance where the LLM claimed it had tried a test on its laptop. This obviously isn't true so the user argued with it. But they'd originally told it that it was a Senior Software Engineer so playing that role, saying you tested locally is fine.

As soon as you start arguing with those minor points you break the context; now it's both a Software Engineer and an LLM. Of course you get confused responses if you do that.

pupppet•8h ago

The problem I often have is I may have instruction like-

General instruction: - Do "ABC"

If condition == whatever: - Do "XYZ" instead

I have a hard time making the AI obey the instances I wish to override my own instruction and without having full control of the input context, I can't just modify my 'General Instruction' on a case by case basis to simply avoid having to contradict myself.

olddustytrail•4h ago

That's a difficult case where you might want to collect your good context and shift it to a different session.

It would be nice if the UI made that easy to do.

greenfish6•9h ago

Yes, but if you look in the rate limit notes, the rate limit is 500k tokens / minite for tier 4, which we are on. Given how stingy anthropic has been with rate limit increases, this is for very few people right now

alvis•9h ago

Context window after certain size doesn’t bring in much benefit but higher bill. If it still keeps forgetting instructions it would be just much easier to be ended up with long messages with higher context consumption and hence the bill

I’d rather having an option to limit the context size

EcommerceFlow•9h ago

It does if you're working with bigger codebases. I've found copy/pasting my entire codebase + adding a <task> works significantly better than cursor.

spiderice•7h ago

How does one even copy their entire codebase? Are you saying you attach all the files? Or you use some script to copy all the text to your clipboard? Or something else?

EcommerceFlow•4h ago

I created a script that outputs the entire codebase to a text file (also allows me to exclude files/folders/node_modules), separating and labeling each file in the program folder.

I then structure my prompts around like so:

<project_code> ``` ``` </project_code>

<heroku_errors> " " </heroku_errors>

I've been using this with Google Ai studio and it's worked phenomenally. 1 million tokens is A LOT of code, so I'd imagine this would work for lots n lots of project type programs.

swader999•2h ago

Repomix, there's a cli and an MCP.

andrewstuart•9h ago

Oh man finally. This has been such a HUGE advantage for Gemini.

Could we please have zip files too? ChatGPT and Gemini both unpack zip files via the chat window.

Now how about a button to download all files?

qsort•9h ago

I won't complain about a strict upgrade, but that's a pricy boi. Interesting to see differential pricing based on size of input, which is understandable given the O(n^2) nature of attention.

isoprophlex•9h ago

1M of input... at $6/1M input tokens. Better hope it can one-shot your answer.

elitan•7h ago

have you ever hired humans?

bicepjai•4h ago

Depends on which human you tried :) Donot underestimate yourself !

rafaelero•9h ago

god they keep raising prices

revskill•9h ago

The critical issue with LLM which never beats human: break what worked.

henriquegodoy•9h ago

Thats incredible to see how ai models are improving, i'm really happy with this news. (imo it's more impactful than the release of gpt5) now, we need more tokens per second, and then the self-improvement of the model will accelerate.

lherron•8h ago

Wow, I thought they would feel some pricing pressure from GPT5 API costs, but they are doubling down on their API being more expensive than everyone else.

sebzim4500•8h ago

I think it's the right approach, the cost of running these things as coding assistants is negligable compared to the benefit of even a slight model improvement.

AtNightWeCode•7h ago

GPT5 API uses more tokens for answers of the same quality as previous versions. Fell into that trap myself. I use both Claude and OpenAI right now. Will probably drop OpenAI since they are obviously not to be trusted considering the way they do changes.

shamano•8h ago

1M tokens is impressive, but the real gains will come from how we curate context—compact summaries, per-repo indexes, and phase resets. Bigger windows help; guardrails keep models focused and costs predictable.

jbellis•8h ago

Just completed a new benchmark that sheds some light on whether Anthropic's premium is worth it.

(Short answer: not unless your top priority is speed.)

https://brokk.ai/power-rankings

24xpossible•8h ago

Why no Grok 4?

Zorbanator•6h ago

You should be able to guess.

rcanepa•7h ago

I recently switched to the $200 CC subscription and I think I will stay with it for a while. I briefly tested whatever version of ChatGPT 5 comes with the free Cursor plan and it was unbearably slow. I could not really code with it as I was constantly getting distracted while waiting for a response. So, speed matters a lot for some people.

Someone1234•8h ago

Before this they supposedly had a longer context window than ChatGPT, but I have workloads that abuse the heck out of context windows (100-120K tokens). ChatGPT genuinely seems to have a 32K context window, in the sense that is legitimately remembers/can utilize everything within that window.

Claude previously had "200K" context windows, but during testing it wouldn't even hit a full 32K before hitting a wall/it forgetting earlier parts of the context. They also have extremely short prompt limits relative to the other services around, making it hard to utilize their supposedly larger context windows (which is suspicious).

I guess my point is that with Anthropic specifically, I don't trust their claims because that has been my personal experience. It would be nice if this "1M" context window now allows you to actually use 200K though, but it remains to be seen if it can even do that. As I said with Anthropic you need to verify everything they claim.

Etheryte•8h ago

Strong agree, Claude is very quick to forget things like "don't do this", "never do this" or things it tried that were wrong. It will happily keep looping even in very short conversations, completely defeating the purpose of using it. It's easy to game the numbers, but it falls apart in the real world.

joquarky•3h ago

I've found it better to use antonyms than negations most situations.

lvl155•8h ago

Only time this is useful is to do init on a sizable code base or dump a “big” csv.

film42•8h ago

The 1M token context was Gemini's headlining feature. Now, the only thing I'd like Claude to work on is tokens counted towards document processing. Gemini will often bill 1/10th the tokens Anthropic does for the same document.

varyherb•8h ago

I believe this can be configured in Claude Code via the following environment variable:

ANTHROPIC_BETAS="context-1m-2025-08-07" claude

falcor84•5h ago

Have you tested it? I see that this env var isn't specified in their docs

https://docs.anthropic.com/en/docs/claude-code/settings#envi...

bazhand•3h ago

    Add these settings to your `.claude/settings.json`:

     ```json
     {
       "env": {
         "ANTHROPIC_CUSTOM_HEADERS": {"anthropic-beta":
     "context-1m-2025-08-07"},
         "ANTHROPIC_MODEL": "claude-sonnet-4-20250514",
         "CLAUDE_CODE_MAX_OUTPUT_TOKENS": 8192
       }
     }
     ```

varyherb•35m ago

Yup! Claude Code has a lot of undocumented configuration. Once I saw the beta header value in their docs [1], I tried to see in their source code if there was anyway to specify this flag via env var config. Their source code is already on your computer, just gotta dig through the minified JS :) Try:

`cat $(which claude) | grep ANTHROPIC_BETAS`

Sibling comment's approach with the other (documented) env var works too.

[1] https://docs.anthropic.com/en/docs/build-with-claude/context...

gdudeman•8h ago

A tip for those who both use Claude Code and are worried about token use (which you should be if you're stuffing 400k tokens into context even if you're on 20x Max):

  1. Build context for the work you're doing. Put lots of your codebase into the context window.
  2. Do work, but at each logical stopping point hit double escape to rewind to the context-filled checkpoint. You do not spend those tokens to rewind to that point.
  3. Tell Claude your developer finished XYZ, have it read it into context and give high level and low level feedback (Claude will find more problems with your developer's work than with yours).

If you want to have multiple chats running, use /resume and pull up the same thread. Hit double escape to the point where Claude has rich context, but has not started down a specific rabbit hole.

rvnx•8h ago

Thank you for the tips, do you know how to rollback latest changes ? Trying very hard to do it, but seems like Git is the only way ?

gdudeman•7h ago

Git or my favorite "Undo all of those changes."

spike021•7h ago

this usually gets the job done for me as well

SparkyMcUnicorn•7h ago

I haven't used it, but saw this the other day: https://github.com/RonitSachdev/ccundo

rtuin•2h ago

Quick tip when working with Claude Code and Git: When you're happy with an intermediate result, stage the changes by running `git add` (no commit). That makes it possible to always go back to the staged changes when Claude messes up. You can then just discard the unstaged changes and don't have to roll back to the latest commit.

seperman•6h ago

Very interesting. Why does Claude find more problems if we mention the code is written by another developer?

bgilly•6h ago

In my experience, Claude will criticize others more than it will criticize itself. Seems similar to how LLMs in general tend to say yes to things or call anything a good idea by default.

I find it to be an entertaining reflection of the cultural nuances embedded into training data and reinforcement learning processes.

mcintyre1994•6h ago

Total guess, but maybe it breaks it out of the sycophancy that most models seem to exhibit?

I wonder if they’d also be better at things like telling you an idea is dumb if you tell it it’s from someone else and you’re just assessing it.

gdudeman•4h ago

Claude is very agreeable and is an eager helper.

It gives you the benefit of the doubt if you're coding.

It also gives you the benefit of the doubt if you're looking for feedback on your developers work. If you give it a hint of distrust "my developer says they completed this, can you check and make sure, give them feedback....?" Claude will look out for you.

sixothree•6h ago

I've been using Serena MCP to keep my context smaller. It seems to be working because claude uses it pretty much exclusively to search the codebase.

lucasfdacunha•5h ago

Could you elaborate a bit on how that works? Does it need any changes in how you use Claude?

sixothree•48m ago

No. I have three MCPs installed and this is the only one that doesn’t need guidance. You’ll see it using it for search and finding references and such. It’s a one line install and no work to maintain.

The advantage is that Claude won’t have to use the file system to find files. And it won’t have to go read files into context to find what it’s looking for. It can use its context for the parts of code that actually matter.

And I feel like my results have actually been much better with this.

yahoozoo•6h ago

I thought double escape just clears the text box?

gdudeman•4h ago

With an empty text box, double escape shows you a list of previous inputs from you. You can go back and fork at any one of those.

oars•5h ago

I tell Claude that it wrote XYZ in another session (I wrote it) then use that context to ask questions or make changes.

gdudeman•4h ago

I'll note this saves a lot of wait time as well! No sitting there while a new Claude builds context from scratch.

i_have_an_idea•4h ago

This sounds like the programmer equivalent of astrology.

> Build context for the work you're doing. Put lots of your codebase into the context window.

If you don’t say that, what do you think happens as the agent works on your codebase.

insane_dreamer•4h ago

I usually tell CC (or opencode, which I've been using recently) to look up the files and find the relevant code. So I'm not attaching a huge number of files to the context. But I don't actually know whether this saves tokens or not.

Wowfunhappy•3h ago

I do this all the time and it sometimes works, but it's not a silver bullet. Sometimes Claude benefits from having the full conversation.

FajitaNachos•29m ago

What's the benefit to using claude code CLI directly over something like Cursor?

nojs•9m ago

In my experience jumping back like this is risky unless you explicitly tell it you made changes, otherwise they will get clobbered because it will update files based on the old context.

Telling it to “re-read” xyz files before starting works though.

ZeroCool2u•8h ago

It's great they've finally caught up, but unfortunate it's on their mid-tier model only and it's laughably expensive.

thimabi•8h ago

Oh, well, ChatGPT is being left in the dust…

When done correctly, having one million tokens of context window is amazing for all sorts of tasks: understanding large codebases, summarizing books, finding information on many documents, etc.

Existing RAG solutions fill a void up to a point, but they lack the precision that large context windows offer.

I’m excited for this release and hope to see it soon on the UI as well.

OutOfHere•8h ago

Fwiw, OpenAI does have a decent active API model family of GPT-4.1 with a 1M context. But yes, the context of the GPT-5 models is terrible in comparison, and it's altogether atrocious for the GPT-5-Chat model.

The biggest issue in ChatGPT right now is a very inconsistent experience, presumably due to smaller models getting used even for paid users with complex questions.

kotaKat•8h ago

A million tokens? Damn, I’m gonna need a lot of quarters to play this game at Chuck-E-Cheese.

xnx•8h ago

1M context windows are not created equal. I doubt Claude's recall is as good as Gemini's 1M context recall. https://cloud.google.com/blog/products/ai-machine-learning/t...

xnx•8h ago

Good analysis here: https://news.ycombinator.com/item?id=44878999

> the model that’s best at details in long context text and code analysis is still Gemini.

> Gemini Pro and Flash, by comparison, are far cheaper

firasd•8h ago

A big problem with the chat apps (ChatGPT; Claude.ai) is the weird context window hijinks. Especially ChatGPT does wild stuff.. sudden truncation; summarization; reinjecting 'ghost snippets' etc

I was thinking this should be up to the user (do you want to continue this conversation with context rolling out of the window or start a new chat) but now I realized that this is inevitable given the way pricing tiers and limited computation works. Like the only way to have full context is use developer tools like Google AI Studio or use a chat app that wraps the API

With a custom chat app that wraps the API you can even inject the current timestamp into each message and just ask the LLM btw every 10 minutes just make a new row in a markdown table that summarizes every 10 min chunk

cruffle_duffle•6h ago

> btw every 10 minutes just make a new row in a markdown table that summarizes every 10 min chunk

Why make it time based instead of "message based"... like "every 10 messages, summarize to blah-blah.md"?

dev0p•6h ago

Probably it's more cost effective and less error prone to just dump the message log rather than actively rethink the context window, costing resources and potentially losing information in the process. As the models gets better, this might change.

firasd•5h ago

Sure. But you'd want to help out the LLM with a message count like this is message 40, this is message 41... so when it hits message 50 it's like ahh time for a new summary and call the memory_table function (cause it's executing the earlier standing order in your prompt)

tosh•8h ago

How did they do the 1M context window?

Same technique as Qwen? As Gemini?

deadbabe•8h ago

Unfortunately, larger context isn’t really the answer after a certain point. Small focused context is better, lazily throwing a bunch of tokens in as a context is going to yield bad results.

ramoz•8h ago

Awesome addition to a great model.

The best interface for long context reasoning has been AIStudio by Google. Exceptional experience.

I use Prompt Tower to create long context payloads.

simianwords•8h ago

How does "supporting 1M tokens" really work in practice? Is it a new model? Or did they just remove some hard coded constraint?

eldenring•7h ago

Serving a model efficiently at 1M context is difficult and could be much more expensive/numerically tricky. I'm guessing they were working on serving it properly, since its the same "model" in scores and such.

simianwords•7h ago

Thanks - still not clear what they did really. Some inference time hacks?

FergusArgyll•5h ago

That would imply the model always had a 1m token context but they limited it in the api and app? That's strange because they can just charge more for every token past 250k (like google does, I believe).

But if not shouldn't it have to be completely retrained model? it's clearly not that - good question!

Aeolun•1h ago

They already had 0.5M context window on the enteprise version.

nickphx•8h ago

Yay, more room for stray cats.

alienbaby•8h ago

The fracturing of all the models offered across providers is annoying. The number of different models and the fact a given model will have different capabilities from different providers is ridiculous.

chrisweekly•7h ago

Peer of this post currently also on HN front page, comparing perf for Claude vs Gemini, w/ 1M tokens: https://news.ycombinator.com/item?id=44878999

DiabloD3•7h ago

Neat. I do 1M tokens context locally, and do it entirely with a single GPU and FOSS software, and have access to a wide range of models of equivalent or better quality.

Explain to me, again, how Anthropic's flawed business model works?

codazoda•6h ago

Tell us more?

DiabloD3•5h ago

Nothing really to say, its just like everyone else's inference setups.

Select a model that produces good results, has anywhere from 256k to 1M context (ex: Qwen3-Coder can do 1M), is under one of the acceptable open weights licenses, and run it in llama.cpp.

llama.cpp can split layers between active and MoE, and only load the active ones into vram, leaving the rest of it available for context.

With Qwen3-Coder-30B-A3B, I can use Unsloth's Q4_K_M, consume a mere 784MB of VRAM with the active layers, then consume 27648MB (kv cache) + 3096MB (context) with the kv cache quantized to iq4_nl. This will fit onto a single card with 32GB of VRAM, or slightly spill over on 24GB.

Since I don't personally need that much, I'm not pouring entire projects into it (I know people do this, and more data does not produce better results), I bump it down to 512k context and fit it in 16.0GB, to avoid spill over on my 24GB card. In the event I do need the context, I am always free to enable it.

I do not see a meaningful performance difference between all on the card and MoE sent to RAM while active is on VRAM, its very much a worthwhile option for home inference.

Edit: For completeness sake, 256k context with this configuration is 8.3GB total VRAM, making _very_ budget good inference absolutely possible.

ffitch•7h ago

I wonder how modern models fair on NovelQA and FLenQA (benchmarks that test ability to understand long context beyond needle in a haystack retrieval). The only such test on a reasoning model that I found was done on o3-mini-high (https://arxiv.org/abs/2504.21318), it suggests that reasoning noticeably improves FLenQA performance, but this test only explored context up to 3,000 tokens.

dang•7h ago

Related ongoing thread:

Claude vs. Gemini: Testing on 1M Tokens of Context - https://news.ycombinator.com/item?id=44878999 - Aug 2025 (9 comments)

whalesalad•7h ago

My first thought was "gg no re" can't wait to see how this changes compaction requirements in claude code.

pmxi•7h ago

The reason I initially got interested in Claude was because they were the first to offer a 200K token context window. That was massive in 2023. However, they didn't keep up once Gemini offered a 1M token window last year.

I'm glad to see an attempt to return to having a competitive context window.

markb139•7h ago

I’ve tried 2 AI tools recently. Neither could produce the correct code to calculate the CPU temperature on a Raspberry Pi RP2040. The code worked, looked ok and even produced reasonable looking results - until I put a finger on the chip and thus raised the temp. The calculated temperature went down. As an aside the free version of chatGPT didn’t know about anything newer than 2023 so couldn’t tell me about the RP2350

anvuong•7h ago

How can you be sure putting the finger on the chip raise the temp? If you feel hot that means heat from the chip is being transferred to your finger, that may decrease the temp, no?

broshtush•7h ago

From my understanding putting your finger on an uncooled CPU acts like a passive cooler, thus actually decreasing temperature.

fwip•7h ago

I don't think a larger context window would help with that.

fpauser•4h ago

Best comment ;)

ghjv•7h ago

wouldn't your finger have acted as a heat sink, lowering the temp? sounds like the program may have worked correctly. could be worth trying again with a hot enough piece of metal instead of your finger

logicchains•7h ago

With that pricing I can't imagine why anyone would use Claude Sonnet through the API when Gemini 2.5 Pro is both better and cheaper (especially at long-context understanding).

CuriouslyC•3h ago

Claude is a good deal with the $20 subscription giving a fair amount of sonnet use with Code. It's also got a very distinct voice as far as LLMs go, and tends to produce cleaner/clearer writing in general. I wouldn't use the API in an application but the subscription feels like a pretty good deal.

siva7•7h ago

Ah, so claude code on subscription will become a crippled down version

joduplessis•7h ago

As far as coding goes Claude seems to be the most competent right now, I like it. GPT5 is abysmal - I'm not sure if they're bugs, or what, but the new release takes a good few steps back. Gemini still a hit and miss - and Grok seems to be a poor man's Claude (where code is kind of okay, a bit buggy and somehow similar to Claude).

brokegrammer•6h ago

Many people are confused about the usefulness of 1M tokens because LLMs often start to get confused after about 100k. But this is big for Claude 4 because it uses automatic RAG when the context becomes large. With optimized retrieval thanks to RAG, we'll be able to make good use of those 1M tokens.

m4r71n•5h ago

How does this work under the hood? Does it build an in-memory vector database of the input sources and runs queries on top of that data to supplement the context window?

Balgair•6h ago

Wow!

As a fiction writer/noodler this is amazing. I can put not just a whole book in as before, not just a whole series, but the entire corpus of authors in.

I mean, from the pov of biography writers, this is awesome too. Just dump it all in, right?

I'll have to switch using to Sonnet 4 now for workflows and edit my RAG code to be longer windows, a lot longer

irthomasthomas•6h ago

Brain: Hey, you going to sleep? Me: Yes. Brain: That 200,001st token cost you $600,000/M.

qwertox•6h ago

> desperately need LLMs to maintain extremely effective context

Last time I used Gemini it did something very surprising: instead of providing readable code, it started to generate pseudo-minified code.

Like on CSS class would become one long line of CSS, and one JS function became one long line of JS, with most of the variable names minified, while some remained readable, but short. It did away with all unnecessary spaces.

I was asking myself what is happening here, and my only explanation was that maybe Google started training Gemini on minified code, on making Gemini understand and generate it, in order to maximize the value of every token.

ericol•6h ago

"...in API"

That's a VERY relevant clarification. this DOESN'T apply to web or app users.

Basically, if you want a 1M context window you have to specifically pay for it.

sporkland•6h ago

Does anyone have data on how much better these 1M token context models produce better results than the more limited windows alongside certain RAG implementations? Or how much better in the face of RAG the 200k vs 1M token models perform on a benchmark?

poniko•6h ago

[Claude usage limit reached. Your limit will reset at..] .. eh lunch is a good time to go home anyways..

chmod775•6h ago

For some context, only the tweaks files and scripting parts of Cyberpunk 2077 are ~2 million LOC.

not_that_d•5h ago

My experience with the current tools so far:

1. It helps to get me going with new languages, frameworks, utilities or full green field stuff. After that I expend a lot of time parsing the code to understand what it wrote that I kind of "trust" it because it is too tedious but "it works".

2. When working with languages or frameworks that I know, I find it makes me unproductive, the amount of time I spend writing a good enough prompt with the correct context is almost the same or more that if I write the stuff myself and to be honest the solution that it gives me works for this specific case but looks like a junior code with pitfalls that are not that obvious unless you have the experience to know it.

I used it with Typescript, Kotlin, Java and C++, for different scenarios, like websites, ESPHome components (ESP32), backend APIs, node scripts etc.

Botton line: usefull for hobby projects, scripts and to prototypes, but for enterprise level code it is not there.

jeremywho•5h ago

My workflow is to use Claude desktop with the filesystem mcp server.

I give claude the full path to a couple of relevant files related to the task at hand, ie where the new code should hook into or where the current problem is.

Then I ask it to solve the task.

Claude will read the files, determine what should be done and it will edit/add relevant files. There's typically a couple of build errors I will paste back in and have it correct.

Current code patterns & style will be maintained in the new code. It's been quite impressive.

This has been with Typescript and C#.

I don't agree that what it has produced for me is hobby-grade only...

taberiand•5h ago

I've been using it the same way. One approach that's worked well for me is to start a project and first ask it to analyse and make a plan with phases for what needs to be done, save that plan into the project, then get it to do each phase in sequence. Once it completes a phase, have it review the code to confirm if the phase is complete. Each phase of work and review is a new chat.

This way helps ensure it works on manageable amounts of code at a time and doesn't overload its context, but also keeps the bigger picture and goal in sight.

mnky9800n•4h ago

I find that sometimes this works great and sometimes it happily tells you everything works and your code fails successfully and if you aren’t reading all the code you would never know. It’s kind of strange actually. I don’t have a good feeling when it will get everything correct and when it will fail and that’s what is disconcerting. I would be happy to be given advice on what to do to untangle when it’s good and when it’s not. I love chatting with Claude code about code. It’s annoying that it doesn’t always get it right and also doesn’t really interact with failure like a human would. At Least in my experience anyways.

taberiand•2h ago

Of course, everything needs to be verified - I'm just trying to figure out a process that enables it to work as effectively as it can on large code bases in a structured way. Committing each stage to git, fixing issues and adjusting the context still comes into play.

nwatson•4h ago

One can also integrate with, say, a running PyCharm with the Jetbrains IDE MCP server. Claude Desktop can then interact directly with PyCharm.

hamandcheese•4h ago

Any particular reason you prefer that over Claude code?

jeremywho•4h ago

I'm on windows. Claude Code via WSL hasn't been as smooth a ride.

JyB•2h ago

That's exactly how you should do it. You can also plug in an MCP for your CI or mention cli.github.com in your prompt to also make it iterate on CI failures.

Next you use claude code instead and you make several work on their own clone on their own workspace and branches in the background; So you can still iterate yourself on some other topic on your personal clone.

Then you check out its tab from time to time and optionally checkout its branch if you'd rather do some updates yourself. It's so ingrained in my day-to-day flow now it's been super impressive.

risyachka•5h ago

Pretty much my experience too.

I usually go to option 2 - just write it by myself as it is same time-wise but keeps skills sharp.

fpauser•5h ago

Not to degenerate is really challenging these days. There are the bubbles that simulate multiple realities to us and try to untrain us logic thinking. And there are the llms that try to convice us that self thinking is unproductive. I wonder when this digitalophily suddenly turns into digitalophobia.

flowerthoughts•5h ago

I predict microservices will get a huge push forward. The question then becomes if we're good enough at saying "Claude, this is too big now, you have to split it in two services" or not.

If LLMs maintain the code, the API boundary definitions/documentation and orchestration, it might be manageable.

fsloth•5h ago

Why microservices? Monoliths with code-golfed minimal implementation size (but high quality architecture) implemented in strongly typed language would consume far less tokens (and thus would be cheaper to maintain).

arwhatever•4h ago

Won’t this cause [insert LLM] to lose context around the semantics of messages passed between microservices?

You could then put all services in 1 repo, or point LLM at X number of folders containing source for all X services, but then it doesn’t seem like you’ll have gained anything, and at the cost of added network calls and more infra management.

urbandw311er•4h ago

Why not just cleanly separated code in a single execution environment? No need to actually run the services in separate execution environments just for the sake of an LLM being able to parse it, that’s crazy! You can just give it the files or folders it needs for the particular services within the project.

Obviously there’s still other reasons to create micro services if you wish, but this does not need to be another reason.

fpauser•5h ago

Same conclusion here. Also good for analyzing existing codebases and to generate documentation for undocumented projects.

j45•4h ago

It's quite good at this, I have been tying in Gemini Pro with this too.

johnisgood•5h ago

> but for enterprise level code it is not there

It is good for me in Go but I had to tell it what to write and how.

sdesol•4h ago

I've been able to create a very advanced search engine for my chat app that is more than enterprise ready. I've spent a decade thinking about search, but in a different language. Like you, I needed to explain what I knew about writing a search engine in Java for the LLM, to write it in JavaScript using libraries I did not know and it got me 95% of the way there.

It is also incredibly important to note that the 5% that I needed to figure out was the difference between throw away code and something useful. You absolutely need domain knowledge but LLMs are more than enterprise ready in my opinion.

Here is some documentation on how my search solution is used in my app to show that it is not a hobby feature.

https://github.com/gitsense/chat/blob/main/packages/chat/wid...

johnisgood•4h ago

Thanks for your reply, I am in the same boat, and it works for me, like it seems to work for you. So as long as we are effective with it, why not? Of course I am not doing things blindly and expect good results.

jiggawatts•5h ago

Something I’ve discovered is that it may be worthwhile writing the prompt anyway, even for a framework you’re an expert with. Sometimes the AIs will surprise me with a novel approach, but the real value is that the prompt makes for excellent documentation of the requirements! It’s a much better starting point for doc-comments or PR blurbs than after-the-fact ramblings.

viccis•4h ago

I agree. For me it's a modern version of that good ol "rails new" scaffolding with Ruby on Rails that got you started with a project structure. It makes sense because LLMs are particularly good at tasks that require little more knowledge than just a near perfect knowledge of the documentation of the tooling involved, and creating a well organized scaffold for a greenfield project falls squarely in that area.

For legacy systems, especially ones in which a lot of the things they do are because of requirements from external services (whether that's tech debt or just normal growing complexity in a large connected system), it's less useful.

And for tooling that moves fast and breaks things (looking at you, Databricks), it's basically worthless. People have already brought attention to the fact that it will only be as current as its training data was, and so if a bunch of terminology, features, and syntax have changed since then (ahem, Databricks), you would have to do some kind of prompt engineering with up to date docs for it to have any hope of succeeding.

pvorb•2h ago

I'm wondering what exact issue you are referring to with Databricks? I can't remember a time I had to change a line I wrote during the past 2.5 years I've been using it. Or are you talking about non-breaking changes?

alfalfasprout•4h ago

The bigger problem I'm seeing is engineers that become over reliant on vibe coding tools are starting to lose context on how systems are designed and work.

As a result, their productivity might go up on simple "ticket like tasks" where it's basically just simple implementation (find the file(s) to edit, modify it, test it) but when they start using it for all their tasks suddenly they don't know how anything works. Or worse, they let the LLM dictate and bad decisions are made.

These same people are also very dogmatic on the use of these tools. They refuse to just code when needed.

Don't get me wrong, this stuff has value. But I just hate seeing how it's made many engineers complacent and accelerated their ability to add to tech debt like never before.

mnky9800n•4h ago

Yea that’s right. It’s kind of annoying how useful it is for hobby projects and it is suddenly useless on anything at work. Haha. I love Claude code for some stuff (like generating a notebook to analyse some data). But it really just disconnects you from the problem you are solving without you going through everything it writes. And I’m really bullish on ai coding tools haha, for example:

https://open.substack.com/pub/mnky9800n/p/coding-agents-prov...

pqs•4h ago

I'm not a programmer, but I need to write python and bash programs to do my work. I also have a few websites and other personal projects. Claude Code helps me implement those little projects I've been wanting to do for a very long time, but I couldn't due to the lack of coding experience and time. Now I'm doing them. Also now I can improve my emacs environment, because I can create lisp functions with ease. For me, this is the perfect tool, because now I can do those little projects I couldn't do before, making my life easier.

zingar•3h ago

Big +1 to customizing emacs! Used to feel so out of reach, but now I basically rolled my own cursor.

chamomeal•3h ago

LLMs totally kick ass for making bash scripts

dboreham•3h ago

Strong agree. Bash is so annoying that there have been many scripts that I wanted to have, but just didn't write (did the thing manually instead) rather than go down the rabbit hole of Bash nonsense. LLMs turn this on its head. I probably have LLMs write 1-2 bash scripts a week now, that I commit to git for use now and later.

unshavedyak•2h ago

Similarly my Nix[OS] env had a ton of annoyances and updates needed that i didn't care to do. My first week of Claude saw tons of Nix improvements for my environment across my three machines (desk, server, macbook) and it's a much more rich environment.

Claude did great at Nix, something i struggled with due to lack of documentation. It was far from perfect, but it usually pointed me towards the answer that i could later refine with it. Felt magical.

int_19h•2h ago

Why not use a more sensible shell, e.g. Fish?

MangoCoffee•3h ago

At the end of the day, all tools are made to make their users' lives easier.

I use GitHub Copilot. I recently did a vibe code hobby project for a command line tool that can display my computer's IP, hard drive, hard drive space, CPU, etc. GPT 4.1 did coding and Claude did the bug fixing.

The code it wrote worked, and I even asked it to create a PowerShell script to build the project for release

dekhn•2h ago

For context I'm a principal software engineer who has worked in and out of machine learning for decades (along with a bunch of tech infra, high performance scientific computing, and a bunch of hobby projects).

In the few weeks since I've started using Gemini/ChatGPT/Claude, I've

1. had it read my undergrad thesis and the paper it's based on, implementing correct pytorch code for featurization and training, along wiht some aspects of the original paper that I didn't include in my thesis. I had been waiting until retirement until taking on this task.

2. had it write a bunch of different scripts for automating tasks (typically scripting a few cloud APIs) which I then ran, cleaning up a long backlog of activities I had been putting off.

3. had it write a yahtzee game and implement a decent "pick a good move" feature . It took a few tries but then it output a fully functional PyQt5 desktop app that played the game. It beat my top score of all time in the first few plays.

4. tried to convert the yahtzee game to an android app so my son and I could play. This has continually failed on every chat agent I've tried- typically getting stuck with gradle or the android SDK. This matches my own personal experience with android.

5. had it write python and web-based g-code senders that allowed me to replace some tools I didn't like (UGS). Adding real-time vis of the toolpath and objects wasn't that hard either. Took about 10 minutes and it cleaned up a number of issues I saw with my own previous implementations (multithreading). It was stunning how quickly it can create fully capable web applications using javascript and external libraries.

6. had it implement a gcode toolpath generator for basic operations. At first I asked it to write Rust code, which turned out to be an issue (mainly because the opencascade bindings are incomplete), it generated mostly functional code but left it to me to implement the core algorithm. I asked it to switch to C++ and it spit out the correct code the first time. I spent more time getting cmake working on my system than I did writing the prompt and waiting for the code.

7. had it Write a script to extract subtitles from a movie, translate them into my language, and re-mux them back into the video. I was able to watch the movie less than an hour after having the idea- and most of that time was just customizing my prompt to get several refinements.

8. had it write a fully functional chemistry structure variational autoencoder that trains faster and more accurate than any I previously implemented.

9. various other scientific/imaging/photography related codes, like impleemnting multi-camera rectification, so I can view obscured objects head-on from two angled cameras.

With a few caveats (Android projects, Rust-based toolpath generation), I have been absolutely blown away with how effective the tools are (especially used in a agent which has terminal and file read/write capabilities). It's like having a mini-renaissance in my garage, unblocking things that would have taken me a while, or been so frustrating I'd give up.

I've also found that AI summaries in google search are often good enough that I don't click on links to pages (wikipedia, papers, tutorials etc). The more experience I get, the more limitations I see, but many of those limitations are simply due to the extraordinary level of unnecessary complexity required to do nearly anything on a modern computer (see my comments about about Android apps & gradle).

stpedgwdgfhgdd•4h ago

For enterprise software development CC is definitely there. 100k Go code paas platform, micro services architecture, mono repo is manageable.

The prompt needs to be good, but in plan mode it will iteratively figure it out.

You need to have automated tests. For enterprise software development that actually goes without saying.

dclowd9901•3h ago

It also steps right over easy optimizations. I was doing a query on some github data (tedious work) and rather than preliminarily filter down using the graphql search method, it wanted to comb through all PRs individually. This seems like something it probably should have figured out.

amelius•3h ago

It is very useful for small tasks like fixing network problems, or writing regexp patterns based on a few examples.

MarcelOlsz•3h ago

Here's how YOU can save $200/mo!

brulard•3h ago

For me it was like this for like a year (using Cline + Sonnet & Gemini) until Claude Code came out and until I learned how to keep context real clean. The key breakthrough was treating AI as an architect/implementer rather than a code generator.

Most recently I ask first CC to create a design document for what we are going to do. He has instructions to look into the relevant parts of the code and docs to reference them. I review it and few back-and-forths we have defined what we want to do. Next step is to chunk it into stages and even those to smaller steps. All this may take few hours, but after this is well defined, I clear the context. I then let him read the docs and implement one stage. This goes mostly well and if it doesn't I either try to steer him to correct it, or if it's too bad, I improve the docs and start this stage over. After stage is complete, we commit, clear context and proceed to next stage.

This way I spend maybe a day creating a feature that would take me maybe 2-3. And at the end we have a document, unit tests, storybook pages, and features that gets overlooked like accessibility, aria-things, etc.

At the very end I like another model to make a code review.

Even if this didn't make me faster now, I would consider it future-proofing myself as a software engineer as these tools are improving quickly

imiric•3h ago

This is a common workflow that most advanced users are familiar with.

Yet even following it to a T, and being really careful with how you manage context, the LLM will still hallucinate, generate non-working code, steer you into wrong directions and dead ends, and just waste your time in most scenarios. There's no magical workflow or workaround for avoiding this. These issues are inherent to the technology, and have been since its inception. The tools have certainly gotten more capable, and the ecosystem has matured greatly in the last couple of years, but these issues remain unsolved. The idea that people who experience them are not using the tools correctly is insulting.

I'm not saying that the current generation of this tech isn't useful. I've found it very useful for the same scenarios GP mentioned. But the above issues prevent me from relying on it for anything more sophisticated than that.

brulard•1h ago

> These issues are inherent to the technology

That's simply false. Even if LLMs don't produce correct and valid code on first shot 100% times of the cases, if you use an agent, it's simply a matter of iterations. I have claude code connected to Playwright, context7 for docs and to Playwright, so it can iterate by itself if there are syntax errors, runtime errors or problems with the data on the backend side. Currently I have near zero cases when it does not produce valid working code. If it is incorrect in some aspect, it is then not that hard to steer it to better solution or to fix yourself.

And even if it failed in implementing most of these stages of the plan, it's not all wasted time. I brainstormed ideas, formed the requirements, specifications to features and have clear documentation and plan of the implementation, unit tests, etc. and I can use it to code it myself. So even in the worst case scenario my development workflow is improved.

mathiaspoint•58m ago

It definitely isn't. LLMs often end up stuck in weird corners they just don't get and need someone familiar with the theory of what they're working on to unstick them. If the agent is the same model as the code generator it won't be able to on its own.

john-tells-all•2h ago

I've seen this referred to as Chain of Thought. I've used it with great success a few times.

https://martinfowler.com/articles/2023-chatgpt-xu-hao.html

aatd86•2h ago

For me it's the opposite. As long as I ask for small tasks, or error checking, it can help. But I'd rather think of the overall design myself because I tend to figure out corner cases or superlinear complexities much better. I develop better mental models than the NNs. That's somewhat of a relief.

Also the longer the conversation goes, the less effective it gets. (saturated context window?)

brulard•1h ago

I don't think thats the opposite. I have an idea what I want and to some extent how I want it to be done. The design document starts with a brainstorming where I throw all my ideas at the agent and we iterate together.

> Also the longer the conversation goes, the less effective it gets. (saturated context window?)

Yes, this is exactly why I said the breakthrough came for me when I learned how to keep the context clean. That means multiple times in the process I ask the model to put the relevant parts of our discussion into an MD document, I may review and edit it and I reset the context with /clear. Then I have him read just the relevant things from MD docs and we continue.

ramshanker•1h ago

Same here. A small variation: I explicitly use website to manage what context it gets to see.

brulard•1h ago

What do you mean by website? An HTML doc?

drums8787•3h ago

My experience is the opposite I guess. I am having a great time using claude to quickly implement little "filler features" that require a good amount of typing and pulling from/editing different sources. Nothing that requires much brainpower beyond remembering the details of some sub system, finding the right files, and typing.

Once the code is written, review, test and done. And on to more fun things.

Maybe what has made it work is that these tasks have all fit comfortably within existing code patterns.

My next step is to break down bigger & more complex changes into claude friendly bites to save me more grunt work.

unlikelytomato•2h ago

I wish I shared this experience. There are virtually no filter features for me to work on. When things feel like filler on my team, it's generally a sign of tech debt and we wouldn't want to have it generate all the code it would take. What are some examples of filler features for you?

On the other hand, it does cost me about 8 hours a week debugging issues created by bad autocompletes from my team. The last 6 months have gotten really bad with that. But that is a different issue.

apimade•2h ago

Many who say LLMs produce “enterprise-grade” code haven’t worked in mid-tier or traditional companies, where projects are held together by duct tape, requirements are outdated, and testing barely exists. In those environments, enterprise-ready code is rare even without AI.

For developers deeply familiar with a codebase they’ve worked on for years, LLMs can be a game-changer. But in most other cases, they’re best for brainstorming, creating small tests, or prototyping. When mid-level or junior developers lean heavily on them, the output may look useful.. until a third-party review reveals security flaws, performance issues, and built-in legacy debt.

That might be fine for quick fixes or internal tooling, but it’s a poor fit for enterprise.

bityard•2h ago

I work in the enterprise, although not as a programmer, but I get to see how the sausage is made. And describing code as "enterprise grade" would not be a compliment in my book. Very analogous to "contractor grade" when describing home furnishings.

Aeolun•2h ago

Umm, Claude Code is a lot better than a lot of enterprise grade code I see. And it actually learns from mistakes with a properly crafted instruction xD

therealpygon•2h ago

I mostly agree, with the caveat that I would say it can certainly be useful when used appropriately as an “assistant”. NOT vibe coding blindly and hoping what I end up with is useful. “Implement x specific thing” (e.g. add an edit button to component x), not “implement a whole new epic feature that includes changes to a significant number of files”. Imagine meeting a house builder and saying “I want a house”, then leaving and expecting to come back to exactly the house you dreamed of.

I get why, it’s a test of just how intuitive the model can be at planning and execution which drives innovation more than 1% differences in benchmarks ever will. I encourage that innovation in the hobby arena or when dogfooding your AI engineer. But as a replacement developer in an enterprise where an uncaught mistake could cost millions? No way. I wouldn’t even want to be the manager of the AI engineering team, when they come looking for the only real person to blame for the mistake not being caught.

For additional checks/tasks as a completely extra set of eyes, building internal tools, and for scripts? Sure. It’s incredibly useful with all sorts of non- application development tasks. I’ve not written a batch or bash script in forever…you just don’t really need to much anymore. The linear flow of most batch/bash/scripts (like you mentioned) couldn’t be a more suitable domain.

Also, with a basic prompt, it can be an incredibly useful rubber duck. For example, I’ll say something like “how do you think I should solve x problem”(with tools for the codebase and such, of course), and then over time having rejected and been adversarial to every suggestion, I end up working through the problem and have a more concrete mental design. Think “over-eager junior know-it-all that tries to be right constantly” without the person attached and you get a better idea of what kind of LLM output you can expect including following false leads to test your ideas. For me it’s less about wanting a plan from the LLM, and more about talking through the problems I think my plan could solve better, when more things are considered outside the LLMs direct knowledge or access.

“We can’t do that, changing X would break Y external process because Z. Summarize that concern into a paragraph to be added to the knowledge base. Then, what other options would you suggest?”

tonyhart7•2h ago

it depends on model but sonnet is more than capable for enterprise code

when you stuck at claude doing dumb shit, you didnt give the model enough context to know better the system

after following spec driven development, works with LLM in large code base make it so much easier than without it like its heaven and hell differences

but also it increase in token cost exponentially, so there's that

hoppp•2h ago

I used it with Tyopescript and Go, SQL, Rust

Using it with rust is just horrible imho. Lots and lots of errors, I cant wait to stop this rust project already. But the project itself is quite complex

Go on the other hand is super productive, mainly because the language is already very simple. I can move 2x fast

Typescript is fine, I use it for react components and it will do animations Im lazy to do...

SQL and postgresql is fine, I can do it without it also, I just dont like to write stored functions cuz of the boilerplatey syntax, a little speed up saves me from carpal tunnel

epolanski•35m ago

I really find your experience strikingly different than mine, I'll share you my flow:

- step A: ask AI to write a featureA-requirements.md file at the root of the project, I give it a general description for the task, then have it ask me as many questions as possible to refine user stories and requirements. It generally comes up with a dozen or more of questions, of which multiples I would've not thought about and found out much later. Time: between 5 and 40 minutes. It's very detailed.

- step B: after we refine the requirements (functional and non functional) we write together a todo plan as featureA-todo.md. I refine the plan again, this is generally shorter than the requirements and I'm generally done in less than 10 minutes.

- step C: implementation phase. Again the AI does most of the job, I correct it at each edit and point flaws. Are there cases where I would've done that faster? Maybe. I can still jump in the editor and do the changes I want. This step in general includes comprehensive tests for all the requirements and edge cases we have found in step A, both functional, integration and E2Es. This really varies but it is generally highly tied to the quality of phase A and B. It can be as little as few minutes (especially true when we indeed come up with the most effective plan) and as much as few hours.

- step D: documentation and PR description. With all of this context (in requirements and todos) at this point updating any relevant documentation and writing the PR description is a very short experiment.

In all of that: I have textual files with precise coding style guidelines, comprehensive readmes to give precise context, etc that get referenced in the context.

Bottom line: you might be doing something profoundly wrong, because in my case, all of this planning, requirements gathering, testing, documenting etc is pushing me to deliver a much higher quality engineering work.

TZubiri•5h ago

Remember kids, just because you CAN doesn't mean you SHOULD

mrcwinn•5h ago

This tells me they've gotten very good at caching and modeling the impact of caching.

fpauser•5h ago

O observed that claude produces a lot of bloat. Wonder how such llm generated projects age.

cadamsdotcom•5h ago

I’m glad to see the only company chasing margins - which they get by having a great product and a meticulous brand - finding even more ways to get margin. That’s good business.

howinator•5h ago

I could be wrong, but I think this pricing is the first to admit that cost scales quadratically with number of tokens. It’s the first time I’ve seen nonlinear pricing from an LLM provider which implicitly mirrors the inference scaling laws I think we're all aware of.

jpau•4h ago

Google[1] also has a "long context" pricing structure. OpenAI may be considering offering similar since they do not offer their priority processing SLAs[2] for context >128K.

[1] https://cloud.google.com/vertex-ai/generative-ai/pricing

[2] https://openai.com/api-priority-processing/

reverseblade2•5h ago

Does this cover subscription?

anonym29•3h ago

API only for now, but at the very bottom of the post: "We're also exploring how to bring long context to other Claude products."

So, not yet, but maybe someday?

_joel•4h ago

Fantastic, use up your quota even more quickly. :)

phyzix5761•4h ago

What I've found with LLMs is they're basically a better version of Google Search. If I need a quick "How do I do..." or if I need to find a quick answer to something its way more useful than Google and the fact that I can ask follow up questions is amazing. But for any serious deep work it has a long way to go.

mr_moon•3h ago

I feel exactly the same way. why skim and sift 15 different stackoverflow posts when an LLM can pick out exactly the info I need?

I don't need to spin up an entire feature in a few seconds. I need help understanding where something is broken; what are some opinions o best practice; or finding out what a poorly written snippet is doing.

context still v important for this though and I appreciate cranking that capacity. "read 15000 stackoverflow posts for me please"

anvuong•3h ago

The action of sifting through through poop to find gold actually positively develops my critical thinking skill. I, too, went through a phase of just asking LLM for a specific concept instead of Googling it and weave through dozens of wiki pages or niche mailing list discussions. It did improve my productivity but I feel like it dulls my brain. So recently I have to tone that down and force myself to go back to the old way. Maybe too much of a good thing is bad.

Whatarethese•3h ago

This is my primary use of AI. Looking for a new mountain bike and using AI to list and compare parts of the bike and which is best for my use case scenario. Works pretty well so far.

meander_water•4h ago

I like to spend a lot of time in "Ask" mode in Cursor. I guess the equivalent in Claude code is "plan" mode.

Where I have minimal knowledge about the framework or language, I ask a lot of questions about how the implementation would work, what the tradeoffs are etc. This is to minimize any misunderstanding between me and the tool. Then I ask it to write the implementation plan, and execute it one by one.

Cursor lets you have multiple tabs open so I'll have a Ask mode and Agent mode running in parallel.

This is a lot slower, and if it was a language/framework I'm familiar with I'm more likely to execute the plan myself.

itissid•3h ago

My experience with Claude code beyond building anything bigger than a webpage, a small API, a tutorial on CSS etc has been pretty bad. I think context length is a manageable problem, but not the main one. I used it to write a 50K LoC python code base with 300 unit tests and it went ok for the first few weeks and then it failed. This is after there is a CLAUDE.md file for every single module that needs it as well as detailed agents for testing, design, coding and review.

I won't going into a case by case list of its failures, The core of the issue is misaligned incentives, which I want to get into:

1. The incentives for coding agent, in general and claude, are writing LOTS of code. None of them — O — are good at the planning and verification.

2. The involvement of the human, ironically, in a haphazard way in the agent's process. And this has to do with how the problem of coding for these agents is defined. Human developers are like snow flakes when it comes to opinions on software design, there is no way to apply each's preference(except paper machet and superglue SO, Reddit threads and books) to the design of the system in any meaningful way and that makes a simple system way too complex or it makes a complex problem simplistic.

  - There is no way to evolve the plan to accept new preferences except text in CLAUDE.md file in git that you will have to read through and edit.

  - There is no way to know the near term effect of code choices now on 1 week from now. 

  - So much code is written that asking a person to review it in case you are at the envelope and pushing the limit feels morally wrong and an insane ask. How many of your Code reviews are instead replaced by 15-30 min design meetings to instead solicit feedback on design of the PR — because it so complex — and just push the PR into dev? WTF am I even doing I wonder.

  - It does not know how far to explore for better rewards and does not know it better from local rewards, Resulting in commented out tests and deleting arbitrary code, to make its plan "work".

In short code is a commodity for CEOs of Coding agent companies and CXOs of your company to use(sales force has everyone coding, but that just raises the floor and its a good thing, it does NOT lower the bar and make people 10x devs). All of them have bought into this idea that 10x is somehow producing 10x code. Your time reviewing and unmangling and mainitaining the code is not the commodity. It never ever was.

lpa22•3h ago

One of the most helpful usages of CC so far is when I simply ask:

"Are there any bugs in the current diff"

It analyzes the changes very thoroughly, often finds very subtle bugs that would cost hours of time/deployments down the line, and points out a bunch of things to think through for correctness.

swyx•2h ago

maybe want to reify that as a claude code hook!

bertil•2h ago

That matches my experience with non-coding tasks: it’s not very creative, but it’s a comprehensive critical reader.

neucoas•1h ago

I am trying this tomorrow

lpa22•47m ago

Let me know how it goes. It’s a game changer

KTibow•1h ago

I'm surprised that works even without telling it to think/think hard/think harder/ultrathink.

i_have_an_idea•2h ago

While this is cool, can anything be done about the speed of inference?

At least for my use, 200K context is fine, but I’d like to see a lot faster task completion. I feel like more people would be OK with the smaller context if the agent acts quickly (vs waiting 2-3 mins per prompt).

maxnevermind•2h ago

Does very large context significantly increase a response time? Are there any benchmarks/leader-boards estimating different models in that regard?

hoppp•2h ago

So I can upload 1M tokens per prompt but pay $3 per 1M input tokens?

Its really expensive to use.

Aeolun•1h ago

Only the first time. After that it’s 0.3 per 1M input tokens (cached).

psyclobe•2h ago

Isn’t Opus better? Whenever I run out of Opus tokens and get kicked down to Sonnet it’s quite a shock sometimes.

But man I’m at the perfect stage in my career for these tools. I know a lot, I understand a lot, I have a lot of great ideas-but I’m getting kinda tired of hammering out code all day long. Now with Claude I am just busting ass executing in all these ideas and tests and fixes-never going back!

Aeolun•1h ago

Haha, I think I recognize that. I’m just worried my actual skills will athrophy while I use Claude Code like I’m a manager on steroids.

stavros•1h ago

By definition the thing that atrophies is the thing you never need to use.

as367•2h ago

That is an unfortunate logo.

tomsanbear•1h ago

I just want a better way to invalidate old context... It's great that I can fit more context, but the main challenge is claude getting sidetracked with 10 invalid grep calls, pytest dumping a 10k token stack trace etc.... And yes the ability to go back in time via esc+esc is great but I want claude to read the error stack learn from it and purge from its context or at least let me lobotomize ot selectively... Learning and discarding the raw output from tool calls feels like the missing piece here still.

aledalgrande•52m ago

I hope that they are going to put something in Claude Code to display if you're entering the expensive window. Sometime I just keep the conversation going. I wouldn't want that to burn my Max credits 2x faster.

terminalshort•44m ago

Yeah, that 1 MM tokens is a $15 (IIRC) API call. That's gonna add up quick! My favorite hypothetical AI failure scenario is that LLM agents eventually achieve human level general intelligence, but have to burn so many tokens to do it that they actually become more expensive than a human.

socrateslee•50m ago

It's like double "double the dose"

forgingahead•29m ago

Wish it was on the web app as well!

williamtrask•9m ago

Claude is down.

EDIT: for the moment... it supports 0 tokens of context xD

nojs•7m ago

Currently the quality seems to degrade long before the context limit is reached, as the context becomes “polluted”.

Should we expect the higher limit to also increase the practical context size proportionally?

Claude Sonnet 4 now supports 1M tokens of context

Search all text in New York City

Scapegoating the Algorithm

Ashet Home Computer

Show HN: Building a web search engine from scratch with 3B neural embeddings

Journaling using Nix, Vim and coreutils

A gentle introduction to anchor positioning

Training language models to be warm and empathetic makes them less reliable

Show HN: Omnara – Run Claude Code from anywhere

Multimodal WFH setup: flight SIM, EE lab, and music studio in 60sqft/5.5M²

Blender is Native on Windows 11 on Arm

AI Eroded Doctors' Ability to Spot Cancer Within Months in Study

The Missing Protocol: Let Me Know

WHY2025: How to become your own ISP [video]

Launch HN: Design Arena (YC S25) – Head-to-head AI benchmark for aesthetics

LLMs aren't world models

Go 1.25 Release Notes

Why are there so many rationalist cults?

The Equality Delete Problem in Apache Iceberg

RISC-V single-board computer for less than 40 euros

Debian GNU/Hurd 2025 released

Visualizing quaternions, an explorable video series

Dumb to managed switch conversion (2010)

Weave (YC W25) is hiring a founding AI engineer

Fixing a loud PSU fan without dying

Galileo’s telescopes: Seeing is believing (2010)

Nexus: An Open-Source AI Router for Governance, Control and Observability

Australian court finds Apple, Google guilty of being anticompetitive

How to safely escape JSON inside HTML SCRIPT elements

Comparing baseball greats across eras, who comes out on top?

Claude Sonnet 4 now supports 1M tokens of context

Search all text in New York City

Scapegoating the Algorithm

Ashet Home Computer

Show HN: Building a web search engine from scratch with 3B neural embeddings

Journaling using Nix, Vim and coreutils

A gentle introduction to anchor positioning

Training language models to be warm and empathetic makes them less reliable

Show HN: Omnara – Run Claude Code from anywhere

Multimodal WFH setup: flight SIM, EE lab, and music studio in 60sqft/5.5M²

Blender is Native on Windows 11 on Arm

AI Eroded Doctors' Ability to Spot Cancer Within Months in Study

The Missing Protocol: Let Me Know

WHY2025: How to become your own ISP [video]

Launch HN: Design Arena (YC S25) – Head-to-head AI benchmark for aesthetics

LLMs aren't world models

Go 1.25 Release Notes

Why are there so many rationalist cults?

The Equality Delete Problem in Apache Iceberg

RISC-V single-board computer for less than 40 euros

Debian GNU/Hurd 2025 released

Visualizing quaternions, an explorable video series

Dumb to managed switch conversion (2010)

Weave (YC W25) is hiring a founding AI engineer

Fixing a loud PSU fan without dying

Galileo’s telescopes: Seeing is believing (2010)

Nexus: An Open-Source AI Router for Governance, Control and Observability

Australian court finds Apple, Google guilty of being anticompetitive

How to safely escape JSON inside HTML SCRIPT elements

Comparing baseball greats across eras, who comes out on top?

Claude Sonnet 4 now supports 1M tokens of context

Comments