Fair transactions involve fair and transparent measurements of goods exchanged. I'm going to cancel my subscription this month.
Running non deterministic software for deterministic tasks is still an area for efficiency to improve.
> The March 6 change makes Claude Code cheaper, not more expensive. 1h TTL for every request could cost more, not less
Feels very AI. > Restore 1h as the default / expose as configurable? 1h everywhere would increase total cost given the request mix, so we're not planning a global toggle.
They won't show a toggle because it will increase costs for some unknown percentage of requests?
There must be a better way to do this. The consumer option is the pricing difference. If they’d make cache writes the same price as regular writes, that would solve the whole problem. If you really want to push it, use that pricing only for requests where number of cache hits > 0 (to avoid people setting this flag without intent to use it), and you solved the whole issue.
And if you can't stomach OpenAI, GLM 5.1 is actually quite competent. About Opus 4.5 / GPT 5.2 quality.
Anthropic sells you 'knowledge' in the form of 'tokens' and you spend money rolling the dice, spinning the roulette wheels and inserting coins for another try. They later add limits and dumb down the model (which are their gambling machines) of their knowledge for you to pay for the wrong answers.
Once you hit your limit or Anthropic changes the usage limits, they don't care and halt your usage for a while.
If you don't like any of that, just save your money and use local LLMs instead.
Especially when it's on purpose.
Any highlights you can share here? I'm always looking to improve me setup.
> As the Codex promotion on Plus winds down today
I’ve moved away from Claude and toward open-source models plus a ChatGPT subscription.
That setup has worked really well for me: the subscription is generous, the API is flexible, and it fits nicely into my workflow. GPT-5.4 + Swival (https://swival.dev) are now my daily drivers.
Either you are using it wrong or you are working in a totally different field.
https://docs.github.com/en/copilot/concepts/billing/copilot-...
This clearly isn't true for agentic mode though. This document is extremely misleading. VSCode has the `chat.agent.maxRequests` option which lets you define how many requests an agent can use before it asks if you want to continue iterating, and the default is not one. A long running session (say, implementing an openspec proposal) can easily eat through dozens of requests. I have a prompt that I use for security scanning and with a single input/request (`/prompt`) it will use anywhere between 17 and 25 premium requests without any user input.
The overall context windows are smaller with copilot I believe, but it dfoesnt appear to be hurting my work.
I'm using it for approx 4 hours a day most days. Generally one shotting fun ideas I thoroughly plan out in planning mode first, and I have my own verison of the idea->plan->analyse-> document implementation phases -> implement via agent loop. simulations, games, stuff-im-curious about and resurrecting old projects that never really got off the ground.
Now a single question consistently uses around 15% of my quota
Once people won't be able to think anymore and business expect the level of productivity witnessed before, will have no choice but cough up whatever providers bill us.
Online advertising is now ubiquitous, terrible, and mandatory for anyone who wants to do e-commerce. You can't run a mass-market online business without buying Adwords, Instagram Ads, etc.
AI will be ubiquitous, and then it will get worse and more expensive. But we will be unable to return to the prior status quo.
It occurred to me an outright rejections of these tools is brewing but can't quite materialise yet.
Is that bad? After all, even if they hiked to price infinity, you wouldn't worse off than if AI didn't exist because you could still code by hand. Moreover if it's really in a "business" (employment?) context, the tools should be provided by your employer, not least for compliance/security reasons. The "expectation" angle doesn't make sense either. If it's actually more efficient than coding by hand, people will eventually adopt it, word will get around and expectations will rise irrespective of whether you used it or not.
My argument was not about AI. Rather about the practice of Anthropic and the likes.
Demand is higher than supply it is just the start of bubble.
Everyone and their dog is burning tokens on stupid shit that would be freed up if they would ask to make deterministic code for the task and run the task. OpenAI, Anthropic are cutting free use and decreasing limits because they are not able to meet the demand.
When general public catches up with how to really use it and demand will fall and the today built supply will become oversupply that’s where the bubble will burst.
I say 5 more years.
There's this honeymoon period with Claude you experience for a month or two followed by a trough of disillusionment, and then a rebound after a model update (rinse and repeat). It doesn't help that Anthropic is experiencing a vicious compute famine atm.
To add the fact we are being taken for fools with dramatic announcements, FOMOs messages. I even suspect some reaction farms are going on to boost post from people boasting Claude models.
These don't happen for codex. Nor for mistral. Nor for deepseek. It can't just be that Claude code is so much better.
There are open weight models that work perfectly fine for most cases, at a fraction of the cost. Why are more people not talking about those. Manipulation.
I often compare with Gemini. Sure those Google servers are super fast. But I can't see it better. Qwen and deepseek simply work better for me.
Haven't tested Mistral in a while, you may be right.
People try out and feel comfortable: using U.S models (I can see the logic), but mostly for brand recognition. Anthropic and OpenAi are the best aren't they? When the models jam they blame themselves.
It’s further frustrating that I have committed to certain project deadlines knowing that I’d be able to complete it in X amount of time with agent tooling. That agentic tooling is no longer viable and I’m scrambling to readjust expectations and how much I can commit to.
however his response gaslights us because in the OPs opening post his math demonstrates this is not true, it shows reads 26x more so at least in his case the cache is not doing what the anthropic employee describes.
clearly we are being charged for less optimization here and being given the message (from my perspective by anthropic) that if you are in a special situation your needs don't matter and we will close your thread without really listening.
It's also in the interest of the users to keep certain params private, we are meant to deduce that. Did you not ?
Are there any other $50B+ Valuation companies that care about special situations? If so, who?
During core US business hours, I have to actively keep a session going or I risk a massive jump in usage while the entire thread rebuilds. During weekend or off-hours, I never see the crazy jumps in usage - even if I let threads sit stale.
How is this normal?
But the opacity itself is a bit offensive to me. It feels shady somehow.
The thing is, if it's going to be this expensive it's not going to be worth it for me. Then I'll rather do it myself. I'm never going to pay for a €100 subscription, that's insane. It's more than my monthly energy bill.
Maybe from a business standpoint it still makes sense because you can use it to make money, but as a consumer no way.
For those not in the Google Gemini/Antigravity sphere, over the last month or so that community has been experiencing nothing short of contempt from Google when attempting to address an apparent bait and switch on quota expectations for their pro and ultra customers (myself included). [1]
While I continue to pay for my Google Pro subscription, probably out of some Stockholm Syndrome, beaten wife level loyalty and false hope that it is just a bug and not Google being Google and self-immolating a good product, I have since moved to Kiro for my IDE and Codex for my CLI and am as happy as clam with this new setup.
[1] https://github.com/google-gemini/gemini-cli/issues/24937
Huh?
The reddit summary comment makes no sense. How are they getting revenues without ads or paying customers?
"After" makes more sense.
FTA:
>The company has yet to show a profit and is searching for ways to make money to cover its high computing costs and infrastructure plans.
I'm dying to see S-1 filing for Anthropic or OpenAI. I don't actually think inference is as cheap as people say if you consider the total cost (hardware, energy, capex, etc)
IMO they need as many users before their IPO - then the changes will really begin.
There's a lot of angles you take from that as a starting point and I'm not confident that I fully understand it, so I'll leave it to the reader.
The parent's argument is that the marginal cost of inference is minimal. However, the fundamental flaw is that he's separating inference from the high cost frontier models. It's a cross-subsidy that can't be ignored.
You also can't put ads in code completion AIs because the instant you do the utility to me of them at work drops to negative. Guess how much money companies are going to pay for negative-value AIs? Let's just say it won't exactly pay for the AI bubble. A code agent AI puts an ad for, well, anything and the AI accidentally puts it into code that gets served out to a customer and someone's going to sue. The merits of the case won't matter, nor the fact the customer "should have caught it in review", the lawsuit and public reputation hit (how many people here are reading this and salivating at the thought of being able to post an angrygram about AIs being nothing but ad machines?) still cost way too much for the AI companies creating the agents to risk.
Valuation have already reached point where these companies can run their nuclear power station, fund developement of new hardware and techniques and boost capabilities of their models by 10x
However, I've found that the flash quota is much more generous. I have been building a trio drive FOC system for the STM32G474 and basically prompting my way through the process. I have yet to be able to run completely out of flash quota in a given five hour time window. It is definitely completing the work a lot faster than I could do myself -- mainly due to its patience with trying different things to get to the bottom of problems. It's not perfect but it's pretty good. You do often have to pop back in and clean up debris left from debugging or attempts that went nowhere, or prompt the AI to do so, but that's a lot easier than figuring things out in the first place as long as you keep up with it.
I say this as someone who was really skeptical of AI coding until fairly recently. A friend gave me a tutorial last weekend, basically pointing out that you need to instruct the AI to test everything. Getting hardware-in-loop unit tests up and running was a big turning point for productivity on this project. I also self-wired a bunch of the peripherals on my dev board so that the unit tests could pretend to be connected to real external devices.
I think it helps a lot that I've been programming for the last twenty years, so I can sometimes jump in when it looks like the AI is spinning its wheels. But anyway, that's my experience. I'm just using flash and plan mode for everything and not running out of the $20/mo quota, probably getting things done 3x as fast as I could if I were writing everything myself.
Can confirm, I initially enjoyed the 5-hour limits on Gemini CLI and Antigravity so much that I paid for a full year, thinking it was a great decision
In the following months, they significantly cut the 5-hour limits (not sure if it even exists anymore), introduced the unrealistically bad weekly limit that I can fully consume in 1-2 hour, introduced the monthly AI credits system, and added ads to upgrade to Ultra everywhere
At the very least the Gemini mobile app / web app is still kinda useful for project planning and day-to-day use I guess. They also bumped the storage from 2TB to 5TB, but I don't even use that
Looks like enshittification on steroids, honestly.
I ended up buying the $100 Codex plan. So far it has been much more generous with usage and more accurate than Claude for the kind of work I do.
That said, Codex has its own issues. Its personality can be a bit off-putting for my taste. I had to add extra instructions in Agents.md just to make it less snarky. I was annoyed enough that I explicitly told it not to use the word “canonical.”
On UI/UX taste, I still think current Codex is behind the Jan/Feb era of Claude Code. Claude used to have much better finesse there. But for backend logic, hard debugging, and complex problem-solving, Codex has been clearly better for me. These days I use Impeccable Skillset inside Codex to compensate for the weaker UI taste, but it still does not quite match the polish and instinct Claude Code used to have.
I used to be a huge Claude Code advocate. At this point, I cannot recommend it in good conscience.
My advice now is simple: try the $20 plans for Codex and Cursor, and see which one matches your workflow and vibes best
Give it a custom sandbox and context for the work, so it has no opportunity to roam around when not required. AI agentic coding is hugely wasteful of context and tokens in general (compared to generic chat, which is how most people use AI), there's a whole lot of scope for improvement there.
It does seem like a cynical attempt to make more money.
My experience is limited only to CC, Gemini-cli, and Codex - not Aider yet, trying different combinations of different models.
But, from my experience, CC puts everything else to shame.
How does Cursor compare? Has anyone found an Aider combination that works as well?
It was pretty much first for CLI agents and had a benchmark that was the go to at the start of LLM coding. Now the benchmark doesn't get updated and aider never gets a mention in talking about CLI tools till now.
My best guess is this is the result of the companies running "experiments" to test changes. Or it's just all in my head :)
It’s not under load either it’s just fully downgraded. Feels more they’re dialing in what they can get away with but are pushing it very far.
Still, in comparison with Claude Code, the quota of Codex is a much better deal. However, they should not make it worse...
At the same time, they’ve been giving out a ton of additional quota resets seemingly every other week (and committed to an additional reset for every million additional users until they hit 10mil on codex).
So they’ve really set a high bar for people’s expectations on their quota limits.
Once they drop the 2x promotion for good and stop the frequent resets, there are going to be a lot of complaints.
This is what I'm working on proving now.
It is more that there is a confidence score while thinking. Opus will quit if it is too high and will grind on if the confidence score is close to the real answer. Haiku handles this well too.
If you give Sonnet a hard task, it won't quit when it should.
Nonetheless, that issue has been fixed with Opus.
I'll try to show that the speed of using Opus on tasks that have medium to hard difficultly is consistently the same price or cheaper than running them with Haiku and Sonnet. While easier tasks, the busy work that is known, is cheaper run with Haiku.
When will people realize this is the same as vendor lock-in?
"Maybe if I spend more money on the max plan it will be better" > no it will be the same "Maybe if I change my prompt it will work" > no it will be the same "Maybe if I try it via this API instead of that API it will improve" > no it will be the same.
Claude, ChatGPT, Gemini etc all of these SOTA models are carefully trained, with platforms carefully designed to get you to pay more for "better" output, or try different things instead of using a different product.
It's to keep you in the ecosystem and keep you exploring. There is a reason you can't see the layers upon layers of scaffolding they have. And there's a reason why after 2 weeks post major update, the model is suddenly "bad" and "frustrating". It's the same reason its done with A/B testing, so when you complain, someone else has no issues, when they complain, you have no issues. It muddies the water intentionally.
None of it is because you're doing anything wrong, it's not a skill issue, it's a careful strategy to extract as much engagement and money from customers as possible. It's the same reason they give people who buy new gun skins in call of duty easier matches in matchmaking for the first couple games.
The only mistake you made was paying MORE, hoping it would get better. It won't, that's not what makes them money. Making people angry and making people waste their time, while others have no issues, and making them explore and try different things for longer so they can show to investors how long people use these AI tools is what makes them money.
When competitors have a better product these issues go away When a new model is released these issues don't exist
I was paying a ton of money for claude, once I stopped and cancelled my subscription entirely, suddenly sonnet 4.6 is performing like opus and I don't have prompts using 10% of my quota in one message despite being the same complexity.
OpenCode is great though, and can (for now) use an OpenAI subscription.
Codex consumes way fewer resources and is much snappier.
What I did instead is tune the prompt for gemma 4 26b and a 3090. Worked like a charm. Sometimes you have to run the main prompt and then a refinement prompt or split the processing into cases but it’s doable.
Now I’m waiting for anyone to put up some competition against NVIDIA so I can finally be able to afford a workstation GPU for a price less than a new kidney.
I guess this is fitting when the person who submitted the issue is in "AI | Crypto".
Well there's no crying at the casino when, you exhaust your usage or token limit.
The house (Anthropic) always wins.
I strongly believe google's legs will allow it to sustain this influx of compute and still not do the rug-pull like OAI or Anthropic will be forced to do as more people come onboard the code-gen use case.
I'm curious what are people doing that is consuming your limits? I can't imagine filling the $200 a month plan unless I was essentially using Claude code itself as the api to mass process stuff? For basic coding what are people doing?
I suspect I was getting rate limited very aggressively on Thursday last week. It honestly infuriated me, because I'm paying $200 a month for this thing. If it's going to rate limit me, at least tell me what it's doing instead of just making it seem like it's taking 12 hours to run through something that I would expect to be 15 minutes. The worst part is that it never even finished it.
My gut feeling is this is not enough money for them by far (not to mention their investors), and we'll eventually get ratcheted up inline with dev salaries. E.g. "look how many devs you didn't have to hire", etc.
Think Twitter's fail-whale problems. Sometimes you are lucky, sometimes you aren't. Why? We won't know until Anthropic figures it out and from the outside it sure looks like they're struggling.
As of now, I’m consistently hitting my 5 hour limit in less than 1 hour during N/A business hours. I’m getting to the point where I basically can’t use CC for work unless I work very early or late in the day.
Either they decimated the limits internally, or they broke something.
Tried all the third-party tricks (headroom, etc.), switched to 200k context window, switched back to 4.5.
I hope 4.5 will help, but the rest of the efforts didn’t move the needle much
No FOMO
For context, with Google AI Pro, I can burn through the Antigravity weekly limit in 1-2 hours if I force it to use Gemini 3.1 Pro. Meanwhile Gemini 3 Flash is basically unlimited but frequently produces buggy code or fail to implement things how I personally would (felt like it doesn't "think" like a software dev)
I also tried VS Code + Cline + OpenRouter + MiniMax M2.7. It's quite cheap and seems to be better than Gemini 3 Flash, but it gets really pricy as the context fills up because prompt caching is not supported for MiniMax on OpenRouter. The result itself usually needs 3-6 revisions on average so the context fills up pretty often
Eventually I got Claude Max 5x to try for a month. VS Code + Claude Code extension on a ~15k lines codebase, model set to "Default", and effort set to "Max". So far it's been really good: 0-2 revisions on average, and most of the time it implements things exactly how I would or better. And, like I said, I can only consume 40-60% of the 5-hour limits no matter how hard I try
Granted, I'm not forcing it to use Opus like OP (nor do I use complicated skills or launch multiple tasks at the same time), but I feel like they really nailed the right balance of when to use which model and how to pass context between the them. Or at least enough that I haven't felt the need to force it to use Opus all the time
https://www.reddit.com/r/ClaudeAI/comments/1s4idaq/update_on...
It’s been unusable for me as my daily coding agent. I run out of credits in the pro account in an hour or so. Before that I had never reached the session limit. Switched back to Junie with Gemini/chatgpt.
Anthropic is not incentivized to reduce token use, only to increase it, which is what we are seeing with Opus 4.6 and now they are putting the screws on
I am getting bored of having to plan my weekends around quota limit reset times...
To try things out you can use llama.cpp with Vulkan or even CPU and a small model like Gemma 4 26B-A4B or Gemma 4 31B or Qwen 3.5 35-A3B or Qwen3.5 27B. Some of the smaller quants fit within 16GB of GPU memory. The default people usually go with now is Q4_K_XL, a 4-bit quant for decent performance and size.
https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF
https://huggingface.co/unsloth/gemma-4-31B-it-GGUF
https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF
https://huggingface.co/unsloth/Qwen3.5-27B-GGUF
Get a second hand 3090/4090 or buy a new Intel Arc Pro B70. Use MoE models and offload to RAM for best bang for your buck. For speed try to find a model that fits entirely within VRAM. If you want to use multiple GPUs you might want to switch to vLLM or something else.
You can try any of the following models:
High-end: GLM 5.1, MiniMax 2.7
Medium: Gemma 4, Qwen 3.5
https://unsloth.ai/docs/models/minimax-m27
https://unsloth.ai/docs/models/glm-5.1
https://unsloth.ai/docs/models/gemma-4
This is by design, of course. Anyone who has been paying even the slightest bit of attention knows these subscriptions are not sustainable, and the prices will have to go up over time. Quietly reducing the usage limits that they were never specific about in the first place is much easier than raising the prices of the individual subscription tiers, with the same effect.
If you want to know what kind of prices you'll be paying to fuel your vibe coding addiction in a few years, try out API pricing for a bit, and try not to cry when your 100$ credit is gone in 2 days.
With that said, I pay the Pro subscription (20/mo) and I hit limits maybe 2/3 times over a period of 4 months building a simple running app in Python. I’d not call it production ready but it’s not nothing either.
If people were considerably more willing to aggressively prune their context and scope tasks well, they could get a lot more done with it, at least in my experience. Anthropic can’t really fix anything because the underlying model architecture can’t be “patched”. But I definitely feel a lot of people can’t wrap their heads around the new paradigms needed to effectively prompt these models.
Additionally, opting out is always an option… but these types of issues feel more like laziness than real, structural issues with the model/harness…
No they can't. When I buy an annual subscription and prepay for the year, they can't just go "ok now you get one token a month" a day in. I bought the plan as I bought it. They can't change anything until the next renewal.
a) quotas will get restricted
b) the subscription plan prices will go up
c) all LLMs will become good enough at coding tasks
I just open sourced a coding agent https://github.com/dirac-run/dirac
The entire goal is to be token efficient (over 50% cheaper), and by extension, take advantage of LLM's better reasoning at shorter context lengths
This really started as an internal side project that made me more productive, I hope it will help others too. Apache 2.0
Currently it still can't compete the subsidized coding plan rates using Anthropic API pricing though (even though it beats CC while both use API key), which tells me that all subscription plan operators are losing money on such plans
In theory the /stats command tells you how many tokens you've used, which you could use to compute how much you are getting for your subscription, but in practice it doesn't contain any useful info, it may be counting what is printed to the terminal or something - my stats suggest my claude code usage is a tiny amount of tokens, but they must be an extremely underestimated token count, or they are charging much more for the subscription than the API per token (which is not supposed to be the case).
Last week's free extra usage quota shed some light on this. It seems like the reported tokens are probably are between 1/30th to 1/100th of the actual tokens billed, from looking at how they billed (/stats went up 10k tokens and I was billed $7.10). With the API it should be $25 for a million tokens.
For general queries and investigation I will use whatever public/free model is available without being logged in. Not having a bunch of prior state stacked up all the time is a feature for me. This is essentially my google replacement.
For very specific technical work against code files, I use prepaid OAI tokens in VS copilot as a "custom" model (it's just gpt5.4).
I burn through maybe $30 worth of tokens per month with this approach. A big advantage of prepaying for the API tokens is that I can look at everything copilot is doing in my usage logs. If I use the precanned coding agent products, the prompts are all hidden in another layer of black box.
Here’s what I’ve done to mostly fix my usage issues:
* Turn on max thinking on every session. It save tokens overall because I’m not correcting it of having it waste energy on dead paths.
* keep active sessions active. It seems like caches are expiring after ~5 minutes (especially during peak usage). When the caches expire it sees like all tokens need to be rebuilt this gets especially bad as token usage goes up.
* compact after 200k tokens as soon as I reasonably can. I have no data but my usage absolutely sky rockets as I get into longer sessions. This is the most frustrating thing because Anthropic forced the 1M model on everyone.
At least up until recently the 1M model was separated into /model opus[1M]
They also silently raised the usage input tokens consume so it's a double whammi.
This is definitely true. Ever since I realized there is an /effort max option I am no longer fighting it that much and wasting hours.
Vibes, indeed.
> * keep active sessions active. It seems like caches are expiring after ~5 minutes (especially during peak usage). When the caches expire it sees like all tokens need to be rebuilt this gets especially bad as token usage goes up.
Is this as opaque on their end as it sounds, or is there a way to check?
Good chance it's not real or misdiagnosed. But it gives me some degree of schadenfreude to see it happening to the Claude Code repo.
Probably a combination of it being vibe coded shit and something in the backend I expect.
They inflated how much their tools burn tokens from day one pretty much,remember all the stupid research and reports Claude always wanted to do, no matter what you asked it. Other tools are much smarter so this is not such a big deal.
More importantly, these moves tend to reverberate in the industry, so I expect others will clamp down on usage a lot and this will spoil my joy of using AI without countring every token.
Burning tokens doesn't just wastes your allotment, it also wastes your time. This gave rise to turbo offering where you get responses faster but burn 2x your tokens.
Taking a second opinion has significantly helped me to design the system better, and it helped me to uncover my own and Claude blindspots.
Also, agree that, it spent and waist a lot of token on web search and many a times get stuck in loop.
Going forward- i will always use all 3 of them. Still my main coding agent is Claude for now.. but happy to see this field evolving so fast and it's easy to switch and use others on same project.
No network effects or lock in for a customer. Great to live in this period of time.
It does seem like this new routing is worse for the consumer in terms of code quality and token usage somehow.
It is hard now to hit the limit...
Cache reads cost $0.31
Cache writes cost $105
Input tokens cost $0.04
Output tokens cost $28.75
The total spent in the session is $134.10, while the Pro Max 5x subscription is $100.
Even taking the Anthropics API pricing, we arrive at $80.58. Below the subscription price, but not by much.
It's just the end of the free tokens, nothing to see here. It's easy to feel like you're doing "moderate" or even "light" usage because you use so little input tokens, but those "agentic workflows" are simply not viable financially.
We're generating all of the code for swamp[1] with AI. We review all of that generated code with AI (this is done with the anthropic API.) Every part of our SDLC is pure AI + compute. Many feature requests every day. Bug fixes, etc.
Never hit the quota once. Something weird is definitely going on.
But people who go > 5 minutes between prompts and see no cache, usage is eaten up quickly. Especially passing in hundreds of thousands of tokens of conversation history.
I know my quote goes a lot further when I sit down and keep sessions active, and much less far when I’m distracted and let it sit for 10+ minutes between queries.
It’s a guess. But n=1 and possible confirmation bias noted, it’s what I’m seeing.
What it does for you is simple: if you want to automate something, it does. Load the AI harness of your choice, tell it what to automate, swamp builds extensions for whatever it needs to to accomplish your task.
It keeps a perfect memory of everything that was done, manages secrets through vaults (which are themselves extensions it can write) and leaves behind repeatable workflows. People have built all sorts of shit - full vm lifecycle management, homelab setups, manage infrastructure in aws and azure.
What's also interesting is the way we're building it. I gave a brief description in my initial comment.
The sociotechnical stuff with System Initiative was made by your CEO? The guy who is really into music? And I don't even know how long that product was a thing before the pivot. Not long!
"effortLevel": "high",
"autoUpdatesChannel": "stable",
"minimumVersion": "2.1.34",
"env": {
"DISABLE_AUTOUPDATER": 1,
"CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING": 1
}
I also had to:1. Nuke all other versions within /.local/share/claude/versions/ except 2.1.34. 2. Link ~/.local/bin/claude to claude -> ~/.local/share/claude/versions/2.1.34
This seems to have fixed my running out of quota issues quickly problems. I have periods of intense use (nights, weekends) and no use (day job). Before these changes, I was running out of quota rather quickly. I am on the same 100$ plan.
I am not sure adaptive thinking setting is relevant for this version but in the future that will help once they fix all the quota & cache issues. Seriously thinking about switching to Codex though. Gemini is far behind from what I have tried so far.
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000 export MAX_THINKING_TOKENS=31999 export DISABLE_AUTOUPDATER=1 export CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1
Please unsubscribe to these services and see how they perform:
"Maybe if I spend more money on the max plan it will be better" > no it will be the same "Maybe if I change my prompt it will work" > no it will be the same "Maybe if I try it via this API instead of that API it will improve" > no it will be the same.
Claude, ChatGPT, Gemini etc all of these SOTA models are carefully trained, with platforms carefully designed to get you to pay more for "better" output, or try different things instead of using a different product.
It's to keep you in the ecosystem and keep you exploring. There is a reason you can't see the layers upon layers of scaffolding they have. And there's a reason why after 2 weeks post major update, the model is suddenly "bad" and "frustrating". It's the same reason its done with A/B testing, so when you complain, someone else has no issues, when they complain, you have no issues. It muddies the water intentionally.
None of it is because you're doing anything wrong, it's not a skill issue, it's a careful strategy to extract as much engagement and money from customers as possible. It's the same reason they give people who buy new gun skins in call of duty easier matches in matchmaking for the first couple games.
Stop paying more, stop buying these pro max plans, hoping it will get better. It won't, that's not what makes them money. Making people angry and making people waste their time, while others have no issues, and making them explore and try different things for longer so they can show to investors how long people use these AI tools is what makes them money.
When competitors have a better product these issues go away When a new model is released these issues don't exist
I was paying a ton of money for claude, once I stopped and cancelled my subscription entirely, suddenly sonnet 4.6 is performing like opus and I don't have prompts using 10% of my quota in one message despite being the same complexity.
I am tired of all the astroturf articles meant to blame the user with “tips” for using fewer tokens. I never had to (still don’t) think of this with Codex, and there has been a massive, obvious decline between Claude 1 month ago and Claude today.
What I wish for right now is for open-weight models and hardware companies (looking at you Apple) to make it possible to run local models with Opus 4.6-level intelligence.
@Anthropic I've cancelled my subscription. Good luck :)
For something I spend all my time using- I’d rather iterate with Claude. The personality makes a big difference to me.
Honestly when I get codex to review the work that Claude does (my own or my coworker's) it consistently finds terrible terrible bugs, usually missing error handling / negative conditions, or full on race conditions in critical paths.
I don't trust code written by Claude in a production environment.
All AI code needs review by human, and often by other AIs, but Opus 4.6 is the worst. It's way too "yeet"
The opus models are for building prototypes, not production software.
GPT 5.4 in codex is also way more efficient with tokens or budget. I can get a lot more done with it.
I don't like giving money to sama, but I hate bugs even more.
On the flip side- Using Opus with a baby billy freeman persona has never been more entertaining.
Since then, I've been seeing increased critique of Anthropic in particular (several front page posts on HN, especially past few days), either due to it being nerfed or just straight up eating up usage quota (which matches my personal experience). It appears that we're once again getting hit by enshittiffication of sorts.
Nowadays I rely a lot on LLMs on a daily basis for architecture and writing code, but I'm so glad that majority of my experience came from pre-AI era.
If you use these tools, make sure you don't let it atrophy your software engineering "muscles". I'm positive that in long run LLMs are here to stay. The jump in what you can now self-host, or run on consumer hardware is huge, year after year. But if your abilities rely on one vendor, what happens if you come to work one day and find out you're locked out of your swiss army knife and you can no longer outsource thinking?
We've been investigating these reports, and a few of the top issues we've found are:
1. Prompt cache misses when using 1M token context window are expensive. Since Claude Code uses a 1 hour prompt cache window for the main agent, if you leave your computer for over an hour then continue a stale session, it's often a full cache miss. To improve this, we have shipped a few UX improvements (eg. to nudge you to /clear before continuing a long stale session), and are investigating defaulting to 400k context instead, with an option to configure your context window to up to 1M if preferred. To experiment with this now, try: CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000 claude.
2. People pulling in a large number of skills, or running many agents or background automations, which sometimes happens when using a large number of plugins. This was the case for a surprisingly large number of users, and we are actively working on (a) improving the UX to make these cases more visible to users and (b) more intelligently truncating, pruning, and scheduling non-main tasks to avoid surprise token usage.
In the process, we ruled out a large number of hypotheses: adaptive thinking, other kinds of harness regressions, model and inference regressions.
We are continuing to investigate and prioritize this. The most actionable thing for people running into this is to run /feedback, and optionally post the feedback ids either here or in the Github issue. That makes it possible for us to debug specific reports.
I have yet to see Anthropic doing the same. Sorry but this whole thing seems to be quite on purpose.
I use Claude Code about 8hrs every work day extensively, and have yet to see any issues.
It really does seem like PEBKAC.
With a new version of Claude Code pretty much each day, constant changes to their usage rules (2x outside of peak hours, temporarily 2x for a few weeks, ...), hidden usage decisions (past 256k it looks like your usage consumes your limits faster) and model degradation (Opus 4.6 is now worse than Opus 4.5 as many reported), I kind of miss how it can be an user error.
The only user error I see here is still trusting Anthropic to be on the good side tbh.
If you need to hear it from someone else: https://www.youtube.com/watch?v=stZr6U_7S90
This is false. My guess is what is happening is #1 above, where restarting a stale session causes a 256k cache miss.
That said, I hear the frustration. We are actively working on improving rate limit predictability and visibility into token usage.
For example, I don't pull in tons of third-party skills, preferring to have a small list of ones I write and update myself, but it's not at all obvious to me that pulling in a big list of third-party skills (like I know a lot of people do with superpowers, gstack, etc...) would cause quota or cache miss issues, and if that's causing problems, I'd call that more of a UX footgun than user error. Same with the 1M context window being a heavily-touted feature that's apparently not something you want to actually take advantage of...
No! It’s the children who are wrong!
EDIT: prompt caching behavior -did- change! 1hr -> 5min on March 6th. I'm not sure how starting a fresh session fixes it, as it's just rebuilding everything. Why even make this available?
This is not accurate. The main agent typically uses a 1h cache (except for API customers, which can enable 1h but it is not on by default because it costs more). Sub-agents typically use a 5m cache.
It is a horrible error of judgement to insert a complex request for such a basic ability. It is also an error of judgement to make claude make decisions whether it wants to improve the code or not at all.
It is so bad, that i stopped working on my current project and went to try other models. So far qwen is quite promising.
Maybe using a heartbeat to detect live sessions to cache longer than sessions the user has already closed. And only do it for long sessions where a cache miss would be very expensive.
Running Claude Cowork in the background will hit tokens and it might not be the most efficient use of token use.
Last, but not least, turning off 1M token context by default is helpful.
cmaster11•2h ago
TacticalCoder•1h ago
GorbachevyChase•1h ago
dividedcomet•52m ago
scrollop•1h ago