To me, it is hard to reject this hypothesis today. The fact that Anthropic is rapidly trying to increase price may betray the fact that their recent lead is at the cost of dramatically higher operating costs. Their gross margins in this past quarter will be an important data point on this.
I think the tendency for graphs of model assessment to display the log of cost/tokens on the x axis (i.e. Artificial Analysis' site) has obscured this dynamic.
So there's a push for them to increase revenue per user, which brings us closer to the real cost of running these models.
At that point you are beholden to your shareholders and no longer can eschew profit in favor of ethics.
Unfortunately, I think this is the beginning of the end of Anthropic and Modei being a company and CEO you could actually get behind and believe that they were trying to do "the right thing".
It will become an increasingly more cutthroat competition between Anthropic and OpenAI (and perhaps Google eventually if they can close the gap between their frontier models and Claude/GPT) to win market share and revenue.
Perhaps Amodei will eventually leave Anthropic too and start yet another AI startup because of Anthropic's seemingly inevitable prioritization of profit over safety.
A publicly traded company is legally obligated to go against the global good.
Call me an optimist, but I'm still holding out hope that Amodei is and still can do the right thing. That hope is fading fast though.
So no matter what, if you do something lots of people like (and hence compensate you for), you will be evil.
It's a very interesting quirk of human intuition.
Can't blame someone who comes to such a conclusion about money and power.
Or they are just not willing to burn obscene levels of capital like OpenAI.
Much of the token usage is in reasoning, exploring, and code generation rather than outputs to the user.
Does making Claude sound like a caveman actually move the needle on costs? I am not sure anymore whether people are serious about this.
To me, caveman sounds bad and is not as easy to understand compared to normal English.
I'm already at 27% of my weekly limit in ONE DAY.
it seems to hallucinate a bit more (anecdotal)
People love to throw around "this is the dumbest AI will ever be", but the corollary to that is "this is the most aligned the incentives between model providers and customers will ever be" because we're all just burning VC money for now.
Please say this louder for everyone to hear. We are still at the stage where it is best for Anthropic's product to be as consumer aligned (and cost-friendly) as possible. Anthropic is loosing a lot of money. Both of those things will not be true in the near future.
https://platform.claude.com/docs/en/about-claude/pricing
So if you are generating more tokens, you are eating up your usage faster
It is like comparing an 8K display to a 16K display because at normal viewing distance, the difference is imperceptible, but 16K comes at significant premium.
The same applies to intelligence. Sure, some users might register a meaningful bump, but if 99% can't tell the difference in their day-to-day work, does it matter?
A 20-30% cost increase needs to deliver a proportional leap in perceivable value.
For coding though, there is kind of no limit to the complexity of software. The more invariants and potential interactions the model can be aware of, the better presumably. It can handle larger codebases. Probably past the point where humans could work on said codebases unassisted (which brings other potential problems).
If I can get the performance I'm seeing out of free models on a 6-year-old Macbook Pro M1, it's a sign of things to come.
Frontier models will have their place for 1) extensive integrations and tooling and 2) massive context windows. But I could see a very real local-first near future where a good portion of compute and inference is run locally and only goes to a frontier model as needed.
It's not necessary a single discrete point I think. In my experience, it's tied to the quality/power of your harness and tooling. More powerful tooling has made revealed differences between models that were previously not easy to notice. This matches your display analogy, because I'm essentially saying that the point at which display resolution improvements are imperceptible matters on how far you sit.
Gamblers (vibe-coders) at Anthropic's casino realising that their new slot machine upgrade (Claude Opus) is now taking 20%-30% more credits for every push of the spin button.
Problem is, it advertises how good it is (unverified benchmarks) and has a better random number generator but it still can be rigged (made dumber) by the vendor (Anthropic).
The house (Anthropic) always wins.
> People just want free tools forever?
Using local models are the answer to this if you want to use AI models free forever.
These services were and still are wholly subsidized by VC money, in terms of price increase you have seen nothing yet. Same with the competition...
Re-ran the bake-off with 4.7 authoring and… gpt5.4 still clearly winning. Same skills, same prompts, same agents.md.
> In Claude Code, we’ve raised the default effort level to xhigh for all plans.
Try changing your effort level and see what results you get
I find 5 thinking levels to be super confusing - I dont really get why they went from 3 -> 5
The final calculation assumes that Opus 4.7 uses the exact same trajectory + reasoning output as Opus 4.6. I have not verified, but I assume it not to be the case, given that Opus 4.7 on Low thinking is strictly better than Opus 4.6 on Medium, etc., etc.
"given that Opus 4.7 on Low thinking is strictly better than Opus 4.6 on Medium, etc., etc.”
Opus 4.7 in general is more expensive for similar usage. Now we can argue that is provides better performance all else being equal but I haven’t been able to see that
I think a big issue with the industry right now is it's constantly chasing higher performing models and that comes at the cost of everything else. What I would love to see in the next few years is all these frontier AI labs go from just trying to create the most powerful model at any cost to actually making the whole thing sustainable and focusing on efficiency.
The GPT-3 era was a taste of what the future could hold but those models were toys compare to what we have today. We saw real gains during the GPT-4 / Claude 3 era where they could start being used as tools but required quite a bit of oversight. Now in the GPT-5 / Claude 4 era I don't really think we need to go much further and start focusing on efficiency and sustainability.
What I would love the industry to start focusing on in the next few years is not on the high end but the low end. Focus on making the 0.5B - 1B parameter models better for specific tasks. I'm currently experimenting with fine-tuning 0.5B models for very specific tasks and long term I think that's the future of AI.
And if it's not good enough for coding, what kind of money, if any, would make it good enough?
Many providers out there host open weights models for cheap, try them out and see what you think before actually investing in hardware to run your own.
Do yourself a favor: Set up OpenCode and OpenRouter, and try all the models you want to try there.
Other than the top performers (e.g. GLM 5.1, Kimi K2.5, where required hardware is basically unaffordable for a single person), the open models are more trouble than they're worth IMO, at least for now (in terms of actually Getting Shit Done).
Fun fact: AWS offers apple silicon EC2 instances you can spin up to test.
I also wonder if token utilization has or will ever find its way to employee performance reviews as these models go up in price.
Except, it's not that trivial to solve. I tried experimenting with asking the model to first give a list of symbols it will modify, and then just write the modified symbols. The results were OK, but less refined than when it echoes back the entire file.
The way I see it is that when you echo back the entire file, the process of thinking "should I do an edit here" is distributed over a longer span, so it has more room to make a good decision. Like instead of asking "which 2 of the 10 functions should you change" you're asking it "should you change method1? what about method2? what about method3?", etc., and that puts less pressure on the LLM.
Except, currently we are effectively paying for the LLM to make that decision for *every token*, which is terribly inefficient. So, there has to be some middle ground between expensively echoing back thousands of unchanged tokens and giving an error-ridden high-level summary. We just haven't found that middle ground yet.
I thought coding harnesses provided tools to apply diffs so the LLM didn't have to echo back the entire file?
So, in practice, many tools still work on the file level.
grit.io was working on this years ago, not sure if they are still alive/around, but I liked their approach (just had a very buggy transformer/language).
Feels like LLMs are devolving into having a single, instantly recognizable and predictable writing style.
Recently it started promoting me for feedback even though I am on API access and have disabled this. When I did a deep dive of their feedback mechanism in the past (months ago so probably changed a lot since then) the feedback prompt was pushing message ids even if you didn't respond. If you are on API usage and have told them no to training on your data then anything pushing a message id implies that it is leaking information about your session. It is hard to keep auditing them when they push so many changes so I am now 'default they are stealing my info' instead of believing their privacy/data use policy claims. Basically, my level of trust is eroding fast in their commitment to not training on me and I am paying a premium to not have that happen.
https://docs.github.com/fr/copilot/reference/ai-models/suppo...
At 7.5x for 4.7, heck no. It isn't even clear it is an upgrade over Opus 4.6.
This is already becoming apparent as users are seeing quality degrade which implies that anthropic is dropping performance across the board to minimize financial losses.
> max: Max effort can deliver performance gains in some use cases, but may show diminishing returns from increased token usage. This setting can also sometimes be prone to overthinking. We recommend testing max effort for intelligence-demanding tasks.
> xhigh (new): Extra high effort is the best setting for most coding and agentic use cases
Ref: https://platform.claude.com/docs/en/build-with-claude/prompt...
Commercial inference providers serve Chinese models of comparable quality at 0.1x-0.25x. I think Anthropic realised that the game is up and they will not be able to hold the lead in quality forever so it's best to switch to value extraction whilst that lead is still somewhat there.
"Comparable" is doing some heavy lifting there. Comparable to Anthropic models in 1H'25, maybe.
uberman•1h ago
Given that Opus 4.6 and even Sonnet 4.6 are still valid options, for me the question is not "Does 4.7 cost more than claimed?" but "What capabilities does 4.7 give me that 4.6 did not?"
Yesterday 4.6 was a great option and it is too soon for me to tell if 4.7 is a meaningful lift. If it is, then I can evaluate if the increased cost is justified.
pier25•1h ago
ed_elliott_asc•1h ago
solenoid0937•1h ago
https://marginlab.ai/trackers/claude-code-historical-perform...
Majromax•1h ago
Moreover, on the companion codex graphs (https://marginlab.ai/trackers/codex-historical-performance/), you can see a few different GPT model releases marked yet none correspond to a visual break in the series. Either GPT 5.4-xhigh is no more powerful than GPT 5.2, or the benchmarking apparatus is not sensitive enough to detect such changes.
cbg0•1h ago
addisonj•1h ago
But... Are you really going to completely rely on benchmarks that have time and time again be shown to be gamed as the complete story?
My take: It is pretty clear that the capacity crunch is real and the changes they made to effort are in part to reduce that. It likely changed the experience for users.
grim_io•1h ago
nfredericks•1h ago
grim_io•56m ago
Jeremy1026•1h ago
hypercube33•43m ago