To be clear, I'm not saying that it's a good thing, but it does seem to be going in this direction.
Under the hood, what was happening is that older models needed reminders, while 4.7 no longer needs it. When we showed these reminders to 4.7 it tended to over-fixate on them. The fix was to stop adding cyber reminders.
More here: https://x.com/ClaudeDevs/status/2045238786339299431
It's going to be a very expensive game, and the masses will be left with subpar local versions. It would be like if we reversed the democratization of compilers and coding tooling, done in the 90s and 00s, and the polished more capable tools are again all proprietary.
You could call it a rug pull, but they may just be doing the math and realize this is where pricing needs to shift to before going public.
Oh well
OpenAI was built as you say. Google had a corporate motto of "Don't be evil" which they removed so they could, um, do evil stuff without cognitive dissonance, I guess.
This is the other kind of enshitification where the businesses turn into power accumulators.
Plenty of OSS models being released as of late, with GLM and Kimi arguably being the most interesting for the near-SOTA case ("give these companies a run for their money"). Of course, actually running them locally for anything other than very slow Q&A is hard.
Though, from my limited testing, the new model is far more token hungry overall
https://artificialanalysis.ai/?intelligence-efficiency=intel...
Looking at their cost breakdown, while input cost rose by $800, output cost dropped by $1400. Granted whether output offsets input will be very use-case dependent, and I imagine the delta is a lot closer at lower effort levels.
I’ve noticed 4.7 cycling a lot more on basic tasks. Though, it also seems a bit better at holding long running context.
In my opinion, we've reached some ceiling where more tokens lead only to incremental improvements. A conspiracy seems unlikely given all providers are still competing for customers and a 50% token drives infra costs up dramatically too.
If I can have Claude write up the plan, and the other models actually execute it, I'd get the best of both worlds.
(Amusingly, I think Codex tolerates being invoked by Claude (de facto tolerated ToS violation), but not the other way around.)
If tech companies convince Congress that AI is an existential issue (in defense or even just productivity), then these companies will get subsidies forever.
And shafting your customers too hard is bad for business, so I expect only moderate shafting. (Kind of surprised at what I've been seeing lately.)
If the models don't get to a higher level of 'intelligence' and still struggle with certain basic tasks at the SOTA while also getting more expensive, then then the pitch is misleading and unlikely to happen.
What I've been doing is running a dual-model setup — use the cheaper/faster model for the heavy lifting where quality variance doesn't matter much, and only route to the expensive one when the output is customer-facing and quality is non-negotiable. Cuts costs significantly without the user noticing any difference.
The real risk is that pricing like this pushes smaller builders toward open models or Chinese labs like Qwen, which I suspect isn't what Anthropic wants long term.
A smaller builder might reconsider (re)acquiring relevant skills and applying them. We don't suddenly lose the ability to program (or hire someone to do it) just because an inference provider is available.
There are 2 things to consider:
* Time to market.
* Building a house on someone else's land.
You're balancing the 2, hoping that you win the time to market, making the second point obsolete from a cost perspective, or you have money to pivot to DIY.This is going to be blunt, but this business model is fundamentally unsustainable and "founders" don't get to complain their prospecting costs went up. These businesses are setting themselves up to get Sherlocked.
The only realistic exit for these kinds of businesses is to score a couple gold nuggets, sell them to the highest bidder, and leave.
We'll be keeping an eye on open models (of which we already make good use of). I think that's the way forward. Actually it would be great if everybody would put more focus on open models, perhaps we can come up with something like the "linux/postgres/git/http/etc" of the LLMs: something we all can benefit from while it not being monopolized by a single billionarie company. Wouldn't it be nice if we don't need to pay for tokens? Paying for infra (servers, electricity) is already expensive enough
One of two main reasons why I'm wary of LLMs. The other is fear of skill atrophy. These two problems compound. Skill atrophy is less bad if the replacement for the previous skill does not depend on a potentially less-than-friendly party.
We have gone multi cloud disaster recovery on our infrastructure. Something I would not have done yet, had we not had LLMs.
I am learning at an incredible rate with LLMs.
What an interesting paradox-like situation.
And not even just understanding, but verifying that they’ve implemented the optimal solution.
But I’m so much more detached of the code, I don’t feel that ‘deep neural connection’ from actual spending days in locked in a refactor or debugging a really complex issue.
I don’t know how a feel about it.
But if you don't and there's no PR process (side projects), the motivation to form that connection is quite low.
Sure, you don't know the code by heart, but people debugging code translated to assembly already do that.
The big difference is being able to unleash scripts that invalidate enormous amount of hypothesis very fast and that can analyze the data.
Used to do that by hand it took hours, so it would be a last resort approach. Now that's very cheap, so validating many hypothesis is way cheaper!
I feel like my "debugging ability" in terms of value delivered has gone way up. For skill, it's changing. I cannot tell, but the value i am delivering for debugging sessions has gone way up
Could you do it again without the help of an LLM?
If no, then can you really claim to have learned anything?
And yes. If LLMs disappear, then we need to hire a lot of people to maintain the infrastructure.
Which naturally is a part of the risk modeling.
It’s quite possible to be deep into solving a problem with an LLM guiding you where you’re reading and learning from what it says. This is not really that different from googling random blogs and learning from Stack Overflow.
Assuming everyone just sits there dribbling whilst Claude is in YOLO mode isn’t always correct.
You very much decide how you employ LLMs.
Nobody are keeping a gun to your head to use them. In a certain way.
Sonif you use them in a way that increase you inherent risk, then you are incredibly wrong.
I don't believe it. Having something else do the work for you is not learning, no matter how much you tell yourself it is.
Open your eyes, and you might become a believer.
I've worked with people who will look at code they don't understand, say "llm says this", and express zero intention of learning something.
It's like, why even review that PR in the first place if you don't even know what you're working with?
But it requires that one does not do something stupid.
Eg. For recurring tasks: keep the task specification in the source code and just ask Claude to execute it.
The same with all documentation, etc.
My manager doesn't even want us to use copilot locally. Now we are supposed to only use the GitHub copilot cloud agent. One shot from prompt to PR. With people like that selling vendor lock in for them these companies like GitHub, OpenAI, Anthropic etc don't even need sales and marketing departments!
That's an incentive difficult to reconcile with the user's benefit.
To keep this business running they do need to invest to make the best model, period.
It happens to be exactly what Anthropic's strategy is. That and great tooling.
latest claude still fails the car wash test
After just ~4 prompts I blew past my daily limit. Another ~7 more promoted I blew past my weekly limit.
The entire HTMl/CSS/JS was less than 300 lines of code.
I was shocked how fast it exhausted my usage limits.
With enterprise subscription, the bill gets bigger but it's not like VP can easily send a memo to all its staff that a migration is coming.
Individuals may end their subscription, that would appease the DC usage, and turn profits up.
The whole magic of (pre-nerfed) 4.6 was how it magically seemed to understand what I wanted, regardless of how perfectly I articulated it.
Now, Anth says that needing to explicitly define instructions are as a "feature"?!
My workflow is to give the agent pretty fine-grained instructions, and I'm always fighting agents that insist on doing too much. Opus 4.5 is the best out of all agents I've tried at following the guidance to do only-what-is-needed-and-no-more.
Opus 4.6 takes longer, overthinks things and changes too much; the high-powered GPTs are similarly flawed. Other models such as Sonnet aren't nearly as good at discerning my intentions from less-than-perfectly-crafted prompts as Opus.
Eventually, I quit experimenting and just started using Opus 4.5 exclusively knowing this would all be different in a few months anyway. Opus cost more, but the value was there.
But now I see that 4.7 is going to replace both 4.5 and 4.6 in VSCode Copilot, and with a 7.5x modifier. Based on the description, this is going to be a price hike for slower performance — and if the 4.5 to 4.6 change is any guide, more overthinking targeted at long-running tasks, rather than fine-grained. For me, that seems like a step backwards.
I hit my 5 hour limit within 2 hours yesterday, initially I was trying the batched mode for a refactor but cancelled after seeing it take 30% of the limit within 5 minutes. Had to cancel and try a serial approach, consumed less (took ~50 minutes, xhigh effort, ~60% of the remaining allocation IIRC), but still very clearly consumed much faster than with 4.6.
It feels like every exchange takes ~5% of the 5 hour limit now, when it used to be maybe ~1-2%. For reference I'm on the Max 5x plan.
For now I can tolerate it since I still have plenty of headroom in my limits (used ~5% of my weekly, I don't use claude heavily every day so this is OK), but I hope they either offer more clarity on this or improve the situation. The effort setting is still a bit too opaque to really help.
To me this seems more that it's trained to be concise by default which I guess can be countered with preference instructions if required.
What's interesting to me is that they're using a new tokeniser. Does it mean they trained a new model from scratch? Used an existing model and further trained it with a swapped out tokeniser?
The looped model research / speculation is also quite interesting - if done right there's significant speed up / resource savings.
anabranch•1h ago
I'm surprised that it's 45%. Might go down (?) with longer context answers but still surprising. It can be more than 2x for small prompts.
pawelduda•49m ago