Models are picky enough about prompting styles that changing to a new model every week/month becomes an added chunk of cognitive overload, testing and experimentation, plus even in developer tooling there have been minor grating changes in API invocations and use of parameters like temperature (I have a fairly low-level wrapper for OpenAI, and I had to tweak the JSON handling for GPT-5).
Also, there are just too many variations in API endpoints, providers, etc. We don’t really have a uniform standard. Since I don’t use “just” OpenAI, every single tool I try out requires me to jump through a bunch of hoops to grab a new API key, specify an endpoint, etc.—and it just gets worse if you use a non-mainstream AI endpoint.
They say that the number of users on Claude 4.5 spiked and then a significant number of users reverted to 4.0 with the trend going up, and they are talking about their usage metrics. So I don't get how your comment is relevant to the article ?
It loves doing a whole bunch of reasoning steps and prolaim how mucf of a very good job it did clearing up its own todo steps and all that mumbo jumbo, but at the end of the day, I only asked it a small piece of information about nginx try_files that even GPT3 could answer instantly.
Maybe before you make reasoning models that go on funny little sidequests wher they multiply numbers by 0 a couple of times, make it so its good at identfying the length of a task. ntil then, I'll ask little bro and advance only if necessity arrives. And if it ends up gathering dust, well... yeah.
(§) You know that it's a hyperlink, do you? /s
If you are talking about local models, you can switch that off. The reasoning is a common technique now to improve the accuracy of the output where the question is more complex.
Imagine waiting for a minute until Google spits out the first 10 results.
My prediction: All AI models of the future will give an immediate result, with more and more innovation in mechanisms and UX to drill down further on request.
Edit: After reading my reply I realize that this is also true for interactions with other people. I like interacting with people who give me a 1 sentence response to my question, and only start elaborating and going on tangents and down rabbit holes upon request.
A 5090 has 32GB of VRAM allowing you to run a 32B model in memory at Q6_K.
You can run larger models by splitting the GPU layers that are run in VRAM vs stored in RAM. That is slower, but still viable.
This means that you can run the Qwen3-Coder-30B-A3B model locally on a 4090 or 5090. That model is a Mixture of Experts model with 3B active parameters, so you really only need a card with 3B of VRAM so you could run it on a 3090.
The Qwen3-Coder-480B-A35B model could also be run on a 4090 or 5090 by splitting the active 35B parameters across VRAM and RAM.
Yes, it will be slower than running it in the cloud. But you can get a long way with a high-end gaming rig.
> The Qwen3-Coder-480B-A35B model could also be run on a 4090 or 5090 by splitting the active 35B parameters across VRAM and RAM.
Reminds me of running Doom when I had to hack config.sys to forage 640KB of memory.
Less than 0.1% of the people reading this are doing that. Me, I gave $20 to some cloud service and I can do whatever the hell I want from this M1 MBA in a hotel room in Japan.
There was even a recent release of Granite4 that runs on a Raspberry Pi.
https://github.com/Jewelzufo/granitepi-4-nano
For my local work I use Ollama. (M4 Max 128GB)
- gpt-oss. 20b or 120b depending on complexity of use cases.
- granite4 for speed and lower complexity (around the same as gpt20b).
Inference for new releases is routinely bugged for at least a month or two as well, depending on how active the devs of a specific inference engine are and how much model creators collaborate. Personally, I hate how data from GPT's few week (and arguably somewhat ongoing) sycophancy rampage has leaked into datasets that are used for training local models, making a lot of new LLM releases insufferable to use.
What is unclear from the presentation is wether they do this or not. Do teams that use Sonnet 4.5 just always use it, and teams on Sonnet 4.0 likewise? Or do individuals decided which model to use on a per task basis.
Personally I tend to default to just 1, and only go to an alternative if it gets stuck or doesn't get me what I want.
As evidenced by furious posters on r/cursor, who make every prompt to super-opus-thinking-max+++ and are astonished when they have blown their monthly request allowance in about a day.
If I need another pair of (artificial) eyes on a difficult debugging problem, I’ll occasionally use a premium model sparingly. For chore tasks or UI layout tweaks, I’ll use something more economical (like grok-4-fast or claude-4.5-haiku - not old models but much cheaper).
Nevertheless, 7 datapoints does not a trend make (and the data presented certainly doesnt explain why). The daily variation is more than I would have expected, but could also be down to what day of the week the pizza party is or the weekly scrum meetings is at a few of their customers workplaces.
30 seconds-1 minute is just the time I am patient enough to wait as that's the time I am spending on writing a question.
Faster models just make too many mistakes / don't understand the question.
This data is basically meaningless, show us the latest stats.
It could be true that newer models just produce more tokens seemingly out of no reasons. But with the increasing number of tool definitions, in the long run, I think it will pay off.
Just a few days ago, I read "Interleaved Thinking Unlocks Reliable MiniMax-M2 Agentic Capability"[1]. I think they have a valid point that this thinking process has significance as we are moving towards agents.
[1] https://www.minimax.io/news/why-is-interleaved-thinking-impo...
But when things get more complex, I prefer GPT-5, talking with it often gives me fresh ideas and new perspectives.
Once, I set up a proxy that allowed Claude and Codex to "pair program" and collaborate, and it was cool to watch them talk to each other, delegate tasks, and handle different bits and pieces until the task was done.
Last week Claude seemed to have a shift in the way it works. The way it summarises and outputs its results is different. For me it's gotten worse. Slower, worse results, more confusing narrowing down what actually changed etc etc.
Long story short, I wish I was able to checkpoint the entire system and just revert to how it was previously. I feel like it had gotten to a stage where I felt pretty satisfied, and whatever got changed ... I just want it reverted!
This (as well as the table above it) matches my experience. Sonnet 4.0 answers SO-type questions very fast and mostly accurately (if not on a niche topic), Sonnet 4.5 is a little bit more clever but can err on the side of complexity for complexity's sake, and can have a hard time getting out of a hole it dug for itself.
ChatGPT 5 is excellent at finding sources on the web; Gemini simply makes stuff up and continues to do so even when told to verify; ChatGPT provides link that work and are generally relevant.
Manfred•2h ago