I knew of all the 3.5’s and the one 3.6, but only now heard about the Plus.
OpenAI on the other hand has different models optimized for coding, GPT-x-codex, Anthropic doesnt have this distinction
https://deepinfra.com/zai-org/GLM-5.1
Looks like fp4 quantization now though? Last week was showing fp8. Hm..
I also regularly experience Deepinfra slow to an absolute crawl - I've actually gotten more consistent performance from Z.ai.
I really liked Deepinfra but something doesn't seem right over there at the moment.
It's frankly a bummer that there's not seemingly a better serving option for GLM 5.1 than z.AI, who seems to have reliability and cost issues.
I did give it one task which was more complex and I was quite impressed by. I had a local setup with Tiltdev, K3S and a pnpm monorepo which was failing to run the web application dev server; GLM correctly figured out that it was a container image build cache issue after inspecting the containers etc and corrected the Tiltfile and build setup.
CC has a limited capacity for Opus, but fairly good for Sonnet. For Codex, never had issues about hitting my limits and I'm only a pro user.
They have difficulty supplying their users with capacity, but in an email they pointed out that they are aware of it. During peak hours, I experience degraded performance. But I am on their lowest tier subscription, so I understand if my demand is not prioritized during those hours.
For more complicated stuff, like queries or data comparison, Codex seems always behind for me.
It's not crushing Opus 4.5 in real-life use for me, but it's close enough to be near interchangeable with Sonnet for me for a lot of tasks, though some of the "savings" are eaten up by seemingly using more tokens for similar complexity tasks (I don't have enough data yet, but I've pushed ~500m tokens through it so far.
I find even the SOTA models to be far away from trustworthy for anything beyond throwaway tasks. Supervising a less-than-SOTA model to save $10 to $100 per month is not attractive to me in the least.
I have been experimenting with self hosted models for smaller throwaway tasks a lot. It’s fun, but I’m not going to waste my time with it for the real work.
Now, given they can't satisfy current volume, they are forced to settle for just having crazy margins.
The idea that every new foundation model needs to be pretrained from scratch, using warehouses of GPUs to crunch the same 50 terabytes of data from the same original dumps of Common Crawl and various Russian pirate sites, is hard to justify on an intuitive basis. I think the hard work has already been done. We just don't know how to leverage it properly yet.
People think that Chinese AI labs are just super cool bros that love sharing for free.
The don't understand it's just a state sponsored venture meant to further entrench China in global supply and logistics. China's VCs are Chinese banks and a sprinkle of "private" money. Private in quotes because technically it still belongs to the state anyway.
China doesn't have companies and government like the US. It just has government, and a thin veil of "company" that readily fool westerners.
I still prefer that over US total dominance.
Let them fight it out.
And I've using Claude, Gemini, GLM, Qwen to double check my math, my code and to get practical information to make my path tracer more efficient. Claude and Gemini failed me a couple of times with wrong, misleading and unnecessary information but on the other hand Qwen always gave me proper, practical and correct information. I almost stopped using Claude and Gemini to not to waste my time anymore.
Claude code may shine developing web applications, backends and simple games. But it's definitely not for me. And this is the story of my specific use case.
This is not my experience at all, Qwen3.6-Plus spits out multiple paragraphs of text for the prompts I give. It wasn't like this before. Now I have to explicitly tell it not to yap so much and keep it short, concise and direct.
In my own experience, even with web app of medium scale (think Odoo kind of ERP), they are next to useless in understanding and modling domain correctly with very detailed written specs fed in (whole directory with index.md and sub sections and more detailed sections/chapters in separate markdown files with pointers in index.md) and I am not talking open weight models here - I am talking SOTA Claude Opus 4.6 and Gemini 3.1 Pro etc.
But that narrative isn't popular. I see the parallels here with the Crypto and NFT era. That was surely the future and at least my firm pays me in cypto whereas NFTs are used for rewarding bonusess.
otoh, we spotted a wrong formula regarding learning rate on wikipedia and it is now correct :) without gemini and just our intuition of "mhh this formula doesn't seem right", that definitely inflated our ego
They brag about Qwen but don't let people use it.
jjice•1h ago
Someone1234•1h ago
Even many people on a Claude subscription aren't choosing or able to choose Opus 4.7 because of those cost/usage pressures. Often using Sonnet or an older opus, because of the value Vs. quality curve.
wahnfrieden•1h ago
dd8601fn•1h ago
seplite•57m ago
zozbot234•56m ago
CamperBob2•1h ago
jpfromlondon•8m ago
hirako2000•1h ago
In any case a benchmark provided by the provider is always biased, they will pick the frameworks where their model fares well. Omit the others.
Independent benchmarks are the go to.
alex_young•1h ago
oidar•1h ago
bluegatty•55m ago
vidarh•49m ago
If even cheaper models start reaching that level (GLM 5.1 is also close enough that I'm using it at lot), that's a big deal, and a totally valid reason to compare against Opus 4.5
jasonjmcghee•29m ago
For me, Opus 4.5 and 4.6 feel so different compared to sonnet.
Maybe I'm lazy or something but sonnet is much worse in my experience at inferring intent correctly if I've left any ambiguity.
That effect is super compounding.