I would be interested to know where the claim of the “killer combination” comes from. I would also like to know who the people behind Z.ai are — I haven’t heard of them before. Their plans seem crazy cheap compared to Anthropic, especially if their models actually perform better than Opus.
To be clear, Z.ai are the people who built GLM 4.5, so they're talking up their own product.
But to be fair, GLM 4.5 and GLM 4.5 Air are genuinely good coding models. GLM 4.5 Air costs about 10% of what Claude Sonnet does (when hosted on DeepInfra, at least), and it can perform simple coding tasks quite quickly. I haven't tested GLM 4.5 Air, but it seems to be popular as well.
If you can easily afford all the Claude Code tokens you want, then you'll probably get better results from Sonnet. But if you already know enough programming to work around any issues that arise, the GLM models are quite usable.
But you can't easily run GLM 4.5 Air quickly without professional workstation- or server-grade hardware (RTX 6000 Pro 96GB would be nice), at least not without a serious speed hit.
Still, it's a very interesting sign for the future of open coding models.
This is the data for that claim: https://huggingface.co/datasets/zai-org/CC-Bench-trajectorie...
Chinese software always has such a design language:
- prepaid and then use credit to subscribe
- strange serif font
- that slider thing for captcha
But I'm going to try it out now.
Also fascinating how they solved the issue that Claude expects a 200+k token model while GLM 4.5 has 128k.
I think this is why many people have concerns about AI. This group can't express neutral ideas. They have to hype about a simple official documentation page.
Maybe it's best for shorter tasks or condensed context?
I find it interesting the number of models latching onto Claude codes harness. I'm still using Cursor for work and personal but tried out open code and Claude for a bit. I just miss having the checkpoints and whatnot.
I'm really concerned that some of the providers are using quantized versions of the models so they can run more models per card and larger batches of inference.
We are heavily incentivized to prioritize/make transparent high-quality inference and have no incentive to offer quantized/poorly-performing alternatives. We certainly hear plenty of anecdotal reports like this, but when we dig in we generally don't see it.
An exception is when a model is first released -- for example this terrific work by artificial analysis: https://x.com/ArtificialAnlys/status/1955102409044398415
It does take providers time to learn how to run the models in a high quality way; my expectation is that the difference in quality will be (or already is) minimal over time. The large variance in that case was because GPT OSS had only been out for a couple of weeks.
For well-established models, our (admittedly limited) testing has not revealed much variance between providers in terms of quality. There is some but it's not like we see a couple of providers 'cheating' by secretly quantizing and clearly serving less intelligence versions of the model. We're going to get more systematic about it though and perhaps will uncover some surprises.
This is quite nice. Will try it out a bit longer over the weekend. I tested it using Claude Code with env variables overrides.
apparent•2h ago
> GLM-4.5 and GLM-4.5-Air are our latest flagship models
Maybe it is great, but with a conflict of interest so obvious I can't exactly take their word for it.
JimDabell•2h ago
nicce•50m ago