GLM-5.2 is the new leading open weights model on Artificial Analysis

https://artificialanalysis.ai/articles/glm-5-2-is-the-new-leading-open-weights-model-on-the-artificial-analysis-intelligence-index

187•himata4113•2h ago

Comments

Tiberium•1h ago

It seems to really be a nice step-up and is getting quite close to the frontier. I wish they'd start focusing on the reasoning efficiency now, though. I have a simple (relatively) test task to evaluate LLMs: writing a simple math evaluator library in Nim (it's about 400-600 lines total max), and GLM 5.2 (xhigh which maps to max effort) spent over 15 minutes (!) reasoning, spending about 45k tokens, before it finally wrote the first file.

I know it's hard to improve on that, but now that their models are good enough at raw intelligence, I think this should become a higher priority task.

Currently on https://artificialanalysis.ai/#output-tokens GPT 5.5 xhigh spends 16k tokens total on average, GPT 5.5 high is 10k, Fable 5 33k, Opus 4.8 41k, GLM 5.2 is 42k. GPT 5.5 is extremely reasoning efficient.

Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.

bertili•1h ago

This is GLM 5.2 Max. GLM 5.2 High which use less than half[1] the tokens.

[1] https://z.ai/blog/glm-5.2

Tiberium•1h ago

Yes, but the Artificial Analysis result is also from GLM 5.2 (max), not high.

andai•50m ago

They have this with a lot of models, measuring only the max setting, while the one you'd actually want to use for most tasks is much lower.

epolanski•32m ago

For the brief period with had Fable, I never had to use it above medium.

Low nailed the overwhelming majority of mundane tasks on it's own, medium was good for more complex stuff.

vorticalbox•46m ago

This is a problem I find with opus is will spend so long thinking then going “but wait what if”

To point where I stop it and simple tell it to “start writing code you can work it out as you go along”

Seems writers block also effects LLM

epolanski•33m ago

Fable was 20 times worse on that.

It's clear it was the vibe coding model, as like no other model before, fully turned you into his assistant instead of the other way around.

benjiro29•20m ago

GLM 5.2 Max = Opus 4.8 Max in thinking behavior. The thinking chain is so similar, and so is the amount of token usage on the output.

If you want reasonable token usage, you need to run it GLM 5.2 at High. There is little drop in quality from Max to High (for most tasks). And it cuts token usage by 2 a 2.5x. GLM 5.2, Max is really something you only need for complex tasks.

In essence, GLM 5.2 is Opus 4.8 its little brother, at a way, WAY cheaper price.

There has been really no training on Opus models going on, really, none i tell you! /sarcasm

Havoc•1h ago

It’s pretty good. More talkative than 5.1. Reminds me of deepseek 4

Their servers are melting though - getting more timeouts etc

unrvl22•1h ago

Why aren't more people talking about this? It's literally Opus 4.7 quality stupid prices. I know providers who are offering this at unlimited tokens for $50 a month. Some are even offering API rates at 3x lower than the official ZAI api rates which are already like 10x cheaper than Opus. (Crof and Umans btw)

This is a huge blow to Anthropic/OpenAI/Google and a massive win for the rest of the world. The official API prices and speeds mean nothing for open source models.

unrvl22•1h ago

I cancelled my claude sub after realizing I can burn 300m tokens a day of this quality, for $50 a month.

Hamuko•1h ago

I’m not that interested in models that I can’t run on my desktop for ~0€, which is my AI budget.

igravious•1h ago

Cool beans. You're not the target audience then.

Hamuko•52m ago

Did I claim I was? I just said why I and people like me are not talking about it.

simianwords•38m ago

and he said its cool

nh43215rgb•1h ago

> GLM-5.2 sits off the most attractive quadrant on the Intelligence vs Output Tokens chart.

That is unfortunate...

CuriouslyC•1h ago

I've been playing with this model a fair amount over the last 24 hours, and I can confirm it's quite capable, while being a little bit verbose (I've seen it reconsider things 3-4 times in thinking traces before deciding on a path forward), and not being quite as good as GPT5.5 at working through complex abstract requirements.

Honestly it's good enough that I feel comfortable recommending a Z.AI sub + a $20/mo OpenAI sub for all but the most AI pilled multi-orchestrators, or the die hard Claude fans. GLM writing + GPT reviewing/debugging feels pretty unlimited and minimally worse than just doing everything in GPT with the $200/mo plan.

igravious•52m ago

After having got a taste of Fable 5 for me Opus 4.8 doesn't cut it any more -- and I don't know how to put this, I don't know if it's just me, but it's rhetorical flourishes are starting to really grate on me, never mind that it is at times deliberately weasel-wordy and economical with the truth until pressed. Opus 4.8 is definitely a stronger coding agent than DeepSeek 4.0 or Kimi 2.7 succeeding where they flounder and fail but its way of expressing itself conversationally is making me reconsider my subscription …

elwebmaster•39m ago

You are not alone. How about GPT 5.5? Does it come close to Fable 5?

fragmede•34m ago

5.5 is pretty good. It's no Fable though. It is definitely better than opus tho.

theplumber•25m ago

GPT 5.5 xhigh is smarter than Fable but Fable like Opus 4.8 as well is faster and seems more “agentic”. It’s easy to test this. Build a fairly complex software with Claude(opus or Fable).

Review the commits with both Claude and GPT 5.5 Xhigh. You can see that Fable is still sloppy(er) compared to GPT. You can test it the other way around as well(drive the dev with GPT and review with GPT and Claude). You get the same result Claude has an edge though and that’s on building more beautiful user interfaces.

kingstnap•1h ago

According to many benchmarks this model is straight up frontier level and Zai seriously cooked. Some of these numbers are incredible.

Excited to see if this turns out to be a Open Weight Opus 4.5 or better.

andai•34m ago

The only benchmarks that matters is your actual task.

I've had models that benched poorly but performed great. And I constantly see models at near the top of AA, which are terrible.

There doesn't necessarily seem to be a lot of overlap between benchmarks and real world usage. (Let alone common sense!)

As far as they go, though, these harder benchmarks match my experience more closely:

https://deepswe.datacurve.ai/

and https://cognition.ai/blog/frontier-code

Where we see "top" models drop way down in score when given longer tasks.

That being said, I've had a reasonably pleasant time with GLM-5.2 so far. (And have had an OK time with DeepSeek as well.)

By the time I'm done testing all the Chinese models, they'll be obsolete :)

davidwritesbugs•58m ago

I like their models, super cheap - I'm a Lite plan subscriber, and subjective performance seems to be same as lower Anthropic models, useful for lots of grunt work. The problem is that Ziphu really __really__ struggle with capacity - everyone is complaining of timeouts or very slow speeds. I can't get direct access to the model though I see it is in OpenRouter so I may play. But the capacity issues means DeepSeek is my main provider these days

mohsen1•54m ago

I don't if it is harness or the model is really not at the level those benchmarks are showing because based of my own "feelings" after using it I felt it's not Opus 4.5 level. It can't figure things out in my project (https://tsz.dev) or maybe tsz is at a stage that things are getting too difficult even for frontier models to be productive. I had the most productive time in the weekend Fable was available and since then it's been pretty slow to make progress

benjiro29•6m ago

A yes, the stealth advertisement post ...

tensegrist•50m ago

> On the Intelligence vs. Cost per Task Pareto Frontier: GLM-5.2 is on the Pareto frontier of the Intelligence vs Cost per Task chart, with the lowest cost per task among models at its intelligence level. GLM-5.2 costs ~$0.46 per task, compared to GLM-5.1 ($0.25), Kimi K2.6 ($0.31), MiniMax-M3 ($0.18) and DeepSeek V4 Pro (max, $0.05)

am i missing something?

xiaoyu2006•41m ago

Some models are heavily subsidized. Total params & active params are better measurement of inference cost.

simianwords•39m ago

No models are subsidised -- there are lots of third party hosting services that will still run at breakeven/profit. (except Deepseek after discount)

OtherShrezzing•24m ago

I think they’ve just picked poor peer examples. Instead of choosing other models near 5.2 on the intelligence scale, they’ve picked some open models from further down the scale.

rahidz•50m ago

Correct me if I'm wrong, but neither DeepSeek nor GLM have image input modality. This makes them less useful when looking at UIs, photos, screenshots, etc. doesn't it? Or do they have alternate ways of doing so?

mordae•30m ago

They do not and it sucks for certain tasks.

It also means that if they actually trained with vision, they'd be on par with Anthropic models as vision seems to improve model performance across the board even for non-vision tasks.

adrian_b•30m ago

That's right, but there are other recent open weights and relatively big LLMs that are multimodal, e.g. MiniMax-M3.

With open weights LLMs, it is affordable to use many different models, each for whatever it is better.

Moreover, for analyzing "UIs, photos, screenshots, etc." there are small models that can be run locally on smartphones or laptops, e.g. IBM granite-vision-4.1-4B, certain Google Gemma 4 variants and certain Qwen variants, whose output you can use as input for a big LLM, in order to accomplish some more complex task.

dryarzeg•21m ago

Yes, you are right (as far as I'm aware). For things where you need the LLM to look at screenshots, photos or other images you can use Kimi-K2.6/K2.7 - comparable pricing, somewhat comparable performance and quality. You can even probably combine two models (e.g Kimi and GLM) in one agent, using Kimi for multimodal inputs and GLM for everything else, although 1) I'm not sure if this will not cause some kind of context poisoning with low-quality patterns for better performing model (e.g. in some cases Kimi may be worse than GLM, but GLM, when following up, may adopt the same reasoning patterns as Kimi, undermining it's own performance), and 2) I'm not quite sure if it's possible with the tools currently available (I'm not really into agentic or chatbots stuff to be honest).

creamyhorror•43m ago

It's a real step forward, getting closer to SOTA. It seems to be very epistemically cautious in its reasoning. I hope Deepseek and the other open-weights labs stay in the game and catch up too.

xiaoyu2006•42m ago

This open source model is quite near SOTA with only 700B/40B MoE. Truly efficient.

lousken•42m ago

Cerebras really needs to have this on their API list (if they even still exist).

Marciplan•37m ago

they went public a few weeks ago

lousken•6m ago

That's cool and all, but they are still on GLM 4.7

ramon156•39m ago

I've made a comment before that 5.1 will sometimes get stuck looping over a simple decision or statement. It will basically contradict and then not realize that one option is the definite option. Sometimes it's two statements that aren't even exclusive. Nonetheless, a lot of tokens that get wasted from this.

I haven't extensively used 5.2 yet, but it seems a lot better.

_pdp_•35m ago

I am helpful.

DeepSeek V4 has been quite amazing in our workloads and it operates at a fraction of the cost. I have not tried GLM 5.2 but it seems that it hits a sweet spot.

XCSme•26m ago

In my tests[0] GLM-5.2 is not much better than GLM-5, and overall DeepSeek V4 Flash seems to be the better/more cost-effective choice:

[0]: https://aibenchy.com/compare/deepseek-deepseek-v4-flash-high...

Pragmata•23m ago

So this basically means we will have a near opus level model able to be run locally in the next couple of months right?

QWEN 3.6 27b is already pretty good, but it should be possible to get a better option now that runs in the same hardware, right?

XCSme•8m ago

Which Opus?

GLM-5.2 is already close to Opus-4.7 level:

https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...

XCSme•8m ago

Oh, or you meant a smaller model than GLM-5.2 with similar capabilities?

CubsFan1060•20m ago

Knowing very little about how to run these, how close are we to medium or larger businesses starting to buy hardware to run models like this to keep the models local?

It’s expensive, and not as capable as the frontier models, but would have some pretty big benefits around privacy and agency.

re-thc•18m ago

> how close are we to medium or larger businesses starting to buy hardware to run models like this to keep the models local?

Years.

Even Microsoft said they don't have enough for Github and need to call Amazon.

Getting a few even at decent prices is hard. Unless the shortages goes down...

petesergeant•15m ago

Unless you have genuine national security concerns, you’d be better off just negotiating a commercial agreement with privacy protections with a couple of existing vendors.

moffkalast•14m ago

So far there seems to be one major use-case for complete privacy, and that is legal work. You don't need top of the line models to search vast amounts of text in discovery and it needs to be completely confidential. There's quite a few lawyers over on r/localllama showing off their multi-GPU builds. Coincidentally they also have the vast funding required for it.

mrngld•4m ago

Artificial Analysis coding benchmark shows GLM5.1 on high pretty close to GPT5.5 xhigh in cost to run, with GPT5.5 on medium significantly less expensive. Compared to GPT5.5 medium GLM5.1xhigh is twice the cost and half the intelligence. They don't have GLM5.2 on there yet, but that'd a big gap to bridge.

https://artificialanalysis.ai/agents/coding-agents?coding-ag...

I thought I was "holding it wrong" until DeepSWE came along -- personally it seems to match my own experiences pretty well. Really makes me wonder how legitimate some of the internet noise is about open models. There's surely some use cases for them, not everything needs the absolute frontier (GPT5.5 on low is awesome), but if you want to be near the frontier everyone needs to be honest about the fact that we're only talking about Opus, Fable, GPT5.5.

kissgyorgy•4m ago

I tried it today through Openrouter and the API is atrocious. I got multiple rate limit and random errors every turn.

Somebody wrote [1]; "I am never touching Minimax or GLM again. Their APIs had constant outages and I had to restart my runs multiple times — after burning money on the runs that failed midway." and I 100% agree.

The model might be good, but if the API is so bad, it's effectively useless.

[1]: https://kasra.blog/blog/i-spent-1500-seeing-if-llms-could-ha...

GLM-5.2 is the new leading open weights model on Artificial Analysis

Show HN: High-Res Neural Cellular Automata

GrapheneOS has been ported to Android 17

Running local models is good now

Hacker News but for Independent Blogs

RFC 10008: The new HTTP Query Method

U.S. Science Is in Chaos

Show HN: Capacitor Alarm Clock

Humiliating IIS servers for fun and jail time

Map Clustering Is Not My Favorite

TIL: You can make HTTP requests without curl using Bash /dev/TCP

Subterranean fungi networks more than 100 quadrillion km in length

Calvin and Hobbes and the price of integrity

Has AI already killed self-help nonfiction books?

Wolfram Language and Mathematica version 15

GLM 5.2 Performance Benchmarks

GPT‑NL: a sovereign language model for the Netherlands

Stop Using JWTs

The founder's playbook: Building an AI-native startup

Abandoned and Little-Known Airfields

From Chesterton's fence to Chesterton's gap

Semiclassical Gravity Efficiently Solves NP-Complete Problems

SpaceX to buy Cursor for $60B

But yak shaving is fun (2019)

Making 'food out of thin air' (2024)

Stop Killing Games fails to secure EU law despite 1.3M signatures

A brief tour of the PDP-11, the most influential minicomputer of all time (2022)

Lattice Triangles Are Rare

10Gb/s Ethernet: switching to a Broadcom SFP+ module

The Amphibious Villagers of Indonesia

GLM-5.2 is the new leading open weights model on Artificial Analysis

Show HN: High-Res Neural Cellular Automata

GrapheneOS has been ported to Android 17

Running local models is good now

Hacker News but for Independent Blogs

RFC 10008: The new HTTP Query Method

U.S. Science Is in Chaos

Show HN: Capacitor Alarm Clock

Humiliating IIS servers for fun and jail time

Map Clustering Is Not My Favorite

TIL: You can make HTTP requests without curl using Bash /dev/TCP

Subterranean fungi networks more than 100 quadrillion km in length

Calvin and Hobbes and the price of integrity

Has AI already killed self-help nonfiction books?

Wolfram Language and Mathematica version 15

GLM 5.2 Performance Benchmarks

GPT‑NL: a sovereign language model for the Netherlands

Stop Using JWTs

The founder's playbook: Building an AI-native startup

Abandoned and Little-Known Airfields

From Chesterton's fence to Chesterton's gap

Semiclassical Gravity Efficiently Solves NP-Complete Problems

SpaceX to buy Cursor for $60B

But yak shaving is fun (2019)

Making 'food out of thin air' (2024)

Stop Killing Games fails to secure EU law despite 1.3M signatures

A brief tour of the PDP-11, the most influential minicomputer of all time (2022)

Lattice Triangles Are Rare

10Gb/s Ethernet: switching to a Broadcom SFP+ module

The Amphibious Villagers of Indonesia

GLM-5.2 is the new leading open weights model on Artificial Analysis

Comments