GLM-5: From Vibe Coding to Agentic Engineering

152•meetpateltech•1h ago

Comments

eugene3306•1h ago

why don't they publish at ARC-AGI ? too expensive?

Bolwin•1h ago

Arc agi was never a good benchmark that tested spatial understanding more than reasoning. I'm glad it's no longer popular

falcor84•1h ago

What do you mean? It definitely tests reasoning as well, and if anything, I expect spatial and embodied reasoning to become more important in the coming years, as AI agents will be expected to take on more real world tasks.

eugene3306•56m ago

spatial or not, arc-agi is the only test that correlates to my impression with my coding requests

beAroundHere•1h ago

I'd say that they're super confident about the GLM-5 release, since they're directly comparing it with Opus 4.5 and don't mention Sonnet 4.5 at all.

I am still waiting if they'd launch GLM-5 Air series,which would run on consumer hardware.

esafak•1h ago

I place GLM 4.7 behind Sonnet.

revolvingthrow•1h ago

Qwen and GLM both promise the stars in the sky every single release and the results are always firmly in the "whatever" range

esafak•1h ago

I got fed up with GLM-4.7 after using it for a few weeks; it was slow through z.ai and not as good as the benchmarks lead me to believe (esp. with regards to instruction following) but I'm willing to give it another try.

valvar•57m ago

Try Cerberas

w4yai•17m ago

Synthetic is a bless when it comes to providing OSS models (including GLM), their team is responsive, no downtime or any issue for the last 6 months.

Full list of models provided : https://dev.synthetic.new/docs/api/models

Referal link if you're interested in trying it for free, and discount for the first month : https://synthetic.new/?referral=kwjqga9QYoUgpZV

jnd0•1h ago

cmrdporcupine•56m ago

yes, plenty of good convo over there, the two should probably be merged

woah•1h ago

Is this a lot cheaper to run (on their service or rented GPUs) than Claude or ChatGPT?

esafak•1h ago

Yes. https://z.ai/subscribe

leumon•51m ago

although apparently only the max subscription includes glm-5

su-m4tt•1h ago

dramatically cheaper.

algorithm314•1h ago

Here is the pricing per M tokens. https://docs.z.ai/guides/overview/pricing

Why is GLM 5 more expensive than GLM 4.7 even when using sparse attention?

There is also a GLM 5-code model.

logicprog•1h ago

I think it's likely more expensive because they have more activated parameters, which kind of outweighs the benefits of DSA?

l5870uoo9y•55m ago

It's roughly three times cheaper than GPT-5.2-codex, which in turn reflects the difference in energy cost between US and China.

re-thc•46m ago

It reflects the Nvidia tax overhead too.

anthonypasq•43m ago

1. electricity costs are at most 25% of inference costs so even if electricity is 3x cheaper in china that would only be a 16% cost reduction.

2. cost is only a singular input into price determination and we really have absolutely zero idea what the margins on inference even are so assuming the current pricing is actually connected to costs is suspect.

pu_pe•1h ago

Really impressive benchmarks. It was commonly stated that open source models were lagging 6 months behind state of the art, but they are likely even closer now.

justinparus•1h ago

Been using GLM-4.7 for a couple weeks now. Anecdotally, it’s comparable to sonnet, but requires a little bit more instruction and clarity to get things right. For bigger complex changes I still use anthropic’s family, but for very concise and well defined smaller tasks the price of GLM-4.7 is hard to beat.

Aurornis•1h ago

The benchmarks are impressive, but it's comparing to last generation models (Opus 4.5 and GPT-5.2). The competitor models are new, but they would have easily had enough time to re-run the benchmarks and update the press release by now.

Although it doesn't really matter much. All of the open weights models lately come with impressive benchmarks but then don't perform as well as expected in actual use. There's clearly some benchmaxxing going on.

cmrdporcupine•53m ago

I tried GLM 5 by API earlier this morning and was impressed.

Particularly for tool use.

throwup238•47m ago

> Although it doesn't really matter much. All of the open weights models lately come with impressive benchmarks but then don't perform as well as expected in actual use. There's clearly some benchmaxxing going on.

Agreed. I think the problem is that while they can innovate at algorithms and training efficiency, the human part of RLHF just doesn't scale and they can't afford the massive amount of custom data created and purchased by the frontier labs.

IIRC it was the application of RLHF which solved a lot of the broken syntax generated by LLMs like unbalanced braces and I still see lots of these little problems in every open source model I try. I don't think I've seen broken syntax from the frontier models in over a year from Codex or Claude.

algorithm314•44m ago

Can't they just run the output through a compiler to get feedback? Syntax errors seem easier to get right.

rockinghigh•14m ago

They do. Pretty much all agentic models call linting, compiling and testing tools as part of their flow.

ej88•38m ago

the new meta is purchasing rl environments where models can be self-corrected (e.g. a compiler will error) after sft + rlhf ran into diminishing returns. although theres still lots of demand for "real world" data for actually economically valuable tasks

yieldcrv•45m ago

come on guys, you were using Opus 4.5 literally a week ago and don't even like 4.6

something that is at parity with Opus 4.5 can ship everything you did in the last 8 weeks, ya know... when 4.5 came out

just remember to put all of this in perspective, most of the engineers and people here haven't even noticed any of this stuff and if they have are too stubborn or policy constrained to use it - and the open source nature of the GLM series helps the policy constrained organizations since they can theoretically run it internally or on prem.

Aurornis•16m ago

> something that is at parity with Opus 4.5

You're assuming the conclusion

The previous GLM-4.7 was also supposed to be better than Sonnet and even match or beat Opus 4.5 in some benchmarks ( https://www.cerebras.ai/blog/glm-4-7 ) but in real world use it didn't perform at that level.

You can't read the benchmarks alone any more.

InsideOutSanta•33m ago

> it's comparing to last generation models (Opus 4.5 and GPT-5.2).

If it's anywhere close to those models, I couldn't possibly be happier. Going from GLM-4.7 to something comparable to 4.5 or 5.2 would be an absolutely crazy improvement.

Aurornis•17m ago

> Going from GLM-4.7 to something comparable to 4.5 or 5.2 would be an absolutely crazy improvement.

Before you get too excited, GLM-4.7 outperformed Opus 4.5 on some benchmarks too - https://www.cerebras.ai/blog/glm-4-7 See the LiveCodeBench comparison

The benchmarks of the open weights models are always more impressive than the performance. Everyone is competing for attention and market share so the incentives to benchmaxx are out of control.

InsideOutSanta•10m ago

Sure. My sole point is that calling Opus 4.5 and GPT-5.2 "last generation models" is discounting how good they are. In fact, in my experience, Opus 4.6 isn't much of an improvement over 4.5 for agentic coding.

I'm not immediately discounting Z.ai's claims because they showed with GLM-4.7 that they can do quite a lot with very little. And Kimi K2.5 is genuinely a great model, so it's possible for Chinese open-weight models to compete with proprietary high-end American models.

Aurornis•5m ago

I think there are two types of people in these conversations:

Those of us who just want to get work done don't care about comparisons to old models, we just want to know what's good right now. Issuing a press release comparing to old models when they had enough time to re-run the benchmarks and update the imagery is a calculated move where they hope readers won't notice.

There's another type of discussion where some just want to talk about how impressive it is that a model came close to some other model. I think that's interesting, too, but less so when the models are so big that I can't run them locally anyway. It's useful for making purchasing decisions for someone trying to keep token costs as low as possible, but for actual coding work I've never found it useful to use anything other than the best available hosted models at the time.

ionelaipatioaei•23m ago

I think the only advantage that closed models have are the tools around them (claude code and codex). At this point if forced I could totally live with open models only if needed.

evv•16m ago

The tooling is totally replicated in open source. OpenCode and Letta are two notable examples, but there are surely more. I'm hacking on one in the evenings.

OpenCode in particular has huge community support around it- possibly more than Claude Code.

Daviey•1m ago

It's there now, `opencode models --refresh`

quikoa•1m ago

If tooling really is an advantage why isn't it possible to use the API with a subscription and save money?

dust42•18m ago

They are all just token generators without any intelligence. There is so little difference nowadays that I think in a blind test nobody will be able to differentiate the models - whether open source or closed source.

Today's meme was this question: "The car wash is only 50 meters from my house. I want to get my car washed, should I drive there or walk?"

Here is Claude's answer just right now:

"Walk! At only 50 meters (about 150 feet), it would take you less than a minute to walk there. Driving such a short distance would actually be less convenient - by the time you get in the car, start it, drive, and park, you could already be there on foot. Plus, you'd save fuel and your car would be right there waiting for you after the wash is done."

Here is ChatGPT, also right now:

"Since it’s only *50 meters* (about 160 feet), walking makes a lot more sense.

Here’s why:

* *It’s extremely close* — you’ll get there in under a minute. * *Driving would actually be less efficient* — by the time you start the car, move it, and park, you’re already there. * *Cold starts aren’t great for your engine* — very short drives don’t let the engine warm up properly. * *It saves fuel and emissions*, even if it’s a tiny amount. * You avoid the hassle of maneuvering such a short distance.

The only time driving might make sense is if:

* The weather is terrible (heavy rain, snow, extreme cold). * You have mobility issues. * The car wash requires you to stay in the vehicle and you don’t want to leave it there.

Otherwise, this is a perfect “walk it over” situation. "

Please save us all that talk about frontier and SOTA and that only the closedAI models are any good and the others are all so bad and benchmaxxed. For most purposes a Toyota is just as good as a BMW or a Merc or whatever luxury brand tickles your fancy. Even worse, the lastest 80B Qwen Next is not far from Opus 4.6 but runs on my laptop.

haute_cuisine•16m ago

Doesn't seem to be the case, gpt 5.2 thinking replies: To get the car washed, the car has to be at the car wash — so unless you’re planning to push it like a shopping cart, you’ll need to drive it those 50 meters.

bonoboTP•15m ago

It's unclear where the car is currently from your phrasing. If you add that the car is in your garage, it says you'll need to drive to get the car into the wash.

king_phil•15m ago

Gemini 3 Pro:

This is a classic logistical puzzle!

Unless you have a very unique way of carrying your vehicle, you should definitely drive.

If you walk there, you'll arrive at the car wash, but your car will still be dirty back at your house. You need to take the car with you to get it washed.

Would you like me to check the weather forecast for $mytown to see if it's a good day for a car wash?

Scene_Cast2•15m ago

I just ran this with Gemini 3 Pro, Opus 4.6, and Grok 4 (the models I personally find the smartest for my work). All three answered correctly.

rockinghigh•12m ago

How is this riddle relevant to a coding model?

esafak•6m ago

It's not a coding model. Go to https://chat.z.ai/ and you'll see it is presented as a generalist.

Aurornis•10m ago

If you're asking simple riddles, you shouldn't be paying for SOTA frontier models with long context.

This is a silly test for the big coding models.

This is like saying "all calculators are the same, nobody needs a TI-89!" and then adding 1+2 on a pocket calculator to prove your point.

esafak•9m ago

This is a great comment. The frontier labs ought to be embarrassed they can't get this right. A human would not make the same mistake, after thinking for a second.

cherryteastain•4m ago

Gemini 3 Flash ("Fast" in the web app) did not have trouble with this:

Since you presumably want the car to be clean, the answer depends entirely on how the car is getting washed. The Breakdown If it’s a Professional Car Wash: You’ll need to drive. Bringing the car to the wash is generally the prerequisite for them washing it. Plus, walking 50 meters just to ask them to come pick up a car that is visible from their front door might result in some very confused looks from the staff. If it’s a Self-Service Pressure Washer: Again, drive. Dragging a 1,500kg vehicle 50 meters by hand is an incredible workout, but probably not the "quick chore" you had in mind. If you are hiring a Mobile Detailer: Walk. Stroll over there, tell them where you're parked, and walk back home to watch TV while they do the work in your driveway.

petetnt•1h ago

Whoa, I think GPT-5.3-Codex was a disappointment, but GLM-5 is definitely the future!

cmrdporcupine•54m ago

I find 5.3 very impressive TBH. Bigger jump than Opus 4.6.

But this here is excellent value, if they offer it as part of their subscription coding plan. Paying by token could really add up. I did about 20 minutes of work and it cost me $1.50USD, and it's more expensive than Kimi 2.5.

Still 1/10th the cost of Opus 4.5 or Opus 4.6 when paying by the token.

mnicky•53m ago

> I think GPT-5.3-Codex was a disappointment

Care to elaborate more?

meffmadd•1h ago

It will be tough to run on our 4x H200 node… I wish they stayed around the 350B range. MLA will reduce KV cache usage but I don’t think the reduction will be significant enough.

pcwelder•59m ago

It's live on openrouter now.

In my personal benchmark it's bad. So far the benchmark has been a really good indicator of instruction following and agentic behaviour in general.

To those who are curious, the benchmark is just the ability of model to follow a custom tool calling format. I ask it to using coding tasks using chat.md [1] + mcps. And so far it's just not able to follow it at all.

[1] https://github.com/rusiaaman/chat.md

manofmanysmiles•22m ago

I love the idea of chat.md.

I'm developing a personal text editor with vim keybindings and paused work because I couldn't think of a good interface that felt right. This could be it.

I think I'll update my editor to do something like this but with intelligent "collapsing" of extra text to reduce visual noise.

nolist_policy•19m ago

Could also be the provider that is bad. Happens way too often on OpenRouter.

pcwelder•15m ago

I had added z-ai in allow list explicitly and verified that it's the one being used.

sergiotapia•2m ago

Be careful with openrouter. They routinely host quantized versions of models via their listed providers and the models just suck because of that. Use the original providers only.

ExpertAdvisor01•57m ago

They increased their prices substantially

woeirua•57m ago

It might be impressive on benchmarks, but there's just no way for them to break through the noise from the frontier models. At these prices they're just hemorrhaging money. I can't see a path forward for the smaller companies in this space.

syntaxing•44m ago

Tim Dettmers had an interesting take on this [1]. Fundamentally, the philosophy is different.

>China’s philosophy is different. They believe model capabilities do not matter as much as application. What matters is how you use AI.

https://timdettmers.com/2025/12/10/why-agi-will-not-happen/

woeirua•32m ago

Sorry, but that's an exceptionally unimpressive article. The crux of his thesis is:

>The main flaw is that this idea treats intelligence as purely abstract and not grounded in physical reality. To improve any system, you need resources. And even if a superintelligence uses these resources more effectively than humans to improve itself, it is still bound by the scaling of improvements I mentioned before — linear improvements need exponential resources. Diminishing returns can be avoided by switching to more independent problems – like adding one-off features to GPUs – but these quickly hit their own diminishing returns.

Literally everyone already knows the problems with scaling compute and data. This is not a deep insight. His assertion that we can't keep scaling GPUs is apparently not being taken seriously by _anyone_ else.

qprofyeh•16m ago

There are startups in this space getting funded as we speak: https://olix.com/blog/compute-manifesto

syntaxing•13m ago

Was more mentioning the article about the economic aspect of China vs US in terms of AI.

While I do understand your sentiment, it might be worth noting the author is the author of bitandbytes. Which is one of the first library with quantization methods built in and was(?) one of the most used inference engines. I’m pretty sure transformers from HF still uses this as the Python to CUDA framework

re-thc•31m ago

When you have export restrictions what do you expect them to say?

> They believe model capabilities do not matter as much as application.

Tell me their tone when their hardware can match up.

It doesn't matter because they can't make it matter (yet).

riku_iki•8m ago

maybe being in China gives them advantage of electricity cost, which could be big chunk of bill..

karolist•54m ago

The amount of times benchmarks of competitors said something is close to Claude and it was remotely close in practice in the past year: 0

ionelaipatioaei•21m ago

I honestly feel like people are brainwashed by anthropic propaganda when it comes to claude, I think codex is just way better and kimi 2.5 (and I think glm 5 now) are perfectly fine for a claude replacement.

ChrisArchitect•49m ago

Earlier: https://news.ycombinator.com/item?id=46974853

nullbyte•40m ago

GLM 5 beats Kimi on SWE bench and Terminal bench. If it's anywhere near Kimi in price, this looks great.

Edit: Input tokens are twice as expensive. That might be a deal breaker.

westernzevon•23m ago

It seems to be much better at first pass tho. We'll see how real costs stack up

simonw•39m ago

Pelican generated via OpenRouter: https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f07...

Solid bird, not a great bicycle frame.

btown•25m ago

Thank you for continuing to maintain the only benchmarking system that matters!

Context for the unaware: https://simonwillison.net/tags/pelican-riding-a-bicycle/

_joel•17m ago

Now this is the test that matters, cheers Simon.

pwython•2m ago

How many pelican riding bicycle SVGs were there before this test existed? What if the training data is being polluted with all these wonky results...

surrTurr•28m ago

we're seeing so many LLM releases that they can't even keep their benchmark comparisons updated

cherryteastain•13m ago

What is truly amazing here is the fact that they trained this entirely on Huawei Ascend chips per reporting [1]. Hence we can conclude the semiconductor to model Chinese tech stack is only 3 months behind the US, considering Opus 4.5 released in November. (Excluding the lithography equipment here, as SMIC still uses older ASML DUV machines)

US attempts to contain Chinese AI tech totally failed. Not only that, they cost Nvidia possibly trillions of dollars of exports over the next decade, as the Chinese govt called the American bluff and now actively disallow imports of Nvidia chips as a direct result of past sanctions [2]. At a time when Trump admin is trying to do whatever it can to reduce the US trade imbalance with China.

[1] https://tech.yahoo.com/ai/articles/chinas-ai-startup-zhipu-r...

[2] https://www.reuters.com/world/china/chinas-customs-agents-to...

re-thc•5m ago

> What is truly amazing here is the fact that they trained this entirely on Huawei Ascend chips

Has any of these outfits ever publicly stated they used Nvidia chips? As in the non-officially obtained 1s. No.

> US attempts to contain Chinese AI tech totally failed. Not only that, they cost Nvidia possibly trillions of dollars of exports over the next decade, as the Chinese govt called the American bluff and now actively disallow imports of Nvidia chips

Sort of. It's all a front. On both sides. China still ALWAYS had access to Nvidia chips - whether that's the "smuggled" 1s or they run it in another country. It's not costing Nvidia much. The opening of China sales for Nvidia likewise isn't as much of a boon. It's already included.

> At a time when Trump admin is trying to do whatever it can to reduce the US trade imbalance with China

Again, it's a front. It's about news and headlines. Just like when China banned lobsters from a certain country, the only thing that happened was that they went to Hong Kong or elsewhere, got rebadged and still went in.

mohas•12m ago

I kinda feel this bench-marking thing with Chinese models is like university Olympiads, they specifically study for those but when time comes for the real world work they seriously lack behind.

OsrsNeedsf2P•10m ago

I kinda feel like the goalposts are shifting. While we're not there yet, in a world where Chinese models surpass Western ones, HN will be nitpicking edge cases long after the ship sails

Oras•2m ago

I don’t think it’s undermining the effort and improvement, but usability of these models aren’t usually what their benchmarks suggest.

Last time there was a hype about GLM coding model, I tested it with some coding tasks and it wasn’t usable when comparing with Sonnet or GPT-5

I hope this one is different

goldenarm•12m ago

If you're tired of cross-referencing the cherry-picked benchmarks, here's the geometric mean of SWE-bench Verified & HLE-tools :

Claude Opus 4.6: 65.5%

GLM-5: 62.6%

GPT-5.2: 60.3%

Gemini 3 Pro: 59.1%

Overview of end-to-end encrypted AI inference for Confer

Where Do Grandmasters Play Chess? – Lichess vs. Chess.com

Show HN: PythonICO – Simple SVG Badges for PyPI Stats (FastAPI)

Git-cola: The highly caffeinated Git GUI

Claude Code Is Being Dumbed Down

Pool and VDEV Topology for Proxmox Workloads

Show HN: Project Genesis – A Bio-Mimetic Digital Organism Using LSM

I Don't Want My Terminal to Be a Platform

Show HN: NOOR – A Sovereign AI developed on a smartphone under siege in Yemen

Google releases beta of Android 17, adopts a continous developer release plan

Illness Is Rampant Among Children Trapped in ICE's Jail in Texas

Can Anyone Monetize OpenClaw?

Outcome Engineering

Show HN: Praetorian Guard – Free AI tool to self-evaluate your CV (educational)

Show HN: HelixNotes – UpNote-inspired local-first Markdown notes in Rust

Puffy Alps

GitHub: AnchorID is a minimal attribution resolver for people

Trusting Trust in the Fediverse

Show HN: Vibe Coded Math Games

Show HN: Nomad Tracker, a local-first iOS app to track visas and tax residency

Show HN: Privacy-first iOS Keyboard with above-key predictions

Love in the stacks of the LOC

Show HN: A CLI tool to simplify and automate common VPS configuration tasks

Property-based testing is about to rule the (software) world

Why Colonize Space: The Need for Frontier

Hydra Joins Supabase

CBP Signs Clearview AI Deal to Use Face Recognition for 'Tactical Targeting'

A piece of code that causes LLVM Flang to generate NaN/Inf randomly

Airspace closure followed spat over drone-related tests and balloon shoot-down

NetNewsWire Turns 23

GLM-5: From Vibe Coding to Agentic Engineering

Comments

Overview of end-to-end encrypted AI inference for Confer

Where Do Grandmasters Play Chess? – Lichess vs. Chess.com

Show HN: PythonICO – Simple SVG Badges for PyPI Stats (FastAPI)

Git-cola: The highly caffeinated Git GUI

Claude Code Is Being Dumbed Down

Pool and VDEV Topology for Proxmox Workloads

Show HN: Project Genesis – A Bio-Mimetic Digital Organism Using LSM

I Don't Want My Terminal to Be a Platform

Show HN: NOOR – A Sovereign AI developed on a smartphone under siege in Yemen

Google releases beta of Android 17, adopts a continous developer release plan

Illness Is Rampant Among Children Trapped in ICE's Jail in Texas

Can Anyone Monetize OpenClaw?

Outcome Engineering

Show HN: Praetorian Guard – Free AI tool to self-evaluate your CV (educational)

Show HN: HelixNotes – UpNote-inspired local-first Markdown notes in Rust

Puffy Alps

GitHub: AnchorID is a minimal attribution resolver for people

Trusting Trust in the Fediverse

Show HN: Vibe Coded Math Games

Show HN: Nomad Tracker, a local-first iOS app to track visas and tax residency

Show HN: Privacy-first iOS Keyboard with above-key predictions

Love in the stacks of the LOC

Show HN: A CLI tool to simplify and automate common VPS configuration tasks

Property-based testing is about to rule the (software) world

Why Colonize Space: The Need for Frontier

Hydra Joins Supabase

CBP Signs Clearview AI Deal to Use Face Recognition for 'Tactical Targeting'

A piece of code that causes LLVM Flang to generate NaN/Inf randomly

Airspace closure followed spat over drone-related tests and balloon shoot-down

NetNewsWire Turns 23