frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

GLM 5.2 vs. Opus

https://techstackups.com/comparisons/glm-5.2-vs-opus/
100•ritzaco•1h ago

Comments

meander_water•1h ago
> So we ran it head-to-head against Claude Opus 4.8: same one-shot prompt, build a 3D platformer in raw WebGL from scratch

Running a single one-shot prompt is not a benchmark, not is it representative of any sort of real-world usage.

Most agent usage is collaborative so you need to test things like reliability (when I delegate a task, does it complete it without making up test results for e.g.) and steerability (does it obey my instructions or does it just do what it thinks is best).

ritzaco•1h ago
sure that's why we look at a mix of formal benchmarks, one longer analysis of a side-by-side, and various other people who we trust to form an opinion, all covered in the article - not intended to be a formal benchmark, there are enough of those.
patates•1h ago
Then maybe you should add that caveat emptor to the article?

You make a very strong claim at the end that the hype is mostly real, and making it clear to what extent your claim holds should help the reader.

jameswhitford•1h ago
Hi, I am the author, I completely agree! I set out to run a vibe test on this one, not a benchmark, the real benchmarks are listed. My test shows what the models can do when both tasked with a long-running, technically difficult, one-shot task.

I think your test you describe (collaborative, task delegation, task completion, TTD, steerability) is a great format for a future test that I will definitely try out.

meander_water•1h ago
Thanks, I didn't mean to be brusque, but I have seen a lot of these vibe tests lately that come to grand conclusions like "X model is better than Y" from the result of a single prompt.

Appreciate you sharing the results of your tests though!

wongarsu•1h ago
Tbf, most of the "real benchmarks" have issues that are just as bad. Assessing LLM performance is just hard
esperent•1h ago
On the other hand, I did just leave my pi agent running GPT 5.5 overnight on a clearly defined, long running task. It's been running about 10 hours now and it's mostly done. So this kind of use case is also valid.

Thinking about it, I would say that the majority of agentic work I do, by a long shot, is subagents which are launched from the main session, using a prompt of its choosing. Those could be considered short versions of these fully autonomous tasks.

unliftedq•33m ago
Totally agree, a single one-shot prompt can't prove anything.
greyman•1h ago
>On output tokens, GLM-5.2 is less than a fifth the price of Opus.

Opus is most expensive model in pay as you go model, but IMO fair comparison should include subscription price as well. For example when one has $100 Claude Max and use it up through the month, it might not be more expensive than GLM, or at least not 5x.

lithiumii•1h ago
GLM has subscription plans too.
linzhangrun•1h ago
Out of stock, unavailable
jameswhitford•1h ago
Yes this is true. This test was run on a $20 pro Claude subscription. I would definitely love to try use both models on the highest plans for a whole month and compare the two, great format for a future head-to-head comparison.
buster•1h ago
Is it fair when the one is heavily subsidized and the other one is not?

I think it's most fair to compare the plain token pricing that is used by everyone.

esperent•1h ago
> Is it fair when the one is heavily subsidized

As a consumer, yes, it's totally fair. All that matters to me is the price I pay at the pump, not whether that price is "real" or not.

xlii•1h ago
I've been checking out GLM 5.2 on some projects and few thoughts on it:

- it takes it sweet time to get code rolling, not the fastest model by any means

- it strays a lot during discovery/planning but then corrects

- it's not steering friendly, as it hallucinates things that it doesn't follow later on

- its output is quite good

A sample use case: I was optimizing rendering on Swift+Zig codebase. It chocked on 5k data entries.

GLM 5.2 spent 20 minutes building the benchmarks and getting data out, which made me frustrated so I blocked non-editing tool access and went AFK, after approx. 30 minutes I found that it used already-made benchmarks and some "conclusions" to optimize 3 choke points. Output pointed that it couldn't validate suspicions and asked for more data.

Implementation worked well, it was idiomatic and non-intrusive. I would even say that it was more idiomatic than GPT 5.5 effects on same repo.

I would opt in in using it more BUT GPT usually completes same requests 5x faster.

GLM 5.2 was spark for preparing and running inside isolated containers with JJ workspaces (so that multiple can be ran in parallel).

Imanari•56m ago
This mirrors my experience. I have been using it in Pi. It is smart and output is good but it is not efficient in getting there.
ju-st•40m ago
which thinking level? max or high?
Oras•26m ago
Also pricing, I wanted to give a try, but when pricing is only 30% cheaper than Opus, I wouldn't go for it with these issues.
jkwang•1h ago
GLM-5.2 is quietly becoming the most interesting open model release this year. The coding benchmarks are surprisingly close to frontier models at a fraction of the inference cost.
epolanski•49m ago
To me DS 4 is still the most interesting due to much lower costs. Also DS 4 training isn't done yet.

From my Opus vs DS 4 Pro personal benchmarks, 16 different real-life work tasks, DS 4 has performed as well as Opus 4.8 high overall but with few drawbacks:

- on the 16 tasks, one needed several prompts to be steered back into the topic

- its review capabilities seem much worse

- DS4 had the cleanly better solution in 3 cases out of 16, with Opus "only" doing cleanly better 2 times out of 16. But still, I want to emphasize, is the worst case scenarios that imho matter the most, not the best ones, and on that front Opus outperformed.

That being said I spent less than 2$ of API working 4 days, which is more or less what I would've spent with Anthropic APIs for less than one task.

em500•39m ago
We've had the great small Qwen 3.6 early April that many could actually run on their laptop. Then similar from Google a few weeks later (Gemma4, better in prose, worse in code). Then the super cheap large Deepseek V4 a. Then antirez DS4 build that made that actually runnable on MacBooks and Mac Studios. And now the "near-frontier / near-Opus" GLM 5.2.

For people who follow open LLMs, none of these were quiet and all were the most interesting open model release for a few days/weeks. In one or two months, it will be some other model again. Now I do appreciate the real rapid improvements in open models. But there's also a ton of hype and fast-fashion around all of this.

joshrw•1h ago
Chinese models optimize for benchmarks and do poorly in real-world tasks
epolanski•47m ago
Not my experience at all, I have written about comparing DS4 vs Opus 4.8 on 16 real life work tasks on multiple posts.

Also, every single lab does RL on benchmarks, which is why Opus 4.6 was the last truly great assistant, after it, all models tend to drift into implementation asap.

IronWolve•1h ago
Having issues with coding a render for good looking realistic smoke coming off burning incense, opus 4.8 & gpt-5.5 both have code issues, glm-5.2 did it. Amazing.

The real time 3d fluid dynamics appear to be the tricky part, I wish I still had opus access, would love to see if it can do it.

cultofmetatron•1h ago
I seriously dont' know all this big hullabaloo about one shot prompting.

by definition, a single prompt wont' constitute the complexity of a software project. ergo, what you'll get is a series of assumptions made by the model based on preexisting code in its training corpus.

I'd rather see a coding agent that can follow steps in a plan file to a T while following guardrails and adhering to the proper coding conventions in the human reviewed spec.

Id rather see performance in agent loops against human defined objectives where it can be verified to stick to defined guardrails and continue without drift till its objectives are complete.

I'd also like to see it identify bugs and potential performance increases by identifying existing code and suggesting refactors based on context it can pickup about the particular use case you are trying to create.

These are way more valuable metrics than "hey build X"

epolanski•54m ago
Yet this is how virtually everybody is benchmarking and fine tuning.

Since Opus 4.6 I've seen later Anthropic models being more and more capable on one hand, but also less useful on multi turn open tasks.

It feels like with each model they are more and more prone to go "their own way" and jump into the implementation as soon as they can.

I can't but blame it on benchmarks and fine tuning around prompt-to-solution work.

halyconWays•30m ago
"We did multi-shot prompting to try and get these two games into comparable states using these two different models."

"Well obviously you provided better follow-up prompts to the one that came out better."

Also nothing about human-provided plan files and guardrails preclude the one-shot benchmark test. Heavens, I almost said "real coding," but in "real agentic program creation" you'd obviously be doing multi-turn interaction with the agent, but how can you provide a fair test when the model's output n determines your n+1 response?

linzhangrun•1h ago
Just that their Coding Plan is too hard to get. I've been trying to grab it for a week and still can't get it
Aozora7•1h ago
I used GLM 5.0/5.1/5.2 for some projects, and for me, the area in which they lag behind frontier models the most are user interfaces. They get really close to Opus when it comes to pure algorithms, but when I need something like web application or a mobile app that looks and works well, they are very noticeably worse than even Sonnet.
david_shi•57m ago
> GLM-5.2 cost a fraction as much. Opus finished in half the time and shipped a cleaner game.

Off topic, but does anyone else instantly pick up on LLMisms like this? It seems like all the models have converged on this style of writing, and improvements aren't really changing it.

speedgoose•52m ago
I think a bunch of real humans started to adopt the LLMs writing style.
himata4113•34m ago
Yep, as I reread my own sentances I notice these LLMisms and have to rewrite them quite often. Reading so much llm-output definitely impacts your writing style.
leumon•55m ago
I've seen glm 5.2 struggle writing simple compilable c code. It might be good at web, but it's world knowledge is limited due to the small model size, making it's use quite limited in my opinion.
speedgoose•51m ago
While this is interesting, one single sample with different coding harness is not very scientific.
ulrikrasmussen•50m ago
> Through an API it costs a fraction of Opus, and you can run it yourself for free if you have the hardware.

I haven't been keeping up on hardware costs for state of the art LLM inference, but this remark made me ask myself how many readers of the article would actually be able to run this model on hardware they own. How much would it cost to acquire such a setup?

jack_pp•48m ago
This framing local LLMs as free is stupid. Basically pay 100+ months worth of API costs up front isn't free in the slightest. And it will be slower than non-local, your hardware will be outdated in 12 months and probably won't be able to run SOTA at anywhere near non-local speed in max 20 months
ulrikrasmussen•42m ago
Yeah, it glosses over a gigantic capital expenditure. It's sort of like saying that an open source modern CPU architecture allows you to build your own CPU "for free" (provided that you own and operate a fab).
cicko•25m ago
True. But there are other meanings of "free". I.e. nobody can say "from now on you no longer have access to model X because you're an asshole"
crimsoneer•46m ago
Practically nobody.
bestouff•
zkmon•50m ago
Cost difference matters most as cost optimization is the whole point of AI. Time difference (30 min vs 1 hr) is not a deal-breaker. The small precision gap on the first iteration does not matter for 99% of the work that happens in real world.
TurdF3rguson•47m ago
Pretty clearly it's beating Opus at [web dev](https://www.gptbased.com/) - on price, on score.. I mean what else is there?
trick-or-treat•36m ago
Latency? Just saying there's other things to consider.
jofzar•32m ago
I hate to be that guy, but real privacy policy on training data/it being hosted somewhere where I'm not worried about secrets being stored/leaked.
HPsquared•17m ago
Open weights win on that front surely?
Havoc•4m ago
Realistically you’d need to rotate secrets anyway once it moves from dev to production regardless of model provider
msejas•43m ago
Seeing the results I don't see how the results are even comparable Opus is clearly far superior in most aspects. Smoothness, design, functionality etc.

At the end of the day, the time earned is more important then the cost for big players.

The ability to spawn 10 claude agents and rush a project to outcompete someone is more important for big businesses in my imo. Also the small details that GLM missed would take significant more time to iron out, considering it already took double the time.

I do hope other (open weight) models catch up, but to act like they are anywhere close for me is a bit disingenuous.

close2•42m ago
I wonder how much tokens and time where used for the verifying part. Maybe GLM 5.2 instantly found the "solution" to read the screen pixel by pixel, but it could also have been a major token and time consumer.
trick-or-treat•32m ago
I could be wrong but I believe this is a non-vision model. Please weigh in to correct me bc I would love to be wrong
jofzar•34m ago
Great article,

My only, I guess feedback, is that it's not really clear about the price.

Would the 21.92 be the API pricing I guess?

Cost $5.39 (real billed) ~$21.92 (estimate, list pricing)

postatic•33m ago
I've signed up with Ollama to experiment with these open source models. For the past 3 months, it's just been experimenting, trying it out. GLM is the first model that I am using on a daily basis to do my coding work (as well as using Claude). It's good - I've been maxing out my Ollama usage limits everyday :)
_pdp_•27m ago
In the name of science we crafted an autonomous AI agent that builds games on a loop. It is based on GLM 5.2.

I am not sure where this is going to lead us but it is fun to watch.

sourcecodeplz•25m ago
What is this fashion of testing models by giving them one shot projects? Especially games. this is so stupid
wejick•19m ago
Totally agree witg the general assessment. The biggest problem with Z.ai model for a long time is not quality, but the inference speed and general capacity availability. Hopefully with this recent hype, there will be more provider on openrouter for 5.2.
usef-•41m ago
Z.ai is also believed to be "subsidised". Its parent company is running at a massive loss.

And Anthropic have claimed they expect their first profitable quarter this year. They may have bigger margins on their raw API than you realise.

Aozora7•1h ago
There is, for example, OpenCode Go subscription, which for $10 a month gives you a decently generous quota of GLM-5.2, among other models.

And z.ai themselves also have subscriptions.

sourcecodeplz•9m ago
to be exact, it gives you USD 60 of usage of open models.
KronisLV•8m ago
> For example when one has $100 Claude Max and use it up through the month, it might not be more expensive than GLM, or at least not 5x.

https://z.ai/subscribe

I’m currently trying to figure out whether a downgrade from Max 5x to Pro in combination with one of those would save me money and if so, how much.

pegasus
•
4m ago
Sure, real-world usage is always more difficult to benchmark, but the additional issue with the one shot prompting benchmark is that by optimizing for it, models are nudged towards making all those assumptions they shouldn't really make. Maybe a better test would be to have a fully spec'd-out plan, but start with a one shot, high-level prompt and expect the agent to discover your preferences by repeatedly asking for clarifications. The system that manages to suss out more of the details in the hidden spec this way, in less steps and with less unnecessary questions would more likely to be a truly well-calibrated agent.
jaapz•12m ago
When the model produces reasonable results from one prompt, you could assume that it will also return reasonable results through the follow up prompts.
LoganDark•5m ago
The thing with one-shot prompting is that it tests the ability for the model to make good choices on its own, rather than only instruction following.

Instruction following has been down for years, and while there are of course metrics that continue to improve as the frontier advances (for example, the ability to continue following the original instructions even as context grows), you can't really get that much better at performing a list of instructions as-written if the instructions are sufficiently precise enough that there's no wiggle room for interpretation (which seems to be what you are describing).

For example, one of the things that got me the most excited for Fable 5 was its ability to work for over eight hours straight on a single instruction and seemingly faithfully the entire time. That was something I observed personally after trying out the same workflow that runs for maybe two or three hours with Opus and then still needs followups. Fable needed no followups. That's a game changer for me compared to the prior state of the art.

That kind of stuff is going to end up being the most beneficial to people who are touching the edges of their knowledge or even exploring completely new areas. And that type of work is exactly the kind of work that makes agentic coding so powerful, even as much as it gets harder to judge the quality of the work when you lack the skills yourself. It's a good thing that the quality increases across the board, even for skilled practitioners.

For example, even people who know how to write inference engines or how matmul kernels work or how to optimize model architecture can't always predict just the sheer breadth of things agents can try to improve performance, and sometimes you get over some wall and reach a completely different optimum that you just wouldn't have reached in any reasonable amount of time by applying traditional knowledge even if you're an expert in the field.

That kind of stuff is amazing. And that's exactly the kind of stuff that one-shot prompting is testing for. It's kind of like testing for the model's "innovation", as much of an oxymoron that is.

41m ago
The price of a small house.

Deno Desktop

https://docs.deno.com/runtime/desktop/
308•GeneralMaximus•3h ago•114 comments

GLM 5.2 vs. Opus

https://techstackups.com/comparisons/glm-5.2-vs-opus/
100•ritzaco•1h ago•61 comments

Help I accidentally a wigglegram

https://lmao.center/blog/wiggle-accidents/
211•gregsadetsky•2d ago•36 comments

Codex logging bug may write TBs to local SSDs

https://github.com/openai/codex/issues/28224
55•vantareed•1h ago•27 comments

Did my old job only exist because of fraud?

https://david.newgas.net/did-my-old-job-only-exist-because-of-fraud/
527•advisedwang•11h ago•228 comments

Apertus – Open Foundation Model for Sovereign AI

https://apertvs.ai/
366•T-A•11h ago•125 comments

Munich 1991: The Roots of the Current AI Boom

https://people.idsia.ch/~juergen/ai-boom-roots-munich-1991.html
42•tosh•2d ago•7 comments

There is minimal downside to switching to open models

https://www.marble.onl/posts/cancel_claude.html
218•amarble•12h ago•167 comments

Memory Safe Inline Assembly

https://fil-c.org/inlineasm
103•pizlonator•2d ago•20 comments

Everything is logarithms

https://alexkritchevsky.com/2026/05/25/everything-is-logarithms.html
217•E-Reverance•11h ago•46 comments

Good results fine tuning a local LLM like Qwen 3:0.6B to categorize questions

https://www.teachmecoolstuff.com/viewarticle/fine-tuning-a-local-llm-to-categorize-questions
125•dev-experiments•10h ago•29 comments

Lisp in the Rust Type System

https://github.com/playX18/lisp-in-types/
58•quasigloam•2d ago•0 comments

Sakana Fugu

https://sakana.ai/fugu/
119•Finbarr•6h ago•73 comments

Identity verification on Claude

https://support.claude.com/en/articles/14328960-identity-verification-on-claude
738•bathory•20h ago•616 comments

JSON-LD explained for personal websites

https://hawksley.dev/blog/json-ld-explained-for-personal-websites/
212•ethanhawksley•14h ago•63 comments

Efficient C++ Programming for Modern C++ CPUs, Chapter 4/part 2

https://6it.dev/blog/infographics-operation-costs-in-cpu-clock-cycles-take-2-80736
52•birdculture•2d ago•7 comments

Japanese verb conjugation the simple hard way

https://underreacted.leaflet.pub/3mmevu6woys27
97•valzevul•10h ago•118 comments

How I play video games with spinal muscular atrophy

https://www.openassistivetech.org/how-i-actually-play-video-games-with-sma-the-tools-i-use-every-...
105•dannyobrien•3d ago•15 comments

1983 Northern Telecom Commodore Phone

https://www.oldtelephoneroom.ca/1983-northern-telecom-commodore-phone/
51•arexxbifs•8h ago•15 comments

Minecraft: Java Edition 26.2, the first version with Vulkan 1.2

https://www.minecraft.net/en-us/article/minecraft-java-edition-26-2
139•ObviouslyFlamer•5d ago•45 comments

PowerFox Browser

https://powerfox.jazzzny.me/
128•thisislife2•11h ago•33 comments

Show HN: Teach your kids perfect pitch

https://github.com/paytonjjones/bsharp
130•paytonjjones•20h ago•82 comments

Rent collections are down in New York

https://www.politico.com/news/2026/06/21/rent-collections-are-down-in-new-york-and-no-ones-sure-w...
84•JumpCrisscross•11h ago•320 comments

My 1992 view of the problems of computer programming in 1992

https://blog.plover.com/prog/fortran-i.html
6•speckx•2d ago•1 comments

Show HN: Criterion Closet as a website – pull any of 1,247 films off the shelf

https://the-criterion-closet.vercel.app
103•olievans•1d ago•26 comments

Danish privacy activist Lars Andersen raided by police

https://twitter.com/LarsAnders1620/status/2068208864747540516#m
235•I_am_tiberius•4h ago•171 comments

Prefer duplication over the wrong abstraction (2016)

https://sandimetz.com/blog/2016/1/20/the-wrong-abstraction
479•rafaepta•16h ago•315 comments

The minimum viable unit of saleable software

https://brandur.org/minimum-viable-unit
163•brandur•16h ago•62 comments

(How to Write a (Lisp) Interpreter (In Python)) (2010)

https://norvig.com/lispy.html
184•tosh•17h ago•61 comments

FDA advisors unanimously vote to approve Moderna's mRNA after agency drama

https://arstechnica.com/health/2026/06/fda-advisors-unanimously-vote-to-approve-modernas-mrna-aft...
185•worik•11h ago•97 comments