GLM-5.1: Towards Long-Horizon Tasks

129•zixuanlimit•1h ago

Comments

dang•1h ago

[stub for offtopicness]

[[you guys, please don't post like this to HN - it will just irritate the community and get you flamed]]

smith7018•1h ago

Hmm, three spam comments posted within 9 minutes of each other. The accounts were created 15 minutes ago, 51 days ago, and 3 months ago.

Interesting.

Hopefully these aren't bots created by Z.AI because GLM doesn't need fake engagement.

dang•1h ago

These comments are probably either by friends of the OP or perhaps associated with the project somehow, which is against HN's rules but not the kind of attack we're mostly concerned with these days. Old-fashioned voting rings and booster comments aren't existential threats and actually bring up somewhat nostalgic feelings at the moment!

Thanks for watching out for the quality of HN...

ray__•20m ago

Would love to read a Tell HN post about the kinds of attacks you are concerned with!

tadfisher•53m ago

I moderate a medium-sized development subreddit. The sheer volume of spam advertising some AI SaaS company has skyrocketed over the past few months, like 10000%. Comment spam is now a service you can purchase [0][1], and I would not be surprised if Z.ai engaged some marketing firm which ended up purchasing this service.

There are YC members in the current batch who are spamming us right now [2]. They are all obvious engagement-bait questions which are conveniently answered with references to the SaaS.

[0]: https://www.reddit.com/r/DoneDirtCheap/comments/1n5gubz/get_...

[1]: https://www.reddit.com/r/AIJobs/comments/1oxjfjs/hiring_paid...

[2]: https://www.reddit.com/r/androiddev/comments/1sdyijs/no_code...

greenavocado•48m ago

Z.ai Discord is filled to the brim with people experiencing capacity issues. I had to cancel my subscription with Z.ai because the service was totally unusable. Their Discord is a graveyard of failures. I switched to Alibaba Cloud for GLM but now they hiked their coding plan to $50 a month which is 2.5x more expensive than ChatGPT Plus. Totally insane.

bigyabai•1h ago

It's an okay model. My biggest issue using GLM 5.1 in OpenCode is that it loses coherency over longer contexts. When you crest 128k tokens, there's a high chance that the model will start spouting gibberish until you compact the history.

For short-term bugfixing and tweaks though, it does about what I'd expect from Sonnet for a pretty low price.

embedding-shape•1h ago

> It's an okay model. My biggest issue using GLM 5.1 in OpenCode is that it loses coherency over longer contexts

Since the entire purpose, focus and motivation of this model seems to have been "coherency over longer contexts", doesn't that issue makes it not an OK model? It's bad at the thing it's supposed to be good at, no?

wolttam•54m ago

long(er) contexts (than the previous model)

It does devolve into gibberish at long context (~120k+ tokens by my estimation but I haven't properly measured), but this is still by far the best bang-for-buck value model I have used for coding.

It's a fine model

verdverm•35m ago

Have you tried gemma4?

I'm curious how the bang for buck ratio works in comparison. My initial tests for coding tasks have been positive and I can run it at home. Bigger models I assume are still better on harder tasks.

whimblepop•45m ago

That's pretty few, at least for the way I'm currently using LLMs. I have them do some Nix work (both debugging and coding) where accuracy and quality matters to me, so they're instructed to behave as I would when it comes to docs, always consulting certain docs and source code in a specific order. It's not unusual for them to chew through 200k - 600k tokens in a single session before they solve everything I want them to. That's what I currently think of when I think of "long horizon within a single context window".

So I need them to not only not devolve into gibberish, but remain smart enough to be useful at contexts several times longer than that.

azuanrb•43m ago

Have you compared it with using Claude Code as the harness? It performs much better than opencode for me.

jauntywundrkind•41m ago

Chiming in to second this issue. It is wildly frustrating.

I suspect that this isn't the model, but something that z.ai is doing with hosting it. At launch I was related to find glm-5.1 was stable even as the context window filled all the way up (~200k). Where-as glm-5, while it could still talk and think, but had forgotten the finer points of tool use to the point where it was making grevious errors as it went (burning gobs of tokens to fix duplicate code problems).

However, real brutal changes happened sometimes in the last two or three months: the parent problem emerged and emerged hard, out of nowhere. Worse, for me, it seemed to be around 60k context windows, which was heinous: I was honestly a bit despondent that my z.ai subscription had become so effectively useless. That I could only work on small problems.

Thankfully the coherency barrier raised signficiantly around three weeks go. It now seems to lose its mind and emits chaotic non-sentance gibberish around 100k for me. GLM-5 was already getting pretty shaky at this point, so I feel like I at least have some kind of parity. But at least glm-5 was speaking & thinking with real sentances, I could keep conversing with it somewhat, where-as glm-5.1 seems to go from perfectly level headed working fine to all of a sudden just total breakdown, hard switch, at such a predictable context window size.

It seems so so probable to me that this isn't the model that's making this happen: it's the hosting. There's some KV cache issue, or they are trying to expand the context window in some way, or to switch from one serving pool of small context to a big context serving pool, or something infrastructure wise that falls flat and collapses. Seeing the window so clearly change from 200k to 60k to 100k is both hope, but also, misery.

I've been leaving some breadcrumbs on Bluesky as I go. It's been brutal to see. Especially having tasted a working glm-5.1. I don't super want to pay API rates to someone else, but I fully expect this situation to not reproduce on other hosting, and may well spend the money to try and see. https://bsky.app/profile/jauntywk.bsky.social/post/3mhxep7ek...

All such a shame because aside from totally going mad & speaking unpuncutaed gibberish, glm-5.1 is clearly very very good and I trust it enormously.

cassianoleal•9m ago

I've done some very long sessions on OpenCode with Dynamic Context Pruning. Highly recommend it.

https://github.com/Opencode-DCP/opencode-dynamic-context-pru...

Yukonv•59m ago

Unsloth quantizations are available on release as well. [0] The IQ4_XS is a massive 361 GB with the 754B parameters. This is definitely a model your average local LLM enthusiast is not going to be able to run even with high end hardware.

[0] https://huggingface.co/unsloth/GLM-5.1-GGUF

RickHull•48m ago

I am on their "Coding Lite" plan, which I got a lot of use out of for a few months, but it has been seriously gimped now. Obvious quantization issues, going in circles, flipping from X to !X, injecting chinese characters. It is useless now for any serious coding work.

kay_o•41m ago

I am on the mid tier Coding plan to trying it out for the sake of curiosity.

During off peak hour a simple 3 line CSS change took over 50 minutes and it routinely times out mid-tool and leaves dangling XML and tool uses everywhere, overwriting files badly or patching duplicate lines into files

unicornfinder•27m ago

I'm on their pro plan and I respectfully disagree - it's genuinely excellent with GLM 5.1 so long as you remember to /compact once it hits around 100k tokens. At that point it's pretty much broken and entirely unusable, but if you keep context under about 100k it's genuinely on par with Opus for me, and in some ways it's arguably better.

airstrike•13m ago

[delayed]

margorczynski•24m ago

It has been useless for long time when compared to Opus or even something like Kimi. The saving grace was that it was dirt cheap but that doesn't matter if it can't do what I want even after many repeated tries and trying to push it to a correct solution.

satvikpendem•24m ago

Every model seems that way, going back to even GPT 3 and 4, the company comes out with a very impressive model that then regresses over a few months as the company tries to rein in inference costs through quantization and other methods.

wolttam•23m ago

This is surprising to me. Maybe because I'm on Pro, and not Lite. I signed up last week and managed to get a ton of good work done with 5.1. I think I did run into the odd quantization quirk, but overall: $30 well spent

Mashimo•23m ago

I'm also on the lite plan and have been using 5.1 for a few days now. It works fine for me.

But it's all casual side projects.

Edit: I often to /compact at around 100 000 token or switch to a new session. Maybe that is why.

benterix•10m ago

> Obvious quantization issues

Devil's advocate: why shouldn't they do it if OpenAI, Anthropic and Google get away with playing this game?

alex7o•40m ago

To be honest I am a bit sad as, glm5.1 is producing mich better typescript than opus or codex imo, but no matter what it does sometimes go into shizo mode at some point over longer contexts. Not always tho I have had multiple session go over 200k and be fine.

MegagramEnjoyer•10m ago

Why is that sad? A free and open source model outperforming their closed source counterparts is always a win for the users

kirby88•7m ago

I wonder how that compare to harness methods like MAKER https://www.cognizant.com/us/en/ai-lab/blog/maker

War Is the Best VC Pitch Nobody Wants to Give

AI replacing radiologists: Docs slam Nvidia, Anthropic CEOs for false info

Donald Trump is threatening the extinction of an 'entire civilization' tonight

Ask HN: What are you working on? (April 2026) (Non AI)

Show HN: PromptJuggler – A dev env and runner for prompts, workflows, agents

AI-powered roasts (and Solutions) for your product

I made Claude Code run on my Apple Watch

GPT-5.4 in OpenClaw doesn't suck. Your prompts do

Recursive Make Considered Harmful (1997) [pdf]

Snapdragon X2 Elite Extreme: Strong Rival to Apple, Major Threat to AMD/Intel

Re-Thinking Framebuffers in PanVK

The Masters is a smarter business

McGridsort: Warping Grids for GPU k-way mergesort

Perfdeck (formerly Perfmon): Consolidate CLI monitoring tools into a single TUI

LED bulbs can damage paintings

Moving Averages

Sorry kid, drones are for war now

Ask HN: Are You Using Finetuning?

ShadowStrike Phantom: Open-Source EDR/XDR Platform

Trump is 'calling for a nuclear strike,' former White House comms director says

Show HN: Marimo pair – reactive Python notebooks as environments for agents

Making of Words.zip (Infinite Word Search)

Agentic development aspirations: build, run, observe – without more Markdown

Say no to a 'camera on your face', says Meta smart glasses rival

Got Bored due to endless scrolling on ChatGPT and Gemini

What we learned about TEE security from auditing WhatsApp's Private Inference

My Attempt on AI Workflow

This Spillway Failed on Purpose [video]

Show HN: Maintenance OS – AI-powered property maintenance for landlords

BQN: Primitive Overloading