GLM-5.2: The Most Powerful Open Model yet and the Brutal Reality of Running It

https://vettedconsumer.com/glm-5-2-the-most-powerful-open-weight-model-yet-and-the-brutal-reality-of-running-it-locally/

37•ermantrout•2h ago

Comments

walrus01•1h ago

Before people go and drop a gargantuan sum of money on a server capable of running it entirely in GPU, there's still a fair amount of used x86-64 servers capable of running it in CPU and RAM (using llama-server) for probably under $6000. For example a Dell R640 with two older Xeon 18-core CPUs and 1TB of RAM. Test it out at a slow token/sec rate and see if it fits your needs.

Same idea for Kimi.

qingcharles•1h ago

Agreed. There are some crazy good deals on these older servers. For me, the inference speed would be fine as I'd just get on with a million other tasks between each response.

sgc•1h ago

To check whether I understand how this all works: Wouldn't a 4 bit quant run reasonably well (for that hardware) with far less ram, something like 1.5x the 476gb, or 714gb+ ram?

walrus01•53m ago

Yes, but the price difference between buying a used x86-64 server with 512GB and 1024GB isn't that great, and if you're already determined to buy the hardware to run in CPU a "large" model (eg: Not Qwen 3.6 35B-A3B, gemma4 or similar size), the loss of quality and sometimes suspicious nature of the output from a 4-bit quant might be undesirable vs running a Q8 quant or full precision.

You would also want a lot of RAM for context/kv cache to make it usable so just the amount of RAM that will fit a Q4 model and run it (before any cache starts getting populated through active use) isn't enough.

tfirst•1h ago

If model performance continues to scale with model size, I have a hard time seeing how local models will have any chance of competing with models hosted on datacenter hardware.

1. There are strong economies of scale in hosting inference (batched prompts, high uptime, shared infrastructure).

2. There are physical limits on how much memory we will be able to produce over the next few years. Demand will probably scale at least as fast as production does, so we won't be saved by falling prices.

dabinat•1h ago

Cloud models will always be ahead, but not every task needs Fable-level intelligence. The number of usable situations for local models will increase as hardware and open-weight models improve.

walrus01•1h ago

There's a value for many people and organizations with running a model locally on hardware they fully own and control (or pay to colocate in a datacenter somewhere) vs running a model on something owned/controlled by any third party. For highly privacy sensitive, medical applications, etc. It's not just a question of raw efficiency in dollar per tokens per day or tokens/second.

butvacuum•1h ago

For your first point- You've just repeated "shared tenant." A scaling factor that's been used since before the turn of the milenium. Uptime is, as always, an irrelevancy for personal/homelab vs cloud. It shifts from uptime to pure financial (capex first, then how you account for "wasted" time).

2) The current memory crunch is more political than cyclical. The only reason we have fabs as far intro construction as we do is CHIPS Act. Which, predates LLMs public existance by more than 6mo. the horrific silicon prices are a direct result of openAI's openly Illegal dealings. Their pretense of needing it for stargate gets sundered further with each missed or cancelled deadline.

They predicted the political and regulatory outcome superbly.

kristianp•1h ago

Irritating LLMisms:

    - "real architecture trick"
    - "the honest hardware reality of running it at home."
    - "What it is — and what Z.ai claims"
    - "The one genuinely new idea"

And many more.

LeoPanthera•1h ago

I've been using "the one genuinely adjective noun" for years as a weird English tic, and it bothers me that it's become an LLM tell.

butvacuum•1h ago

That's because most the "tells" expose more about the "reader" than the content.

CorpOverreach•1h ago

Yep. The entire thing. Instant turn-off when reading an article.

I'm sure the content does have some value, and perhaps someone spent time putting together an original copy that they thought was going to be made better by having AI "make it better".

Actually, I take some of that back - most of the site seems to be AI written, following the formula of "ingest multiple sources" => feed to AI => write article.

KaoruAoiShiho•1h ago

Terrible zero value article, I am extremely surprised it is upvoted.

That being said Artificial Analysis just came out with a brand new benchmark where it scored between opus 4.8 and gpt-5.5 and well behind fable-5 so it's definitely frontier-ish https://x.com/ArtificialAnlys/status/2067744637155226101

CorpOverreach•1h ago

I do think it's going to get harder and harder to run bleeding-edge models; this is just the start of it.

It being hard for the average joe to run these at its fullest potential is unfortunate, but the important part is that _you can_ assuming you can acquire the resources.

I think that's going to be important for the sake of preserving privacy and freedom of information in the long run. We're seeing this play out right now with Anthropic originally playing the "safety" card for why they can't let everyone at Mythos and subsequently got on the US Gov't radar with access to Fable being pulled.

The next biggest milestone will be an open-weights challenger to Mythos. There'll be consequences to that, but I feel those are less worse than someone else deciding what you can and can't use a model for.

lamida•1h ago

Pretty sure the article is fully written by LLM without editing at all. See all the — emdash sprinkled all over.

blackoil•1h ago

I think people overrate 'local' part of open Models vs private. With OpenAI my choice is 1. I have to use them, even if they decide to double the cost or work with govt to blow my country. My $5 server can't run GLM but I have choice from many providers based on my requirements of cost, data residency, political alignment.

easygenes•1h ago

Article reads as though written by someone who doesn't have much experience with deployments like this. Underestimates the memory needed to run with a reasonable amount of context. Misses two other obvious targets:

  1) 4x DGX Spark (or equivalent other GB10 boxes) with a switch (MikroTik CRS504 or CRS804) and TP=4.
  2) 4x RTX PRO 6000 box. Probably the most practical for cost/perf if you want on-prem as an individual.

Both would be best to run a 2-bit quant so everything can stay resident (article claims you could run a 4-bit quant with 4x RTX 6000 Ada, and while technically true it would mean a lot of the weights are streaming from DRAM, so it would be slow and impractical. You would need 8x RTX PRO 6000 to run 4 bit at a good speed).

This model quantizes unusually well: https://unsloth.ai/docs/models/glm-5.2#quantization-analysis

redox99•46m ago

Can you really say you're running GLM 5.2 if its a 2 bit quant? It might be usable but the capabilities will definitely not be the same.

ma2kx•29m ago

Thats just stupid.

- Why should I run it on local hardware when there are already about a dozen US provider available?

- To compare the token usage per task with GLM 5.1 is worthless when GLM 5.1 is unable to do the task.

- Not even z.ai itself runs the model with BF16 weights.

- I couldn't care less how good the model is at drawing a pelican on a bicycle.

To study how chips work, MIT researchers built their own operating system

How Japan's railways stayed one while splitting apart

Zero-Touch OAuth for MCP

Datasette Apps: Host custom HTML applications inside Datasette

I found 10k GitHub repositories distributing Trojan malware

Building a robotics research setup that lives next to my desk

Cell-based architecture for resilient payment systems

Ubiquiti: Enterprise NAS, Built on ZFS

CS 6120: Advanced Compilers: The Self-Guided Online Course (2020)

Zork name origin got an update on Wikipedia

Flexport (YC W14) Is Hiring in Indonesia, India, and Thailand

Hospitals and universities repurposing drugs at lower cost

I told them forced consent was unlawful. 5 years later it cost Elkjop €1.8M

Show HN: Are You in the Weights?

.gitignore Isn't the only way to ignore files in Git

Launch HN: TesterArmy (YC P26) – Agents that test web and mobile apps

Swiss parliament lifts ban on new nuclear power plants

The Token Compression Illusion: Why I'm Skeptical of RTK

W Social, public institutions and the theater of European digital sovereignty

If your product is Great, it doesn't need to be Good (2010)

Noam Shazeer Joins OpenAI

Modos Color Monitor Pushes E-Paper Displays Further

How Alberta Eradicated Rats

Show HN: Gerrymandle - Daily puzzle game where you redraw electoral districts

Migrating from GNU Stow to Chezmoi

Horizons JPL Solar System Data Demo and NASA DSN Updates: Datastar, Common Lisp

Agentic Resource Discovery Specification

Update on Ocean Observatories Initiative

Emacs 31 is around the corner: The changes I'm daily driving

Show HN: Talos – Open-source WASM interpreter for Lean