GLM-5.2: The Most Powerful Open Model yet and the Brutal Reality of Running It

https://vettedconsumer.com/glm-5-2-the-most-powerful-open-weight-model-yet-and-the-brutal-reality-of-running-it-locally/

37•ermantrout•1h ago

Comments

walrus01•1h ago

Before people go and drop a gargantuan sum of money on a server capable of running it entirely in GPU, there's still a fair amount of used x86-64 servers capable of running it in CPU and RAM (using llama-server) for probably under $6000. For example a Dell R640 with two older Xeon 18-core CPUs and 1TB of RAM. Test it out at a slow token/sec rate and see if it fits your needs.

Same idea for Kimi.

qingcharles•50m ago

Agreed. There are some crazy good deals on these older servers. For me, the inference speed would be fine as I'd just get on with a million other tasks between each response.

sgc•36m ago

To check whether I understand how this all works: Wouldn't a 4 bit quant run reasonably well (for that hardware) with far less ram, something like 1.5x the 476gb, or 714gb+ ram?

walrus01•26m ago

Yes, but the price difference between buying a used x86-64 server with 512GB and 1024GB isn't that great, and if you're already determined to buy the hardware to run in CPU a "large" model (eg: Not Qwen 3.6 35B-A3B, gemma4 or similar size), the loss of quality and sometimes suspicious nature of the output from a 4-bit quant might be undesirable vs running a Q8 quant or full precision.

You would also want a lot of RAM for context/kv cache to make it usable so just the amount of RAM that will fit a Q4 model and run it (before any cache starts getting populated through active use) isn't enough.

tfirst•1h ago

If model performance continues to scale with model size, I have a hard time seeing how local models will have any chance of competing with models hosted on datacenter hardware.

1. There are strong economies of scale in hosting inference (batched prompts, high uptime, shared infrastructure).

2. There are physical limits on how much memory we will be able to produce over the next few years. Demand will probably scale at least as fast as production does, so we won't be saved by falling prices.

dabinat•1h ago

Cloud models will always be ahead, but not every task needs Fable-level intelligence. The number of usable situations for local models will increase as hardware and open-weight models improve.

walrus01•1h ago

There's a value for many people and organizations with running a model locally on hardware they fully own and control (or pay to colocate in a datacenter somewhere) vs running a model on something owned/controlled by any third party. For highly privacy sensitive, medical applications, etc. It's not just a question of raw efficiency in dollar per tokens per day or tokens/second.

butvacuum•44m ago

For your first point- You've just repeated "shared tenant." A scaling factor that's been used since before the turn of the milenium. Uptime is, as always, an irrelevancy for personal/homelab vs cloud. It shifts from uptime to pure financial (capex first, then how you account for "wasted" time).

2) The current memory crunch is more political than cyclical. The only reason we have fabs as far intro construction as we do is CHIPS Act. Which, predates LLMs public existance by more than 6mo. the horrific silicon prices are a direct result of openAI's openly Illegal dealings. Their pretense of needing it for stargate gets sundered further with each missed or cancelled deadline.

They predicted the political and regulatory outcome superbly.

kristianp•1h ago

Irritating LLMisms:

    - "real architecture trick"
    - "the honest hardware reality of running it at home."
    - "What it is — and what Z.ai claims"
    - "The one genuinely new idea"

And many more.

LeoPanthera•1h ago

I've been using "the one genuinely adjective noun" for years as a weird English tic, and it bothers me that it's become an LLM tell.

butvacuum•59m ago

That's because most the "tells" expose more about the "reader" than the content.

CorpOverreach•1h ago

Yep. The entire thing. Instant turn-off when reading an article.

I'm sure the content does have some value, and perhaps someone spent time putting together an original copy that they thought was going to be made better by having AI "make it better".

Actually, I take some of that back - most of the site seems to be AI written, following the formula of "ingest multiple sources" => feed to AI => write article.

KaoruAoiShiho•1h ago

Terrible zero value article, I am extremely surprised it is upvoted.

That being said Artificial Analysis just came out with a brand new benchmark where it scored between opus 4.8 and gpt-5.5 and well behind fable-5 so it's definitely frontier-ish https://x.com/ArtificialAnlys/status/2067744637155226101

CorpOverreach•54m ago

I do think it's going to get harder and harder to run bleeding-edge models; this is just the start of it.

It being hard for the average joe to run these at its fullest potential is unfortunate, but the important part is that _you can_ assuming you can acquire the resources.

I think that's going to be important for the sake of preserving privacy and freedom of information in the long run. We're seeing this play out right now with Anthropic originally playing the "safety" card for why they can't let everyone at Mythos and subsequently got on the US Gov't radar with access to Fable being pulled.

The next biggest milestone will be an open-weights challenger to Mythos. There'll be consequences to that, but I feel those are less worse than someone else deciding what you can and can't use a model for.

lamida•54m ago

Pretty sure the article is fully written by LLM without editing at all. See all the — emdash sprinkled all over.

blackoil•53m ago

I think people overrate 'local' part of open Models vs private. With OpenAI my choice is 1. I have to use them, even if they decide to double the cost or work with govt to blow my country. My $5 server can't run GLM but I have choice from many providers based on my requirements of cost, data residency, political alignment.

easygenes•52m ago

Article reads as though written by someone who doesn't have much experience with deployments like this. Underestimates the memory needed to run with a reasonable amount of context. Misses two other obvious targets:

  1) 4x DGX Spark (or equivalent other GB10 boxes) with a switch (MikroTik CRS504 or CRS804) and TP=4.
  2) 4x RTX PRO 6000 box. Probably the most practical for cost/perf if you want on-prem as an individual.

Both would be best to run a 2-bit quant so everything can stay resident (article claims you could run a 4-bit quant with 4x RTX 6000 Ada, and while technically true it would mean a lot of the weights are streaming from DRAM, so it would be slow and impractical. You would need 8x RTX PRO 6000 to run 4 bit at a good speed).

This model quantizes unusually well: https://unsloth.ai/docs/models/glm-5.2#quantization-analysis

redox99•19m ago

Can you really say you're running GLM 5.2 if its a 2 bit quant? It might be usable but the capabilities will definitely not be the same.

ma2kx•2m ago

Thats just stupid.

- Why should I run it on local hardware when there are already about a dozen US provider available?

- To compare the token usage per task with GLM 5.1 is worthless when GLM 5.1 is unable to do the task.

- Not even z.ai itself runs the model with BF16 weights.

- I couldn't care less how good the model is at drawing a pelican on a bicycle.

Google's Secret Warrant Fight over DOJ Pipe Bomb Probe Revealed

Is the world becoming more predictable?

The Ghost in the Ledger

Applied AI Engineer/ Product Builder/ Data Science

The Reason Your IT Team Isn't Getting Anything Done

Between backyards and nakamals: Shifting Australia–Vanuatu relations

CS 153: Frontier Systems

Ask HN: Has AI made digital distribution more powerful than code?

Our game anticheat has no kernel driver, we catch not block

Adama City Government Exposes 29 GB of Sensitive Ethiopian Citizens' Data

Writing for humans is the only SEO trick left

SQLite Hub

A bold satellite rescue mission came together in record time, but will it work?

AI coding: loop engineering a translator

The First Website Killed by Google: Their Online Answer Marketplace

Valve's Latest SteamOS Out

Where the Light Falls: Who Was Johannes Vermeer?

Show HN: Iamspeed.dev – Fast.com Style but for LLMs

We built a privacy-focused vector memory mobile app

US Tells ASML It's Concerned China May Have Top Chip Tool

Stack Overflow for Agents – Stack Overflow

Moving Beyond Fork() + Exec()

Agent Finder

Hierarchy of the Sciences

DeepSWE v1.1

Is It Time for a New Embedded Linux Build System?

A tiny ingestible sensor can measure temperature from inside the body

The gentrification of Harajuku: how the coolest city is becoming "uncool"

Grok Is More Important Than Clean Air, DOJ Says

Fractal OS