I am very excited for local LLMs I think we may have GPT 5.5-xhigh level of performance for under 2000 EUR
This should put more pressure on the frontier models to avoid sitting on any fancy stuff and lower token prices as a whole.
Nothing beats a local LLM disconnected from the cloud.
Also I m not sure where you are getting the under 2k value. I bought a Framework desktop 128GB last year and my setup was around 2.7k. The same setup now sells for around 4.7k.
Do not mix the benchmark results of GLM 5.2 FP16/FP8 with FP4 or FP2.
* FP4 will mean a accuracy loss of about 3%. Not noticeable but more chance for mistakes.
* FP2 ... what is what most people are able to run at home, for a "reasonable" price. Your looking at over 17% loss in accuracy.
At that point, your running at less then claude-sonnet-4.6, as the issues compound with accuracy losses. And reasonable priced is still in the ~ $5000 range (192GB + GPU 32GB active/kv cache system).
For that price your using a Codex / Claude Pro subscription for the next 4+ years with better models (by default), let alone with a FP2 GLM 5.2 version. And your looking at < 10 fps. A MacStudio with 512GB will net you 18 a 20fps+ with FP4, but ... i mean, those used to be $10.000.
Unfortunately the local hardware cost is a major issue for running large models like that.
think Apple M6 or M7 with a currently unforeseen denser memory style, 256gb RAM
a couple inference or cache improvements on the algorithmic side, using less ram for context windows and doubling token speed again
denser open source models, packing more experts for smaller active layers
it'll still be expensive but like $8,000 - $13,000 instead of $450,000 worth of B200s
For companies wanting LLMs in their products in general, I have to think going the local llm route is even more tempting. Somewhat dumb models are more than good enough for a lot of the things people are integrating LLMs into their products.
On top of that, you will still be heavily quantized.
...a bit of an odd question: how well do LLMs losslessly compress, as in for cold storage?
I definitely don't have the hardware to run this model at any kind of reasonable speed (and I don't want to use a super aggressive quantization that would kill performance). Even so, I think it would be cool to retain an offline copy, in case... I don't really know, a solar flare destroys the internet some day, or maybe a zombie apocalypse. It would just be cool to have.
But 1.5 TB is a bit too much! If it could be compressed down into something semi kind of reasonable, that would be fun!
We are maybe 10 years off that.
RAM prices are going to continue to increase for the next 2 years at least.
Even putting that aside it's currently around 40-70,000 EUR to run this with a FP8 quantization (which you need to get close to maximum performance).
To actually get GPT 5.5-xhigh performance in the real world you need more headroom to support things like subagents (which will fill up your KV cache).
I like local models but realism is important. The sweet spot for the next 3 years will continue to be ~35B MoE models. They might match GPT 5.5-xhigh for chat-style problems but not for coding.
The other way this could go is that Z.ai could decide to release a smaller model targeted towards coding. They've done that before (GLM-4.7-Flash had 30B params). It would be great if they decided to release something in the 80B-100B param range. Something that size would easily run in a current Strix Halo system.
Having said that, I don’t think it’s all that tempting for companies at all, considering this whole market is developing rapidly and it’s nearly impossible to predict where we’ll be at in a year or two.
It's not like you'd lose capabilities, if anything this solution just gets better with time.
For who pays for it, obviously the employer would.
For "how did I arrive at this number" Ballpark estimate from what I know about part cost. Most of that money will go towards AI cards about $5k for the mb, cpu, power supply, etc. $45k would be for as much ram and as big/expensive nVidia cards as you can get your hands on. The B300 has 288GB of VRAM in it. Probably what you'd be after.
xrd•1h ago
https://unsloth.ai/docs/models/glm-5.2#usage-guide
In a prior thread, someone said it would take $500k in hardware:
https://news.ycombinator.com/item?id=48629970
mgambati•1h ago
kibibu•1h ago
cheema33•1h ago
elliotbnvl•36m ago
NVFP4 at reasonable speeds (~120 tok/s) and concurrency is possible at a $80/90k figure with today's prices, maybe even less. That buys you 6 RTX 6000 PRO Blackwells, a decent CPU and motherboard, power supply. 576gb of VRAM.
You could do it for under $50k if you're OK with 40 tok/s decode, ~1200 tok/s prefill.