1. There are strong economies of scale in hosting inference (batched prompts, high uptime, shared infrastructure).
2. There are physical limits on how much memory we will be able to produce over the next few years. Demand will probably scale at least as fast as production does, so we won't be saved by falling prices.
2) The current memory crunch is more political than cyclical. The only reason we have fabs as far intro construction as we do is CHIPS Act. Which, predates LLMs public existance by more than 6mo. the horrific silicon prices are a direct result of openAI's openly Illegal dealings. Their pretense of needing it for stargate gets sundered further with each missed or cancelled deadline.
They predicted the political and regulatory outcome superbly.
- "real architecture trick"
- "the honest hardware reality of running it at home."
- "What it is — and what Z.ai claims"
- "The one genuinely new idea"
And many more.I'm sure the content does have some value, and perhaps someone spent time putting together an original copy that they thought was going to be made better by having AI "make it better".
Actually, I take some of that back - most of the site seems to be AI written, following the formula of "ingest multiple sources" => feed to AI => write article.
That being said Artificial Analysis just came out with a brand new benchmark where it scored between opus 4.8 and gpt-5.5 and well behind fable-5 so it's definitely frontier-ish https://x.com/ArtificialAnlys/status/2067744637155226101
It being hard for the average joe to run these at its fullest potential is unfortunate, but the important part is that _you can_ assuming you can acquire the resources.
I think that's going to be important for the sake of preserving privacy and freedom of information in the long run. We're seeing this play out right now with Anthropic originally playing the "safety" card for why they can't let everyone at Mythos and subsequently got on the US Gov't radar with access to Fable being pulled.
The next biggest milestone will be an open-weights challenger to Mythos. There'll be consequences to that, but I feel those are less worse than someone else deciding what you can and can't use a model for.
1) 4x DGX Spark (or equivalent other GB10 boxes) with a switch (MikroTik CRS504 or CRS804) and TP=4.
2) 4x RTX PRO 6000 box. Probably the most practical for cost/perf if you want on-prem as an individual.
Both would be best to run a 2-bit quant so everything can stay resident (article claims you could run a 4-bit quant with 4x RTX 6000 Ada, and while technically true it would mean a lot of the weights are streaming from DRAM, so it would be slow and impractical. You would need 8x RTX PRO 6000 to run 4 bit at a good speed).This model quantizes unusually well: https://unsloth.ai/docs/models/glm-5.2#quantization-analysis
- Why should I run it on local hardware when there are already about a dozen US provider available?
- To compare the token usage per task with GLM 5.1 is worthless when GLM 5.1 is unable to do the task.
- Not even z.ai itself runs the model with BF16 weights.
- I couldn't care less how good the model is at drawing a pelican on a bicycle.
walrus01•1h ago
Same idea for Kimi.
qingcharles•50m ago
sgc•36m ago
walrus01•26m ago
You would also want a lot of RAM for context/kv cache to make it usable so just the amount of RAM that will fit a Q4 model and run it (before any cache starts getting populated through active use) isn't enough.