frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

Jamesob's guide to running SOTA LLMs locally

https://github.com/jamesob/local-llm
44•livestyle•1h ago

Comments

beardsciences•1h ago
I am somewhere in the middle, where I want something with more than 48GB/$2k of VRAM, but less than 384GB/$40k.

I'm curious if GMKtec's EVO-X2, with ~96GB of usable VRAM, is still a good solution for something like this for $3,399.

sampullman•1h ago
I picked up the 128gb version when it was $2,199 and it runs Qwen 3.6 reasonably well with a 128kb context. Not very useful for complex tasks but it can handle some web stuff.
mft_•42m ago
It has lower memory bandwidth than most comparable Macs.
datadrivenangel•54m ago
"A great way to go is 2x RTX 3090s for a total of 48GB VRAM total. You can then run Qwen3.6-27B, which is an awesome model."

Just want to note that for $3k you can get an M5 macbook pro with 48gb of shared memory, and it will not be a giant box. Also, consider committing to spending that money on a cloud hosting provider, which will be at least somewhat cheaper if not significantly cheaper. It is awesome being able to run models locally though.

jbellis•47m ago
That's a reasonable option, just be aware that you get about 1/3 as much memory bandwidth with the M5 Pro, or 2/3 with the M5 Max [now you're at $4100 for the lowest-end]. So both your prefill (flops-bound, M5 has a lot less) and decode (bw-bound) will be slower.
LeBit•42m ago
I’m an idiot who is unable to project itself in situations I’ve never experienced before.

So, I always thought local LLMs were toys not worth pursuing.

Only once have I tried something decent like Gemma 4 31B and Qwen 3.6 27B did I realize how incredibly useful they are.

You stop fearing you are sharing sensitive information.

You stop fearing you will run out of tokens.

You stop fearing about the availability of the remote AI.

Local LLMs are extremely valuable.

bityard•34m ago
*for many tasks
Aurornis•35m ago
I have an M5 MacBook Pro and I also have a separate GPU setup for running models. The difference in speed is significant. It's not just token generation speed, but time to first token (prompt processing).

The M5 hardware is amazing for what it is, but GPUs are still so much faster.

Running the models on the GPU box also means I can use the laptop on my lap instead of turning it into a hot plate.

boredatoms•23m ago
The standalone mini/studio is better if you dont want to have a constantly hot laptop

Get a regular laptop and use the network to access the LLM

xela79•51m ago
did he call Qwen a SOTA model?
mft_•41m ago
No, he’s running GLM 5.2, which is closer to SOTA.
zackify•50m ago
You can get amazing local STT using parakeet which can use as little as 600mb of vram. Better or as good as whisper v3 large
kgeist•43m ago
>$40k gets you almost-Opus

GLM 5.2 is "almost Opus," and it needs at least 8xH200s for comfortable inference (so it's closer to $400k than $40k).

They suggest using this modified model:

>A REAP-pruned (≈22% of experts removed), Int8-mix NVFP4 quantized version of GLM-5.2, ≈594B parameters.

I wonder how it behaves in practice outside of benchmarks. Qwen3.6, even at 6-bit quantization, often gets stuck in loops while reasoning. And here they've also removed some experts. I mean, sometimes an 8-bit or 16-bit small model can be smarter than a lobotomized large model. I heard the consensus is you shouldn't go below 8 bit for coding.

Also, it's not clear what is left of the available context when you try to fit a lobotomized model into 4 RTX 6000s. Anything below 100k is barely usable because it often hits compaction before it's able to gather the necessary context P.S. found in the repos, 240k context

amelius•39m ago
How does this work with scaling?

I assume you can then somehow run several hundreds of prompts concurrently?

api•39m ago
Apple M series chips deserve a mention as another option, especially since you get a whole Mac laptop or desktop workstation too.

They have unified memory and respectable inference performance, and for some variations can be cheaper than video cards, especially if you get an older-gen high-end M series with a lot of RAM used or refurbished.

I've read that Apple has plans once the RAM bottleneck passes to offer more RAM in all their models, and that future M series GPUs and NPUs will be even better for local inference, so in the future I expect Apple to be a serious offering for local inference and AI research workstations.

And what about AMD and Intel Arc GPUs? They don't get as much love but I've heard they can be compelling for certain shapes of a local LLM configuration.

At this point though, I think we may be in a "renters market" for LLM compute. If you want privacy it might be better to rent GPU time in raw form or use spot pricing at various providers. It probably only makes sense to build if you have extreme privacy/security needs or just want to do it cause it's cool.

wxw•39m ago
I agree that local LLMs are the likely future and worth investing in… but at $40k for possible-SOTA right now, this isn’t worth it for the average consumer.

I’m pretty bullish that Apple will deliver something very competitive for the average consumer in the next couple years.

Aurornis•37m ago
I play with local LLMs a lot. I've spent more on hardware than I should. I'm friends with a local group of people who have spent a lot more than I have.

The warning I would have for everyone is to temper your expectations and read the fine print carefully. The big build in article starts off with a $40K budget and then includes 4 GPUs that are $12K each. For those doing the math, this build is going to cost more like 50-55K.

Local setups also often rely on quantization and techniques like REAP to fit the models on their hardware. You will read a lot of claims that 4-bit quantization is lossless, but those claims come from KL divergence measurements on a small corpus. Use one of these 4-bit models on long context coding tasks and the quality will be noticeably less. Even for non-coding tasks like dataset analysis, I can measure a substantial quality difference between 4-bit models, 8-bit quants, and even some times the full 16-bit source.

This article is also encouraging the use of a REAP model, which means someone has cut out some of the weights to make it smaller. The idea is to remove weights that are less useful for certain tasks, but again this is going to reduce the overall quality of the output.

The trap is that people say "I'm running GLM-5.2 locally!" and it sounds amazing when you look at the GLM-5.2 benchmarks. However they're not actually running GLM-5.2, they're running a model derived from GLM-5.2 that discards most of the bits and drops some of the experts. It does not perform the same as what you see in the benchmarks. In my experience, the divergence between a quantized/REAP model and the parent model is unnoticeable when you try it on very small tasks or chat, but becomes painful when you start trying to use it on long-horizon tasks where little errors start compounding.

Then you get into the slippery slope of thinking you're $50K deep into this project, but what you really need is just one or two more of those $12K GPUs to use the next level of quantization that might improve the quality a little more and make your investment worthwhile...

CamperBob2•12m ago
All very true. Right now, running GLM 5.2 at its full BF16 quantization level needs 1.5 TB of VRAM. You can't run this locally at a usable speed for less than $250K or so, and frankly I'd be surprised if it could be done for less than $500K.

The best NV4FP quant for 5.2 appears to be lukealonso's at https://huggingface.co/lukealonso/GLM-5.2-NVFP4, and it is capable of good throughput (75-100 tps output) without losing much reasoning performance. Allowing for overhead for the KV cache and other requirements, this quant will (barely) run in 8-way tensor-parallel mode on 8x RTX 6000 cards. Not too long ago it was possible to put an 8x machine together for less than $100K USD, but that's probably not true now, assuming you buy all-new components.

It'll almost certainly be worth it, given the abusive behavior we've seen and will continue to see from the major closed-model providers. If I hadn't already put a similar rig together, I'd be kicking myself. But getting it running well is by no means as simple as buying a bunch of RTX6K cards and calling it a day, and people need to know what they're getting into.

Local AI is in its Altair and IMSAI days. There's no turnkey Apple II or C64 on the market yet, much less an IBM PC. Hardware, yes -- you can buy a capable box off the shelf from various vendors -- but you have to be prepared to take up a whole new hobby when it comes to getting a complete system working well.

turova•7m ago
For qwen3.6-27b you can also run the q4 variant with full ~250K context on one 3090. It's fast enough to not be frustrating so the speed gains with 2x 3090s wouldn't be worth it to me. Running a q6 on 2x 3090s at half the speed with a smaller context is an option, but you're really not going to compete with SOTA models there anyway so unless you already have 2x 3090s, I would say 1 is the best investment given current prices. It's good enough to do a lot, especially with a well-configured harness.
Aurornis•7m ago
> It'll almost certainly be worth it, given the abusive behavior we've seen and will continue to see from the major closed-model providers.

The proper financial comparison for GLM-5.2 would be one of the providers on OpenRouter. Compare apples to apples.

You will almost certainly never break even compared to paying per token.

Local LLMs at this scale are only worth it if you have extremely strict requirements that data not leave the premises.

Markets are competitive if and only if P = NP

https://arxiv.org/abs/2602.20415
109•kscarlet•1h ago•74 comments

Half-Baked Product

https://weli.dev/blog/half-baked-product/
941•weli•8h ago•283 comments

America, 1926: What a Forgotten 100-Year-Old Report Says About Who We Are

https://www.derekthompson.org/p/america-1926-an-absurdly-deep-dive
41•momentmaker•1h ago•23 comments

Jamesob's guide to running SOTA LLMs locally

https://github.com/jamesob/local-llm
47•livestyle•1h ago•22 comments

Factories Are Just Rooms

https://interconnected.org/home/2026/07/03/factories
37•arbesman•1h ago•16 comments

Give Smart People the Tools to Do Smart Things

https://superuserdone.com/posts/2026-07-03-give-smart-people-the-tools/
36•SuperUserDone•2h ago•14 comments

The Life and Times of Maxis, Part 1: SimEverything

https://www.filfre.net/2026/07/the-life-and-times-of-maxis-part-1-simeverything/
12•doppp•1h ago•0 comments

PostgreSQL and the OOM Killer: Why We Use Strict Memory Overcommit

https://www.ubicloud.com/blog/postgresql-and-the-oom-killer-why-we-use-strict-memory-overcommit
81•furkansahin•4h ago•30 comments

Claude, please stop trying to memorize random crap

https://12gramsofcarbon.com/p/agentics-memorizing-session-transcripts
11•theahura•1h ago•0 comments

Hunting a 16-year-old SQLite WAL bug with TLA+

https://ubuntu.com/blog/hunting-a-16-year-old-sqlite-bug-with-tla-is-dqlite-affected
54•peterparker204•3d ago•1 comments

Best Simple System for Now

https://dannorth.net/blog/best-simple-system-for-now/
27•daan-k•1h ago•3 comments

Valve open source the Steam Machine e-ink screen so you can make your own

https://www.gamingonlinux.com/2026/07/valve-open-source-the-steam-machine-e-ink-screen-so-you-can...
295•ahlCVA•3h ago•42 comments

The Fall and Rise of Screwworm

https://www.construction-physics.com/p/the-fall-and-rise-of-screwworm
53•crescit_eundo•4h ago•19 comments

My Dad Helped Build North America's Oat Supply Chain: Can It Be Remade?

https://ambrook.com/offrange/perspective/how-we-lost-our-oats
19•surprisetalk•3d ago•2 comments

Wordgard: The new in-browser rich-text editor from the creator of ProseMirror

https://wordgard.net/
158•indy•8h ago•70 comments

Right to Local Intelligence

https://righttointelligence.org/
425•thoughtpeddler•17h ago•147 comments

CarPlay Is Additive

https://www.caseyliss.com/2026/7/2/carplay-is-additive-you-dolts
477•sprawl_•15h ago•626 comments

Anatomy of Persistent Memory's 3 Layers: Comparing ContextNest, Mem0 and Zep

https://promptowl.ai/resources/persistent-memory-ai-agents/
10•sparkystacey•2h ago•0 comments

How working with a blind client revealed invisible accessibility gaps

https://iinteractive.com/resources/blog/read-only
71•fortyseven•3d ago•49 comments

The Safari MCP server for web developers

https://webkit.org/blog/18136/introducing-the-safari-mcp-server-for-web-developers/
204•coloneltcb•15h ago•58 comments

crustc: entirety of `rustc`, translated to C

https://github.com/FractalFir/crustc
352•Philpax•18h ago•75 comments

Program-as-Weights: A Programming Paradigm for Fuzzy Functions

https://arxiv.org/abs/2607.02512
18•simonpure•4h ago•2 comments

Local Reasoning for Global Properties

https://tratt.net/laurie/blog/2026/local_reasoning_for_global_properties.html
24•mpweiher•2d ago•2 comments

Commodore 64 Basic for PostgreSQL

https://thombrown.blogspot.com/2026/07/load-plcbmbasic81-commodore-64-basic.html
44•hans_castorp•7h ago•8 comments

Reality has a surprising amount of detail (2017)

https://johnsalvatier.org/blog/2017/reality-has-a-surprising-amount-of-detail
342•vinhnx•5d ago•128 comments

Quake in 13 Kilobytes (2021)

https://js13kgames.com/games/q1k3
114•mortenjorck•6d ago•16 comments

Hackers shoveled snow for company, were rewarded with network admin access

https://www.theregister.com/security/2026/07/02/hackers-shoveled-snow-for-company-were-rewarded-w...
57•ike_usawa•3h ago•33 comments

Alibaba to ban Claude Code in workplace over alleged backdoor risks, source says

https://www.reuters.com/world/china/alibaba-ban-claude-code-workplace-over-alleged-backdoor-risks...
275•nsoonhui•8h ago•231 comments

Exapunks (2018)

https://www.zachtronics.com/exapunks/
316•yu3zhou4•22h ago•109 comments

60% Fable cost cut by converting code to images and having the model OCR it

https://github.com/teamchong/pxpipe
11•dimitropoulos•1h ago•4 comments