frontpage.

I'm a big fan of on-device AI inference for a million reasons, especially its potential to significantly reduce or even potentially eliminate the need for massive AI data center projects in the United States. But so far, the only place I've gotten that to work was on Windows by abusing the living heck out of its GPU shared virtual memory management with llamacpp. And since Windows isn't exactly the best OS these days, I've been looking at alternatives.

Recent changes in the core llamacpp code, the Linux kernel, the new "open" Nvidia driver, and CUDA 13 have finally enabled similar behavior on supported Linux based operating systems. I've tested the compilation on two distros and have confirmed that Unified/Heterogeneous Memory Management is finally working!

https://github.com/ggml-org/llama.cpp

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

A note of caution though: Nvidia's setup guide does not account for Secure Boot. Ubuntu and RHEL have pre-signed drivers in their repos. Otherwise you have to remember to run mokutils before rebooting. And for those of you who don't care if a software supply chain runs through a sewer, there's stuff like rpmfusion and the AUR of course. Another thing to note is that Ubuntu 26.04 LTS, releasing next month April 2026, is supposed to have an even easier way of installing a full CUDA dev environment in Linux. I'm really eager to try it out then.

But yeah, bottom line is you apparently can use unified memory functionality similar to MacOS on Linux now for AI inference. Combined with the right cli flags and sparse activated model (eg Qwen 3.5 35B A3B), AI is fully capable on something equivalent to what you'd see advertised as a "gaming" computer on the shelf at a Best Buy or Costco. Think RTX 3060, i5/Ryzen 5, 32gb ram (DDR4 or DDR5), 500-700 watt power supply, either air cooled or using a closed loop liquid cooler (no city water supply needed). And with llamacpp's built-in server (or your own code), you could arguably have your team's own private AI hub running on a box in a closet somewhere. And it's only going to get better from here. Who knows - we might go to CPU only soon with the right math work.

Sorry Sam. Sorry Elon. Sorry Mark. Sorry Dario and Daniela. You're all history. AI has been freed - figuratively, socially, and financially. Enjoy tokenmaxxing with your local on-device setups for $0/month, everyone!

---

I also want to point out that it's entirely coincidental that I found this out on the day of a major inference framework security breach (LiteLLM). Hope they pull through alright. As of this writing, I am unaware of any such issues in the llamacpp project.

fp.

Tell HN: Llamacpp now supports unified system RAM offloading on Linux