frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Run a 1T parameter model on a 32gb Mac by streaming tensors from NVMe

https://github.com/t8/hypura
72•tatef•1h ago

Comments

marksully•1h ago
Where does "1T parameter model" come from? I can only see models with 70B params or less mentioned in the repo.
causal•54m ago
Yeah title comes from nowhere in the link. No doubt it's possible but all that matters is speed and we learn nothing of that here...
zozbot234•57m ago
It will be interesting to compare this to https://news.ycombinator.com/item?id=47476422 and https://news.ycombinator.com/item?id=47490070 . Very similar design except that this is apparently using mmap, which according to the earlier experiment incurs significant overhead.
salynchnew•35m ago
It was written by an LLM, so... yeah.
jeffybefffy519•14m ago
Except this isnt using heavily quantised versions of the model thus reducing quality.
Insanity•52m ago
This is a pretty cool project! Essentially this is like using Swap memory to extend your RAM, but in a 'smart' way so you don't overload the NVMe unnecessarily.

I do wonder in practice how the 'smarts' pan out, because putting a ton of stress on your NVMe during generation is probably not the best choice for it's longevity.

zozbot234•47m ago
This is not putting any stress or wear on the NVMe, it's a pure read workload.
embedding-shape•45m ago
> but in a 'smart' way so you don't overload the NVMe unnecessarily

"overloading NVMe"? What is that about? First time I've heard anything about it.

> because putting a ton of stress on your NVMe during generation

Really shouldn't "stress your NVMe", something is severely wrong if that's happening. I've been hammering my SSDs forever, and while write operations "hurt" the longevity of the flash cells themselves, the controller interface really shouldn't be affected by this at all, unless I'm missing something here.

Insanity•38m ago
I had assumed heat generation on the controller if it's continuously reading. But maybe it's not actually bad.
monksy•39m ago
There needs to be something like this from Ollama. At the moment Ollama has a lot of flaws that prevent it from getting great performance. (My understanding is better GPU/CPU splits, etc). But Ollama is the only way to host an LLM and have it switch out on demand. Sigh.
rubiquity•33m ago
llama.cpp and llama-swap do this better than Ollama and with far more control.
zozbot234•27m ago
Ollama has very substandard support for mmap at present, which hurts inference with larger models. There are some recent pull requests in flight that should help address this to at least some extent https://github.com/ollama/ollama/pull/14525 https://github.com/ollama/ollama/pull/14134 https://github.com/ollama/ollama/pull/14864 but progress seems to be stalling. Their support for recent Qwen models seems to also have some bespoke incompatibilities with llama.cpp, which doesn't help matters; it's difficult to test the same model with both.
baq•34m ago
Intel Optane rolling in its grave.
liuliu•30m ago
Still have 4 brand new ones in my storage unit. Just in case these moments.

Joke aside (I do have them tho!), I don't think Optane is that much use (not to mention it is only 256GiB for my unit). It is useful legacy crutch if you have legacy software that is not designed to issue multiple reads / writes in parallel. If you do, it is really not faster than NVMe, especially these modern ones.

zozbot234•21m ago
It's not about being faster (except for small reads where latency dominates, which is actually relevant when reading a handful of expert-layers immediately after routing), it's the wearout resistance which opens up the possibility of storing KV-cache (including the "linear" KV-cache of recent Qwen, which is not append-only as it was with the pure attention model) and maybe even per-layer activations - though this has the least use given how ephemeral these are.
speedgoose•15m ago
Is it too late for Intel to bring them back to life?
c0balt•9m ago
Yes, their NAND division has been sold, it is now mostly under solidigm. Maybe solidigm could bring it back, but it seems unlikely (given the previous commercial failure).
0ptan3•13m ago
pmem
moffkalast•8m ago
Wouldn't be Intel if they didn't quit halfway through on a good thing.

Still, couldn't one get a RAID 0 card with four drives to saturate a 16x lane? That's already the max one could push through PCIe anyhow.

nullbyte•25m ago
I am curious how the TPS compares vs default OS virtual memory paging
anshulbasia27•19m ago
OS paging would be significantly worse here. The kernel's page fault handler is reactive — it doesn't know you're about to read layer 47's FFN weights, so it can't prefetch. You stall on every fault, wait for the 4KB/16KB page to load, then resume. With 80 layers of dense FFN streaming, that's thousands of cold faults per token.

  What makes this approach faster is that the model's access pattern is completely deterministic during         
  inference. You know exactly which tensors are needed next because transformer layers execute sequentially. So
  you can issue large sequential reads and prefetch the next layer while the current one is computing on Metal. 
  The OS page cache can't do that — it has no concept of "layer N+1 comes after layer N."

  For MoE it's even more stark. The OS would page in all 8 experts on the first token that routes to each one,  
  then evict them under memory pressure with LRU, which has no idea that expert 3 fires 10x more often than
  expert 7. The neuron cache here is basically a domain-specific replacement policy.
zozbot234•18m ago
> The kernel's page fault handler is reactive — it doesn't know you're about to read layer 47's FFN weights, so it can't prefetch.

man 2 madvise

EnPissant•14m ago
That assumes you have significant work to do between fetches (so you can prefetch while using the current data). With LLM decode you don't.
EnPissant•16m ago
You do not provide any comparison to llama.cpp with mmap.

You do not explain how any kind of predictor can work for MoE experts.

You do not explain how prediction can even be useful. I can predict the layers used in a dense model (all of them are used in order), but that doesn't help me much. It's still bottlenecked on bandwidth (hint: MoE doesn't change this).

amelius•11m ago
This is <1 tok/s for the 40GB model.

Come on, "Run" is not the right word. "Crawl" is.

speedgoose•10m ago
I wonder how many minutes per token on GLM 5.
vicchenai•9m ago
the practical question is whether the read pattern is sequential enough to actually saturate nvme bandwidth or if the attention layer access pattern ends up being random enough to kill throughput. sequential reads on a decent nvme get you 5-7 GB/s, random reads drop to maybe 500 MB/s depending on queue depth.

for a 1T model youd need to stream something like 2TB of weights per forward pass at fp16. even at peak sequential thats 300+ seconds per token which is... not great for interactive use but maybe fine for batch inference where you dont care about latency.

still a cool proof of concept though. the gap between 'can run' and 'runs usefully' is where things get interesting.

erikcw•7m ago
Simon Willison wrote a good post about Dan Woods’ work on “Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally”.

[0] https://simonwillison.net/2026/Mar/18/llm-in-a-flash/

All_US_Updated_FInalL_list_Of_Coinbase_Customer_suppott.pdf

https://sgdea.supertransporte.gov.co/UltimusAttachment//STPQRSD/Adjuntos/4eabebc5-e25c-43f6-b2cd-...
1•prashantss•53s ago•0 comments

FreeCAD 1.1.0 Released

https://wiki.freecad.org/Release_notes_1.1
1•yehoshuapw•54s ago•0 comments

The hardest question to answer about AI-fueled delusions

https://www.technologyreview.com/2026/03/23/1134527/the-hardest-question-to-answer-about-ai-fuele...
1•Brajeshwar•1m ago•0 comments

Association between Covid-19 vaccination and sudden death

https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1004924
1•Anon84•2m ago•0 comments

Show HN: Mantyx – A platform to orchestrate, manage, and share your agents

https://mantyx.io/
1•grillorafael•2m ago•0 comments

App that uses a hidden sensor in Macs to turn typing force into keyboard sounds

https://www.haptyk.com/
1•olvvier•3m ago•1 comments

We built an observability database for agents, not humans

https://blog.firetiger.com/we-built-an-observability-database-for-agents-not-humans/
3•matsur•4m ago•0 comments

A map showing all startups hiring in Bangalore

https://www.blrstartuparena.com/map
1•astroanax•5m ago•0 comments

WebKit Features for Safari 26.4

https://webkit.org/blog/17862/webkit-features-for-safari-26-4/
1•dfabulich•7m ago•0 comments

Cover Image Skill for OpenClaw, Claude Code

https://www.npmjs.com/package/blog-cover-image-cli
1•sam_josh1•7m ago•0 comments

We Logged 341 EV Charging Sessions. 4 in 10 Had Problems

https://www.evcourse.com/ev-charging-reliability-europe
1•userium•9m ago•0 comments

ZML/v2: Frontier performance through composability

https://zml.ai/posts/zml-v2/
1•steeve•10m ago•0 comments

Postgraphile v5 Released

https://postgraphile.org/news/2026-03-24-v5-published/
3•purge•10m ago•2 comments

AI Agents Gone Rogue

https://www.osohq.com/developers/ai-agents-gone-rogue
1•forks•11m ago•0 comments

Show HN: The / marketplace, an open-source ChatGPT Checkout

https://marketplace.openship.org
1•theturtletalks•11m ago•0 comments

My 2c on the AI/GenAI/LLM bubble

https://riffraff.info/2026/03/my-2c-on-the-ai-genai-llm-bubble/
1•speckx•12m ago•0 comments

Country that put backdoors in Cisco routers to spy on world bans foreign routers

https://www.theregister.com/2026/03/24/fcc_foreign_routers/
11•beardyw•12m ago•1 comments

Global Carmakers Retreat En Masse from Electric Vehicle Plans

https://www.ft.com/content/1198863d-4974-4c4d-be5f-9e7152045b26
1•karakoram•13m ago•1 comments

The AI Industry Is Lying to You

https://www.wheresyoured.at/the-ai-industry-is-lying-to-you/
2•spking•13m ago•0 comments

Pebble Time 2 enters mass production

https://repebble.com/blog/pebble-time-2-is-in-mass-production
1•smig0•14m ago•0 comments

Arm AGI CPU

https://newsroom.arm.com/blog/introducing-arm-agi-cpu
2•RealityVoid•14m ago•1 comments

Systemd has not implemented age verification

https://blog.bofh.it/debian/id_473
2•edward•14m ago•0 comments

How to Build a PMF Machine

https://speedrun.substack.com/p/how-to-build-a-pmf-machine
1•babelfish•16m ago•0 comments

Fitbit Data Sheds Light on Best Time to Exercise

https://nautil.us/fitbit-data-sheds-light-on-best-time-to-exercise-1279140
1•Brajeshwar•17m ago•0 comments

OpenWonton: Nomad-Compatible Workload Orchestrator

https://github.com/openwonton/openwonton
1•InitEnabler•18m ago•0 comments

Richland Correctional Institution rehabilitates animals and people

https://www.ashlandsource.com/2025/10/17/prison-opossums-how-richland-correctional-institution-is...
1•pavel_lishin•19m ago•0 comments

Who Makes What, and Where with the US ISP CPE Supply Chain

https://www.senki.org/operators-security-toolkit/us-isp-cpe-supply-chain/
1•speckx•19m ago•0 comments

The US bans all new foreign-made network routers

https://www.engadget.com/big-tech/the-us-bans-all-new-foreign-made-network-routers-223622966.html
4•ZunarJ5•19m ago•2 comments

Why Performance Reviews Need a Makeover

https://www.ft.com/content/c3d40d72-3c91-4dbe-9c48-a8f2940cc147
1•karakoram•20m ago•1 comments

PBMs Extract $30B/Year from Drug Prices (Data Analysis)

https://andrewrexroad.substack.com/p/the-middlemen
1•rexroad•20m ago•0 comments