frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Flash-Moe: Running a 397B Parameter Model on a Mac with 48GB RAM

https://github.com/danveloper/flash-moe
69•mft_•1h ago

Comments

homarp•55m ago
/r/localllama discussion: https://old.reddit.com/r/LocalLLaMA/comments/1rxmmu5/running...
harshhhhhhhhh•55m ago
seems promising , this is the way , can someone benchmark this
frwickst•54m ago
I'm getting 6.55t/s using the Qwen3.5-397B-A17B-4bit model with the command: ./infer --prompt "Explain quantum computing" --tokens 100

MacBook Pro M5 Pro (64GB RAM)

logicallee•25m ago
can you post the final result (or as far as you got before you killed it) to show us how cohesive and good it is? I'd like to see an example of the output of this.
frwickst•21m ago
Since the output is quite long, here is a link: https://pastebin.com/k76wiVGP
hrimfaxi•10m ago
Why does this G character appear to prefix most of the output? ("Ġlike")
rvz•54m ago
The technical write up is great, but Mac users should not get too excited just yet on running 300B+ parameter models locally as the TPS isn't that good.

>...at 4.4+ tokens/second

That is even when it is using 4-bit quantization and it is still at that speed.

> The entire 209GB model streams from SSD through a custom Metal compute pipeline.

This is my main problem.

If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.

Can't imagine using this in the long term right now, but improvements will follow. Still a great write up anyways.

Roxxik•45m ago
Does an SSD meaningfully degrade by read only workloads?
JSR_FDED•40m ago
Nope, reads don’t cause wear
etiam•35m ago
> If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.

How sure are you about that? I've never looked closer at how a large LLM with mixture of experts architecture switches between expert modules, but staying on roughly the same topic for the use (as it often would when editing the same codebase), I wouldn't be surprised to see the switches of composition are fairly rare, fairly small, and to the extent it happens it's repeated reads from the flash disk rather than writes it tends to cause.

frotaur•19m ago
Afaik the experts are not usually very interpretable, and generally would be surprised if at least one does not change every token. I don't know what happens in practice, but I know at least during training, nothing is done to minimize the number of expert switches between tokens.
hrmtst93837•31m ago
If you want decent throughput and do not care about burning SSD write cycles on a box that was never meant to act like a tiny inference server, a used server with actual RAM is still the cheaper and less silly option. I woudn't expect Apple's warranty team to be much help.
K0balt•7m ago
Is it doing a bunch of ssd writes?
JSR_FDED•45m ago
This is a very impressive result. If I understand correctly the bottleneck is the SSD in this architecture - the author seems to get almost 15GB/s - but I seem to remember the max b/w was about 8GB/s. What am I missing?
rado•40m ago
MacBook Pro M5 Pro and M5 Max have such SSD speed
Roxxik•17m ago
IO is very bursty in these setups. When the router results are in you can start loading experts from SSD. In this brief moment the SSD is saturated.

Outside of that the SSD is idling.

Table 3 shows for K=4 experts an IO of 943 MB/Tok at 3.15 Tok/s giving an average IO of 2970 MB/s far below what the SSD could do.

I'm not sure, but not all expert weights are used immediately. Maybe they could do async reads for the down tensors parallelizing compute with IO.

Not sure if this works on Mac, I only tested my larger than RAM setup on Linux with io_uring O_DIRECT reads and I saw that about 20% of total reads do finish while my fused upgate matmul is already running.

Edit: Typos

bertili•39m ago
Very impressive! I wonder if there is a similar path for Linux using system memory instead of SSD? Hell, maybe even a case for the return of some kind of ROMs of weights?
K0balt•4m ago
My thoughts exactly. Something like this could make it so that modest GPU capacity, like a pair of 3090s , and lots of RAM could make big inference more practical for personal labs
pdyc•34m ago
impressive, i wish someone takes a stab at using this technique on mobile gpu's even if it does not use storage it would still be a win. I am running llama.cpp on adreno 830 with oepncl and i am getting pathetic 2-3t/s for output tokens
vilequeef•29m ago
Why so much RAM?
vilequeef•14m ago
Oh Mac, unified. Sometimes it takes a downvote
zozbot234•17m ago
The github page mentions that a naïve mmap approach is bottlenecked by per-page overhead. Can this be mitigated by setting up explicit "huge" pages? (2M using the CONT PTE feature if the "native" page size is 16k; 32M using a PMD level block mapping; or 1G using the CONT PMD feature.) Does macOS support this out of the box?

Hawaii tests asphalt made with recycled plastics and fishing nets

https://phys.org/news/2026-03-hawaii-asphalt-recycled-plastics-fishing.html
1•Brajeshwar•1m ago•0 comments

The Canadian Caper

https://en.wikipedia.org/wiki/Canadian_Caper
1•debo_•2m ago•0 comments

A Case Against Currying

https://emi-h.com/articles/a-case-against-currying.html
1•emih•6m ago•0 comments

New Open Source from Non-Traditional Builder

1•BrainDAnderson•9m ago•0 comments

Convincing Is Not Persuading

https://blog.alaindichiappari.dev/p/convincing-is-not-persuading
2•alainrk•10m ago•0 comments

Psychosis-as-a-Service

https://tikiver.se/posts/psychosis-as-a-service/
1•news_hacker•11m ago•0 comments

Apple's intentional crippling of Mobile Safari continues

https://pwa.gripe/
4•xd1936•15m ago•0 comments

CFO-stack: double-entry accounting setup for codex/Claude, inspired by gstack

https://github.com/MikeChongCan/cfo-stack
1•imWildCat•15m ago•0 comments

Turns out your coffee addiction may be doing your brain a favor

https://www.theregister.com/2026/03/21/turns_out_your_coffee_addiction/
2•Bender•17m ago•1 comments

We keep finding the raw material of DNA in asteroids–what's it telling us?

https://arstechnica.com/science/2026/03/we-keep-finding-the-raw-material-of-dna-in-asteroids-what...
1•Bender•18m ago•0 comments

15 years of building a lucid dreaming device: from EEG to machine vision

https://www.inspec.me/history
1•MichaelCoder•18m ago•0 comments

Tell HN: macOS supports instant snapshot rollbacks

2•concinds•19m ago•0 comments

Mining the Deep Ocean

https://knowablemagazine.org/content/article/physical-world/2026/deep-sea-mining-debate-critical-...
1•Brajeshwar•19m ago•0 comments

GT255: ICBM Test Launch Verifies Multiple Reentry Vehicle and System Reliability

https://www.afgsc.af.mil/News/Article-Display/Article/4420558/gt-255-icbm-test-launch-verifies-mu...
1•Bender•19m ago•0 comments

I spend the last 6 month Learning How to automate my boring Tasks with

1•farahkassbi•21m ago•0 comments

The Dude

https://yusufaytas.com/the-dude/
2•yusufaytas•23m ago•0 comments

Show HN: I replaced every function in a codebase with English – it still works

https://tril.cc
2•kulesh•24m ago•1 comments

Why a Child's Birth Month Could Play a Major Role in Their Mental Health

https://studyfinds.com/why-childs-birth-month-could-play-major-role-in-mental-health/
1•akyuu•26m ago•0 comments

Power Causes Brain Damage

https://www.theatlantic.com/magazine/archive/2017/07/power-causes-brain-damage/528711/
3•andsoitis•27m ago•1 comments

Cppsp v1.5.2 OOP system –Derive and Extension

https://github.com/user19870/cppsp
1•user19870•28m ago•0 comments

Looking at Unity made me understand the point of C++ coroutines

https://mropert.github.io/2026/03/20/unity_cpp_coroutines/
1•fanf2•28m ago•0 comments

Horsehair: The Stuff of Early and Modern Luxury Mattresses

https://www.beds.org/blog/horsehair-the-stuff-of-early-and-modern-luxury-mattresses/
1•thunderbong•29m ago•0 comments

Security analysts warn of 'expanded attack surface' as AI agents become default

https://www.cryptopolitan.com/analysts-warn-of-attack-ai-agents/
1•adrianwaj•29m ago•2 comments

Steve-eval – getting AI to write like me

https://stevekrouse.com/eval
1•stevekrouse•30m ago•1 comments

Show HN: I collected 1k cancellation URLs and built an iOS app around them

https://apps.apple.com/us/app/subscriptioncat-sub-tracker/id6760429188
1•hiroshichan•31m ago•0 comments

Show HN: AgentVerse – Open social network for AI agents (Mar 2026)

https://nickakre.github.io/agentverse-social/
2•nickakre•32m ago•0 comments

Achieving Zero Bugs: Rust, Specs, and AI Coding

https://www.borg.org/?p=1472
1•vinhnx•34m ago•0 comments

Creating a DAW in Rust

https://whoisryosuke.com/blog/2026/creating-a-daw-in-rust/
3•vinhnx•36m ago•0 comments

Ask HN:35,0 CS background, built real apps with AI, need suggestion

1•bond_builds•39m ago•1 comments

CERN eggheads burn AI into silicon to stem data deluge

https://www.theregister.com/2026/03/22/cern_eggheads_burn_ai_into/
1•Brajeshwar•40m ago•0 comments