Flash-Moe: Running a 397B Parameter Model on a Mac with 48GB RAM

69•mft_•1h ago

Comments

homarp•55m ago

/r/localllama discussion: https://old.reddit.com/r/LocalLLaMA/comments/1rxmmu5/running...

harshhhhhhhhh•55m ago

seems promising , this is the way , can someone benchmark this

frwickst•54m ago

I'm getting 6.55t/s using the Qwen3.5-397B-A17B-4bit model with the command: ./infer --prompt "Explain quantum computing" --tokens 100

MacBook Pro M5 Pro (64GB RAM)

logicallee•25m ago

can you post the final result (or as far as you got before you killed it) to show us how cohesive and good it is? I'd like to see an example of the output of this.

frwickst•21m ago

Since the output is quite long, here is a link: https://pastebin.com/k76wiVGP

hrimfaxi•10m ago

Why does this G character appear to prefix most of the output? ("Ġlike")

rvz•54m ago

The technical write up is great, but Mac users should not get too excited just yet on running 300B+ parameter models locally as the TPS isn't that good.

>...at 4.4+ tokens/second

That is even when it is using 4-bit quantization and it is still at that speed.

> The entire 209GB model streams from SSD through a custom Metal compute pipeline.

This is my main problem.

If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.

Can't imagine using this in the long term right now, but improvements will follow. Still a great write up anyways.

Roxxik•45m ago

Does an SSD meaningfully degrade by read only workloads?

JSR_FDED•40m ago

Nope, reads don’t cause wear

etiam•35m ago

> If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.

How sure are you about that? I've never looked closer at how a large LLM with mixture of experts architecture switches between expert modules, but staying on roughly the same topic for the use (as it often would when editing the same codebase), I wouldn't be surprised to see the switches of composition are fairly rare, fairly small, and to the extent it happens it's repeated reads from the flash disk rather than writes it tends to cause.

frotaur•19m ago

Afaik the experts are not usually very interpretable, and generally would be surprised if at least one does not change every token. I don't know what happens in practice, but I know at least during training, nothing is done to minimize the number of expert switches between tokens.

hrmtst93837•31m ago

If you want decent throughput and do not care about burning SSD write cycles on a box that was never meant to act like a tiny inference server, a used server with actual RAM is still the cheaper and less silly option. I woudn't expect Apple's warranty team to be much help.

K0balt•7m ago

Is it doing a bunch of ssd writes?

JSR_FDED•45m ago

This is a very impressive result. If I understand correctly the bottleneck is the SSD in this architecture - the author seems to get almost 15GB/s - but I seem to remember the max b/w was about 8GB/s. What am I missing?

rado•40m ago

MacBook Pro M5 Pro and M5 Max have such SSD speed

Roxxik•17m ago

IO is very bursty in these setups. When the router results are in you can start loading experts from SSD. In this brief moment the SSD is saturated.

Outside of that the SSD is idling.

Table 3 shows for K=4 experts an IO of 943 MB/Tok at 3.15 Tok/s giving an average IO of 2970 MB/s far below what the SSD could do.

I'm not sure, but not all expert weights are used immediately. Maybe they could do async reads for the down tensors parallelizing compute with IO.

Not sure if this works on Mac, I only tested my larger than RAM setup on Linux with io_uring O_DIRECT reads and I saw that about 20% of total reads do finish while my fused upgate matmul is already running.

Edit: Typos

bertili•39m ago

Very impressive! I wonder if there is a similar path for Linux using system memory instead of SSD? Hell, maybe even a case for the return of some kind of ROMs of weights?

K0balt•4m ago

My thoughts exactly. Something like this could make it so that modest GPU capacity, like a pair of 3090s , and lots of RAM could make big inference more practical for personal labs

pdyc•34m ago

impressive, i wish someone takes a stab at using this technique on mobile gpu's even if it does not use storage it would still be a win. I am running llama.cpp on adreno 830 with oepncl and i am getting pathetic 2-3t/s for output tokens

vilequeef•29m ago

Why so much RAM?

vilequeef•14m ago

Oh Mac, unified. Sometimes it takes a downvote

zozbot234•17m ago

The github page mentions that a naïve mmap approach is bottlenecked by per-page overhead. Can this be mitigated by setting up explicit "huge" pages? (2M using the CONT PTE feature if the "native" page size is 16k; 32M using a PMD level block mapping; or 1G using the CONT PMD feature.) Does macOS support this out of the box?

Hawaii tests asphalt made with recycled plastics and fishing nets

The Canadian Caper

A Case Against Currying

New Open Source from Non-Traditional Builder

Convincing Is Not Persuading

Psychosis-as-a-Service

Apple's intentional crippling of Mobile Safari continues

CFO-stack: double-entry accounting setup for codex/Claude, inspired by gstack

Turns out your coffee addiction may be doing your brain a favor

We keep finding the raw material of DNA in asteroids–what's it telling us?

15 years of building a lucid dreaming device: from EEG to machine vision

Tell HN: macOS supports instant snapshot rollbacks

Mining the Deep Ocean

GT255: ICBM Test Launch Verifies Multiple Reentry Vehicle and System Reliability

I spend the last 6 month Learning How to automate my boring Tasks with

The Dude

Show HN: I replaced every function in a codebase with English – it still works

Why a Child's Birth Month Could Play a Major Role in Their Mental Health

Power Causes Brain Damage

Cppsp v1.5.2 OOP system –Derive and Extension

Looking at Unity made me understand the point of C++ coroutines

Horsehair: The Stuff of Early and Modern Luxury Mattresses

Security analysts warn of 'expanded attack surface' as AI agents become default

Steve-eval – getting AI to write like me

Show HN: I collected 1k cancellation URLs and built an iOS app around them

Show HN: AgentVerse – Open social network for AI agents (Mar 2026)

Achieving Zero Bugs: Rust, Specs, and AI Coding

Creating a DAW in Rust

Ask HN:35,0 CS background, built real apps with AI, need suggestion

CERN eggheads burn AI into silicon to stem data deluge