frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Bored of eating your own dogfood? Try smelling your own farts

https://shkspr.mobi/blog/2026/03/bored-of-eating-your-own-dogfood-try-smelling-your-own-farts/
51•ColinWright•48m ago•5 comments

Flash-Moe: Running a 397B Parameter Model on a Mac with 48GB RAM

https://github.com/danveloper/flash-moe
115•mft_•2h ago•38 comments

Hormuz Minesweeper – Are you tired of winning?

https://hormuz.pythonic.ninja/
413•PythonicNinja•5h ago•249 comments

A Case Against Currying

https://emi-h.com/articles/a-case-against-currying.html
11•emih•57m ago•9 comments

25 Years of Eggs

https://www.john-rush.com/posts/eggs-25-years-20260219.html
111•avyfain•3d ago•37 comments

Project Nomad – Knowledge That Never Goes Offline

https://www.projectnomad.us
16•jensgk•1h ago•3 comments

More common mistakes to avoid when creating system architecture diagrams

https://www.ilograph.com/blog/posts/more-common-diagram-mistakes/
25•billyp-rva•2h ago•9 comments

A Review of Dice That Came with the White Castle

https://boardgamegeek.com/thread/3533812/a-review-of-dice-that-came-with-the-white-castle
18•doener•3d ago•2 comments

Node.js worker threads are problematic, but they work great for us

https://www.inngest.com/blog/node-worker-threads
21•goodoldneon•3d ago•9 comments

Convincing Is Not Persuading

https://blog.alaindichiappari.dev/p/convincing-is-not-persuading
5•alainrk•1h ago•3 comments

Revise – An AI Editor for Documents

https://revise.io
7•artursapek•32m ago•3 comments

My first patch to the Linux kernel

https://pooladkhay.com/posts/first-kernel-patch/
147•pooladkhay•2d ago•25 comments

Brute-Forcing My Algorithmic Ignorance with an LLM in 7 Days

http://blog.dominikrudnik.pl/my-google-recruitment-journey-part-1
10•qikcik•1h ago•2 comments

Tinybox – A powerful computer for deep learning

https://tinygrad.org/#tinybox
531•albelfio•17h ago•304 comments

The three pillars of JavaScript bloat

https://43081j.com/2026/03/three-pillars-of-javascript-bloat
381•onlyspaceghost•11h ago•221 comments

Some things just take time

https://lucumr.pocoo.org/2026/3/20/some-things-just-take-time/
763•vaylian•23h ago•244 comments

$ teebot.dev – from terminal to tee in 6 seconds

https://teebot.dev
7•foxpress•1h ago•6 comments

How We Synchronized Editing for Rec Room's Multiplayer Scripting System

https://www.tyleo.com/blog/how-we-synchronized-editing-for-rec-rooms-multiplayer-scripting-system
6•tyleo•1h ago•4 comments

Professional video editing, right in the browser with WebGPU and WASM

https://tooscut.app/
307•mohebifar•16h ago•111 comments

Chest Fridge (2009)

https://mtbest.net/chest-fridge/
141•wolfi1•12h ago•77 comments

'Miracle': Europe reconnects with lost spacecraft

https://phys.org/news/2026-03-miracle-europe-reconnects-lost-spacecraft.html
59•vrganj•3h ago•24 comments

Turns out your coffee addiction may be doing your brain a favor

https://www.theregister.com/2026/03/21/turns_out_your_coffee_addiction/
13•Bender•1h ago•2 comments

Vatican Rebukes Peter Thiel's Antichrist Lectures in Rome

https://www.thenerdreich.com/peter-thiels-antichrist-circus-smacked-down-in-rome/
68•vrganj•4h ago•42 comments

Windows native app development is a mess

https://domenic.me/windows-native-dev/
61•domenicd•4h ago•59 comments

Floci – A free, open-source local AWS emulator

https://github.com/hectorvent/floci
229•shaicoleman•16h ago•66 comments

HopTab–free,open source macOS app switcher and tiler that replaces Cmd+Tab

https://www.royalbhati.com/hoptab
62•robhati•7h ago•18 comments

Electronics for Kids, 2nd Edition

https://nostarch.com/electronics-for-kids-2e
221•0x54MUR41•3d ago•47 comments

The IBM scientist who rewrote the rules of information just won a Turing Award

https://www.ibm.com/think/news/ibm-scientist-charles-bennett-turing-award
3•rbanffy•2h ago•0 comments

Boomloom: Think with your hands

https://www.theboomloom.com
146•rasengan0•1d ago•15 comments

Bayesian statistics for confused data scientists

https://nchagnet.pages.dev/blog/bayesian-statistics-for-confused-data-scientists/
150•speckx•3d ago•42 comments
Open in hackernews

Flash-Moe: Running a 397B Parameter Model on a Mac with 48GB RAM

https://github.com/danveloper/flash-moe
113•mft_•2h ago

Comments

homarp•1h ago
/r/localllama discussion: https://old.reddit.com/r/LocalLLaMA/comments/1rxmmu5/running...
harshhhhhhhhh•1h ago
seems promising , this is the way , can someone benchmark this
frwickst•1h ago
I'm getting 6.55t/s using the Qwen3.5-397B-A17B-4bit model with the command: ./infer --prompt "Explain quantum computing" --tokens 100

MacBook Pro M5 Pro (64GB RAM)

logicallee•1h ago
can you post the final result (or as far as you got before you killed it) to show us how cohesive and good it is? I'd like to see an example of the output of this.
frwickst•1h ago
Since the output is quite long, here is a link: https://pastebin.com/k76wiVGP
hrimfaxi•1h ago
Why does this G character appear to prefix most of the output? ("Ġlike")
kgeist•30m ago
The original tokens have Ġ instead of space. I had this issue too when writing an inference engine for Qwen. You have to "normalize" those special characters.
frwickst•29m ago
It is a tokenizer artifact most likely (https://github.com/huggingface/transformers/issues/4786). So the output is not properly decoded in this case, it should just be a space.
j45•41m ago
Appreciate the data point. M5 Max would also be interesting to see once available in desktop form.
rvz•1h ago
The technical write up is great, but Mac users should not get too excited just yet on running 300B+ parameter models locally as the TPS isn't that good.

>...at 4.4+ tokens/second

That is even when it is using 4-bit quantization and it is still at that speed.

> The entire 209GB model streams from SSD through a custom Metal compute pipeline.

This is my main problem.

If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.

Can't imagine using this in the long term right now, but improvements will follow. Still a great write up anyways.

Roxxik•1h ago
Does an SSD meaningfully degrade by read only workloads?
JSR_FDED•1h ago
Nope, reads don’t cause wear
zozbot234•18m ago
No appreciable wear of course, but read disturb (requiring sporadic rewrites) becomes more of an issue as NAND fabrication advances.
etiam•1h ago
> If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.

How sure are you about that? I've never looked closer at how a large LLM with mixture of experts architecture switches between expert modules, but staying on roughly the same topic for the use (as it often would when editing the same codebase), I wouldn't be surprised to see the switches of composition are fairly rare, fairly small, and to the extent it happens it's repeated reads from the flash disk rather than writes it tends to cause.

frotaur•1h ago
Afaik the experts are not usually very interpretable, and generally would be surprised if at least one does not change every token. I don't know what happens in practice, but I know at least during training, nothing is done to minimize the number of expert switches between tokens.
hrmtst93837•1h ago
If you want decent throughput and do not care about burning SSD write cycles on a box that was never meant to act like a tiny inference server, a used server with actual RAM is still the cheaper and less silly option. I woudn't expect Apple's warranty team to be much help.
K0balt•58m ago
Is it doing a bunch of ssd writes?
Wowfunhappy•53m ago
Eh. I mean, 4 tokens a second works fine if you're patient. Go do something else while you wait.

I feel like whenever I'm trying to find information on which local models will work on my hardware, I have to overestimate because people don't know how to wait for things.

Also, reading data doesn't cause SSD wear.

JSR_FDED•1h ago
This is a very impressive result. If I understand correctly the bottleneck is the SSD in this architecture - the author seems to get almost 15GB/s - but I seem to remember the max b/w was about 8GB/s. What am I missing?
rado•1h ago
MacBook Pro M5 Pro and M5 Max have such SSD speed
selimthegrim•44m ago
I have an MBP M4 Pro and a WD Black SN850x in an external TB5 enclosure and I easily get 6-7 GB/s
Roxxik•1h ago
IO is very bursty in these setups. When the router results are in you can start loading experts from SSD. In this brief moment the SSD is saturated.

Outside of that the SSD is idling.

Table 3 shows for K=4 experts an IO of 943 MB/Tok at 3.15 Tok/s giving an average IO of 2970 MB/s far below what the SSD could do.

I'm not sure, but not all expert weights are used immediately. Maybe they could do async reads for the down tensors parallelizing compute with IO.

Not sure if this works on Mac, I only tested my larger than RAM setup on Linux with io_uring O_DIRECT reads and I saw that about 20% of total reads do finish while my fused upgate matmul is already running.

Edit: Typos

zozbot234•43m ago
The github page mentions that you can't overlap SSD traffic and GPU compute on Apple Silicon, you get heavy contention for the shared hardware resources.
bertili•1h ago
Very impressive! I wonder if there is a similar path for Linux using system memory instead of SSD? Hell, maybe even a case for the return of some kind of ROMs of weights?
K0balt•54m ago
My thoughts exactly. Something like this could make it so that modest GPU capacity, like a pair of 3090s , and lots of RAM could make big inference more practical for personal labs
zozbot234•45m ago
Loading experts to system memory is supported by most local-AI frameworks. But you do not gain much by running that part of the decode on GPU, since decode is not compute-limited and the CPU-GPU transfer involves overhead. It's best to use the GPU for speeding up the shared part of the model.
pdyc•1h ago
impressive, i wish someone takes a stab at using this technique on mobile gpu's even if it does not use storage it would still be a win. I am running llama.cpp on adreno 830 with oepncl and i am getting pathetic 2-3t/s for output tokens
vilequeef•1h ago
Why so much RAM?
vilequeef•1h ago
Oh Mac, unified. Sometimes it takes a downvote
zozbot234•1h ago
The github page mentions that a naïve mmap approach is bottlenecked by per-page overhead. Can this be mitigated by setting up explicit "huge" pages? (2M using the CONT PTE feature if the "native" page size is 16k; 32M using a PMD level block mapping; or 1G using the CONT PMD feature.) Does macOS support this out of the box? Alternatively, one might use a simple mmap and then something like posix_fadvise to set up prefetching of the data.
lostmsu•51m ago
How large is the KV cache?
xbar•13m ago
0.1 GB per full-attention layer and "The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention." So, 1.5 GB.
spwa4•39m ago
Does this mean that it should be possible to load up a system with ~10 (seems to me at least the number of active experts) SSDs to get 40 tok/s even on truly gigantic models?
zozbot234•28m ago
SSD bandwidth will ultimately be limited by the amount of PCIe lanes you have available (for something other than the Apple Silicon internal storage). So the approach has inherent limitations. You can of course scale out to multiple systems to get more throughput.

You can use this approach with Intel Optane, which is wearout-resistant unlike NAND and can thus substitute for RAM. Last I checked, it was available quite cheap on the secondary market, ~$1/GB as opposed to ~$15/GB or more for DRAM. (Of course that's nowhere near as cheap as NAND, which is around ~$0.1/GB but quite wearout-prone with heavy writes.)

tarruda•10m ago
Note that this is not the only way to run Qwen 3.5 397B on consumer devices, there are excellent ~2.5 BPW quants available that make it viable for 128G devices.

I've had great success (~20 t/s) running it on a M1 Ultra with room for 256k context. Here are some lm-evaluation-harness results I ran against it:

    mmlu: 87.86%

    gpqa diamond: 82.32%

    gsm8k: 86.43%

    ifeval: 75.90%
More details of my experience:

- https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discu...

- https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discu...

- https://gist.github.com/simonw/67c754bbc0bc609a6caedee16fef8...

Overall an excellent model to have for offline inference.