frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

TurboQuant KV Compression and SSD Expert Streaming for M5 Pro and IOS

https://github.com/SharpAI/SwiftLM
55•aegis_camera•2h ago

Comments

aegis_camera•2h ago
We implemented two techniques to run massive 100B+ parameter MoE models natively on the M5 Pro 64GB MacBook Pro:

TurboQuant KV compression: We ported the V3 Lloyd-Max codebooks from the TurboQuant paper (Zandieh et al., ICLR 2026) into native C++ and fused dequantization into Metal shaders. This achieves a measured 4.3× KV cache compression at runtime, completely eliminating Python overhead.

SSD Expert Streaming: To fit a 122B parameter model (e.g., Qwen3.5-122B MoE) without triggering macOS VM swapping or Watchdog kernel kills, the full ~60 GB weight file remains on NVMe. Only the top-k active expert pages are streamed to the GPU per forward pass at ~9 GB/s. As a result, inference runs with only 2,694 MB of active GPU VRAM on the M5 Pro 64GB, while the OS page cache automatically handles hot-expert reuse.

By combining these two approaches, we can comfortably run massive models in memory-constrained environments on Apple Silicon.

Also tested QWEN 4B on IPHONE 13 Pro.

Code and implementation details: https://github.com/SharpAI/SwiftLM

altruios•1h ago
what tokens/s are you getting with a 122B MoE model in this setup? I didn't see any benchmarks in the benchmarks section on the readme.md
gigatexal•50m ago
yeah this I'd like to see added to teh readme.
aegis_camera•31m ago
I'll add more details. We just wired up the pipeline on both MAC and IOS.
anemll•16m ago
Check it out, you might be able to speed it up using this https://github.com/Anemll/anemll-flash-mlx https://x.com/anemll/status/2038684375425200360
Aurornis•1h ago
Although I'm interested in both topics (KV compression and attempts to stream MoE models from storage) this is at least the 10th vibecoded project on this topic I've seen today alone across HN, Twitter, and some subreddits I visit.

At least this one gave credit to the upstream projects which it used as a reference.

The llama.cpp project is also getting a wave of vibecoded PRs that are very clearly being produced by pointing claude at the repo and the original paper and having it produce something.

Almost none of these attempts contain information that really matters, like actual benchmark tests with differen KV quantization levels (not just perplexity or KLD).

_zoltan_•1h ago
"vibe coded" is NOT the bad thing you think it is.

Going from paper to implementation from scratch in half an hour or so is great.

mjr00•58m ago
> "vibe coded" is NOT the bad thing you think it is.

It's not inherently bad in the same way that a first draft of a novel is not inherently bad.

But if someone asked me to read their novel and it was a first draft that they themselves had clearly not bothered reading or editing, I'd tell them to fuck off.

sumeno•12m ago
At least in the novel example the author had the decency to write what they're asking you to read.

These are more like sending someone who didn't ask you a question a LMGTFY link they didn't ask for and expecting them to read all the results. Just a complete lack of awareness and respect for the maintainers

brokencode•57m ago
That’s a starting spot, but how about some testing and benchmarks?

Where’s the value added if the person just tells Claude to do it and then submits a PR?

The maintainers may as well vibe code it themselves if that’s all the work the would-be contributor is going to put into it.

yieldcrv•50m ago
if it works it works

we live in a wholly unoptimized world because the available resources have been so high, while the benefits of optimizing have been so low. that has flipped now and there are tons of low hanging fruit to optimize.

I agree that benchmarks would be great, but thats only relevant to this one topic, not the overall agentic coded pull request concept itself

jmalicki•44m ago
It's relevant in that it's an example that people are doing the easy part - the coding - and skipping the hard part - the benchmarking and proving it works and provides value.

A PR without evidence it works and expectations for the benefits using the new feature would bring is kind of worthless.

sumeno•15m ago
> if it works it works

If it works in one case that doesn't mean it works consistently or well in the general case

I've made lots of things with Claude Code that just work... until I do things in a slightly different order and the whole thing explodes

pqtyw•11m ago
It might work, but what's the point is sharing it if anyone can do the same in those 30 minutes with minimal effort?
sroussey•42m ago
The authors of the project have CC as well, so doing this is just eating their time.
aegis_camera•35m ago
Yes, this took time to test :)
simonw•21m ago
Sure, but the problem is when you take that half hour of work and share it with other people without making clear how much effort has gone into it.

Software is valuable if it has been tested and exercised properly by other people. I don't care if you vide coded it provided you then put the real work in to verify that it actually works correctly - and then include the proof that you've done that when you start widely sharing it with the world.

Right now it's impossible to tell which of these projects implementing the paper are worth spending time with.

pqtyw•12m ago
If there is nothing valuable it contributes, though? i.e. its not a novel paper then only value is the whatever you personally learn from it.
Aurornis•5m ago
> Going from paper to implementation from scratch in half an hour or so is great.

This repo isn’t showing that at all. Scroll to the bottom of the README and you’ll see the other project it was based on.

boogerlad•1h ago
Does this use anything from the flash-moe project?

https://github.com/Alexintosh/flash-moe

aegis_camera•27m ago
Yes, this is a reference project, the main different is we don't use os swap ( it introduces latency, will add https://github.com/danveloper/flash-moe to the original reference as well ).
vessenes•1h ago
I like this idea on expert streaming. I've been poking around fairly thoroughly at the same idea - can we fix a set of experts? when can we fix them? How long is the top-k selection "good" for in terms of number of forward passes?

One thing I've turned up in smaller models and I'm sort of winding my way toward verifying in larger ones is that if you train the MoE model from scratch with this kind of knockout / subset of experts baked in, then you get significantly better loss outcomes. In small models, it's actually better than training an MOE without conditioning on a reduced set of experts per pass.

Anyway, pretty cool. There's some Pareto-optimal curve based on memory bandwidth, amount of GPU / unified RAM and inference compute times for streaming stuff in.

robotswantdata•45m ago
Feels 100% vibe coded in a bad way.

Llama.cpp already has KV compression and one of the turbo quant PRs will get merged at some point.

If you don’t care about the fancy 3 bit, the q8 KV compression is good enough! Don’t bother with q4

./build/bin/llama-server -m model.gguf \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ -c 65536

Etc

xiphias2•28m ago
Another project without running real benchmarks. It's very easy to generate tokens, it's much harder to solve tasks locally.
simonw•17m ago
I couldn't get the downloadable binary to work, or the binary I compiled myself:

  ./SwiftLM \
    --model mlx-community/Qwen3.5-122B-A10B-4bit \
    --stream-experts \
    --port 5413
Error:

  [SwiftLM] Loading model: mlx-community/Qwen3.5-122B-A10B-4bit
  [SwiftLM] Enabled Async SSD Streaming on directory: e9c67b08899964be5fdd069bb1b4bc8907fe68f5
  [SwiftLM]  Memory strategy: FULL GPU (69.6GB model, 133.4GB available)
  [SwiftLM] Download: [===================>] 100% ⠋ (66395.4 MB / 66395.4 MB) | Speed: 0.0 MB/s      
  MLX error: Failed to load the default metallib. library not found library not found library not found library not found  at /Users/runner/work/SwiftLM/SwiftLM/LocalPackages/mlx-swift/Source/Cmlx/mlx-c/mlx/c/stream.cpp:115
aegis_camera•15m ago
Let me check, I had seen metallib error during development, let me check.
aegis_camera•5m ago
git clone https://github.com/SharpAI/SwiftLM # no --recursive needed cd SwiftLM swift build -c release ### Please let me know if this fix the issue:

# Copy metallib next to the binary (one-time step) cp LocalPackages/mlx-swift/Source/Cmlx/mlx/mlx/backend/metal/kernels/default.metallib \ .build/release/

gervwyk•7m ago
Anyone else looking at these developments and thinking that local llms are the future. So many advantages above remote, and the hardware is just not there jet, but another leap like apple silicon and the tech is there..

Ofcourse large corps will have fancy proprietary models, but for every day queries and tasks, local feels like a huge, and just slightly out of reach.

Am i missing something fundamental?

Artemis II astronauts arrive at launch pad 39B in an astrovan

https://techfixated.com/artemis-ii-astronauts-arrive-at-launch-pad-39b-in-an-astrovan/
25•benlarweh•36m ago•7 comments

You're still signing data structures the wrong way

https://blog.foks.pub/posts/domain-separation-in-idl/
14•malgorithms•28m ago•4 comments

EmDash – a spiritual successor to WordPress that solves plugin security

https://blog.cloudflare.com/emdash-wordpress/
332•elithrar•4h ago•231 comments

Ask HN: Who is hiring? (April 2026)

139•whoishiring•5h ago•114 comments

AI for American-produced cement and concrete

https://engineering.fb.com/2026/03/30/data-center-engineering/ai-for-american-produced-cement-and...
94•latchkey•3h ago•73 comments

TurboQuant KV Compression and SSD Expert Streaming for M5 Pro and IOS

https://github.com/SharpAI/SwiftLM
56•aegis_camera•2h ago•30 comments

Windows 95 defenses against installers that overwrite a file with an older one

https://devblogs.microsoft.com/oldnewthing/20260324-00/?p=112159
17•michelangelo•3d ago•2 comments

Show HN: Git bayesect – Bayesian Git bisection for non-deterministic bugs

https://github.com/hauntsaninja/git_bayesect
52•hauntsaninja•3d ago•6 comments

StepFun 3.5 Flash is #1 cost-effective model for OpenClaw tasks (300 battles)

https://app.uniclaw.ai/arena?tab=costEffectiveness&via=hn
96•skysniper•4h ago•36 comments

Show HN: Zerobox – Sandbox any command with file, network, credential controls

https://github.com/afshinm/zerobox
56•afshinmeh•2d ago•60 comments

CERN levels up with new superconducting karts

https://home.cern/news/news/engineering/cern-levels-new-superconducting-karts
363•fnands•12h ago•80 comments

An Introduction to Writing Systems and Unicode

https://r12a.github.io/scripts/tutorial/part2
32•mariuz•3d ago•6 comments

Apple at 50

https://www.apple.com/
53•janandonly•1h ago•28 comments

Show HN: Real-time dashboard for Claude Code agent teams

https://github.com/simple10/agents-observe
55•simple10•3h ago•18 comments

The OpenAI Graveyard: All the Deals and Products That Haven't Happened

https://www.forbes.com/sites/phoebeliu/2026/03/31/openai-graveyard-deals-and-products-havent-happ...
165•dherls•4h ago•129 comments

The AI Marketing BS Index

https://bastian.rieck.me/blog/2026/bs/
61•speckx•2h ago•6 comments

NASA Artemis II moon mission live launch broadcast

https://plus.nasa.gov/scheduled-video/nasas-artemis-ii-crew-launches-to-the-moon-official-broadcast/
235•apitman•3h ago•136 comments

Is BGP safe yet?

https://isbgpsafeyet.com/
208•janandonly•7h ago•72 comments

Random numbers, Persian code: A mysterious signal transfixes radio sleuths

https://www.rferl.org/a/mystery-numbers-station-persian-signal-iran-war/33700659.html
90•thinkingemote•8h ago•90 comments

Ada and Spark on ARM Cortex-M – A Tutorial with Arduino and Nucleo Examples

http://inspirel.com/articles/Ada_On_Cortex.html
46•swq115•4d ago•13 comments

Ukrainian Drone Holds Position for 6 Weeks

https://defenceleaders.com/news/ukrainian-combat-robot-holds-frontline-position-for-six-weeks-in-...
81•AftHurrahWinch•2h ago•48 comments

Wasmer (YC S19) Is Hiring – Rust and DevRel Positions

https://www.workatastartup.com/companies/wasmer
1•syrusakbary•8h ago

Intuiting Pratt Parsing

https://louis.co.nz/2026/03/26/pratt-parsing.html
126•signa11•2d ago•42 comments

Claude Wrote a Full FreeBSD Remote Kernel RCE with Root Shell (CVE-2026-4747)

https://github.com/califio/publications/blob/main/MADBugs/CVE-2026-4747/write-up.md
213•ishqdehlvi•14h ago•95 comments

Consider the Greenland Shark (2020)

https://www.lrb.co.uk/the-paper/v42/n09/katherine-rundell/consider-the-greenland-shark
74•mooreds•5d ago•30 comments

Randomness on Apple Platforms (2024)

https://blog.xoria.org/randomness-on-apple-platforms/
44•surprisetalk•5d ago•1 comments

Show HN: CLI to order groceries via reverse-engineered REWE API (Haskell)

https://github.com/yannick-cw/korb
183•wazHFsRy•2d ago•79 comments

A dot a day keeps the clutter away

https://scottlawsonbc.com/post/dot-system
525•scottlawson•23h ago•156 comments

Claude Code Unpacked : A visual guide

https://ccunpacked.dev/
981•autocracy101•15h ago•351 comments

Chess in SQL

https://www.dbpro.app/blog/chess-in-pure-sql
169•upmostly•3d ago•42 comments