frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

TurboQuant KV Compression and SSD Expert Streaming for M5 Pro and IOS

https://github.com/SharpAI/SwiftLM
43•aegis_camera•1h ago

Comments

aegis_camera•1h ago
We implemented two techniques to run massive 100B+ parameter MoE models natively on the M5 Pro 64GB MacBook Pro:

TurboQuant KV compression: We ported the V3 Lloyd-Max codebooks from the TurboQuant paper (Zandieh et al., ICLR 2026) into native C++ and fused dequantization into Metal shaders. This achieves a measured 4.3× KV cache compression at runtime, completely eliminating Python overhead.

SSD Expert Streaming: To fit a 122B parameter model (e.g., Qwen3.5-122B MoE) without triggering macOS VM swapping or Watchdog kernel kills, the full ~60 GB weight file remains on NVMe. Only the top-k active expert pages are streamed to the GPU per forward pass at ~9 GB/s. As a result, inference runs with only 2,694 MB of active GPU VRAM on the M5 Pro 64GB, while the OS page cache automatically handles hot-expert reuse.

By combining these two approaches, we can comfortably run massive models in memory-constrained environments on Apple Silicon.

Also tested QWEN 4B on IPHONE 13 Pro.

Code and implementation details: https://github.com/SharpAI/SwiftLM

altruios•35m ago
what tokens/s are you getting with a 122B MoE model in this setup? I didn't see any benchmarks in the benchmarks section on the readme.md
gigatexal•12m ago
yeah this I'd like to see added to teh readme.
Aurornis•35m ago
Although I'm interested in both topics (KV compression and attempts to stream MoE models from storage) this is at least the 10th vibecoded project on this topic I've seen today alone across HN, Twitter, and some subreddits I visit.

At least this one gave credit to the upstream projects which it used as a reference.

The llama.cpp project is also getting a wave of vibecoded PRs that are very clearly being produced by pointing claude at the repo and the original paper and having it produce something.

Almost none of these attempts contain information that really matters, like actual benchmark tests with differen KV quantization levels (not just perplexity or KLD).

_zoltan_•27m ago
"vibe coded" is NOT the bad thing you think it is.

Going from paper to implementation from scratch in half an hour or so is great.

mjr00•21m ago
> "vibe coded" is NOT the bad thing you think it is.

It's not inherently bad in the same way that a first draft of a novel is not inherently bad.

But if someone asked me to read their novel and it was a first draft that they themselves had clearly not bothered reading or editing, I'd tell them to fuck off.

brokencode•19m ago
That’s a starting spot, but how about some testing and benchmarks?

Where’s the value added if the person just tells Claude to do it and then submits a PR?

The maintainers may as well vibe code it themselves if that’s all the work the would-be contributor is going to put into it.

yieldcrv•12m ago
if it works it works

we live in a wholly unoptimized world because the available resources have been so high, while the benefits of optimizing have been so low. that has flipped now and there are tons of low hanging fruit to optimize.

I agree that benchmarks would be great, but thats only relevant to this one topic, not the overall agentic coded pull request concept itself

jmalicki•7m ago
It's relevant in that it's an example that people are doing the easy part - the coding - and skipping the hard part - the benchmarking and proving it works and provides value.

A PR without evidence it works and expectations for the benefits using the new feature would bring is kind of worthless.

sroussey•4m ago
The authors of the project have CC as well, so doing this is just eating their time.
boogerlad•35m ago
Does this use anything from the flash-moe project?

https://github.com/Alexintosh/flash-moe

vessenes•28m ago
I like this idea on expert streaming. I've been poking around fairly thoroughly at the same idea - can we fix a set of experts? when can we fix them? How long is the top-k selection "good" for in terms of number of forward passes?

One thing I've turned up in smaller models and I'm sort of winding my way toward verifying in larger ones is that if you train the MoE model from scratch with this kind of knockout / subset of experts baked in, then you get significantly better loss outcomes. In small models, it's actually better than training an MOE without conditioning on a reduced set of experts per pass.

Anyway, pretty cool. There's some Pareto-optimal curve based on memory bandwidth, amount of GPU / unified RAM and inference compute times for streaming stuff in.

robotswantdata•8m ago
Feels 100% vibe coded in a bad way.

Llama.cpp already has KV compression and one of the turbo quant PRs will get merged at some point.

If you don’t care about the fancy 3 bit, the q8 KV compression is good enough! Don’t bother with q4

./build/bin/llama-server -m model.gguf \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ -c 65536

Etc

Lilly's weight-loss pill wins US approval, sets up next battle with Novo Nordisk

https://www.reuters.com/business/healthcare-pharmaceuticals/lillys-weight-loss-pill-wins-us-appro...
1•onemoresoop•34s ago•0 comments

Why did Harvey choose a top-down enterprise GTM while Cursor went bottom-up?

1•iiTsEddy•2m ago•0 comments

The WebAIM Million report 2026

https://webaim.org/projects/million/
1•pier25•2m ago•0 comments

Product-led growth best practices and guidance

https://www.revturbine.com/resources
1•millereffect•3m ago•0 comments

Paperweight, an April Fool's Prank from 40 years ago

https://www.goto10retro.com/p/paperweight-an-april-fools-prank
1•rbanffy•3m ago•0 comments

ReactOS to reverse engineer Linux Kernel A.I. Pull Requests, helping Linux-Libre

1•pqlfvn•4m ago•0 comments

Ad Economicum Is a Thing

https://gmalandrakis.com/writings/ad-economicum.html
1•plethon•5m ago•0 comments

Placing U.S. Troops in Middle East Hotels May Violate Laws of War

https://www.nytimes.com/2026/04/01/us/politics/troops-iran-hotels.html
1•Betelbuddy•5m ago•0 comments

Jax's true calling: Ray-Marching renderers on WebGL

https://benoit.paris/posts/jax-ray-marcher/
3•BenoitP•6m ago•0 comments

Elon Musk's SpaceX files to go public

https://www.axios.com/2026/04/01/spacex-elon-musk-ipo
3•rurp•9m ago•1 comments

How many trackers are there?

https://adguard.com/en/blog/adguard-ad-tracker-report-2025.html
3•twapi•9m ago•0 comments

Bluesky Is Made with AI

https://bsky.app/profile/jay.bsky.team/post/3micqcyeawc2g
2•ronsor•10m ago•1 comments

Sequential Optimal Packing for PCB Placement

https://blog.autorouting.com/p/sequential-optimal-packing-for-pcb
1•seveibar•12m ago•0 comments

Scientists crack a 20-year nuclear mystery behind the creation of gold

https://www.sciencedaily.com/releases/2026/03/260313002633.htm
3•prabal97•16m ago•0 comments

Show HN: Auto sketch prompt and AI renderings for architects

https://renderai.app/
1•franrai•17m ago•1 comments

Croatia's Football Team Signed Deal with Fake Gambling Sponsor Rep

https://www.bellingcat.com/news/2026/04/01/croatian-football-teams-deal-with-gambling-sponsor/
1•lschueller•18m ago•0 comments

Ask HN: What happens when you block/mark as spam a call or text?

3•dsalzman•19m ago•0 comments

Ask HN: Is this type of person rare?

3•piratesAndSons•19m ago•2 comments

What Claude Code Leak Teaches Us About Agent Skills

https://skilldb.dev/blog/claude-code-leaked-what-500k-lines-teach-us-about-agent-skills
2•dev_chad•19m ago•0 comments

An Aroma Most Beguiling

https://orionmagazine.org/article/an-aroma-most-beguiling/
1•Petiver•19m ago•0 comments

Optimizing Page Size

https://www.redblobgames.com/blog/2026-03-25-optimizing-page-size/
1•ingve•20m ago•0 comments

Robotics Infra: Deploying DeepMind's MuJoCo on Azure, Part 2: Microsoft Heard Us

https://www.hapticlabs.ai/blog/2026/03/31/deploying-mujoco-on-azure-ml-part-2
2•mexitlan•22m ago•0 comments

The Habitable Zone

https://science.nasa.gov/exoplanets/habitable-zone/
1•thunderbong•22m ago•0 comments

Are We Having Fun Yet?

https://makoism.com/are-we-having-fun-yet/
2•speckx•22m ago•0 comments

SpaceX confidentially files to go public at $1.75T, reports say

https://www.theguardian.com/technology/2026/apr/01/spacex-public-offering-stock-market
7•bookofjoe•23m ago•0 comments

How to Get to Tomorrow

https://campedersen.com/kardashev
2•aduffy•24m ago•0 comments

PHPantom: A Fast PHP Language Server Built in Rust

https://github.com/AJenbo/phpantom_lsp
2•Kovah•24m ago•0 comments

NASA's Artemis II Crew Launches to the Moon (Official Broadcast) [video]

https://www.youtube.com/watch?v=Tf_UjBMIzNo
2•twalichiewicz•26m ago•0 comments

ESLint plugin that audits vibe-coded Next.js apps before a dev touches them

https://github.com/srk0102/stackrules
1•srk0102200•26m ago•0 comments

MCP isn't winning because it's technically necessary – it won the way syslog won

https://nickromito.substack.com/p/what-are-my-thoughts-on-mcp
2•nromito•27m ago•0 comments