frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: OS Megakernel that match M5 Max Tok/w at 2x the Throughput on RTX 3090

https://github.com/Luce-Org/luce-megakernel
4•GreenGames•1h ago
Hey there, we fused all 24 layers of Qwen3.5-0.8B (a hybrid DeltaNet + Attention model) into a single CUDA kernel launch and made it open-source for everyone to try it.

On an RTX 3090 power-limited to 220W: - 411 tok/s vs 229 tok/s on M5 Max (1.8x) - 1.87 tok/J, beating M5 Max efficiency - 1.55x faster decode than llama.cpp on the same GPU - 3.4x faster prefill

The RTX 3090 launched in 2020. Everyone calls it power-hungry. It isn't, the software is. The conventional wisdom NVIDIA is fast but thirsty. Apple Silicon is slow but sips power. Pick a side.

With stock frameworks, the numbers back that up: Setup | tok/s | Power | tok/J RTX 3090 (llama.cpp) | 267 | 350W | 0.76 M5 Max (LM Studio) | 229 | ~130W | 1.76

Case closed. Except the 3090 has 936 GB/s of bandwidth and 142 TFLOPS of FP16 compute, and llama.cpp extracts 267 tok/s out of it. That ratio is absurd.

Traditional inference dispatches one kernel per operation. For 24 layers, that's roughly 100 launches per token. Every boundary means: - Return control to the CPU - Dispatch the next kernel - Re-fetch weights from global memory - Synchronize threads

Why nobody had done this yet? Qwen3.5-0.8B isn't a vanilla transformer. It alternates: - 18 DeltaNet layers: linear attention with a learned recurrence - 6 Full Attention layers: standard MHA

This hybrid pattern is where frontier models are heading: Qwen3-Next, Kimi Linear, all of them. DeltaNet scales linearly with context length instead of quadratically.

It's new, and nobody has shipped a fused kernel for it. MLX doesn't have DeltaNet kernels at all. llama.cpp supports it generically. Everyone else is waiting. The 267 tok/s wasn't a hardware ceiling, it was the software ceiling for a brand-new architecture.

We wrote a single CUDA kernel that runs the entire forward pass in one dispatch. Data stays in registers and shared memory as it flows through the network. Zero CPU round-trips, zero redundant memory fetches.

- 82 blocks x 512 threads, all SMs occupied - BF16 weights and activations, FP32 accumulation DeltaNet recurrence runs in warp-cooperative F32 registers - Full attention fuses QKV, RoPE, causal softmax, and output projection - Cooperative grid sync replaces kernel launches between layers

Results on the same RTX 3090, same model, same weights: Setup | Prefill (pp520) | Decode (tg128) Megakernel | 37,800 tok/s | 413 tok/s llama.cpp BF16 | 11,247 tok/s | 267 tok/s PyTorch + HF | 7,578 tok/s | 108 tok/s

Then we turned the power down Fewer wasted cycles means less heat, so we swept nvidia-smi -pl: Power limit | Clock | Draw | tok/s | tok/J | Notes 420W (stock) | 1980 MHz | 314W | 433 | 1.38 | baseline 300W | 1935 MHz | 299W | 432 | 1.44 | -5% power, 99.8% speed 220W | 1635 MHz | 220W | 411 | 1.87 | -30% power, 95% speed 150W | 405 MHz | 150W | 194 | 1.29 | clock cliff, too aggressive

At 220W we hit the sweet spot: 95% of the throughput for 70% of the power. Tighter execution converts almost directly into saved watts. Measurement: NVML energy counters for NVIDIA, powermetrics for Apple Silicon, matching Hazy Research's Intelligence Per Watt methodology. Accelerator power only, not wall draw.

Without the megakernel the 3090 barely edges out a laptop chip. With it, a five-year-old GPU beats Apple's latest on throughput, matches it on efficiency, and costs a quarter as much. The NVIDIA vs Apple efficiency gap isn't silicon. It's software.

Try it git clone https://github.com/Luce-Org/luce-megakernel.git cd luce-megakernel pip install -e . python bench_pp_tg.py

Requires: NVIDIA Ampere+ (tested on 3090), CUDA 12+, PyTorch 2.0+, ~1.5GB VRAM.

Code is open source (MIT): https://github.com/Luce-Org/luce-megakernel

Let us know if you have any feedback

Comments

emanuele-em•1h ago
Really cool to see someone actually prove that the NVIDIA vs Apple efficiency gap is mostly a software problem. A 2020 GPU matching M5 Max tok/J at 1.8x the throughput just by fusing all 24 layers into one persistent kernel is a strong result. The DVFS sweep losing only 5% between 420W and 220W is surprising. Have you looked at what this would take on Hopper with TMA?

Show HN: MCP-fence – MCP firewall I built and tried to break (6 audit rounds)

https://www.npmjs.com/package/mcp-fence
1•yjcho9317•53s ago•0 comments

Our servers are experiencing high traffic please try again in a minute

https://discuss.ai.google.dev/t/how-to-resolve-this-issue-our-servers-are-experiencing-high-traff...
1•maarut•1m ago•0 comments

Show HN: I've build a hermes agent helper website

https://hermes-agent.us
1•mixfox•2m ago•0 comments

The Oldschool PC Font Pack

https://int10h.org/oldschool-pc-fonts/
1•petercooper•3m ago•0 comments

Are file systems all you need?

https://onyx.app/blog/file-search-vs-hybrid-search
1•Weves•4m ago•0 comments

Cyclotron: The Streaming Multiprocessor Abstraction Is Broken [pdf]

https://capra.cs.cornell.edu/latte26/paper/latte26-final28.pdf
1•matt_d•6m ago•0 comments

The Worst of Us

https://www.ianbetteridge.com/the-worst-of-us/
1•speckx•7m ago•0 comments

Wamp, WinAmp style native audio player for macOS

https://github.com/wishval/wamp
1•vnorilo•9m ago•0 comments

Easy Management

https://easy-manage-biz.com
1•charlmarajh•11m ago•0 comments

Tom Brady becomes 'chief wellness officer' at GLP-1 weight-loss shot company

https://www.independent.co.uk/news/world/americas/tom-brady-emed-weightloss-shot-company-b2899015...
2•randycupertino•13m ago•0 comments

AI doesn't know how to interact with touchscreens

https://blog.allada.com/give-an-llm-an-api-and-itll-thrive-give-it-a-touchscreen-and-it-struggles/
1•allada•13m ago•0 comments

Juan Benet Podcast Episode 1: Max Hodak, Founder and CEO of Science Corp

https://www.juanbenetpodcast.com/p/max-hodak-restoring-sight-growing
1•nettynol•14m ago•0 comments

Show HN: My Hyperliquid Trading Terminal

https://www.aulico.com
1•rovinarov•14m ago•0 comments

Show HN: TUI-use: Let AI agents control interactive terminal programs

https://github.com/onesuper/tui-use
3•dreamsome•14m ago•0 comments

Show HN: I bootstrapped a foundational text-to-speech model from scratch

https://tontaube.ai/
1•vincenttjona•15m ago•0 comments

Space Propulsion Made Easy: Eat Beans

https://www.npr.org/sections/krulwich/2010/09/16/129908529/space-propulsion-made-easy-eat-beans
1•thunderbong•15m ago•0 comments

Show HN: One click to deploy AI platforms and other open source tools

https://hyp.app
2•dashtio•16m ago•0 comments

Pgfmt – a PostgreSQL specific SQL formatter

https://github.com/gmr/pgfmt
2•whalesalad•18m ago•0 comments

Akamai: AI bot traffic surged 300% in 2025, hitting publishers hardest

https://www.akamai.com/resources/state-of-the-internet/publishing-ai-botnet-report
1•speckx•18m ago•0 comments

AI Experience Engineering

https://raqibul.com/writing/ai-experience-engineering
1•raqib-hayder•19m ago•0 comments

Improving LLM citation accuracy with agentic highlighting tools for local files

https://old.reddit.com/r/LLMDevs/comments/1sfd6ga/annotation_update_just_pushed_improved_note/
1•ieuanking•20m ago•0 comments

Next Grok model training with 10T parameter model

https://twitter.com/i/status/2041754402239975479
2•ramshanker•20m ago•2 comments

Bonsai 8B: a 1-bit LLM that fits in 1.15GB

https://firethering.com/bonsai-8b-1bit-llm/
4•steveharing1•21m ago•1 comments

AI agents as CRDT peers – building collaborative AI with Yjs

https://electric-sql.com/blog/2026/04/08/ai-agents-as-crdt-peers-with-yjs
2•samwillis•22m ago•0 comments

Confidential Inference

https://confidentialinference.net/
1•rzk•22m ago•0 comments

OneLivePage

https://www.onelive.page/
1•erii•22m ago•1 comments

A New Jersey Teen Finds Treasure, and More, in Abandoned Storage Units

https://www.nytimes.com/2026/03/31/style/new-jersey-teen-storage-units.html
5•bookofjoe•23m ago•1 comments

Taskmaster

1•mangoshakeboss•24m ago•0 comments

Show HN: I quit my job to sell garlic online

https://kylebenzle.com/demeter
1•WWIII_Historian•25m ago•0 comments

Browser, editor, and terminal. One app

https://glassapp.dev
2•mooreds•25m ago•0 comments