frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Compiling LLMs into a MegaKernel: A path to low-latency inference

https://zhihaojia.medium.com/compiling-llms-into-a-megakernel-a-path-to-low-latency-inference-cf7840913c17
114•matt_d•4h ago

Comments

NitroPython•3h ago
Ollama integration?
baq•3h ago
Next step - compile straight to verilog so I can buy some LLMs on aliexpress
bigcat12345678•2h ago
https://riscv.org/blog/2021/02/hardware-description-language... That was one of the promising ideas before AI & GPUs come to the scene. As CPUs are stagnant, and naturally people want further optimize the middle layers software and hardware.

But I suspect parallel computing in GPU style is going to dominate acclerated computing.

General purpose CPUs are going to stay to become the little brain that orchestrates GPUs.

Ideas of software direct to hardware transition might never be the mainstream.

baq•2h ago
I'm thinking more like pseudointellect over serial to attach a $3 esp32 to. Since it's basically tokens in, tokens out, let's just cut the unnecessary parts out. It's like querying the cloud models, except it's your silicon you personally soldered to the esp so nobody will break your home assistant with a system prompt update or a fine tuning run.
mycall•18m ago
> General purpose CPUs are going to stay to become the little brain that orchestrates GPUs

Brings the deterministic compute to the indeterministic.

scotty79•3h ago
> Traditional LLM systems often rely on sequences of GPU kernel launches and external communication calls, resulting in underutilized hardware.

What? Why? This seems like an obvious optimization if it's possible.

shawntan•2h ago
Systems might want to anticipate changes in LLM architectures (even small changes can make a big difference kernel wise), so it's good to not "bake" too much in ahead of time.

That said, at some point it just depends where the costs lie and it might make sense hiring some GPU engineers to do what they did here for whatever architecture you're optimising for.

Not as low-hanging as you might imagine.

catlifeonmars•2h ago
From the article

> Despite these advantages, compiling an LLM into a megakernel is highly challenging. Existing high-level ML frameworks — such as PyTorch, Triton, and TVM — do not natively support end-to-end megakernel generation. Additionally, modern LLM systems are built from a diverse collection of specialized kernel libraries: NCCL or NVSHMEM for communication, FlashInfer or FlashAttention for efficient attention, and CUDA or Triton for custom computation. This fragmentation makes it difficult to consolidate the entire inference pipeline into a single, unified kernel.

So my naive assumption is that yes it is obvious, but nontrivial.

saagarjha•1h ago
Your naive assumption is the right one. It’s quite hard to do this. Even doing it automatically like it’s done here runs into problems with trying to figure out data dependencies and synchronization across nontrivial computation.
liuliu•2h ago
It really is not obvious. These launches are asynchronous, and data movement / computation is overlapped properly through CUDA APIs. Even per-kernel launch cost is reduced with the cudagraph introduction.

CUDA programming model relies on each kernel to be computationally expensive to make sense, and these are not true for token generation of LLM. And we are talking about network evaluation at higher than 1000 per second, whereas previously besides recommendation systems, network evaluation we are look at is ~100 per second at most.

Also, nobody remember Alex's "One Weird Trick" paper, which slices matmul into pieces to overlap device-to-device transfer v.s. computation. That is 10 years ago.

delusional•2h ago
In the common case where the processor dispatching those kernel calls is much faster than the kernel calls themselves, you're not likely to see a meaningful increase in throughput.

What you need to do first is get really optimized kernels (since that makes the dispatching relatively more expensive) and THEN this becomes worth doing. People who are really good a writing optimized GPU kernels are just not that easy to get a hold of right now.

bytepoet•2h ago
This is very cool. I enjoyed going through the writeup and GitHub README.

I was wondering if these same optimizations can be brought to bear on training as well, rather than only inference. I guess the challenge here is fusing backward computations with gradient communication.

I also saw that this currently does not handle dynamic workloads such as MoE. I recently came across this paper that does exactly this:

FlashDMoE: Fast Distributed MoE in a Single Kernel - https://arxiv.org/pdf/2506.04667

zhihaojia•1h ago
Thanks for reading the post and github README. Supporting training is definitely feasible but the benefit may not be as significant as low-latency inference since training generally involves much larger kernels, making kernel launch overhead less significant.

Thanks for sharing the FlashDMoE work. Our next step is to support MoE models. Stay tuned!

liuliu•2h ago
The Qwen 8B number, if verified, is very impressive. Much more practical than the previous megakernel one.

That's being said, these one-persisted kernel on each SM reminds me Larrabee, and now wondering what the world will be if we just do traditional process-thread-simd path rather than CUDA path.

kp1197•2h ago
After working pretty closely with vLLM and SGLang over the past few months, this is EXACTLY what I had envisioned what a successor project would look like - analyzing an operation dependency graph and then fusing (or, at a minimum, scheduling tasks smarter). Congrats to the team.
zhihaojia•1h ago
Thanks a lot for your positive feedback! We believe that MPK can enhance existing LLM serving systems, especially for low-latency LLM serving. We are very excited about the opportunity to collaborate with others on direction.
skavi•2h ago
Does anyone have an intuition on why this offers significant gains over CUDA Graphs?. The CPU launch cost of a graph is tiny which implies most of the work has been offloaded to the GPU's own scheduler. I'd expect that some I/O marshalling at kernel boundaries could be avoided with megakernels. Maybe some loop fusion? Are there any more interesting optimizations they enable?
refulgentis•2h ago
You've hit the nail on the head. The CPU launch cost of a pre-compiled CUDA graph is tiny.

CUDA Graphs are a huge step up from manually launching kernels, but they still treat kernels as monolithic, black-box operations. A megakernel erases the boundaries between those operations.

With CUDA Graphs, as in the example in the article, if you have Matmul -> AllReduce, the AllReduce kernel cannot start until the entire Matmul kernel has finished. The dependency is at the kernel level. With a megakernel, they break these ops into fine-grained "tasks" scheduled across SMs. An AllReduce task that needs data from the first slice of the Matmul can begin as soon as that slice is computed by a few SMs, while other SMs are still working on the rest of the Matmul. This fine-grained software pipelining and compute/communication overlap is simply not possible when the dependency unit is the entire kernel.

saagarjha•1h ago
> The CPU launch cost of a graph is tiny

Absolutely not; it’s comparable to the launch overhead of a kernel.

flakiness•1h ago
This project is from CMU. Hazy Research at Stanford talked about the megakernel too: https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles

Good to see the competition in this area.

(Edited): Related paper covering the larger "mirage" project, but this doesn't cover the "megakernel" approach: https://arxiv.org/abs/2405.05751

zhihaojia•1h ago
This is the writer of the blog post. You are right that Stanford's work is a parallel effort. The main difference is that our focus is on compilation: making it easier to generate megakernels automatically.
olivia111•1h ago
really cool. would love to try it for our 3b model.
olivia111•1h ago
any detailed tutorial about how to use it?
zhihaojia•1h ago
The github repo includes a tutorial for using MPK: https://github.com/mirage-project/mirage/tree/mpk
fxtentacle•46m ago
Isn’t fusing ops at a fine-grained level also the core benefit of JAX over TensorFlow? How does this work compare to JAX?
bdbenton5255•44m ago
Certainly an important discovery for utilizing these models on scaled hardware. This approach could certainly be applied beyond LLMs to other types of neural networks. That would be an interesting space to explore.

Meta tried to buy Ilya Sutskever's $32B AI startup, now planning to hire its CEO

https://www.cnbc.com/2025/06/19/meta-tried-to-buy-safe-superintelligence-hired-ceo-daniel-gross.html
1•mfiguiere•2m ago•0 comments

The Mast – textured 3D Scratch game by awesome-Llama

https://awesome-llama.github.io/projects/tm3d
1•aspizu•8m ago•1 comments

Milner Library's Circus Wardrobe Collection Open for Research

https://news.illinoisstate.edu/2025/01/milner-librarys-circus-wardrobe-collection-open-for-research/
1•ohjeez•8m ago•0 comments

Electrolysis with a cheap metal that can produce 10x more hydrogen

https://farmingdale-observer.com/2025/06/19/japan-has-found-the-holy-grail-of-electrolysis-a-cheap-metal-that-can-produce-1000-more-hydrogen/
2•shaggie76•10m ago•0 comments

Anemll adds Qwen3 support for Apple neural engine

https://twitter.com/anemll/status/1935779931822420447
3•anemll•10m ago•0 comments

Sunsonic 986-II. A Thai Famicom clone with keyboard and mini CRT built-in

https://mastodon.gamedev.place/@pikuma/114711138512697712
1•sohkamyung•12m ago•0 comments

Giant, All-Seeing Telescope Is Set to Revolutionize Astronomy

https://www.science.org/content/article/giant-all-seeing-telescope-set-revolutionize-astronomy
2•gammarator•16m ago•0 comments

Howdy: Windows Hello style facial authentication for Linux

https://github.com/boltgolt/howdy
1•LorenDB•21m ago•0 comments

Show HN: IAM Heimdall - Identity for AI Agents

https://www.iamheimdall.com/
1•lucid-dreamer•28m ago•0 comments

Brain Study Shows ChatGPT Makes You Dumber

https://80.lv/articles/brain-study-shows-chatgpt-actually-makes-you-dumber
3•hammyhavoc•33m ago•1 comments

The Death of the Student Essay–and the Future of Cognition

https://www.forkingpaths.co/p/the-death-of-the-student-essayand
1•panic•34m ago•0 comments

Teen Social Media Ban Moves Closer in Australia After Tech Trial

https://www.bloomberg.com/news/articles/2025-06-19/teen-social-media-ban-moves-closer-in-australia-after-tech-trial
2•toomuchtodo•36m ago•1 comments

Publishers facing existential threat from AI, Cloudflare CEO says

https://www.axios.com/2025/06/19/ai-search-traffic-publishers
2•ChilledTonic•40m ago•0 comments

A.I. Is Poised to Rewrite History. Literally.

https://www.nytimes.com/2025/06/16/magazine/ai-history-historians-scholarship.html
1•bookofjoe•42m ago•1 comments

The Indian who shamed Britain by calling out a massacre

https://www.bbc.com/news/articles/c24ql23qm73o
2•ViktorRay•43m ago•0 comments

Show HN: I wrote a new BitTorrent tracker in Elixir

https://github.com/Dahrkael/ExTracker
2•dahrkael•43m ago•0 comments

Show HN: I Asked ChatGPT to Rebuild My Canvas Radial Menu in SVG

https://github.com/victorqribeiro/radialMenuSVG
1•atum47•44m ago•1 comments

ChatGPT May Be Impacting Your Brain

https://www.psychologytoday.com/us/blog/urban-survival/202506/how-chatgpt-may-be-impacting-your-brain
1•exiguus•44m ago•0 comments

How AI Is Ruining the Electric Grid [video]

https://www.youtube.com/watch?v=3__HO-akNC8
2•lisper•49m ago•1 comments

'F1' and Apple's Movie Strategy

https://variety.com/2025/film/news/f1-apple-movie-strategy-tim-cook-lewis-hamilton-1236424270/
5•Bogdanp•50m ago•1 comments

Pulling images from private registries with KEP-4412

https://www.youtube.com/watch?v=0E2fNx7oBn0
2•imduffy15•51m ago•1 comments

Why pushes for more AI from Duolingo and Audible hit a nerve

https://www.washingtonpost.com/technology/2025/06/18/ai-pushback-audible-duolingo/
2•reaperducer•59m ago•1 comments

Emergent Prefigurative Politics and Social Psychological Processes

https://onlinelibrary.wiley.com/doi/full/10.1002/casp.70040
1•squircle•1h ago•0 comments

The founder's guide to funding health and science organizations [pdf]

https://astera.org/wp-content/uploads/2025/06/founders-guide-to-funding.pdf
2•walterbell•1h ago•0 comments

Energy Costs of Communicating with AI

https://www.frontiersin.org/journals/communication/articles/10.3389/fcomm.2025.1572947/full
1•geox•1h ago•0 comments

We Do – and Must – Go into Space

https://nss.org/why-we-do-and-must-go-into-space/
1•squircle•1h ago•0 comments

On memes, mimetic desire, and why it's always that deep

https://caitlynclark.substack.com/p/deeping-it-manifesto
2•lawrenceyan•1h ago•0 comments

Online A2A Client for Google Agent to Agent Protocol

https://vishalmysore-a2aclient.hf.space/
1•vishyouluck•1h ago•1 comments

We may know how Tylenol works – and it's not how we thought

https://www.livescience.com/health/we-may-finally-know-how-tylenol-works-and-its-not-how-we-thought
2•atombender•1h ago•0 comments

Literate programming tool for any language

https://github.com/zyedidia/Literate
14•LorenDB•1h ago•3 comments