Compiling LLMs into a MegaKernel: A path to low-latency inference

https://zhihaojia.medium.com/compiling-llms-into-a-megakernel-a-path-to-low-latency-inference-cf7840913c17

150•matt_d•6h ago

Comments

NitroPython•5h ago

Ollama integration?

baq•5h ago

Next step - compile straight to verilog so I can buy some LLMs on aliexpress

bigcat12345678•4h ago

https://riscv.org/blog/2021/02/hardware-description-language... That was one of the promising ideas before AI & GPUs come to the scene. As CPUs are stagnant, and naturally people want further optimize the middle layers software and hardware.

But I suspect parallel computing in GPU style is going to dominate acclerated computing.

General purpose CPUs are going to stay to become the little brain that orchestrates GPUs.

Ideas of software direct to hardware transition might never be the mainstream.

baq•4h ago

I'm thinking more like pseudointellect over serial to attach a $3 esp32 to. Since it's basically tokens in, tokens out, let's just cut the unnecessary parts out. It's like querying the cloud models, except it's your silicon you personally soldered to the esp so nobody will break your home assistant with a system prompt update or a fine tuning run.

mycall•2h ago

> General purpose CPUs are going to stay to become the little brain that orchestrates GPUs

Brings the deterministic compute to the indeterministic.

anitil•39m ago

I mean.... LLM-in-a-box would actually be pretty neat! I'm looking at some air-gapped work coming up and having something like that would be quite handy

fc417fc802•22m ago

Isn't that easily accomplished by setting up a local deployment and then yanking the network cable? Anything that can quickly run a capable LLM is going to be a pretty beefy box though. More like LLM in an expensive space heater.

fc417fc802•26m ago

Because training costs weren't high enough already so lets add mask costs on top.

More seriously, isn't that pretty much what all those AI hardware startups have already been doing for a while now?

scotty79•4h ago

> Traditional LLM systems often rely on sequences of GPU kernel launches and external communication calls, resulting in underutilized hardware.

What? Why? This seems like an obvious optimization if it's possible.

shawntan•4h ago

Systems might want to anticipate changes in LLM architectures (even small changes can make a big difference kernel wise), so it's good to not "bake" too much in ahead of time.

That said, at some point it just depends where the costs lie and it might make sense hiring some GPU engineers to do what they did here for whatever architecture you're optimising for.

Not as low-hanging as you might imagine.

catlifeonmars•4h ago

From the article

> Despite these advantages, compiling an LLM into a megakernel is highly challenging. Existing high-level ML frameworks — such as PyTorch, Triton, and TVM — do not natively support end-to-end megakernel generation. Additionally, modern LLM systems are built from a diverse collection of specialized kernel libraries: NCCL or NVSHMEM for communication, FlashInfer or FlashAttention for efficient attention, and CUDA or Triton for custom computation. This fragmentation makes it difficult to consolidate the entire inference pipeline into a single, unified kernel.

So my naive assumption is that yes it is obvious, but nontrivial.

saagarjha•2h ago

Your naive assumption is the right one. It’s quite hard to do this. Even doing it automatically like it’s done here runs into problems with trying to figure out data dependencies and synchronization across nontrivial computation.

liuliu•4h ago

It really is not obvious. These launches are asynchronous, and data movement / computation is overlapped properly through CUDA APIs. Even per-kernel launch cost is reduced with the cudagraph introduction.

CUDA programming model relies on each kernel to be computationally expensive to make sense, and these are not true for token generation of LLM. And we are talking about network evaluation at higher than 1000 per second, whereas previously besides recommendation systems, network evaluation we are look at is ~100 per second at most.

Also, nobody remember Alex's "One Weird Trick" paper, which slices matmul into pieces to overlap device-to-device transfer v.s. computation. That is 10 years ago.

delusional•4h ago

In the common case where the processor dispatching those kernel calls is much faster than the kernel calls themselves, you're not likely to see a meaningful increase in throughput.

What you need to do first is get really optimized kernels (since that makes the dispatching relatively more expensive) and THEN this becomes worth doing. People who are really good a writing optimized GPU kernels are just not that easy to get a hold of right now.

bytepoet•4h ago

This is very cool. I enjoyed going through the writeup and GitHub README.

I was wondering if these same optimizations can be brought to bear on training as well, rather than only inference. I guess the challenge here is fusing backward computations with gradient communication.

I also saw that this currently does not handle dynamic workloads such as MoE. I recently came across this paper that does exactly this:

FlashDMoE: Fast Distributed MoE in a Single Kernel - https://arxiv.org/pdf/2506.04667

zhihaojia•3h ago

Thanks for reading the post and github README. Supporting training is definitely feasible but the benefit may not be as significant as low-latency inference since training generally involves much larger kernels, making kernel launch overhead less significant.

Thanks for sharing the FlashDMoE work. Our next step is to support MoE models. Stay tuned!

bytepoet•41m ago

Thanks for the inputs. It's very helpful to know.

I look forward to following mirage development.

ActorNightly•1h ago

Personally I think its a bit of a waste to invest time into gradient training optimizations. A lot of training tasks IRL have discrete values in nature, which can't be trained with gradients.

liuliu•4h ago

The Qwen 8B number, if verified, is very impressive. Much more practical than the previous megakernel one.

That's being said, these one-persisted kernel on each SM reminds me Larrabee, and now wondering what the world will be if we just do traditional process-thread-simd path rather than CUDA path.

kp1197•4h ago

After working pretty closely with vLLM and SGLang over the past few months, this is EXACTLY what I had envisioned what a successor project would look like - analyzing an operation dependency graph and then fusing (or, at a minimum, scheduling tasks smarter). Congrats to the team.

zhihaojia•3h ago

Thanks a lot for your positive feedback! We believe that MPK can enhance existing LLM serving systems, especially for low-latency LLM serving. We are very excited about the opportunity to collaborate with others on direction.

skavi•4h ago

Does anyone have an intuition on why this offers significant gains over CUDA Graphs?. The CPU launch cost of a graph is tiny which implies most of the work has been offloaded to the GPU's own scheduler. I'd expect that some I/O marshalling at kernel boundaries could be avoided with megakernels. Maybe some loop fusion? Are there any more interesting optimizations they enable?

refulgentis•3h ago

You've hit the nail on the head. The CPU launch cost of a pre-compiled CUDA graph is tiny.

CUDA Graphs are a huge step up from manually launching kernels, but they still treat kernels as monolithic, black-box operations. A megakernel erases the boundaries between those operations.

With CUDA Graphs, as in the example in the article, if you have Matmul -> AllReduce, the AllReduce kernel cannot start until the entire Matmul kernel has finished. The dependency is at the kernel level. With a megakernel, they break these ops into fine-grained "tasks" scheduled across SMs. An AllReduce task that needs data from the first slice of the Matmul can begin as soon as that slice is computed by a few SMs, while other SMs are still working on the rest of the Matmul. This fine-grained software pipelining and compute/communication overlap is simply not possible when the dependency unit is the entire kernel.

saagarjha•2h ago

> The CPU launch cost of a graph is tiny

Absolutely not; it’s comparable to the launch overhead of a kernel.

flakiness•3h ago

This project is from CMU. Hazy Research at Stanford talked about the megakernel too: https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles

Good to see the competition in this area.

(Edited): Related paper covering the larger "mirage" project, but this doesn't cover the "megakernel" approach: https://arxiv.org/abs/2405.05751

zhihaojia•3h ago

This is the writer of the blog post. You are right that Stanford's work is a parallel effort. The main difference is that our focus is on compilation: making it easier to generate megakernels automatically.

zekrioca•30m ago

And their focus is..?

sigbottle•1h ago

Hazy Research also has ThunderKittens, pretty cool library. There's a lot of effort to really formalize, pipeline, divide and conquer in the current NVIDIA GPU model for maximize GPU efficiency, and to write compilers/DSL's for things, it seems.

olivia111•3h ago

really cool. would love to try it for our 3b model.

olivia111•3h ago

any detailed tutorial about how to use it?

zhihaojia•3h ago

The github repo includes a tutorial for using MPK: https://github.com/mirage-project/mirage/tree/mpk

fxtentacle•2h ago

Isn’t fusing ops at a fine-grained level also the core benefit of JAX over TensorFlow? How does this work compare to JAX?

zhihaojia•1h ago

JAX's operator fusion (https://apxml.com/courses/advanced-jax/chapter-2-optimizing-...) can fuse a few local operators (e.g., matmul and elementwise computation) into a single kernel. But JAX's approach cannot fuse an entire LLM with hundreds of operators into a single kernel because many operators involve loop transformations.

MPK takes a different approach where instead of incrementally fusing local operators, it decomposes operators into a task graph and builds a runtime system within a single kernel to execute all tasks specified in the task graph.

bdbenton5255•2h ago

Certainly an important discovery for utilizing these models on scaled hardware. This approach could certainly be applied beyond LLMs to other types of neural networks. That would be an interesting space to explore.

zhihaojia•45m ago

Thanks for the feedback! Yes, we believe the approach is general and applicable to other ML workloads.

tuananh•1h ago

if you want to try on 5090, it's not supported yet

> Support for modern GPU architectures. One of our next milestones is extending MPK to support next-generation architectures such as NVIDIA Blackwell. A major challenge lies in integrating warp specialization — a key optimization for newer GPUs — with MPK’s megakernel execution model.

zhihaojia•47m ago

The task implementations used by MPK are currently optimized for A100. While the Mirage compiler can generate task implementations for other architectures such as Hopper and Blackwell, but we haven't integrated things together yet. This is on the very top of our todo list. Stay tuned!

qihqi•1h ago

Probably should make this into a backend of torch.compile

zhihaojia•43m ago

Yes, it would be a lot of fun if MPK can enable torch.compile to generate megakernels. Torch-generated kernels are currently too slow for latency-sensitive workloads.

Show HN: I wrote a new BitTorrent tracker in Elixir

Compiling LLMs into a MegaKernel: A path to low-latency inference

Literate programming tool for any language

Curved-Crease Sculpture

Andrej Karpathy: Software in the era of AI [video]

Infinite Mac OS X

Show HN: A DOS-like hobby OS written in Rust and x86 assembly

Show HN: EnrichMCP – A Python ORM for Agents

How OpenElections uses LLMs

Extracting memorized pieces of books from open-weight language models

Show HN: RM2000 Tape Recorder, an audio sampler for macOS

Guess I'm a Rationalist Now

Homegrown Closures for Uxn

Octobass

String Interpolation in C++ Using Glaze Stencil/Mustache

Show HN: Unregistry – “docker push” directly to servers without a registry

DNA floating in the air tracks wildlife, viruses, even drugs

Show HN: Claude Code Usage Monitor – real-time tracker to dodge usage cut-offs

Giant, All-Seeing Telescope Is Set to Revolutionize Astronomy

What would a Kubernetes 2.0 look like

We Can Just Measure Things

Testing a Robust Netcode with Godot

Flowspace (YC S17) Is Hiring Software Engineers

Star Quakes and Monster Shock Waves

Public/protected/private is an unnecessary feature

Visual History of the Latin Alphabet

Munich from a Hamburger's perspective

From LLM to AI Agent: What's the Real Journey Behind AI System Development?

Posit floating point numbers: thin triangles and other tricks (2019)

Getting Started Strudel

Compiling LLMs into a MegaKernel: A path to low-latency inference

Comments

Show HN: I wrote a new BitTorrent tracker in Elixir

Compiling LLMs into a MegaKernel: A path to low-latency inference

Literate programming tool for any language

Curved-Crease Sculpture

Andrej Karpathy: Software in the era of AI [video]

Infinite Mac OS X

Show HN: A DOS-like hobby OS written in Rust and x86 assembly

Show HN: EnrichMCP – A Python ORM for Agents

How OpenElections uses LLMs

Extracting memorized pieces of books from open-weight language models

Show HN: RM2000 Tape Recorder, an audio sampler for macOS

Guess I'm a Rationalist Now

Homegrown Closures for Uxn

Octobass

String Interpolation in C++ Using Glaze Stencil/Mustache

Show HN: Unregistry – “docker push” directly to servers without a registry

DNA floating in the air tracks wildlife, viruses, even drugs

Show HN: Claude Code Usage Monitor – real-time tracker to dodge usage cut-offs

Giant, All-Seeing Telescope Is Set to Revolutionize Astronomy

What would a Kubernetes 2.0 look like

We Can Just Measure Things

Testing a Robust Netcode with Godot

Flowspace (YC S17) Is Hiring Software Engineers

Star Quakes and Monster Shock Waves

Public/protected/private is an unnecessary feature

Visual History of the Latin Alphabet

Munich from a Hamburger's perspective

From LLM to AI Agent: What's the Real Journey Behind AI System Development?

Posit floating point numbers: thin triangles and other tricks (2019)

Getting Started Strudel