frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Compiling LLMs into a MegaKernel: A path to low-latency inference

https://zhihaojia.medium.com/compiling-llms-into-a-megakernel-a-path-to-low-latency-inference-cf7840913c17
150•matt_d•6h ago

Comments

NitroPython•5h ago
Ollama integration?
baq•5h ago
Next step - compile straight to verilog so I can buy some LLMs on aliexpress
bigcat12345678•4h ago
https://riscv.org/blog/2021/02/hardware-description-language... That was one of the promising ideas before AI & GPUs come to the scene. As CPUs are stagnant, and naturally people want further optimize the middle layers software and hardware.

But I suspect parallel computing in GPU style is going to dominate acclerated computing.

General purpose CPUs are going to stay to become the little brain that orchestrates GPUs.

Ideas of software direct to hardware transition might never be the mainstream.

baq•4h ago
I'm thinking more like pseudointellect over serial to attach a $3 esp32 to. Since it's basically tokens in, tokens out, let's just cut the unnecessary parts out. It's like querying the cloud models, except it's your silicon you personally soldered to the esp so nobody will break your home assistant with a system prompt update or a fine tuning run.
mycall•2h ago
> General purpose CPUs are going to stay to become the little brain that orchestrates GPUs

Brings the deterministic compute to the indeterministic.

anitil•39m ago
I mean.... LLM-in-a-box would actually be pretty neat! I'm looking at some air-gapped work coming up and having something like that would be quite handy
fc417fc802•22m ago
Isn't that easily accomplished by setting up a local deployment and then yanking the network cable? Anything that can quickly run a capable LLM is going to be a pretty beefy box though. More like LLM in an expensive space heater.
fc417fc802•26m ago
Because training costs weren't high enough already so lets add mask costs on top.

More seriously, isn't that pretty much what all those AI hardware startups have already been doing for a while now?

scotty79•4h ago
> Traditional LLM systems often rely on sequences of GPU kernel launches and external communication calls, resulting in underutilized hardware.

What? Why? This seems like an obvious optimization if it's possible.

shawntan•4h ago
Systems might want to anticipate changes in LLM architectures (even small changes can make a big difference kernel wise), so it's good to not "bake" too much in ahead of time.

That said, at some point it just depends where the costs lie and it might make sense hiring some GPU engineers to do what they did here for whatever architecture you're optimising for.

Not as low-hanging as you might imagine.

catlifeonmars•4h ago
From the article

> Despite these advantages, compiling an LLM into a megakernel is highly challenging. Existing high-level ML frameworks — such as PyTorch, Triton, and TVM — do not natively support end-to-end megakernel generation. Additionally, modern LLM systems are built from a diverse collection of specialized kernel libraries: NCCL or NVSHMEM for communication, FlashInfer or FlashAttention for efficient attention, and CUDA or Triton for custom computation. This fragmentation makes it difficult to consolidate the entire inference pipeline into a single, unified kernel.

So my naive assumption is that yes it is obvious, but nontrivial.

saagarjha•2h ago
Your naive assumption is the right one. It’s quite hard to do this. Even doing it automatically like it’s done here runs into problems with trying to figure out data dependencies and synchronization across nontrivial computation.
liuliu•4h ago
It really is not obvious. These launches are asynchronous, and data movement / computation is overlapped properly through CUDA APIs. Even per-kernel launch cost is reduced with the cudagraph introduction.

CUDA programming model relies on each kernel to be computationally expensive to make sense, and these are not true for token generation of LLM. And we are talking about network evaluation at higher than 1000 per second, whereas previously besides recommendation systems, network evaluation we are look at is ~100 per second at most.

Also, nobody remember Alex's "One Weird Trick" paper, which slices matmul into pieces to overlap device-to-device transfer v.s. computation. That is 10 years ago.

delusional•4h ago
In the common case where the processor dispatching those kernel calls is much faster than the kernel calls themselves, you're not likely to see a meaningful increase in throughput.

What you need to do first is get really optimized kernels (since that makes the dispatching relatively more expensive) and THEN this becomes worth doing. People who are really good a writing optimized GPU kernels are just not that easy to get a hold of right now.

bytepoet•4h ago
This is very cool. I enjoyed going through the writeup and GitHub README.

I was wondering if these same optimizations can be brought to bear on training as well, rather than only inference. I guess the challenge here is fusing backward computations with gradient communication.

I also saw that this currently does not handle dynamic workloads such as MoE. I recently came across this paper that does exactly this:

FlashDMoE: Fast Distributed MoE in a Single Kernel - https://arxiv.org/pdf/2506.04667

zhihaojia•3h ago
Thanks for reading the post and github README. Supporting training is definitely feasible but the benefit may not be as significant as low-latency inference since training generally involves much larger kernels, making kernel launch overhead less significant.

Thanks for sharing the FlashDMoE work. Our next step is to support MoE models. Stay tuned!

bytepoet•41m ago
Thanks for the inputs. It's very helpful to know.

I look forward to following mirage development.

ActorNightly•1h ago
Personally I think its a bit of a waste to invest time into gradient training optimizations. A lot of training tasks IRL have discrete values in nature, which can't be trained with gradients.
liuliu•4h ago
The Qwen 8B number, if verified, is very impressive. Much more practical than the previous megakernel one.

That's being said, these one-persisted kernel on each SM reminds me Larrabee, and now wondering what the world will be if we just do traditional process-thread-simd path rather than CUDA path.

kp1197•4h ago
After working pretty closely with vLLM and SGLang over the past few months, this is EXACTLY what I had envisioned what a successor project would look like - analyzing an operation dependency graph and then fusing (or, at a minimum, scheduling tasks smarter). Congrats to the team.
zhihaojia•3h ago
Thanks a lot for your positive feedback! We believe that MPK can enhance existing LLM serving systems, especially for low-latency LLM serving. We are very excited about the opportunity to collaborate with others on direction.
skavi•4h ago
Does anyone have an intuition on why this offers significant gains over CUDA Graphs?. The CPU launch cost of a graph is tiny which implies most of the work has been offloaded to the GPU's own scheduler. I'd expect that some I/O marshalling at kernel boundaries could be avoided with megakernels. Maybe some loop fusion? Are there any more interesting optimizations they enable?
refulgentis•3h ago
You've hit the nail on the head. The CPU launch cost of a pre-compiled CUDA graph is tiny.

CUDA Graphs are a huge step up from manually launching kernels, but they still treat kernels as monolithic, black-box operations. A megakernel erases the boundaries between those operations.

With CUDA Graphs, as in the example in the article, if you have Matmul -> AllReduce, the AllReduce kernel cannot start until the entire Matmul kernel has finished. The dependency is at the kernel level. With a megakernel, they break these ops into fine-grained "tasks" scheduled across SMs. An AllReduce task that needs data from the first slice of the Matmul can begin as soon as that slice is computed by a few SMs, while other SMs are still working on the rest of the Matmul. This fine-grained software pipelining and compute/communication overlap is simply not possible when the dependency unit is the entire kernel.

saagarjha•2h ago
> The CPU launch cost of a graph is tiny

Absolutely not; it’s comparable to the launch overhead of a kernel.

flakiness•3h ago
This project is from CMU. Hazy Research at Stanford talked about the megakernel too: https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles

Good to see the competition in this area.

(Edited): Related paper covering the larger "mirage" project, but this doesn't cover the "megakernel" approach: https://arxiv.org/abs/2405.05751

zhihaojia•3h ago
This is the writer of the blog post. You are right that Stanford's work is a parallel effort. The main difference is that our focus is on compilation: making it easier to generate megakernels automatically.
zekrioca•30m ago
And their focus is..?
sigbottle•1h ago
Hazy Research also has ThunderKittens, pretty cool library. There's a lot of effort to really formalize, pipeline, divide and conquer in the current NVIDIA GPU model for maximize GPU efficiency, and to write compilers/DSL's for things, it seems.
olivia111•3h ago
really cool. would love to try it for our 3b model.
olivia111•3h ago
any detailed tutorial about how to use it?
zhihaojia•3h ago
The github repo includes a tutorial for using MPK: https://github.com/mirage-project/mirage/tree/mpk
fxtentacle•2h ago
Isn’t fusing ops at a fine-grained level also the core benefit of JAX over TensorFlow? How does this work compare to JAX?
zhihaojia•1h ago
JAX's operator fusion (https://apxml.com/courses/advanced-jax/chapter-2-optimizing-...) can fuse a few local operators (e.g., matmul and elementwise computation) into a single kernel. But JAX's approach cannot fuse an entire LLM with hundreds of operators into a single kernel because many operators involve loop transformations.

MPK takes a different approach where instead of incrementally fusing local operators, it decomposes operators into a task graph and builds a runtime system within a single kernel to execute all tasks specified in the task graph.

bdbenton5255•2h ago
Certainly an important discovery for utilizing these models on scaled hardware. This approach could certainly be applied beyond LLMs to other types of neural networks. That would be an interesting space to explore.
zhihaojia•45m ago
Thanks for the feedback! Yes, we believe the approach is general and applicable to other ML workloads.
tuananh•1h ago
if you want to try on 5090, it's not supported yet

> Support for modern GPU architectures. One of our next milestones is extending MPK to support next-generation architectures such as NVIDIA Blackwell. A major challenge lies in integrating warp specialization — a key optimization for newer GPUs — with MPK’s megakernel execution model.

zhihaojia•47m ago
The task implementations used by MPK are currently optimized for A100. While the Mirage compiler can generate task implementations for other architectures such as Hopper and Blackwell, but we haven't integrated things together yet. This is on the very top of our todo list. Stay tuned!
qihqi•1h ago
Probably should make this into a backend of torch.compile
zhihaojia•43m ago
Yes, it would be a lot of fun if MPK can enable torch.compile to generate megakernels. Torch-generated kernels are currently too slow for latency-sensitive workloads.

Show HN: I wrote a new BitTorrent tracker in Elixir

https://github.com/Dahrkael/ExTracker
77•dahrkael•2h ago•2 comments

Compiling LLMs into a MegaKernel: A path to low-latency inference

https://zhihaojia.medium.com/compiling-llms-into-a-megakernel-a-path-to-low-latency-inference-cf7840913c17
150•matt_d•6h ago•40 comments

Literate programming tool for any language

https://github.com/zyedidia/Literate
40•LorenDB•3h ago•22 comments

Curved-Crease Sculpture

https://erikdemaine.org/curved/
156•wonger_•11h ago•25 comments

Andrej Karpathy: Software in the era of AI [video]

https://www.youtube.com/watch?v=LCEmiRjPEtQ
1110•sandslash•1d ago•620 comments

Infinite Mac OS X

https://blog.persistent.info/2025/03/infinite-mac-os-x.html
4•kristianp•1h ago•0 comments

Show HN: A DOS-like hobby OS written in Rust and x86 assembly

https://github.com/krustowski/rou2exOS
147•krustowski•11h ago•35 comments

Show HN: EnrichMCP – A Python ORM for Agents

https://github.com/featureform/enrichmcp
86•bloppe•7h ago•25 comments

How OpenElections uses LLMs

https://thescoop.org/archives/2025/06/09/how-openelections-uses-llms/index.html
84•m-hodges•9h ago•33 comments

Extracting memorized pieces of books from open-weight language models

https://arxiv.org/abs/2505.12546
47•fzliu•3d ago•39 comments

Show HN: RM2000 Tape Recorder, an audio sampler for macOS

https://rm2000.app
29•marcelox86•2d ago•8 comments

Guess I'm a Rationalist Now

https://scottaaronson.blog/?p=8908
220•nsoonhui•15h ago•633 comments

Homegrown Closures for Uxn

https://krzysckh.org/b/Homegrown-closures-for-uxn.html
70•todsacerdoti•7h ago•9 comments

Octobass

https://www.atlasobscura.com/places/octobass
6•keepamovin•3d ago•0 comments

String Interpolation in C++ Using Glaze Stencil/Mustache

https://stephenberry.github.io/glaze/stencil-mustache/
16•npalli•3d ago•0 comments

Show HN: Unregistry – “docker push” directly to servers without a registry

https://github.com/psviderski/unregistry
630•psviderski•1d ago•139 comments

DNA floating in the air tracks wildlife, viruses, even drugs

https://www.sciencedaily.com/releases/2025/06/250603114822.htm
72•karlperera•3d ago•59 comments

Show HN: Claude Code Usage Monitor – real-time tracker to dodge usage cut-offs

https://github.com/Maciek-roboblog/Claude-Code-Usage-Monitor
195•Maciej-roboblog•15h ago•108 comments

Giant, All-Seeing Telescope Is Set to Revolutionize Astronomy

https://www.science.org/content/article/giant-all-seeing-telescope-set-revolutionize-astronomy
4•gammarator•2h ago•0 comments

What would a Kubernetes 2.0 look like

https://matduggan.com/what-would-a-kubernetes-2-0-look-like/
141•Bogdanp•13h ago•221 comments

We Can Just Measure Things

https://lucumr.pocoo.org/2025/6/17/measuring/
61•tosh•2d ago•47 comments

Testing a Robust Netcode with Godot

https://studios.ptilouk.net/little-brats/blog/2024-10-23_netcode.html
31•smig0•2d ago•9 comments

Flowspace (YC S17) Is Hiring Software Engineers

https://flowspace.applytojob.com/apply/6oDtY2q6E9/Software-Engineer-II
1•mrjasonh•8h ago

Star Quakes and Monster Shock Waves

https://www.caltech.edu/about/news/star-quakes-and-monster-shock-waves
34•gmays•2d ago•4 comments

Public/protected/private is an unnecessary feature

https://catern.com/private.html
51•PaulHoule•2d ago•51 comments

Visual History of the Latin Alphabet

https://uclab.fh-potsdam.de/arete/en
109•speckx•2d ago•65 comments

Munich from a Hamburger's perspective

https://mertbulan.com/2025/06/14/munich-from-a-hamburgers-perspective/
105•toomuchtodo•4d ago•88 comments

From LLM to AI Agent: What's the Real Journey Behind AI System Development?

https://www.codelink.io/blog/post/ai-system-development-llm-rag-ai-workflow-agent
120•codelink•15h ago•37 comments

Posit floating point numbers: thin triangles and other tricks (2019)

http://marc-b-reynolds.github.io/math/2019/02/06/Posit1.html
46•fanf2•10h ago•33 comments

Getting Started Strudel

https://strudel.cc/workshop/getting-started/
140•rcarmo•3d ago•50 comments