But I suspect parallel computing in GPU style is going to dominate acclerated computing.
General purpose CPUs are going to stay to become the little brain that orchestrates GPUs.
Ideas of software direct to hardware transition might never be the mainstream.
Brings the deterministic compute to the indeterministic.
If that was going to happen, it would have happened.
CPUs are genuinely good at what they do, and "what they do" is a lot of tasks that GPUs are actually terrible at. If all we had were GPUs in the world and someone invented a CPU, we'd hail them as a genius. A lot of people seem to think that GPUs are just "better", just ambiently better at everything, but that's lightyears from the truth. They are quite spectacularly terrible at a lot of very common tasks. There's many very good reasons that GPUs are still treated as accelerators for the CPUs and not vice versa.
This is "gaming PC" territory, not "space heater". I mean people already have PS5's and whatnot in their homes.
The hundreds of gigabytes thing exists because the big cloud LLM providers went down the increasing parameter count path. That way is a dead end and we've reached negative returns already.
Prompt engineering + finetunes is the future, but you need developer brains for that, not TFLOPs.
You can just about run a 32B (at Q4/Q5 quantization) on 24GB. Running anything higher (such as the increasingly common 70B models, or higher if you want to run something like Llama 4 or DeepSeek) means splitting the model between RAM and RAM. -- But yes, anything 24B or lower you can run comfortably, including enough capacity for the context.
If you have other models -- such as text-to-speech, speech recognition, etc. -- then those are going to take up VRAM for both the model and during processing/generation. That affects the size of LLM you can run.
Anything that overflows VRAM is going to slow down the response time drastically.
"Space heater" is determined by computational horsepower rather than available RAM.
How big a context window do you want? Last I checked that was very expensive in terms of RAM and having a large one was highly desirable.
Large contexts are very important but they are cheap compared in terms of RAM compared to the costs of increasing parameter count.
Training is the thing that costs the most in terms of power/memory/energy, often requiring months of running multiple (likely 4-8) A100/H100 GPUs on the training data.
Performing inference is cheaper as you can 1) keep the model loaded in VRAM, and 2) run it on a single H100. With the 80GB capacity you would need two to run a 70B model at F16, or one at F8. For 32B models and lower you could run them on a single H100. Then you only need 1 or 2 GPUs to handle the request.
ASICs could optimize things like the ReLU operations, but modern GPUs already have logic and instructions for matrix multiplication and other operations.
I think the sweat spot will be when CPUs have support for high-throughput matrix operations similar to the SIMD operations. That way the system will benefit from being able to use system memory [1] and not have another chip/board consuming power. -- IIUC, things are already moving in that direction for consumer devices.
[1] This will allow access to large amounts of memory without having to chain multiple GPUs. That will make it possible to run the larger models at higher precisions more efficiently and process the large amount of training data efficiently.
Right but at that point you're describing an H100 plus an additional ASIC plus presumably a CPU and some RAM. Or a variant of an H100 with some specialized ML functions baked in. Both of those just sound like a regular workstation to me.
Inference is certainly cheaper but getting it running quickly requires raw horsepower (thus wattage, thus heat dissipation).
Regarding CPUs there's a severe memory bandwidth issue. I haven't kept track of the extreme high end hardware but it's difficult to compete with GPUs on raw throughput.
More seriously, isn't that pretty much what all those AI hardware startups have already been doing for a while now?
What? Why? This seems like an obvious optimization if it's possible.
That said, at some point it just depends where the costs lie and it might make sense hiring some GPU engineers to do what they did here for whatever architecture you're optimising for.
Not as low-hanging as you might imagine.
> Despite these advantages, compiling an LLM into a megakernel is highly challenging. Existing high-level ML frameworks — such as PyTorch, Triton, and TVM — do not natively support end-to-end megakernel generation. Additionally, modern LLM systems are built from a diverse collection of specialized kernel libraries: NCCL or NVSHMEM for communication, FlashInfer or FlashAttention for efficient attention, and CUDA or Triton for custom computation. This fragmentation makes it difficult to consolidate the entire inference pipeline into a single, unified kernel.
So my naive assumption is that yes it is obvious, but nontrivial.
CUDA programming model relies on each kernel to be computationally expensive to make sense, and these are not true for token generation of LLM. And we are talking about network evaluation at higher than 1000 per second, whereas previously besides recommendation systems, network evaluation we are look at is ~100 per second at most.
Also, nobody remember Alex's "One Weird Trick" paper, which slices matmul into pieces to overlap device-to-device transfer v.s. computation. That is 10 years ago.
There are some niche research on parallel token generations though as of late...
What you need to do first is get really optimized kernels (since that makes the dispatching relatively more expensive) and THEN this becomes worth doing. People who are really good a writing optimized GPU kernels are just not that easy to get a hold of right now.
I was wondering if these same optimizations can be brought to bear on training as well, rather than only inference. I guess the challenge here is fusing backward computations with gradient communication.
I also saw that this currently does not handle dynamic workloads such as MoE. I recently came across this paper that does exactly this:
FlashDMoE: Fast Distributed MoE in a Single Kernel - https://arxiv.org/pdf/2506.04667
Thanks for sharing the FlashDMoE work. Our next step is to support MoE models. Stay tuned!
I look forward to following mirage development.
That's being said, these one-persisted kernel on each SM reminds me Larrabee, and now wondering what the world will be if we just do traditional process-thread-simd path rather than CUDA path.
CUDA Graphs are a huge step up from manually launching kernels, but they still treat kernels as monolithic, black-box operations. A megakernel erases the boundaries between those operations.
With CUDA Graphs, as in the example in the article, if you have Matmul -> AllReduce, the AllReduce kernel cannot start until the entire Matmul kernel has finished. The dependency is at the kernel level. With a megakernel, they break these ops into fine-grained "tasks" scheduled across SMs. An AllReduce task that needs data from the first slice of the Matmul can begin as soon as that slice is computed by a few SMs, while other SMs are still working on the rest of the Matmul. This fine-grained software pipelining and compute/communication overlap is simply not possible when the dependency unit is the entire kernel.
I suppose I could look through the code of this project, but I’d hate to have detangle that from the compiler infrastructure.
You might have a look at cooperative groups, also things cuda::pipeline in libcudacxx to handle asynchronous and pipelined memory traffic, and also most of block/warp CUB primitives, and move on up to cuFFTDx, cuBLASDx and now cuSolverDx as the starting toolbox for your fused kernel journey.
Absolutely not; it’s comparable to the launch overhead of a kernel.
What I was getting at was that a “megakernel” and a captured graph should have similar launch costs.
I'm not sure it applies so well in LLMs though (should read the paper...).
Good to see the competition in this area.
(Edited): Related paper covering the larger "mirage" project, but this doesn't cover the "megakernel" approach: https://arxiv.org/abs/2405.05751
MPK takes a different approach where instead of incrementally fusing local operators, it decomposes operators into a task graph and builds a runtime system within a single kernel to execute all tasks specified in the task graph.
> Support for modern GPU architectures. One of our next milestones is extending MPK to support next-generation architectures such as NVIDIA Blackwell. A major challenge lies in integrating warp specialization — a key optimization for newer GPUs — with MPK’s megakernel execution model.
https://news.ycombinator.com/item?id=44111673
I find it curious that fundamentals of the CUDA programming model (eg kernel launches) are being subverted in favor of fine grained task based parallelism that ends up using the hardware more effectively. Makes me wonder if CUDA has been holding us back in some ways.
What are the chances we see your work land in PyTorch as an experimental backend?
Awesome stuff thanks for sharing.
P.S. minor typo, your first two paragraphs under part 1 are nearly identical.
I completely agree that CUDA can be a limiting factor, especially for latency-sensitive workloads. As GPUs are becoming larger and faster, it's increasingly difficult to write standalone kernels that fully utilize hardware resources—particularly when optimizing for low latency with small batch sizes.
> What are the chances we see your work land in PyTorch as an experimental backend?
We're definitely excited about that direction. We believe MPK can help PyTorch support megakernel generation, and we’re actively exploring how to make that happen. Stay tuned!
> P.S. minor typo, your first two paragraphs under part 1 are nearly identical.
Thanks for pointing it out--I meant to remove the duplicate paragraph when finalizing the post.
Thank you !
And unlike a lot of research, the code actually runs well. I can reproduce the results using Modal GPUs, leaving the code here: https://github.com/mirage-project/mirage/pull/327/files
Triton + FlashInfer: Prompt length 39, generate length 264, per-token latency 19.189573345762312 ms
MPK: Prompt length 39, generate length 334, per-token latency 7.71875 ms
1. How fine grain is each task? In a traditional matrix multiplication kernel, for example, each thread block is responsible for a small output tile of the resulting matrix. In Mirage's mega kernel, would there correspondingly be a task for each small output tile?
2. How does the Mirage compiler form the task graph? Does it have domain knowledge of every operator's data flow at the granularity of individual elements? Again taking matmul as an example: a given output output tile requires the correspond M_BLOCK rows of the A matrix. If the A matrix was itself an output of a prior matmul (+ nonlinearity), the dependees would be all of output tile tasks corresponding to those M_BLOCK rows of the operator that produced A?
2. TL;DR: MPK automatically analyzes inter-task dependencies by tracking the input and output tensors associated with each task. A longer version: Longer version: MPK uses imap, omap, and fmap (see Section 2 of the Mirage paper) to determine each task’s input and output tensors. A dependency is introduced between task A and task B if A produces any tensor elements that B consumes—that is, if A's outputs overlap with B's inputs.
> Again taking matmul as an example: a given output output tile requires the correspond M_BLOCK rows of the A matrix. If the A matrix was itself an output of a prior matmul (+ nonlinearity), the dependees would be all of output tile tasks corresponding to those M_BLOCK rows of the operator that produced A?
Exactly. In this case, all output tile tasks that consume those M_BLOCK rows of A will depend on all tasks responsible for producing the corresponding parts of A in the previous operator.
I tried to be smart and cache intermediate results that were shared by multiple kernels.
When the results were published I was stumped to see that others were orders of magnitude faster then me.
Turns out they didn't bother with caching at all. The overhead of recalculating everything a thousand times was tiny compared to the overhead of doing roundtrips through RAM.
I assume it's the same thing here. By compiling into MegaKernels, layer boundaries are squashed. There likely will be _more_ calculations and less shared intermediate results. But overall it's still a win due to less memory roundtrips.
There has to be a sweet spot, especially for convolution networks. No idea if the MegaKernel takes this into account.
We do some on-the-fly optimizations as well (like compiling into CUDA graphs or fusing together calls) which ends up resulting (for some inference engines) faster token throughput too.
1. For tasks like autocomplete, keyword routing, or voice transcription, what would the latency and power savings look like on an ASIC vs. even a megakernel GPU setup? Would that justify a fixed-function approach in edge devices or embedded systems?
2. ASICs obviously kill retraining, but could we envision a hybrid setup where a base model is hardwired and a small, soft, learnable module (e.g., LoRA-style residual layers) runs on a general-purpose co-processor?
3. Would the transformer’s fixed topology lend itself to spatial reuse in ASIC design, or is the model’s size (e.g. GPT-3-class) still prohibitive without aggressive weight pruning or quantization?
Maybe we should think of them like transistors? Right now, we are at the point where we have a room-sized computer than can do multiplication from punch card input.
It is fun to imagine what we could do if we ran, say, 1 million coordinated o3-pro queries at once?
NitroPython•7mo ago