Speeding up PyTorch inference on Apple devices with AI-generated Metal kernels

https://gimletlabs.ai/blog/ai-generated-metal-kernels

186•nserrino•3d ago

Comments

pbronez•3d ago

This is pretty cool.

I initially thought they were writing custom kernels for proprietary models like GPT-5. They aren't - they're using proprietary models to write kernels for a set of ~250 open Pytorch modules.

simlevesque•3d ago

Are these kernel available ? I'd love to try that !

magicalist•2d ago

Feels like they should have released some code, yeah, but gpt5 success rate was high enough that it looks like you can just pass the kernels they got from https://github.com/ScalingIntelligence/KernelBench/ to gpt5 (with up to five rounds of feeding back compilation/correctness errors back to the model) and get the results yourself.

earthnail•3d ago

This is amazing. I wouldn't have thought that AI is this good in niche topics. Very impressive experiment and write up.

Still, I can't help but think we should bet on sth like Mojo instead for the long run.

ipsum2•3d ago

Mojo is a terrible language and its main feature (GPU acceleration through Mojo max) is closed source and requires a commercial license to be purchased.

earthnail•2d ago

Why is it a terrible language? Genuinely curious question.

rovr138•2d ago

A language where a main feature, Gpu acceleration, is being behind a commercial license is a pretty big issue.

Not on language design, but in the ecosystem and other areas.

moelf•2d ago

>I can't help but think we should bet on sth like Mojo instead for the long run.

JAX or Julia hopefully.

nikolayasdf123•3d ago

> non 100% correctness of kernels

wouldn't model not work properly if kernels are even slightly off?

wasn't kernels a part of training stack for models? am I missing anything?

ymsodev•3d ago

The article is referring to GPU compute kernel (https://en.wikipedia.org/wiki/Compute_kernel), not the term kernel used in ML/NN/etc.

saagarjha•2d ago

…aren't they the same thing

ymsodev•23h ago

They're not, but I also misunderstood the original question, they're referring to the correct definition of kernel. I thought they were confusing the GPU kernel with https://en.wikipedia.org/wiki/Kernel_method or https://en.wikipedia.org/wiki/Kernel_(image_processing)

arjvik•2d ago

I believe their speedup is computed _assuming they can easily fix the correctness bugs in the kernels_.

In practice, with slight differences the model will feel almost lobotomized.

turbo_wombat•3d ago

They are comparing unoptimized PyTorch inference, something you would never deploy on a device, to a model with custom kernels.

Yes, of course the model with custom kernels is faster, whether it's written by a human or an AI.

Generally, PyTorch inference is meant to be used during the training process, and when running metrics, not when deploying. When deployed, you should export to ONNX, and then compile the ONNX to the native format of the device.

If you aren't familiar with the pipeline for ML deployment, this is the equivalent of comparing interpreted code to compiled code.

nserrino•2d ago

PyTorch is the baseline because that's what people prototype in, and the most common reference point. The aim here is to show that you can start from prototype code and automatically produce lower-level kernels (in this case Metal) that are more usable in real deployments, without additional work from the developer. Frontier models are capable at generating efficient Metal kernels automatically/immediately, and will only get better. We expect to see significant improvements as we refine the approach, but it's enough to show this seems to be a tractable problem for AI.

CapsAdmin•2d ago

I have never really worked with pytorch professionally, but it feels to me a lot of the open source, especially generative oriented projects, just use pytorch like this. It makes hacking on the models a whole lot easier.

comfyui is a good example of a project like this.

airforce1•2d ago

> and then compile the ONNX to the native format of the device.

I'm assuming you are talking about https://github.com/onnx/onnx-mlir?

In your experience, how much faster is a "compiled" onnx model vs. using an onnx runtime?

dapperdrake•2d ago

For other people reading this:

Back in the day TensorFlow had tfdeploy which compiled TensorFlow terms into NumPy matrix operations. Our synthetic tests saw speedups of factor 50.

yieldcrv•2d ago

> Yes, of course the model with custom kernels is faster, whether it's written by a human or an AI.

But that’s the thing, I wouldn’t write a custom kernel before AI

I don't do that level of development or operate at that part of the stack but I’m very experienced in software development

AI significantly augments my skillsets in this area

am17an•2d ago

The point is those kernels exist already, you can just use them off the shelf. In the case where you're trying to write a production grade kernel without operating at that part of the stack... well good luck with that.

spott•1d ago

vLLM is a LLM model serving framework written using raw PyTorch.

ONNX doesn’t support a bunch of operations that PyTorch does (it isn’t always possible to convert a PyTorch model to ONNX).

Torchserve runs raw PyTorch.

Generally speaking, PyTorch is pretty well optimized. For Mac it has been historically ignored, so the kernels for MPS were all missing or just bad, but on CUDA and Linux they are pretty good.

Tiberium•2d ago

It'd be curious to see how those AI generated kernels compare to kernels generated by https://github.com/tinygrad/tinygrad

xiphias2•2d ago

As they wrote most of the wins are because of fusion and TimyGrad started to have fusion optimizations in the last few weeks.

GeoHot didn't want to make it only FlashAttention specific, he worked on FlashAttenrion being automatically generated by the optimizer. It's going well

magicalist•2d ago

This is cool it works so well (less the performance claims and more the correct translations).

Not to take away from the nice writeup, but for anyone not getting far enough into the writeup, this is essentially taking https://github.com/ScalingIntelligence/KernelBench and seeing if it can generate Metal kernels in addition to the CUDA kernels the benchmark is written for. The dataset was released in November 2024, it looks like, with a paper on arXiv in February and a bunch of discussion at the time[1], so worth keeping likelihood of inclusion in training data in mind when comparing models.

The different levels are interesting. Level 1 and 3 are successfully (5-shot) translated to Metal kernels by gpt5 97% and 88% of the time , but in both cases, the majority of generated kernels are slower than the reference compiled pytorch versions. The speculation about more simple op fusion opportunities in the Level 2 kernels vs the very simple Level 1 kernels and the complex architecture Level 3 kernels seems plausible. From the KernelBench paper, it looks like Level 2 kernels were mostly automatically generated from randomly picking operators and then getting an LLM to generate a kernel combining them, while Level 1 were mostly hand written and Level 3 came from well-known ML architectures.

The swarm part seemed a bit of a stretch. They fired off requests to 8 different models to do the translation, and the "supervisor" benchmarked the returned kernels and picked the fastest one. Technically a swarm, I guess, but feels like we're devaluing the term :)

The correctness testing used made my eye twitch a bit:

> We tested the generated kernel's output against the default implementation's output on 100 random inputs. We set a 0.01 tolerance for both relative and absolute. Let a be the generated kernel output, and b be the reference kernel output. Outputs were considered equal if for every element in the output, absolute(a - b) ≤ (atol + rtol absolute(b)) held true.*

For a numerical kernel, this seems way too loose, but turns out those bounds come straight from KernelBench, which only tested for correctness on 5 random inputs by default in their harness, not the 100 they used here. KernelBench mentions the clear tradeoff they get between how strictly they define correctness and kernel performance, but for Level 1 kernels in particular, which are really just single operations, it seems like the bounds should be multiple orders of magnitude smaller to ensure robust translation. For instance, the all 0s "optimization" mentioned in the writeup allowing for trivially translating the kernel looks like it's due to those loose tolerances[2] and KernelBench was looking to make the evaluation more robust.

[1] Like https://metr.org/blog/2025-02-14-measuring-automated-kernel-...

[2] https://github.com/ScalingIntelligence/KernelBench/pull/25

formalsystem•2d ago

I work on PyTorch and there are many things that make me suspicious about these results. My TL;DR is unless we get a zip file of all the kernels with how they're benchmarked results like this are almost impossible to verify

1. I don't have an M4 but I have an M1 Pro and I tried running the claimed 18x speedup VisionAttention attention example and I get close to identical runtimes. This example has more issues the main optimization the LLM is doing is a fusion and so not comparing to torch.compile is a bit sus. The numerics are off as well and I suspect the atols were way too big. Finally MultiHeadAttention is a deprecated API so using neither SDPA or torch.compile is a weird choice

2. In general 18x (and even some 100x speedups claimed near the end) are just a smell that some kernel is incorrect, the typical way you can get speedups like this is you don't warmup or you forget to synchronize. PyTorch has a lot of benchmarking footguns which is why sharing the exact eval scripts is helpful

3. Speaking of footguns, the shapes I saw in the examples were tiny, in that regime you're more often measuring noise as the primary bottleneck is not compute or memory but overhead

4. Generating many random shapes is also not so safe, some input distributions can make certain kernels trivial for example torch.randn() by default generates samples from a normal distribution with mean 0 and variance 1 and so if you take the mean of a large vector you're almost guaranteed to just get 0 esp if your tolerance is too high

5. KernelBench levels measure vastly different things and if you want to compare to PyTorch operators you want to focus on Level 1, Level 2 is fusions and so the right baseline is torch.compile and more reliable on nightlies. The Mamba 2 example (which I didn't run) also acknowledges that the primary thing it does is fusions which assuming everything is correct would still be strange to baseline vs eager

So please for everyone's sanity if you find a kernel that's 10-100x faster please share the exact code and benchmarking methodology to your smartest performance friends, you should be extremely skeptical of such results often you can discard some numbers based on a simple speed of light analysis. We all desperately want faster kernels but to get them we have to be really fanatical about correctness.

nserrino•2d ago

Hey, thanks for the thoughtful comments. A lot of big claims have been made in this area so skepticism is the right default reaction. tl;dr: agree that we should provide the kernels and benchmark suite so this can be evaluated by others, will follow up with that.

A few clarifications:

1. Baselines - We didn't compare to torch.compile because as of PyTorch 2.7, torch.compile doesn't support the MPS backend, and we ran into some issues on many of the problems when using it. GitHub issue: https://github.com/pytorch/pytorch/issues/150121. Once it's supported, it will be the obvious baseline.

2. Methodology - We followed KernelBench’s protocol to establish a baseline on Metal, adding more correctness checks. Warmup and synchronization were done. We recognize the limitations here and are expanding the validation suite.

3. Optimizations - Right now most of the optimizations are fusions, but there is some use of Metal-specific primitives/optimizations. We expect as we make the supervisor more sophisticated, the novelty of the optimized kernels will also increase.

Overall the goal here is to get some % of the benefit of a human expert in kernel engineering, without developer effort. Compiler-based optimizations are great, but hand-tuned implementations are still common for performance-critical models. The hope is that we can automate some of that process.

rovr138•2d ago

Any insight on the torch.compile issue posted?

Like you said, would be great to test against it.

syntaxing•2d ago

This is gonna be a silly question but what does “kernel” mean in this context. I thought it meant like a Linux kernel module but doesn’t seem to be?

tuckerman•2d ago

A kernel is low level function that is going to run in parallel on your accelerator (hopefully efficiently). You will have various matmuls, convolutions, etc.

If you search CUDA kernel you’d find examples but the term was used in HPC before as well.

drdrey•2d ago

https://en.wikipedia.org/wiki/Compute_kernel

Show HN: Greppers – fast CLI cheat sheet with instant copy and shareable search

Oldest recorded transaction

Qwen3 30B A3B Hits 13 token/s on 4xRaspberry Pi 5

We hacked Burger King: How auth bypass led to drive-thru audio surveillance

The maths you need to start understanding LLMs

Using Claude Code SDK to reduce E2E test time

Anthropic agrees to pay $1.5B to settle lawsuit with book authors

Processing Piano Tutorial Videos in the Browser

The World War Two bomber that cost more than the atomic bomb

AI surveillance should be banned while there is still time

Why language models hallucinate

Europe enters the exascale supercomputing league with Jupiter

The life-changing Sarah Paine framework

Baby's first type checker

Normalization of deviance (2015)

Rug pulls, forks, and open-source feudalism

Our love letter to Internet Relay Chat [video]

GigaByte CXL memory expansion card with up to 512GB DRAM

Speeding up Unreal Editor launch by not spawning unused tooltips

AI hype is crashing into reality. Stay calm

Kenvue stock drops on report RFK Jr will link autism to Tylenol during pregnancy

Video Game Blurs (and how the best one works)

996

A Software Development Methodology for Disciplined LLM Collaboration

The repercussions of missing an Ampersand in C++ and Rust

Purposeful animations

The Universe Within 12.5 Light Years

Novel hollow-core optical fiber transmits data faster with record low loss

Patterns, Predictions, and Actions – A story about machine learning

GLM 4.5 with Claude Code