AI Is Writing Its Own Kernels, and They Are 17x Faster

https://adrs-ucb.notion.site/autocomp

62•accheng•2h ago

Comments

qat321•2h ago

I wonder if these results extend beyond AWS Trainium?

taqpos•2h ago

This post unintentionally highlights exactly why NVIDIA is untouchable. If you need a farm of H100s running GPT-5 just to figure out how to program Amazon's Trainium chip efficiently, the hardware abstraction is fundamentally broken.

CobbledSteel•1h ago

I'd argue the logic goes the other way, if all it takes to get high performant kernels is to rent a GPU farm, that seems to undercut the years and millions of engineering hours required to build the NVIDIA SW infrastructure. High hopes for smaller players now

archipelago123•42m ago

The fact that nobody cared to optimize kernels for these hardware platforms proves Nvidia's CUDA moat, especially now that squeezing performance has become so important for serving inference. Hardware ISA is broken => nobody knows how to program the hardware => unoptimized kernels => nobody will use your hardware. Also, bad baselines present opportunities for LLMs to optimize for. Indeed, the kernel that achieved a 17X speedup seems to be a conv1d, which AWS could not care less about optimizing.

pos456•2h ago

Calling beam search 'AI' is doing a lot of heavy lifting here. This is just superoptimization with a very expensive heuristic function.

jryio•32m ago

That's correct - however as other commenters have noted. Doing this by hand is extremely challenging for human engineers working on tensor kernels.

The expense calculation might be

expense of improvement = (time taken per optimization step * cost of unit time ) / ( speedup - 1)

The expensive heuristic function is saving wall time well also being cheaper in cost of unit time. And as the paper shows the speed up provided for each unit time multiplied by unit cost of time is large.

greeravoctado•6m ago

Usually the rate of overall improvement for this type of optimization is less than Moore law rate of improvement, thus not worth the company investment. 17x micro-benchmarks don't count. Real improvements come from architectural changes, for example: MoE, speculative multi-token prediction, etc.

igorpcosta•2h ago

Very interesting research on this, keen to colab with you folks, I've been building a few experiments for old GTX GPUs to extend lifetime of them with matching performance of tokens for Smol, igor [] autohand.ai let's chat.

quc1k•1h ago

I really appreciate the focus on interpretability. Usually, super-optimizers give you a blob of assembly that runs fast but is impossible to debug or maintain. By forcing the model to output a natural language 'Plan' first, you essentially get documentation for free. If the code breaks later, you can look at the plan to understand why the loop was unrolled or why the memory was laid out that way. That makes this actually usable in a production CI/CD pipeline, unlike most black-box ML optimizations.

kap901•21m ago

manually writing tiling logic for systolic arrays is the absolute worst. if this actually works it saves me so much headache.

pakt1•1h ago

Trainium has always been a black box to me compared to GPUs. Seeing an automated tool reverse-engineer the best way to use the VectorEngine vs the TensorEngine is fascinating. It reveals just how much performance is left on the table by standard compilers.

matll•1h ago

As someone who spent the better part of last year trying to hand-tune kernels for a niche accelerator (not Trainium, but similar vibe), this honestly looks like a dream.

The hardest part of this work isn't coming up with the math; it's the mental overhead of managing the scratchpad memory and async DMA calls without stepping on your own toes. You spend 3 days debugging a race condition just to find out you got a 2% speedup.

If this tool can actually handle the 'grunt work' of generating the tiling logic and memory moves based on a high-level plan, that’s a game changer. I don't even care about the 17x number as much as I care about the '0 to 1' speed. getting any performant kernel running on new hardware usually takes weeks. If this cuts it down to a few hours of LLM churning, that's huge for the industry.

simonw•53m ago

Optimization work sounds like it might be a really good fit for coding agents. If you can provide a robust test which "proves" the implementation works the actual work of increasing its performance is the kind of thing a coding agent could run in a loop, testing each optimization to see if the tests still pass and it runs faster.

whynotmaybe•29m ago

But we might end up with "work on my infrastructure" optimization that would be hard to reproduce.

Like that research that evolved an FPGA where some unconnected parts where crucial for the the expected behaviour.

https://www.eetimes.com/whatever-happened-to-evolvable-hardw...

mholm•23m ago

Adding a few diverse hardware environments available for testing during the duration would mitigate this. Many companies wouldn't have any issues having infrastructure specific optimizations either. (Part of) Deepseek's big advantage over their chinese competitors was their intelligent use of the hardware, after all.

chanwutk•1h ago

Very interesting read!

melissapan•1h ago

ADRS <> Compiler: what if your “compiler” could think?

dksgmlwo•1h ago

Fascinating. Having worked as a kernel engineer before, I know how impactful it is to reduce the initial exploration overhead. It can save a huge amount of the grunt work engineers typically have to do.

maltese669•1h ago

ngl letting AI fiddle with the kernel sounds scary but the results are really impressive

yrh•1h ago

Interesting read. I think the more "whitebox" approach with a laid out menu to choose from makes the resulting kernel more trustworthy, although it does ask the question if going outside of the predefined steps of optimization from time to time may yield insights.

measurablefunc•1h ago

I wonder if this type of work can be applied towards translating kernels between GPU vendors, e.g. CUDA → AMD. Does anyone know if that's possible or whether that kind of problem is AGI-complete?

UncleOxidant•51m ago

It seems like it could be possible now with a bit of work. I don't think that it would require AGI. Didn't AMD have (or fund) something like this and then decide not to pursue it further recently? It was called HIP. There's also ZLUDA https://www.blopig.com/blog/2024/03/an-open-source-cuda-for-...

measurablefunc•40m ago

Very interesting.

jryio•27m ago

There's a higher level of abstraction

https://www.modular.com/mojo

measurablefunc•15m ago

So if CUDA could be ported to Mojo w/ AI then it would be basically available for any GPU/accelerator vendor. Seems like the right kind of approach towards making CUDA a non-issue.

jryio•1h ago

paper: https://arxiv.org/abs/2505.18574

syngrog66•59m ago

AI has told me that Biden was preparing for his upcoming debate with Trump. It told me that in May 2025.

AI has told me its not raining in my city and that in fact there was 0% chance of it that day. As I was looking out my open front door watching a heavy downpour.

DroneBetter•29m ago

that is an indictment of the implementations, not the fundamental limits of the architecture; most commercial LLMs now have web-searching available by default and can do both of those things, but couldn't when they were confined to the user's prompt and their training data (which was often not quite contemporary, until recently)

comrade1234•58m ago

This is completely believable and you should invest in this technology.

DroneBetter•41m ago

I can't tell whether you're trying to convince humans, parody someone who might be, or give superficial sentiment for automated traders' webscrapers to be influenced by

oceansky•21m ago

I think he's just being extremely ironic, meaning the exact opposite of what it actually says.

cornonthecobra•15m ago

or they left the /s off and it's a remark about how the fine article sounds more like hype-machine emesis than legitimate, substantive research

UncleOxidant•52m ago

Was in a startup where we were trying to do this (our tagline was "using AI to make AI run faster and more efficiently"). But we ran out of funding at the end of '22 :(

We were just a little early, I think.

accheng•41m ago

Interesting, did you have any learnings that would apply to this problem now?

karek•45m ago

usually i scroll past these 'LLM optimizes code' posts bc 99% of them are just finding basic peephole optimizations that -O3 wouldve caught anyway. but looking at the conv1d example in the blog, this is actually doing real architectural changes.

the 'dropout' on the optimization menu is a pretty neat hack. kinda reminds me how i work when im stuck... 'ok what if i dont unroll this loop, what else can i do?'. forces the search out of local minima. nice to see an AI tool designed around verification (the simulator loop) rather than just hoping the llm guesses right on the first shot.

bvcasd•42m ago

having an agent that looks at the error + the isa spec and trys a fix automatically is worth its weight in gold. turns a frustrating 'read the docs for 2 hours' session into a 5 min background task. thats the kind of QoL stuff that actually gets devs to adopt this. how close is this to being used in production?

dataeaa•36m ago

Crazy that it beat the hand-tuned amazon kernels. really shows how early we still are with these software stacks.

what are the risks of using these kinds of tools thou? Did you get any tricky/silent bugs you had to manually fix?

bgwalter•32m ago

So, Trainium is an architecture that requires brute force to write software for.

Maybe if we invest $100 trillion in data centers, we can rewrite the Linux Kernel in Malbolge.

mavt6•31m ago

Love the concept of using AI to make the hardware run AI faster. feels like we're finally closing the loop on this stuff!

jryio•30m ago

Chris Latner of Apple's Swift and Tesla fame is running a company entirely predicated on this, but at the deterministic language design level rather than the inference level.

https://www.modular.com/mojo

If a beam search, initiative plan and execute phase is more effective than having better tooling in a deterministic programming language then this will clearly take the lead.

accheng•24m ago

Thanks for the link! I am not familiar with the company but reminds me of the whole formal methods debate in distributed systems. Sure, writing TLA+ specs is the 'correct' deterministic way to build a Raft implementation, but in reality everyone just writes messy Go/Java and patches bugs as they pop up because its faster.

maven5t•19m ago

tried using NKI a few months ago and the docs were rough. having the LLM just figure it out from the ISA spec is honestly genius

dfdsfds•17m ago

Very impressive results! Will be curious to see how correctness is guaranteed and what kind of failures are normal from the LLM-generated code

Nano Banana Pro

Android and iPhone users can now share files, starting with the Pixel 10

New Glenn Update

New OS aims to provide (some) compatibility with macOS

Over-Regulation Is Doubling the Cost by Peter Reinhardt

FEX-emu – run x86 applications on ARM64 Linux devices

Data-at-Rest Encryption in DuckDB

GitHut – Programming Languages and GitHub (2014)

NTSB Preliminary Report – UPS Boeing MD-11F Crash [pdf]

The Lions Operating System

Readonly Characters Are a Big Deal

Okta's NextJS-0auth troubles

Microsoft makes Zork open-source

Launch HN: Poly (YC S22) – Cursor for Files

Free interactive tool that shows you how PCIe lanes work on motherboards

Show HN: F32 – An Extremely Small ESP32 Board

Adversarial poetry as a universal single-turn jailbreak mechanism in LLMs

Prozac 'no better than placebo' for treating children with depression, experts

Run Docker containers natively in Proxmox 9.1 (OCI images)

Show HN: My hobby OS that runs Minecraft

OOP is shifting between domains, not disappearing

Kagi Assistants

Interactive World History Atlas Since 3000 BC

Freer Monads, More Extensible Effects (2015) [pdf]

What's in a Passenger Name Record (PNR)? (2013)

Mozilla says it's finally done with Onerep

Red Alert 2 in web browser

Two recently found works of J.S. Bach presented in Leipzig [video]

France is taking state actions against GrapheneOS?

Go Cryptography State of the Union