AI Is Writing Its Own Kernels, and They Are 17x Faster

https://adrs-ucb.notion.site/autocomp

34•accheng•1h ago

Comments

qat321•1h ago

I wonder if these results extend beyond AWS Trainium?

charleshong•59m ago

Yes! We have also ported Autocomp to an academic accelerator (Gemmini), an RVV dev board (Canaan K230), and a GPU (NVIDIA L40S).

See our paper: https://arxiv.org/abs/2505.18574 And our prior blog posts: https://charleshong3.github.io/blog/

gfhsad•54m ago

Whenever I see '17x faster than experts,' I read 'the experts didn't actually try very hard on the baseline.'

charleshong•52m ago

Well, most of our results are not 17x. But still (IMO) solid across the board!

Also, the 17x came from a pretty obscure fusion optimization that isn't called out anywhere in the documentation (we had to run the profiler to see what was actually going on). Wouldn't be surprised if whoever within AWS wrote the kernel didn't know about that optimization.

snklt•43m ago

17x is a wild improvement regardless of the baseline. Impressive results.

taqpos•1h ago

This post unintentionally highlights exactly why NVIDIA is untouchable. If you need a farm of H100s running GPT-5 just to figure out how to program Amazon's Trainium chip efficiently, the hardware abstraction is fundamentally broken.

CobbledSteel•11m ago

I'd argue the logic goes the other way, if all it takes to get high performant kernels is to rent a GPU farm, that seems to undercut the years and millions of engineering hours required to build the NVIDIA SW infrastructure. High hopes for smaller players now

pos456•57m ago

Calling beam search 'AI' is doing a lot of heavy lifting here. This is just superoptimization with a very expensive heuristic function.

igorpcosta•52m ago

Very interesting research on this, keen to colab with you folks, I've been building a few experiments for old GTX GPUs to extend lifetime of them with matching performance of tokens for Smol, igor [] autohand.ai let's chat.

quc1k•50m ago

I really appreciate the focus on interpretability. Usually, super-optimizers give you a blob of assembly that runs fast but is impossible to debug or maintain. By forcing the model to output a natural language 'Plan' first, you essentially get documentation for free. If the code breaks later, you can look at the plan to understand why the loop was unrolled or why the memory was laid out that way. That makes this actually usable in a production CI/CD pipeline, unlike most black-box ML optimizations.

pakt1•45m ago

Trainium has always been a black box to me compared to GPUs. Seeing an automated tool reverse-engineer the best way to use the VectorEngine vs the TensorEngine is fascinating. It reveals just how much performance is left on the table by standard compilers.

matll•40m ago

As someone who spent the better part of last year trying to hand-tune kernels for a niche accelerator (not Trainium, but similar vibe), this honestly looks like a dream.

The hardest part of this work isn't coming up with the math; it's the mental overhead of managing the scratchpad memory and async DMA calls without stepping on your own toes. You spend 3 days debugging a race condition just to find out you got a 2% speedup.

If this tool can actually handle the 'grunt work' of generating the tiling logic and memory moves based on a high-level plan, that’s a game changer. I don't even care about the 17x number as much as I care about the '0 to 1' speed. getting any performant kernel running on new hardware usually takes weeks. If this cuts it down to a few hours of LLM churning, that's huge for the industry.

chanwutk•34m ago

Very interesting read!

melissapan•33m ago

ADRS <> Compiler: what if your “compiler” could think?

dksgmlwo•27m ago

Fascinating. Having worked as a kernel engineer before, I know how impactful it is to reduce the initial exploration overhead. It can save a huge amount of the grunt work engineers typically have to do.

maltese669•23m ago

ngl letting AI fiddle with the kernel sounds scary but the results are really impressive

yrh•5m ago

Interesting read. I think the more "whitebox" approach with a laid out menu to choose from makes the resulting kernel more trustworthy, although it does ask the question if going outside of the predefined steps of optimization from time to time may yield insights.

Show HN: Facetime Influencer AI Avatars Real-Time

HP and Dell disable HEVC support built into their laptops' CPUs

CDC website changed to contradict conclusion that vaccines don't cause autism

Data Science Weekly – Issue 626

Show HN: 0Portfolio – AI-powered portfolio builder for everyone

Trustworthy Systems Group: secure and performant real-world computer systems

Are cellular towers the next landlines?

Show HN: CampaignTree – A visual alternative to spreadsheets for planning ads

RI judge intervenes after ICE mistakenly detains Superior Court intern

I Let a Brain Organoid Make My Investment Decisions

Is C++26 getting destructive move semantics?

Federal prosecutors move to dismiss charges against woman shot by Border Patrol

The Calvin and Hobbes search Takedown (2010)

Mudyla: Multimodal dynamic launcher, a DAG-based bash script orchestrator

PrivateCut – Trim videos 100% in the browser, no upload, works offline

Putting Down Your Phone May Help You Live Longer (2019)

Ask HN: How Do you undo or checkout changes from Codex CLI and others?

Suppression of pair beam instabilities in a laboratory analogue of blazar jets

Nvidia pushes hotfix after Windows 11 October update tanks gaming performance

Morgan Stanley Delays Data Center Debt Sale Amid Alibaba Risks

Apple Watch's algorithm detects 89% of sleep apnea

Humanoid robot Figure 02 helps build over 30k BMW X3s

Abstractive Thinking Model

Over-Regulation Is Doubling the Cost by Peter Reinhardt

The Game Awards 2025 Nominations

France is taking state actions against GrapheneOS

When First Amendment free speech protections came up against the Red Scare

Color Palette Pro: A Synthesizer for Color

Spiral Development for Hardware Programs

World Bank Published about Artificial Intelligence in Bulgarian