frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

AI Is Writing Its Own Kernels, and They Are 17x Faster

https://adrs-ucb.notion.site/autocomp
62•accheng•2h ago

Comments

qat321•2h ago
I wonder if these results extend beyond AWS Trainium?
taqpos•2h ago
This post unintentionally highlights exactly why NVIDIA is untouchable. If you need a farm of H100s running GPT-5 just to figure out how to program Amazon's Trainium chip efficiently, the hardware abstraction is fundamentally broken.
CobbledSteel•1h ago
I'd argue the logic goes the other way, if all it takes to get high performant kernels is to rent a GPU farm, that seems to undercut the years and millions of engineering hours required to build the NVIDIA SW infrastructure. High hopes for smaller players now
archipelago123•42m ago
The fact that nobody cared to optimize kernels for these hardware platforms proves Nvidia's CUDA moat, especially now that squeezing performance has become so important for serving inference. Hardware ISA is broken => nobody knows how to program the hardware => unoptimized kernels => nobody will use your hardware. Also, bad baselines present opportunities for LLMs to optimize for. Indeed, the kernel that achieved a 17X speedup seems to be a conv1d, which AWS could not care less about optimizing.
pos456•2h ago
Calling beam search 'AI' is doing a lot of heavy lifting here. This is just superoptimization with a very expensive heuristic function.
jryio•32m ago
That's correct - however as other commenters have noted. Doing this by hand is extremely challenging for human engineers working on tensor kernels.

The expense calculation might be

expense of improvement = (time taken per optimization step * cost of unit time ) / ( speedup - 1)

The expensive heuristic function is saving wall time well also being cheaper in cost of unit time. And as the paper shows the speed up provided for each unit time multiplied by unit cost of time is large.

greeravoctado•6m ago
Usually the rate of overall improvement for this type of optimization is less than Moore law rate of improvement, thus not worth the company investment. 17x micro-benchmarks don't count. Real improvements come from architectural changes, for example: MoE, speculative multi-token prediction, etc.
igorpcosta•2h ago
Very interesting research on this, keen to colab with you folks, I've been building a few experiments for old GTX GPUs to extend lifetime of them with matching performance of tokens for Smol, igor [] autohand.ai let's chat.
quc1k•1h ago
I really appreciate the focus on interpretability. Usually, super-optimizers give you a blob of assembly that runs fast but is impossible to debug or maintain. By forcing the model to output a natural language 'Plan' first, you essentially get documentation for free. If the code breaks later, you can look at the plan to understand why the loop was unrolled or why the memory was laid out that way. That makes this actually usable in a production CI/CD pipeline, unlike most black-box ML optimizations.
kap901•21m ago
manually writing tiling logic for systolic arrays is the absolute worst. if this actually works it saves me so much headache.
pakt1•1h ago
Trainium has always been a black box to me compared to GPUs. Seeing an automated tool reverse-engineer the best way to use the VectorEngine vs the TensorEngine is fascinating. It reveals just how much performance is left on the table by standard compilers.
matll•1h ago
As someone who spent the better part of last year trying to hand-tune kernels for a niche accelerator (not Trainium, but similar vibe), this honestly looks like a dream.

The hardest part of this work isn't coming up with the math; it's the mental overhead of managing the scratchpad memory and async DMA calls without stepping on your own toes. You spend 3 days debugging a race condition just to find out you got a 2% speedup.

If this tool can actually handle the 'grunt work' of generating the tiling logic and memory moves based on a high-level plan, that’s a game changer. I don't even care about the 17x number as much as I care about the '0 to 1' speed. getting any performant kernel running on new hardware usually takes weeks. If this cuts it down to a few hours of LLM churning, that's huge for the industry.

simonw•53m ago
Optimization work sounds like it might be a really good fit for coding agents. If you can provide a robust test which "proves" the implementation works the actual work of increasing its performance is the kind of thing a coding agent could run in a loop, testing each optimization to see if the tests still pass and it runs faster.
whynotmaybe•29m ago
But we might end up with "work on my infrastructure" optimization that would be hard to reproduce.

Like that research that evolved an FPGA where some unconnected parts where crucial for the the expected behaviour.

https://www.eetimes.com/whatever-happened-to-evolvable-hardw...

mholm•23m ago
Adding a few diverse hardware environments available for testing during the duration would mitigate this. Many companies wouldn't have any issues having infrastructure specific optimizations either. (Part of) Deepseek's big advantage over their chinese competitors was their intelligent use of the hardware, after all.
chanwutk•1h ago
Very interesting read!
melissapan•1h ago
ADRS <> Compiler: what if your “compiler” could think?
dksgmlwo•1h ago
Fascinating. Having worked as a kernel engineer before, I know how impactful it is to reduce the initial exploration overhead. It can save a huge amount of the grunt work engineers typically have to do.
maltese669•1h ago
ngl letting AI fiddle with the kernel sounds scary but the results are really impressive
yrh•1h ago
Interesting read. I think the more "whitebox" approach with a laid out menu to choose from makes the resulting kernel more trustworthy, although it does ask the question if going outside of the predefined steps of optimization from time to time may yield insights.
measurablefunc•1h ago
I wonder if this type of work can be applied towards translating kernels between GPU vendors, e.g. CUDA → AMD. Does anyone know if that's possible or whether that kind of problem is AGI-complete?
UncleOxidant•51m ago
It seems like it could be possible now with a bit of work. I don't think that it would require AGI. Didn't AMD have (or fund) something like this and then decide not to pursue it further recently? It was called HIP. There's also ZLUDA https://www.blopig.com/blog/2024/03/an-open-source-cuda-for-...
measurablefunc•40m ago
Very interesting.
jryio•27m ago
There's a higher level of abstraction

https://www.modular.com/mojo

measurablefunc•15m ago
So if CUDA could be ported to Mojo w/ AI then it would be basically available for any GPU/accelerator vendor. Seems like the right kind of approach towards making CUDA a non-issue.
jryio•1h ago
paper: https://arxiv.org/abs/2505.18574
syngrog66•59m ago
AI has told me that Biden was preparing for his upcoming debate with Trump. It told me that in May 2025.

AI has told me its not raining in my city and that in fact there was 0% chance of it that day. As I was looking out my open front door watching a heavy downpour.

DroneBetter•29m ago
that is an indictment of the implementations, not the fundamental limits of the architecture; most commercial LLMs now have web-searching available by default and can do both of those things, but couldn't when they were confined to the user's prompt and their training data (which was often not quite contemporary, until recently)
comrade1234•58m ago
This is completely believable and you should invest in this technology.
DroneBetter•41m ago
I can't tell whether you're trying to convince humans, parody someone who might be, or give superficial sentiment for automated traders' webscrapers to be influenced by
oceansky•21m ago
I think he's just being extremely ironic, meaning the exact opposite of what it actually says.
cornonthecobra•15m ago
or they left the /s off and it's a remark about how the fine article sounds more like hype-machine emesis than legitimate, substantive research
UncleOxidant•52m ago
Was in a startup where we were trying to do this (our tagline was "using AI to make AI run faster and more efficiently"). But we ran out of funding at the end of '22 :(

We were just a little early, I think.

accheng•41m ago
Interesting, did you have any learnings that would apply to this problem now?
karek•45m ago
usually i scroll past these 'LLM optimizes code' posts bc 99% of them are just finding basic peephole optimizations that -O3 wouldve caught anyway. but looking at the conv1d example in the blog, this is actually doing real architectural changes.

the 'dropout' on the optimization menu is a pretty neat hack. kinda reminds me how i work when im stuck... 'ok what if i dont unroll this loop, what else can i do?'. forces the search out of local minima. nice to see an AI tool designed around verification (the simulator loop) rather than just hoping the llm guesses right on the first shot.

bvcasd•42m ago
having an agent that looks at the error + the isa spec and trys a fix automatically is worth its weight in gold. turns a frustrating 'read the docs for 2 hours' session into a 5 min background task. thats the kind of QoL stuff that actually gets devs to adopt this. how close is this to being used in production?
dataeaa•36m ago
Crazy that it beat the hand-tuned amazon kernels. really shows how early we still are with these software stacks.

what are the risks of using these kinds of tools thou? Did you get any tricky/silent bugs you had to manually fix?

bgwalter•32m ago
So, Trainium is an architecture that requires brute force to write software for.

Maybe if we invest $100 trillion in data centers, we can rewrite the Linux Kernel in Malbolge.

mavt6•31m ago
Love the concept of using AI to make the hardware run AI faster. feels like we're finally closing the loop on this stuff!
jryio•30m ago
Chris Latner of Apple's Swift and Tesla fame is running a company entirely predicated on this, but at the deterministic language design level rather than the inference level.

https://www.modular.com/mojo

If a beam search, initiative plan and execute phase is more effective than having better tooling in a deterministic programming language then this will clearly take the lead.

accheng•24m ago
Thanks for the link! I am not familiar with the company but reminds me of the whole formal methods debate in distributed systems. Sure, writing TLA+ specs is the 'correct' deterministic way to build a Raft implementation, but in reality everyone just writes messy Go/Java and patches bugs as they pop up because its faster.
maven5t•19m ago
tried using NKI a few months ago and the docs were rough. having the LLM just figure it out from the ISA spec is honestly genius
dfdsfds•17m ago
Very impressive results! Will be curious to see how correctness is guaranteed and what kind of failures are normal from the LLM-generated code

Nano Banana Pro

https://blog.google/technology/ai/nano-banana-pro/
782•meetpateltech•9h ago•494 comments

Android and iPhone users can now share files, starting with the Pixel 10

https://blog.google/products/android/quick-share-airdrop/
377•abraham•7h ago•258 comments

New Glenn Update

https://www.blueorigin.com/news/new-glenn-upgraded-engines-subcooled-components-drive-enhanced-pe...
89•rbanffy•3h ago•37 comments

New OS aims to provide (some) compatibility with macOS

https://github.com/ravynsoft/ravynos
101•kasajian•4h ago•38 comments

Over-Regulation Is Doubling the Cost by Peter Reinhardt

https://rein.pk/over-regulation-is-doubling-the-cost
23•bilsbie•1h ago•16 comments

FEX-emu – run x86 applications on ARM64 Linux devices

https://fex-emu.com/
22•open-paren•1w ago•3 comments

Data-at-Rest Encryption in DuckDB

https://duckdb.org/2025/11/19/encryption-in-duckdb
107•chmaynard•5h ago•15 comments

GitHut – Programming Languages and GitHub (2014)

https://githut.info/
43•tonyhb•3h ago•16 comments

NTSB Preliminary Report – UPS Boeing MD-11F Crash [pdf]

https://www.ntsb.gov/Documents/Prelimiary%20Report%20DCA26MA024.pdf
119•gregsadetsky•6h ago•143 comments

The Lions Operating System

https://lionsos.org
106•plunderer•6h ago•20 comments

Readonly Characters Are a Big Deal

https://matklad.github.io/2025/11/10/readonly-characters.html
25•vinhnx•1w ago•2 comments

Okta's NextJS-0auth troubles

https://joshua.hu/ai-slop-okta-nextjs-0auth-security-vulnerability
205•ramimac•2d ago•73 comments

Microsoft makes Zork open-source

https://opensource.microsoft.com/blog/2025/11/20/preserving-code-that-shaped-generations-zork-i-i...
388•tabletcorry•6h ago•168 comments

Launch HN: Poly (YC S22) – Cursor for Files

39•aabhay•6h ago•41 comments

Free interactive tool that shows you how PCIe lanes work on motherboards

https://mobomaps.com
132•tagyro•1d ago•23 comments

Show HN: F32 – An Extremely Small ESP32 Board

https://github.com/PegorK/f32
172•pegor•1d ago•26 comments

Adversarial poetry as a universal single-turn jailbreak mechanism in LLMs

https://arxiv.org/abs/2511.15304
229•capgre•12h ago•120 comments

Prozac 'no better than placebo' for treating children with depression, experts

https://www.theguardian.com/society/2025/nov/20/prozac-no-better-than-placebo-for-treating-childr...
11•pseudolus•33m ago•1 comments

Run Docker containers natively in Proxmox 9.1 (OCI images)

https://raymii.org/s/tutorials/Finally_run_Docker_containers_natively_in_Proxmox_9.1.html
90•jandeboevrie•3h ago•27 comments

Show HN: My hobby OS that runs Minecraft

https://astral-os.org/posts/2025/10/31/astral-minecraft.html
114•avaliosdev•3d ago•15 comments

OOP is shifting between domains, not disappearing

https://blog.jsbarretto.com/post/actors
49•ibobev•4h ago•95 comments

Kagi Assistants

https://blog.kagi.com/kagi-assistants
119•ingve•4h ago•60 comments

Interactive World History Atlas Since 3000 BC

http://geacron.com/home-en/
282•not_knuth•14h ago•126 comments

Freer Monads, More Extensible Effects (2015) [pdf]

https://okmij.org/ftp/Haskell/extensible/more.pdf
74•todsacerdoti•9h ago•16 comments

What's in a Passenger Name Record (PNR)? (2013)

https://hasbrouck.org/articles/PNR.html
53•rzk•4d ago•13 comments

Mozilla says it's finally done with Onerep

https://krebsonsecurity.com/2025/11/mozilla-says-its-finally-done-with-two-faced-onerep/
96•todsacerdoti•5h ago•57 comments

Red Alert 2 in web browser

https://chronodivide.com/
386•nsoonhui•12h ago•126 comments

Two recently found works of J.S. Bach presented in Leipzig [video]

https://www.youtube.com/watch?v=4hXzUGYIL9M#t=15m19s
88•Archelaos•3d ago•69 comments

France is taking state actions against GrapheneOS?

https://grapheneos.social/@GrapheneOS/115584160910016309
124•gabrielgio•1h ago•55 comments

Go Cryptography State of the Union

https://words.filippo.io/2025-state/
119•ingve•7h ago•46 comments