frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Show HN: Luminal – Open-source, search-based GPU compiler

https://github.com/luminal-ai/luminal
64•jafioti•5h ago
Hi HN, I’m Joe. My friends Matthew, Jake and I are building Luminal (https://luminalai.com/), a GPU compiler for automatically generating fast GPU kernels for AI models. It uses search-based compilation to achieve high performance.

We take high level model code, like you'd have in PyTorch, and generate very fast GPU code. We do that without using LLMs or AI - rather, we pose it as a search problem. Our compiler builds a search space, generates millions of possible kernels, and then searches through it to minimize runtime.

You can try out a demo in `demos/matmul` on mac to see how Luminal takes a naive operation, represented in our IR of 12 simple operations, and compiles it to an optimized, tensor-core enabled Metal kernel. Here’s a video showing how: https://youtu.be/P2oNR8zxSAA

Our approach differs significantly from traditional ML libraries in that we ahead-of-time compile everything, generate a large search space of logically-equivalent kernels, and search through it to find the fastest kernels. This allows us to leverage the Bitter Lesson to discover complex optimizations like Flash Attention entirely automatically without needing manual heuristics. The best rule is no rule, the best heuristic is no heuristic, just search everything.

We’re working on bringing CUDA support up to parity with Metal, adding more flexibility to the search space, adding full-model examples (like Llama), and adding very exotic hardware backends.

We aim to radically simplify the ML ecosystem while improving performance and hardware utilization. Please check out our repo: https://github.com/luminal-ai/luminal and I’d love to hear your thoughts!

Comments

AkashKarnatak•4h ago
Very cool project. Earlier tinygrad used to have ~25 ops but now it has grown to 86 and I believe it is primarily to support hardware feature like tensor core and tma. I don't think luminal supports tensor cores as of now, how do you think the ops will evolve as the library matures.
jafioti•4h ago
we do support tensor cores, but the ops are only part of the search space, so there's virtually no overhead for them. the frontend and main ir is only 12 ops, and we can add hardware-specific ops in to the search space and only add in a bit of code in the codegen pass to support them.
diggan•4h ago
> Luminal can run Q8 Llama 3 8B on M-series Macbooks at 15-25 tokens per second. The goal is to become the fastest ML framework for any model on any device.

Great that some numbers are provided, but in isolation, I'm not sure what they provide. It would be helpful to also share what tok/s you'd get with llama.cpp or something else on the same hardware, so we can actually understand if it's faster or not :) Also including the prompt processing would be a bonus!

jafioti•3h ago
a lot of the search is still being optimized so we don't match super hand-optimized kernels like llama.cpp has, so we def don't match their tps yet, but i want to make a perf tracking page to see improvements over time and prevent regressions
Alifatisk•4h ago
So wait, am I understanding this correctly?

Instead of applying just predetermined optimization rules or patterns, the compiler formulates the problem as searching through many possible configurations or versions of the code. Each possible version can have different arrangements, tiling sizes, thread block configurations, memory access patterns, and instruction sequences, right?

And from my understanding, the “search space” is just a collection of all potential versions of the code (kernels) that the compiler can generate from the original input. So for example, the space might include

- Different ways to partition workloads among GPU threads and blocks

- Varying memory access strategies (using shared memory, global memory)

- Various instruction-level optimizations or reordering

- Alternative loop unroll factors or vectorization strategies

The compiler then programmatically produces a large number of candidate kernels by combining different optimizations and configurations. Among these millions of candidates, the compiler tries to find the one that performs best.

In that case, can the compiler print out which gpu configuration works the best for that computer? And will that configuration be applicable to all computers with the same setup?

This is such an interesting technique.

jakestevens2•3h ago
Your description is exactly right. We create a search space of all possible kernels and find the best ones based on runtime. The best heuristic is no heuristic.

This obviously creates a combinatorial problem that we mitigate with smarter search.

The kernels are run on the computer the compiler is running on. Since runtime is our gold standard it will search for the best configuration for your hardware target. As long as the setup is mostly the same, the optimizations should carry over, yes.

UncleOxidant•3h ago
How long does this typically take? It sounds time consuming. Also, it seems like this could be similar to doing a GA?
jakestevens2•3h ago
That depends on the model architecture and how it was written since that informs the size of the search space.

The typical range is 10 mins to 10 hours. It won't be fast but you only have to do it once and then those optimizations are set for every forward pass.

jakestevens2•3h ago
You can also set a time budget for how long you'd like the search to run for to avoid wasting time on diminishing returns.
aleinin•4h ago
Cool project! How do you think about targeting hardware-specific ISAs directly? There’s an interesting paper from Citadel (https://arxiv.org/pdf/1804.06826) that highlights inefficiencies in nvcc for the Volta architecture. Do you see Luminal’s search-based paradigm eventually extending beyond outperforming handwritten kernels, towards actually competing with NVIDIA’s compiler optimizations at the PTX level?
jafioti•3h ago
yep! currently we're emitting cuda / metal but once the search is better, i want to directly emit ptx / low-level asm on other hardwares.
dvdplm•3h ago
This is very cool. Do you have any advice on papers to read to understand the details of search based compilation a bit more?
jafioti•3h ago
a lot of the ideas luminal is built on are here: https://arxiv.org/abs/2304.04332
UncleOxidant•3h ago
When you say (in the video) that you can target more exotic hardware, what about things FGPA accelerators (maybe taking advantage of TVM's FPGA backend)?

Also, what about CUDA alternatives like ROCm?

matthewjgunton•2h ago
Yup. We are totally hardware agnostic
matthewjgunton•2h ago
i should add this also applies to the language too. we currently support Metal (Apple's language) and CUDA, with extensions planned for others
efnx•3h ago
Cool! How is this project different from the tuning process in TVM?
jafioti•2h ago
basically autotuning on steroids. instead of searching single dimensions of optimization (tile sizing, etc.) we search through full algebraic rewrites (like rewriting softmax to online softmax) and various loop / tiling structures in the same unified search space.
sakras•2h ago
I see you guys are using Egg/Egglog! I've been mildly interested in egraphs for quite a while, glad to see they're gaining traction!
PoignardAzur•23m ago
Right, my first thought when reading the blurb was "kinda sounds like e-graphs?"
helltone•26m ago
I have a background in program analysis, but I'm less familiar with the kind of kernels you are optimising.

- Can you give some more insight on why 12 ops suffice for representing your input program?

- With such a small number of ops, isn't your search space full of repeat patterns? I understand the will to have no predefined heuristics, but it seems that learning some heuristics/patterns would massively help reduce the space.

jafioti•23m ago
we're just optimizing linear algebra, which is mostly made up of patterns of simple ops. for instance, matmul is just broadcasted multiply -> sum reduce.

the search does common subexpression elimination by default. if two patterns are unioned in the search space, it applies that union to every occurrence of that pattern at the same time, so using e-graphs it helps keep the search space smaller.

helltone•12m ago
Right I think I see it.

This is insanely cool.

But then there are performance tradeoffs in reusing intermediates vs recomputing that I think you can't represent.

Some of these may affect numerical stability btw. See eg https://herbie.uwplse.org/

There is so much potential in this project.

fancyfredbot•11m ago
This is a good idea. Do you use a cost model for the search or are you actually executing kernels? What kind of heuristics do you use to avoid search space becoming intractabl

Show HN: I was curious about spherical helix, ended up making this visualization

https://visualrambling.space/moving-objects-in-3d/
510•damarberlari•7h ago•97 comments

Zedless: Zed fork focused on privacy and being local-first

https://github.com/zedless-editor/zed
265•homebrewer•2h ago•116 comments

Introduction to AT Protocol

https://mackuba.eu/2025/08/20/introduction-to-atproto/
55•psionides•2h ago•25 comments

Show HN: PlutoPrint – Generate Beautiful PDFs and PNGs from HTML with Python

https://github.com/plutoprint/plutoprint
14•sammycage•51m ago•2 comments

Gemma 3 270M re-implemented in pure PyTorch for local tinkering

https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/12_gemma3
250•ModelForge•7h ago•41 comments

Visualizing GPT-OSS-20B embeddings

https://melonmars.github.io/LatentExplorer/embedding_viewer.html
34•melonmars•3d ago•17 comments

Coris (YC S22) Is Hiring

https://www.ycombinator.com/companies/coris/jobs/rqO40yy-ai-engineer
1•smaddali•29m ago

An Update on Pytype

https://github.com/google/pytype
105•mxmlnkn•4h ago•37 comments

Pixel 10 Phones

https://blog.google/products/pixel/google-pixel-10-pro-xl/
258•gotmedium•4h ago•459 comments

Launch HN: Channel3 (YC S25) – A database of every product on the internet

64•glawrence13•5h ago•39 comments

Lean proof of Fermat's Last Theorem [pdf]

https://imperialcollegelondon.github.io/FLT/blueprint.pdf
36•ljlolel•3h ago•23 comments

OPA maintainers and Styra employees hired by Apple

https://blog.openpolicyagent.org/note-from-teemu-tim-and-torin-to-the-open-policy-agent-community-2dbbfe494371
95•crcsmnky•5h ago•38 comments

Sequoia backs Zed

https://zed.dev/blog/sequoia-backs-zed
221•vquemener•9h ago•149 comments

Gouach wants you to insert and pluck the cells from its Infinite e-bike battery

https://arstechnica.com/gadgets/2025/05/gouach-wants-you-to-insert-and-pluck-the-cells-from-its-infinite-e-bike-battery/
12•pabs3•2d ago•2 comments

Learning about GPUs through measuring memory bandwidth

https://www.evolvebenchmark.com/blog-posts/learning-about-gpus-through-measuring-memory-bandwidth
22•JasperBekkers•6h ago•3 comments

Closer to the Metal: Leaving Playwright for CDP

https://browser-use.com/posts/playwright-to-cdp
111•gregpr07•5h ago•84 comments

Linear scan register allocation on SSA

https://bernsteinbear.com/blog/linear-scan/
12•surprisetalk•3d ago•1 comments

Why are anime catgirls blocking my access to the Linux kernel?

https://lock.cmpxchg8b.com/anubis.html
106•taviso•6h ago•140 comments

Tidewave Web: in-browser coding agent for Rails and Phoenix

https://tidewave.ai/blog/tidewave-web-phoenix-rails
246•kieloo•11h ago•47 comments

AWS in 2025: Stuff you think you know that's now wrong

https://www.lastweekinaws.com/blog/aws-in-2025-the-stuff-you-think-you-know-thats-now-wrong/
199•keithly•5h ago•113 comments

Show HN: Anchor Relay – A faster, easier way to get Let's Encrypt certificates

https://anchor.dev/relay
47•geemus•5h ago•43 comments

Show HN: Luminal – Open-source, search-based GPU compiler

https://github.com/luminal-ai/luminal
64•jafioti•5h ago•24 comments

Improvements to OCaml code editing: the basics of a refactor engine

https://tarides.com/blog/2025-08-20-internship-report-refactoring-tools-coming-to-merlin/
87•nukifw•7h ago•16 comments

Show HN: Bizcardz.ai – Custom metal business cards

https://github.com/rhodey/bizcardz.ai
16•rhodey•3h ago•17 comments

How to Think About GPUs

https://jax-ml.github.io/scaling-book/gpus/
338•alphabetting•2d ago•104 comments

Mirrorshades: The Cyberpunk Anthology (1986)

https://www.rudyrucker.com/mirrorshades/HTML/
121•keepamovin•13h ago•67 comments

Show HN: Nestable.dev – local whiteboard app with nestable canvases, deep links

https://nestable.dev/about
18•anorak27•3h ago•7 comments

Show HN: What country you would hit if you went straight where you're pointing

https://apps.apple.com/us/app/leascope/id6608979884
48•brgross•6h ago•30 comments

The Rise and Fall of Music Ringtones: A Statistical Analysis

https://www.statsignificant.com/p/the-rise-and-fall-of-music-ringtones
31•gmays•2d ago•39 comments

Best Options for Using AI in Chip Design

https://semiengineering.com/best-options-for-using-ai-in-chip-design/
29•rbanffy•5h ago•7 comments