The core motivation is that modern AI stacks increasingly depend on hand-optimized kernels (GEMM, attention, reductions, fused ops), but writing and tuning them for each hardware target (NVIDIA GPUs, AMD GPUs, custom accelerators like MTIA) does not scale.
KernelEvolve treats kernel programming as a search + evolution problem:
• An LLM generates candidate kernels (e.g., Triton-like code) • Kernels are compiled, benchmarked, and validated on real hardware • Performance feedback is used to evolve better variants over many iterations • The system scales evaluation across large fleets and multiple accelerator types
Unlike one-shot code generation, KernelEvolve continuously improves kernels using closed-loop, hardware-in-the-loop feedback, and can discover non-obvious optimizations that rival or exceed expert-written code.
In the paper we describe:
• The agent architecture and search space design • How we scale kernel evaluation efficiently across heterogeneous accelerators • Case studies showing performance gains over hand-tuned baselines • Practical lessons from deploying this system in production ML workloads
Paper (arXiv): https://arxiv.org/abs/2512.23236 (66 pages)
LinkedIn: https://www.linkedin.com/posts/gangliao_excited-to-share-our-recent-work-on-kernelevolve-activity-7411781675740897280-AQth?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAzsrfsBRed-BvPAGqq9FgvVZ-v6F-sG4SM
We’d love feedback from folks working on compilers, kernels, ML systems, or agentic approaches to code generation.