The hardest part of this work isn't coming up with the math; it's the mental overhead of managing the scratchpad memory and async DMA calls without stepping on your own toes. You spend 3 days debugging a race condition just to find out you got a 2% speedup.
If this tool can actually handle the 'grunt work' of generating the tiling logic and memory moves based on a high-level plan, that’s a game changer. I don't even care about the 17x number as much as I care about the '0 to 1' speed. getting any performant kernel running on new hardware usually takes weeks. If this cuts it down to a few hours of LLM churning, that's huge for the industry.
qat321•1h ago
charleshong•59m ago
See our paper: https://arxiv.org/abs/2505.18574 And our prior blog posts: https://charleshong3.github.io/blog/
gfhsad•54m ago
charleshong•52m ago
Also, the 17x came from a pretty obscure fusion optimization that isn't called out anywhere in the documentation (we had to run the profiler to see what was actually going on). Wouldn't be surprised if whoever within AWS wrote the kernel didn't know about that optimization.
snklt•43m ago