(I work at doubleAI.) The benchmark numbers (90% wins, 2.24x speedup on average, with key attention workloads accelerated ~15x) are one part of the story. The other: AI-generated kernels can pass simple verifiers with room to spare and still break in production. Our writeup walks through one such case: a fused embedding-gradient + RMSNorm backward kernel. As an example, we took the fastest (highest scoring) submission on the benchmark, for this task, submitted by other competitors, put it in our training code, and... everything broke (the loss diverged). The bug turned out to be subtle and hard to nail down (even changing the input distribution or the optimiser made it invisible). More details are in our blogpost.
We find these kinds of bugs very interesting: defending against humans is not the same problem as defending against agents. As models are becoming more powerful, they're also getting great at subverting verifiers (going through the 'path of least resistance'). This means that stronger verifiers are becoming increasingly more necessary, something we at doubleAI specialise in. You can also read our previous blogposts where we discuss this.
laginimaineb•22m ago
We find these kinds of bugs very interesting: defending against humans is not the same problem as defending against agents. As models are becoming more powerful, they're also getting great at subverting verifiers (going through the 'path of least resistance'). This means that stronger verifiers are becoming increasingly more necessary, something we at doubleAI specialise in. You can also read our previous blogposts where we discuss this.
Happy to answer technical questions.