Am I reading this wrong, or does this only support FP16 inputs, and compares its performance against an FP32 solver?
bgwalter•38m ago
> To valid kernel correctness, we need to compare its output to a reference correct kernel with the same inputs.
No, you need a numerical proof, which you don't have.
krapht•26m ago
This is a standard which few kernels will ever meet. I'd say requiring a numerical proof is the same as requiring no proof at all - because it won't ever happen unless you're validating silicon or something equally expensive.
j2kun•25m ago
They claim the algorithm "discovered" the new techniques, but the methods described in section 5 do not seem all that novel to me. It smells like it could be "laundering" the literature [1] and reshuffling existing techniques. This is not inherently a bad thing, but I would hope that if it is borrowing existing techniques, the appropriate citation would eventually make it into this paper.
In the future, we will all be Jürgen Schmidhuber. :-)
alyxya•6m ago
There generally aren't new techniques when optimizing something ubiquitous. Instead, there are a lot of ways to apply existing techniques to create new and better results. Most ideas are built on top of the same foundational principles.
alyxya•9m ago
The chart confused me because I expected to see performance numbers of CUDA-L2 compared to the others, but instead it shows a chart showing the speedup percentage of CUDA-L2 over the others. In some sense, the bar chart effectively inverts the performance of torch.matmul and cuBLAS with how much percentage it shows. 0% on the bar chart would only mean equal performance.
stonogo•41m ago