The video also showcases a much more impressive branchless speedup: computing CRC checksums. Doing this naïvely with an if-statement for each bit is ~10x slower than doing it branchless with a single bitwise operation for each bit. The author of the article should consider showcasing this too, since it's a lot more impressive than the measly 1.2x speedup highlighted in the article.
When you get to think about branchless programming, especially for SIMD optimizations in the real world, you always learn a lot and it’s as if you get a +1 level on your algorithmic skills. The hardest part then is make sure the tricks are clearly laidout so that someone else can take it from here next time
sfilmeyer•13m ago