100x would be a 9900% speed boost, while a 100% speed boost would mean it's 2x as fast.
Which one is it?
It’s a bit ironic that for over a decade everybody was on x86 so SIMD optimizations could have a very wide reach in theory, but the extension architectures were pretty terrible (or you couldn’t count on the newer ones being available). And now that you finally can use the new and better x86 SIMD, you can’t depend on x86 ubiquity anymore.
Modern encoders also have better scaling across threads, though not infinite. I was in an embedded project a few years ago where we spent a lot of time trying to get the SoC’s video encoder working reliably until someone ran ffmpeg and we realized we could just use several of the CPU cores for a better result anyway
The devil is in the details, microbenchmarks are typically calling the same function a million times in a loop and everything gets cached reducing the overhead to sheer cpu cycles.
But that’s not how it’s actually used in the wild. It might be called once in a sea of many many other things.
You can at least go out of your way to create a massive test region of memory to prevent the cache from being so hot, but I doubt they do that.
Have you used ISPC, and what are your thoughts on it?
I feel it's a bit ridiculous that in this day and age you have to write SIMD code by hand, as regular compilers suck at auto-vectorizing, especially as this has never been the case with GPU kernels.
No professional kernel writer uses Auto-vectorization.
> I feel it's a bit ridiculous that in this day and age you have to write SIMD code by hand
You feel it's ridiculous because you've been sold a myth/lie (abstraction). In reality the details have always mattered.
Do you have any wisdom you can share about techniques or references you can point to?
thus sayeth the lord.
praise the lord!
I care more about the outcome than the underlying semantics, to me thats kind of a given
https://thesofproject.github.io/latest/introduction/index.ht...
Clearly I'm wrong on this; I should probably properly learn assembly at some point...
To a first approximation, modern compilers can’t vectorize loops beyond the most trivial (say a dot product), and even that you’ll have to ask for (e.g. gcc -O3, which in other cases is often slower than -O2). So for mathy code like this they can easily be a couple dozen times behind in performance compared to wide vectors (AVX/AVX2 or AVX-512), especially when individual elements are small (like the 8-bit ones here).
Very tight scalar code, on modern superscalar CPUs... You can outcode a compiler by a meaningful margin, sometimes (my current example is a 40% speedup). But you have to be extremely careful (think dependency chains and execution port loads), and the opportunity does not come often (why are you writing scalar code anyway?..).
[1] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346725.h...
[2] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346726.h...
I was commenting there with some suggested change and you can find more performance comparison [0].
For example with small adjustment to C and compiling it for AVX512:
after (gcc -ftree-vectorize --march=znver4)
detect_range_8_c: 285.6 ( 1.00x)
detect_range_8_avx2: 256.0 ( 1.12x)
detect_range_8_avx512: 107.6 ( 2.65x)
Also I argued that it may be a little bit misleading to post comparison without stating the compiler and flags used for said comparison [1].P.S. There is related work to enable -ftree-vectorize by default [2]
[0] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346813.h...
[1] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346794.h...
[2] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346439.h...
And it's 100x because a) min/max have single instructions in SIMD vs cmp+cmov in scalar and b) it's operating in u8 precision so each AVX512 instruction does 64x min/max. So unlike the unoptimized scalar that has a throughput under 1 byte per cycle, the AVX512 version can saturate L1 and L2 bandwidth. (128B and 64B per cycle on Zen 5.)
But, this kernel is operating on an entire frame; if you have to go to L3 because it's more than a megapixel then the gain should halve (depending on CPU, but assuming Zen 5), and the gain decreases even more if the frame isn't resident in L3.
Random example: https://stackoverflow.com/questions/71343461/how-does-gcc-no...
The code in question was called quadrillions of times, so this actually mattered.
And the function in question is specifically for the color range part.
Are you saying that it's run once during a conversion as part of the process? Or that it's a specific flag that you give, it then runs this function, and returns output on the console?
(Either of those would be a one-time affair, so would likely result in close to zero speed improvement in the real world).
So you wouldn’t ever run this.
shmerl•5h ago