100x would be a 9900% speed boost, while a 100% speed boost would mean it's 2x as fast.
Which one is it?
It’s a bit ironic that for over a decade everybody was on x86 so SIMD optimizations could have a very wide reach in theory, but the extension architectures were pretty terrible (or you couldn’t count on the newer ones being available). And now that you finally can use the new and better x86 SIMD, you can’t depend on x86 ubiquity anymore.
Modern encoders also have better scaling across threads, though not infinite. I was in an embedded project a few years ago where we spent a lot of time trying to get the SoC’s video encoder working reliably until someone ran ffmpeg and we realized we could just use several of the CPU cores for a better result anyway
The devil is in the details, microbenchmarks are typically calling the same function a million times in a loop and everything gets cached reducing the overhead to sheer cpu cycles.
But that’s not how it’s actually used in the wild. It might be called once in a sea of many many other things.
You can at least go out of your way to create a massive test region of memory to prevent the cache from being so hot, but I doubt they do that.
Have you used ISPC, and what are your thoughts on it?
I feel it's a bit ridiculous that in this day and age you have to write SIMD code by hand, as regular compilers suck at auto-vectorizing, especially as this has never been the case with GPU kernels.
No professional kernel writer uses Auto-vectorization.
> I feel it's a bit ridiculous that in this day and age you have to write SIMD code by hand
You feel it's ridiculous because you've been sold a myth/lie (abstraction). In reality the details have always mattered.
If you both are vectorizing the same thing there might not be much difference.
If both are not vectorizing there might not be much difference, but with ISPC you can easily make sure that it does use vectorization and the best instruction set for your CPU.
void vector_add(float *a, float *b, float *c, int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}However, if you build your software from kernels like that, you leave a lot of performance on the table. For example, each core of my Zen 4 CPU at base frequency can add FP32 numbers with AVX1 or AVX-512 at 268 GB/sec, which results in 806 GB/sec total bandwidth for your kernel with two inputs and 1 output.
However, dual-channel DDR5 memory in my computer can only deliver 83 GB/sec bandwidth shared across all CPU cores. That’s an order of magnitude difference for a single threaded program, and almost 2 orders of magnitude difference when computing something on the complete CPU.
Even worse, the difference between compute and memory widens over time. The next generation Zen 5 CPUs can add numbers twice as fast per cycle if using AVX-512.
For this reason, ideally you want your kernels to do much more work with the numbers loaded from memory. That’s why efficient compute kernels are often way more complicated than the for loop in your example. Sadly, seems modern compilers can only reliably autovectorize very simple loops.
But you're right. It's hard to come up with enough computing to interleave with the actual expensive part, which is accessing memory. Even L2 cache isn't really fast enough to not be a bottleneck for typical vectorized operations.
If you look at TPU architectures, the the general pattern is: fast local large L1-cache-grade memory, preferably multi-ported, or multi-banked. Plus compute (whatever). The important bit being the memory architecture, not the compute. Plus blistering fast communication between cores over which results get streamed at speeds that are still not really fast enough.
Interestingly, I sat in on architecture meetings three decades ago, where Intel architects were privately telling us: "compute doesn't matter any more; it's all about memory speed".
On Zen 4 CPU, I believe the typical latency of L1D is 4 cycles. However, if you (or your compiler) write AVX code which adds these floats, will still bottleneck on memory even if both inputs are in L1D cache. Each Zen 4 core can sustain two vaddps instructions per cycle, two loads per cycle, and one store per cycle. Due to the load and store bottlenecks, that kernel will only do one vaddps per cycle i.e. will waste 50% of theoretically available compute power.
> Intel architects were privately telling us: "compute doesn't matter any more; it's all about memory speed"
To be fair, that’s only true for automatically vectorized code, kernels like in the GP’s example. With sufficient efforts spent on software development, for some practical problems it’s possible to write codes which do saturate compute.
An example of such problem is multiplication of large matrices. A carefully written manually vectorized implementation should bottleneck on compute not memory, because theoretically required memory bandwidth scales as N^2, while theoretically required FLOPs scale as N^3 where N is size of the matrix.
That’s precisely what many BLAS libraries are doing under the hood. For the same reason GPU vendors report ridiculously high numbers of theoretical TFlops when multiplying low precision matrices with these special AI blocks, wmma/mfma instructions on AMD, tensor cores on nVidia.
As an interesting point of reference, the last time I did Windows kernel development (which was admittedly not recently), code running in ring 0 was not allowed to access SIMD registers because they weren't saved and loaded during kernel-code context switches. Not sure if that's the case, but it probably is. Context switches are a LOT faster if you don't have to save and load SIMD registers.
Do you have any wisdom you can share about techniques or references you can point to?
And they will generally produce better assembler than I can for all but the weird bit-twiddly Intel special-purpose SIMD instructions. I'm actually professionally good at it. But the GCC instruction scheduler seems to be consistently better than I am at instruction scheduling. GCC and Clang actually have detailed models of the execution pipelines of all x64 and aarch64 processors that are used to perform instruction scheduling . Truly an amazing piece of work. So their instruction scheduling is -- as far as I can tell without very expensive Intel and ARM profiling tools -- immaculate across a wide variety of processors and architectures.
And if it vectorizes on x64, it will probably vectorize equally well on aarch64 ARM Neon as well. And if you want optimal instruction scheduling for an Arm Cortex A72 processor (a pi 4), well, there's a switch for that. And for an A76 (a Pi 5). And for an Intel N100.
The basics:
- You need --fastmath -O3 (or the MSVC eqivalent).
- You need to use and understand the "restrict" keyword (or closest compiler equivalent). If the compiler has to guard against aliasing of input and output arrays, you'll get performance-impaired code.
- You can reasonably expect any math in a for loop to be vectorized provided there aren't dependencies between iterations of the loop.
- Operations on arrays and matrices whose size is fixed at compile time are significantly better than operations that operate on matrices and arrays whose size is determined at runtime. (Dealing with the raggedy non-x4 tail end of arrays is easier if the shape is known at compile time).
- trust the compiler. It will do some pretty extraordinary things. (e.g. generating 7 bit-twiddly SIMD instructions to vectorize atanf(float4). But verify. Make sure it is doing what you expect.
- the cycle is: compile the C/C++ code; disassemble to make sure it did vectorize properly (and that it doesn't have checks for aliased pointers); profile; repeat until done, or forever, whichever comes first.
- Even for fairly significant compute, execution is likely to be dominated by memory reads and writes (hopefully cached memory reads and writes). My Neural net code spends about 80% of its time waiting for reads and writes to various cache levels. (And 15% of its time doing properly vectorized atanf operations, some significant portion of which involves memory reads and writes that spill L1 cache). The FFT code spends pretty close to 100% of its time waiting for memory reads and writes (even with pivots that improve cache behavior). The actual optimization effort was determing (at compile time) when to do pivot rounds to improve use of L1 and L2 cache. I would expect this to be generally true. You can do a lot of compute in the time it takes for an L1 cache miss to execute.
I can't think of why ISPC would do better than GCC,given what GCC actually does. My suspicion is that ISPC was a technology demonstration rather than a real product produced at a time when major compilers had no support at all for SIMD.
vmovdqu xmm1, xmmword ptr [rdx]
vmovdqu xmm0, xmmword ptr [rdx + 16]
mov rax, rdi
vmovd ecx, xmm1
or byte ptr [rsi], cl
vpextrb ecx, xmm1, 1
or byte ptr [rsi + 1], cl
vpextrb ecx, xmm1, 2
or byte ptr [rsi + 2], cl
vpextrb ecx, xmm1, 3
or byte ptr [rsi + 3], cl
vpextrb ecx, xmm1, 4
or byte ptr [rsi + 4], cl
vpextrb ecx, xmm1, 5
or byte ptr [rsi + 5], cl
...
vpextrb ecx, xmm0, 15
or byte ptr [rsi + 31], cl
vmovups ymm0, ymmword ptr [rsi]
vmovups ymmword ptr [rdi], ymm0
instead of a vorps xor eax, eax
.L2:
vmovdqu xmm1, XMMWORD PTR [rdx+rax]
vinsertf128 ymm1, ymm1, XMMWORD PTR [rdx+16+rax], 0x1
vmovdqu xmm0, XMMWORD PTR [rsi+rax]
vinsertf128 ymm0, ymm0, XMMWORD PTR [rsi+16+rax], 0x1
vorps ymm0, ymm0, ymm1
vmovdqu XMMWORD PTR [rdi+rax], xmm0
vextractf128 XMMWORD PTR [rdi+16+rax], ymm0, 0x1
add rax, 32
cmp rax, 4096
jne .L2
vzeroupper
retTheir optimizations also do not necessarily carry across compilers.
If you are in gamedev, and are targeting multiple platforms, with multiple CPU architectures and compiler vendors, and your game relies on a particular function being 20x faster than the scalar version, then failing to vectorize is a blocker bug.
Imo due to this, autovectorization is more of a nice surprise than something you can rely on.
As for fragility, if it's an obvious candidate for a SIMD loop, all the compilers I have worked with so far will obviously auto-vectorize it.
What pushed me to the point of no going back: checking to see how they were doing it and finding complete models for the complete execution pipelines of literally hundreds of processor, and realizing that the reason why their code does such good instruction scheduling was because they had a full model of the execution pipeline! In both GCC and Clang sources. How long has this stuff been around for? A decade and a half? Two? The Compiler Kiddies needed SOMETHING to keep them occupied and employed for 20 years. And auto-vectorization was it. A major industry-wide initiative, specifically to address compiler auto-vectorization. AMAZING stuff. (Well. That and continuous never-ending C++ standards. But a LOT of auto-vectorization. And instruction scheduling).
If you look at heavily-optimized SIMD code side-by-side with the equivalent heavily-optimized scalar code, they are often almost entirely unrelated implementations. That's the part compilers can't do.
Note that I use SIMD heavily in a lot of domains that aren't just brute-forcing a lot straightforward numerics. If you are just brute-forcing numerics, that auto-vectorization works pretty well. For example, I have a high-performance I/O scheduler that is almost pure AVX-512 and the compiler can't vectorize any of that.
An ideal scatter-read or gather-store instruction should take time proportional to the number of cache lines that it touches. If all of the lane accesses are sequential and cache line aligned it should take the same amount of time as an aligned vector load or store. If the accesses have high cache locality such that only two cache lines are touched, it should cost exactly the same as loading those two cache lines and shuffling the results into place. That isn't what we have on x86-AVX512. They are microcoded with inefficient lane-at-a-time implementations. If you know that there is good locality of reference in the access, then it can be faster to hand-code your own cache line-at-a-time load/shuffle/masked-merge loop than to rely on the hardware. This makes me sad.
ISPC's varying variables have no way to declare that they are sequential among all lanes. Therefore, without extensive inlining to expose the caller's access pattern, it issues scatters and gathers at the drop of a hat. You might like to write your program with a naive x[y] (x a uniform pointer, y a varying index) in a subroutine, but ISPC's language cannot infer that y is sequential along lanes. So, you have to carefully re-code it to say that y is actually a uniform offset into the array, and write x[y + programIndex]. Error-prone, yet utterly essential for decent performance. I resorted to munging my naming conventions for such indexes, not unlike the Hungarian notation of yesteryear.
Rewriting critical data structures in SoA format instead of AoS format is non-trivial, and a prerequisite to get decent performance from ISPC. You cannot "just" replace some subroutines with ISPC routines, you need to make major refactorings that touch the rest of the program as well. This is neutral in an ISPC-versus-intrinsics (or even ISPC-versus-GPU) shootout, but it is worth mentioning only to point out that ISPC is also not a silver bullet in this regards, either.
Non-minor nit: The ISPC math library gives up far too much precision by default in the name of speed. Fortunately, Sleef is not terribly difficult to integrate and use for the 1-ulp max rounding error that I've come to expect from a competent libm.
Another: The ISPC calling convention adheres rather strictly to the C calling convention... which doesn't provide any callee-saved vector registers, not even for the execution mask. So if you like to decompose your program across multiple compilation units, you will also notice much more register save and restore traffic than you would like or expect.
I want to like it, I can get some work done in it, and I did get significant performance improvements over scalar code when using it. But the resulting source code and object code are not great. They are merely acceptable.
thus sayeth the lord.
praise the lord!
So OP is correct. The 100x speed up is according to some misleading micro benchmark. The reason is that that transform is a huge amount of code and as OP said this will blow out the code cache while the amount of data you’re processing results in a blowout of the data cache. Net overall improvement might be 1% if even that.
Honestly though, nobody who has any idea how anything works would have expected ffmpeg to suddenly unearth a 100x speedup for everything. That's why the devs did not clarify this right away. It's too laughable of an assumption.
So just to to summary: either a 100x, or a 100% speedup (depending on which source)
- comparing hand-coded assembler vs. unoptimized C code.
- on a function that was poorly written in the first place.
- in code that's so rarely used that nobody could be bothered to fix it for decades.
- and even then, a tiny function whose overall CPU cost was about 2% of CPU cost to perform the obsolete task that nobody cared about enough to fix.
- so basically code that fails the profile before optimize rule, and should never have been optimized in the first place.
I think that covers it.
I care more about the outcome than the underlying semantics, to me thats kind of a given
> They would later go on to elaborate that the functionality, which might enjoy a 100% speed boost, depending upon your system, was “an obscure filter.”
However, to be fair, they communicate this stuff very clearly.
Basically, you'd do a block design (https://en.wikipedia.org/wiki/Blocking_(statistics)): on any random hardware you have, you run both versions back to back (or even better, interleaved), and note down the performance.
The idea that the differences in machines themselves and anything else running on them are noise, and you are trying to design your experiments in such a way that the noise should affect arms of the experiment in the same way---at least statistically.
Downsides: you have to do more runs and do more statistics to deal with the noise.
Upside: you can use any old hardware you have access to, even if it's not dedicated. And the numbers are arguably going to be more representative of real conditions, and not just a pristine lab environment.
https://thesofproject.github.io/latest/introduction/index.ht...
Clearly I'm wrong on this; I should probably properly learn assembly at some point...
To a first approximation, modern compilers can’t vectorize loops beyond the most trivial (say a dot product), and even that you’ll have to ask for (e.g. gcc -O3, which in other cases is often slower than -O2). So for mathy code like this they can easily be a couple dozen times behind in performance compared to wide vectors (AVX/AVX2 or AVX-512), especially when individual elements are small (like the 8-bit ones here).
Very tight scalar code, on modern superscalar CPUs... You can outcode a compiler by a meaningful margin, sometimes (my current example is a 40% speedup). But you have to be extremely careful (think dependency chains and execution port loads), and the opportunity does not come often (why are you writing scalar code anyway?..).
[1] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346725.h...
[2] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346726.h...
I was commenting there with some suggested change and you can find more performance comparison [0].
For example with small adjustment to C and compiling it for AVX512:
after (gcc -ftree-vectorize --march=znver4)
detect_range_8_c: 285.6 ( 1.00x)
detect_range_8_avx2: 256.0 ( 1.12x)
detect_range_8_avx512: 107.6 ( 2.65x)
Also I argued that it may be a little bit misleading to post comparison without stating the compiler and flags used for said comparison [1].P.S. There is related work to enable -ftree-vectorize by default [2]
[0] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346813.h...
[1] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346794.h...
[2] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346439.h...
I mean, I love ffmpeg, I use it a lot and it's fantastic for my needs, but I've found their public persona often misleading and well, this just confirms my bias.
> We made a 100x improvement over incredibly unoptimized C by writing heavily specific cpu instructions that the compiler cannot use because we don't allow it!
2x is still an improvement, but way less outstanding as they want it to publicize it because they used assembly.At my day job I have a small pile of code I'm responsible for which is a giant pile of intrinsics. We compile to GCC and MSVC. We have one function that is just a straight function. There are no loops, there is one branch. There is nothing that isn't either a vector intrinsic or an address calculation which I'm pretty sure is simple enough that it can be included in x86's fancy inline memory address calculation thingie. There's basically nothing for the compiler to do except translate vector intrinsics and address calculation from C into assembly.
The code when run in GCC is approximately twice as fast as MSVC.
The compiler is also responsible for register allocation, and MSVC is criminally insane at it.
One of these days I'll have to rewrite this function in assembly, just to get MSVC up to GCC speeds, but frankly I don't want to.
First check that you’re passing correct flags to VC++ compiler and linker: optimized release configuration, correct /arch switch, and ideally LTCG. BTW if you’re using cmake build system it’s rather hard to do, cmake support of VC++ compiler is not great.
Another thing, VC++ doesn’t like emitting giant functions. A possible reason for 2x difference VC++ failed to inline stuff, and your function has calls to other functions instead of a single large one. Note the compiler supports __forceinline keyword to work around that.
And it's 100x because a) min/max have single instructions in SIMD vs cmp+cmov in scalar and b) it's operating in u8 precision so each AVX512 instruction does 64x min/max. So unlike the unoptimized scalar that has a throughput under 1 byte per cycle, the AVX512 version can saturate L1 and L2 bandwidth. (128B and 64B per cycle on Zen 5.)
But, this kernel is operating on an entire frame; if you have to go to L3 because it's more than a megapixel then the gain should halve (depending on CPU, but assuming Zen 5), and the gain decreases even more if the frame isn't resident in L3.
And... yep, the benchmark on 256x16 frames. [1]
[1] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346729.h...
Random example: https://stackoverflow.com/questions/71343461/how-does-gcc-no...
The code in question was called quadrillions of times, so this actually mattered.
And the function in question is specifically for the color range part.
Are you saying that it's run once during a conversion as part of the process? Or that it's a specific flag that you give, it then runs this function, and returns output on the console?
(Either of those would be a one-time affair, so would likely result in close to zero speed improvement in the real world).
So you wouldn’t ever run this.
In later times, we developed DVDs. DVDs were digital. But they still had to encode data that would ultimately be sent across an analog cable to an analog television that would display the analog signal. So DVDs used colors darker than 16 (out of 255) to denote blacker than black. This digital signal would be decoded to an analog signal and were sent directly onto the wire. So while DVDs are ostensibly 8 bit per channel color, it's more like 7.9 bits per channel. This is also true for BluRay and HDMI.
In more recent times, we've decided we want that extra 0.1 bits back. Some codecs will encode video that uses the full range of 0-255 as in-band signal.
The problem is that ... sometimes people do a really bad job of telling the codec whether the signal range is 0-255 or 16-255. And it really does make a different. Sometimes you'll be watching a show or movie or whatever and the dark parts will be all fucked up. There are several reasons this can happen, one of which is because the black level is wrong.
It looks like this function determines scans frames for whether all the pixels are in the 16-255 or 0-255 range. If a codec can be sure that the pixel values are 16-255, it can saves some bits will encoding. But I could be wrong.
I do video stuff at my day job, and much to my own personal shame, I do not handle black levels correctly.
Also as an aside, "limited" is even more limited than 16-255, it's limited on the top end also: max white is 235, and the color components top out at 240.
The project as a whole is also utterly fascinating, if you find the idea of pulling an analog RF signal from a laser and then doing software ADC interesting.
[0]: https://github.com/happycube/ld-decode/wiki/ld-analyse#under...
* not 100x
Text: "... this boost is only seen in an obscure filter", "... up to ... %"
[expletives omitted]
That doesn't imply he speed up the program. Ironically speeding up parts of code may decrease overall performance due to resource contention.
As the saying goes. There are lies, damned lies, statistics, and then benchmarking.
shmerl•6mo ago