For example, suppose an add operation has a latency of one unit in some silicon process. To add reduce a 32 element vector, you'll have a five deep tree, which means your operation has a latency of five units. You can pipeline this, but you can't solve the fact that this operation has a 5x higher latency than the non-reduce operations.
But in general building code around reductions isn't really a thing you'd ideally do; they're necessarily gonna have higher latency / lower throughput / take more silicon compared to avoiding them where possible, best to leave reducing to a single element to the loop tail.
Flaw1:fixed width
I prefer fixed width as it makes the code simpler to write, size is known as compile time so we know the size of our structures. Swizzle algorithms are also customized based on the size.
Flaw2:pipelining
no CPU I care about is in order so mostly irrelevant, and even scalar instructions are pipelined
Flaw3: tail handling
I code with SIMD as the target, and have special containers that pad memory to SIMD width, no need to mask or run a scalar loop. I copy the last valid value into the remaining slots so it doesn't cause any branch divergence.
I don't understand the push for variable width SIMD. Possibly due to ignorance but I think it's an abstraction that can be specialized for different hardware so the similar tradeoffs between low level languages and high level languages apply. Since I already have to be aware of hardware level concepts such as 256bit shuffle not working across 128bit lanes and different instructions having very different performance characteristics on different CPUs I'm already knee deep in hardware specifics. While in general I like abstractions I've largely given up waiting for a 'sufficiently advanced compiler' that would properly auto-vectorize my code. I think AGI AI is more likely to happen sooner. At a guess it seems to be that SIMD code could work on GPUs but GPU code has different memory access costs so the code there would also be completely different.
So my view is either create a much better higher level SIMD abstraction model with a sufficiently advanced compiler that knows all the tricks or let me work closely at the hardware level.
As an outsider who doesn't really know what is going on it does worry me a bit that it appears that WASM is pushing for variable width SIMDs instead of supporting ISAs generally supported by CPUs. I guess it's a portability vs performance tradeoff - I worry that it may be difficult to make variable as performant as fixed width and would prefer to deal with portability by having alternative branches at code level.
  >> Finally, any software that wants to use the new instruction set needs to be rewritten (or at least recompiled). What is worse, software developers often have to target several SIMD generations, and add mechanisms to their programs that dynamically select the optimal code paths depending on which SIMD generation is supported.
So variable width SIMD solves this by making any module using it valid regardless of whether the target supports 512-bit vectors, and the VM 'just' has to solve the problem of generating good code.
Personally I think this is a terrible way to do things and there should have just been a feature detection system, but the horse fled the barn on that one like a decade ago.
Wouldn’t that be suboptimal if/when CPUs that support 1024-bit vectors come along?
> Variable-length vectors, on the other hand, are a very challenging problem for compiler devs. You tend to get worse code out than if you just statically picked a size, even if it's not the native size.
Why would it be challenging? You could statically pick a size on a system with variable-length vectors, too. How would that be worse code?
If you know you're engineering for 16-byte vectors you can 'just' align all your data to 16 bytes. And if you know you have 8 vector registers where 4 of them are non-volatile you can design around that too. But without information like that you have to be defensive, like aligning all your data to 128 bytes instead Just In Case (heaven forbid native vectors get bigger than that), minimizing the number of registers you use to try and avoid stack spills, etc. (I mention this because WASM also doesn't expose any of this information.)
It's true that you could just design for a static size on a system with variable-length vectors. I suspect you'd see a lot of people do that, and potentially under-utilize the hardware's capabilities. Better than nothing, at least!
Is that likely or on anyone's roadmap? It makes a little less sense than 512 bits, at least for Intel, since their cache lines are 64 bytes i.e. 512 bits. Any more than that and they'd have to mess with multiple cache lines all the time, not just on unaligned accesses. And they'd have to support crossing more than 2 cache lines on unaligned accesses. They increase the cache line size too, but that seems terrible for compatibility since a lot of programs assume it's a compile time constant (and it'd have performance overhead to make it a run-time value). Somehow it feels like this isn't the way to go, but hey, I'm not a CPU architect.
Do you have examples for problems that are easier to solve in fixed-width SIMD?
I maintain that most problems can be solved in a vector-length-agnostic manner. Even if it's slightly more tricky, it's certainly easier than restructuring all of your memory allocations to add padding and implementing three versions for all the differently sized SIMD extensions your target may support. And you can always fall back to using a variable-width SIMD ISA in a fixed-width way, when necessary.
There's other kinds of interesting things you can do with vectors that aren't improved by dynamic-length vectors. Something like abseil's hash table, which uses vector code to efficiently manage the occupancy bitmap. Dynamic vector length doesn't help that much in that case, particularly because the vector length you can parallelize over is itself intrinsically low (if you're scanning dozens of elements to find an empty slot, something is wrong). Vector swizzling is harder to do dynamically, and in general, at high vector factors, difficult to do generically in hardware, which means going to larger vectors (even before considering dynamic sizes), vectorization is trickier if you have to do a lot of swizzling.
In general, vector-length-agnostic is really only good for SIMT-like codes, which you can express the vector body as more or less independent f(index) for some knowable-before-you-execute-the-loop range of indices. Stuff like DAXPY or BLAS in general. Move away from this model, and that agnosticism becomes overhead that doesn't pay for itself. (Now granted, this kind of model is a large fraction of parallelizable code, but it's far from all of it).
> Something like abseil's hash table
If I remember this correctly, the abseil lookup does scale with vector length, as long as you use the native data path width. (albeit with small gains) There is a problem with vector length agnostic handling of abseil, which is the iterator API. With a different API, or compilers that could eliminate redundant predicated load/stores, this would be easier.
> good for SIMT-like codes
Certainly, but I've also seen/written a lot of vector length agnostic code using shuffles, which don't fit into the SIMT paradigm, which means that the scope is larger than just SIMT.
---
As a general comparison, take AVX10/128, AVX10/256 and AVX10/512, overlap their instruction encodings, remove the few instructions that don't make sense anymore, and add a cheap instruction to query the vector length. (probably also instructions like vid and viota, for easier shuffle synthesization) Now you have a variable-length SIMD ISA that feels familiar.
The above is basically what SVE is.
For very many cases, writing the code once for an 'unknown to the programmer' vector length indeed works.
One example that doesn't work so well is a sorting network; its size depends on the vector length. (I see you mention this below.)
As mentioned, last time I tried vqsort for RVV it was surprisingly slow.
I tried to replicate it yesterday, but noticed that vqsort is now disabled for RVV: https://github.com/google/highway/blob/400fbf20f2e40b984be12...
Does highway support sorting networks for non-128-bit vector registers?
When I tried to compile it for AVX512, the BaseCase seems to only use xmm registers: https://godbolt.org/z/qr9xoTGKn
Yes, the issue with the sorting network is that it is limited to 16x16 to reduce code explosion. With uint16_t, XMM are sufficient for the 8-column case; your Godbolt link does have some YMM for the 16-column case. When changing the type to sort to uint32_t, we see ZMM as expected.
It has a few more instructions then the VLS version, but the critical dependency chain is the same.
It's also slightly less optimal on x86, because it alway uses lane crossing permutes. For AVX512 that is 5 out of 15 permutations that are vperm, but could've been vshuf. (if the loop isn't unrolled and optimized by the compiler)
I wasn't able to figure out how to implement the multi vector register sort in a VLA way.
Yes, the 2D aspect of the sorting network complicates things. Transposing is already harder to make VLA and fusing it with the other shuffles certainly doesn't help.
The benefit of fixed width is that optimal data structure and algorithm design on various microarchitectures is dependent on explicitly knowing the register width. SIMD widths aren’t not perfectly substitutable in practice, there is more at play than stride size. You can also do things like explicitly combine separate logic streams in a single SIMD instruction based on knowing the word layout. Compilers don’t do this work in 2025.
The argument for vector width agnostic code seems predicated on the proverbial “sufficiently advanced compiler”. I will likely retire from the industry before such a compiler actually exists. Like fusion power, it has been ten years away my entire life.
A SIMD ISA having a fixed size or not is orthogonal to autovectorization. E.g. I've seen a bunch of cases where things get autovectorized for RVV but not for AVX512. The reason isn't fixed vs variable, but rather the supported instructions themselves.
There are two things I'd like from a "sufficiently advanced compiler”, which are sizeless struct support and redundant predicated load/store elimination. Those don't fundamentally add new capabilities, but makes working with/integrating into existing APIs easier.
> All the complexity and cost is in specializing for the capabilities of the underlying SIMD ISA, not the width.
Wow, it almost sounds like you could take basically the same code and run it with different vector lengths.
> The benefit of fixed width is that optimal data structure and algorithm design on various microarchitectures is dependent on explicitly knowing the register width
Optimal to what degree? Like sure, fixed-width SIMD can always turn your pointer increments from a register add to an immediate add, so it's always more "optimal", but that sort of thing doesn't matter.
The only difference you usually encounter when writing variable instead of fixed size code is that you have to synthesize your shuffles outside the loop. This usually just takes a few instructions, but loading a constant is certainly easier.
It isn’t “same instruction but wider or narrower” or anything that can be trivially autovectorized, it is “different algorithm design”. Compilers are not yet rewriting data structures and algorithms based on microarchitecture.
I write a lot of SIMD code, mostly for database engines, little of which is trivial “processing a vector of data types” style code. AVX512 in particular is strong enough of an ISA that it is used in all kinds of contexts that we traditionally wouldn’t think of as a good for SIMD. You can build all kinds of neat quasi-scalar idioms with it and people do.
Afaik mojo allows for this with the autotuning capability and metaprogramming
Regular expression matching and encryption come to mind.
That's probably true. Last time I looked at it, it seemed like parts of vectorscan could be vectorized VLA, but from my, very limited, understanding of the main matching algorithm, it does seem to require specialization on vector length.
It should be possible to do VLA in some capacity, but it would probably be slower and it's too much work to test.
> encryption
From the things I've looked at, it's mixed.
E.g. chacha20 and poly1305 vectorize well in a VLA scheme: https://camel-cdr.github.io/rvv-bench-results/bpi_f3/chacha2..., https://camel-cdr.github.io/rvv-bench-results/bpi_f3/poly130...
Keccak on the other hand was optimized for fast execution on scalar ISAs with 32 GPRs. This is hard to vectorize in general, because GPR "moves" are free and liberally applied.
Another example where it's probably worth specializing is quicksort, specifically the leaf part.
I've written a VLA version, which uses bitonic sort to sort within vector registers. I wasn't able to meaningfully compare it against a fixed size implementation, because vqsort was super slow when I tried to compile it for RVV.
- Register width: we somewhat maxed out at 512 bits, with Intel going back to 256 bits for non-server CPUs. I don't see larger widths on the horizon (even if SVE theoretically supports up to 2048 bits, I don't know any implementation with ~~>256~~ >512 bits). Larger bit widths are not beneficial for most applications and the few applications that are (e.g., some HPC codes) are nowadays served by GPUs.
- The post mentions available opcode space: while opcode space is limited, a reasonably well-designed ISA (e.g., AArch64) has enough holes for extensions. Adding new instructions doesn't require ABI changes, and while adding new registers requires some kernel changes, this is well understood at this point.
- "What is worse, software developers often have to target several SIMD generations" -- no way around this, though, unless auto-vectorization becomes substantially better. Adjusting the register width is not the big problem when porting code, making better use of available instructions is.
- "The packed SIMD paradigm is that there is a 1:1 mapping between the register width and the execution unit width" -- no. E.g., AMD's Zen 4 does double pumping, and AVX was IIRC originally designed to support this as well (although Intel went directly for 256-bit units).
- "At the same time many SIMD operations are pipelined and require several clock cycles to complete" -- well, they are pipelined, but many SIMD instructions have the same latency as their scalar counterpart.
- "Consequently, loops have to be unrolled in order to avoid stalls and keep the pipeline busy." -- loop unroll has several benefits, mostly to reduce the overhead of the loop and to avoid data dependencies between loop iterations. Larger basic blocks are better for hardware as every branch, even if predicted correctly, has a small penalty. "Loop unrolling also increases register pressure" -- it does, but code that really requires >32 registers is extremely rare, so a good instruction scheduler in the compiler can avoid spilling.
In my experience, dynamic vector sizes make code slower, because they inhibit optimizations. E.g., spilling a dynamically sized vector is like a dynamic stack allocation with a dynamic offset. I don't think SVE delivered any large benefits, both in terms of performance (there's not much hardware with SVE to begin with...) and compiler support. RISC-V pushes further into this direction, we'll see how this turns out.
Which still means you have to write your code at least thrice, which is two times more than with a variable length SIMD ISA.
Also there are processors with larger vector length, e.g. 1024-bit: Andes AX45MPV, SiFive X380, 2048-bit: Akeana 1200, 16384-bit: NEC SX-Aurora, Ara, EPI
> no way around this
You rarely need to rewrite SIMD code to take advantage of new extensions, unless somebody decides to create a new one with a larger SIMD width. This mostly happens when very specialized instructions are added.
> In my experience, dynamic vector sizes make code slower, because they inhibit optimizations.
Do you have more examples of this?
I don't see spilling as much of a problem, because you want to avoid it regardless, and codegen for dynamic vector sizes is pretty good in my experience.
> I don't think SVE delivered any large benefits
Well, all Arm CPUs except for the A64FX were build to execute NEON as fast as possible. X86 CPUs aren't built to execute MMX or SSE or the latest, even AVX, as fast as possible.
Anyway, I know of one comparison between NEON and SVE: https://solidpixel.github.io/astcenc_meets_sve
> Performance was a lot better than I expected, giving between 14 and 63% uplift. Larger block sizes benefitted the most, as we get higher utilization of the wider vectors and fewer idle lanes.
> I found the scale of the uplift somewhat surprising as Neoverse V1 allows 4-wide NEON issue, or 2-wide SVE issue, so in terms of data-width the two should work out very similar.
256 and 512 bits are the only reasonable widths. 256 bit AVX2 is what, 13 or 14 years old now.
https://sourceware.org/bugzilla/show_bug.cgi?id=29611
https://developercommunity.visualstudio.com/t/Crash-in-Windo...
How do these fare in terms of absolute performance? The NEC TSUBASA is not a CPU.
> Do you have more examples of this?
I ported some numeric simulation kernel to the A64Fx some time ago, fixing the vector width gave a 2x improvement. Compilers probably/hopefully have gotten better in the mean time and I haven't redone the experiments since then, but I'd be surprised if this changed drastically. Spilling is sometimes unavoidable, e.g. due to function calls.
> Anyway, I know of one comparison between NEON and SVE: https://solidpixel.github.io/astcenc_meets_sve
I was specifically referring to dynamic vector sizes. This experiment uses sizes fixed at compile-time, from the article:
> For the astcenc implementation of SVE I decided to implement a fixed-width 256-bit implementation, where the vector length is known at compile time.
The NEC is an attached accelerator, but IIRC it can run an OS in host mode. It's hard to tell how the others perform, because most don't have hardware available yet or only they and partner companies have access. It's also hard to compare, because they don't target the desktop market.
> I ported some numeric simulation kernel to the A64Fx some time ago, fixing the vector width gave a 2x improvement.
Oh, wow. Was this autovectorized or handwritten intrinsics/assembly?
Any chance it's of a small enough scope that I could try to recreate it?
> I was specifically referring to dynamic vector sizes.
Ah, sorry, yes you are correct. It still shows that supporting VLA mechanisms in an ISA doesn't mean it's slower for fixed-size usage.
I'm not aware of any proper VLA vs VLS comparisons. I benchmarked a VLA vs VLS mandelbrot implementation once where there was no performance difference, but that's a too simple example.
This is a wrong approach. You should be writing you code in a high-level language like this:
    x = sum i for 1..n: a[i] * b[i]
I don't understand what is the advantage of writing the SIMD code manually. At least have a LLM write it if you don't like my imaginary high-level vector language.
In practice, though, the cases that compilers can successfully autovectorize are very limited relative to the total problem space that SIMD is solving. Plus, if I rely on that, it leaves me vulnerable to regressions in the compiler vectorizer.
Ultimately for me, I would rather write the implementation myself and know what is being generated versus trying to write high-level code in just the right way to make the compiler generate what I want.
Just to address this, it's pretty evident why scalar values have stabilized at 64-bit and vectors at ~512 (though there are larger implementations). Tell someone they only have 256 values to work with and they immediately see the limit, it's why old 8-bit code wasted so much time shuffling carries to compute larger values. Tell them you have 65536 values and it alleviates a large set of that problem, but you're still going to hit limits frequently. Now you have up to 4294967296 values and the limits are realistically only going to be hit in computational realms, so bump it up to 18446744073709551615. Now even most commodity computational limits are alleviated and the compiler will handle the data shuffling for larger ones.
There was naturally going to be a point where there was enough static computational power on integers that it didn't make sense to continue widening them (at least, not at the previous rate). The same goes for vectorization, but in even more niche and specific fields.
No, it actually is super common in hpc code. If you unroll a loop N times you need N times as many registers. For normal memory-bound code I agree with you, but most hpc kernels will exploit as much of the register file as they can for blocking/tiling.
I think they would benefit from having 4 vtype registers, though. It's wasted scalar space, but how often do you actually rotate between 4 different vector types in main loop bodies. The answer is pretty rarely. And you'd greatly reduce the swapping between vtypes when. I think they needed to find 1 more bit but it's tough the encoding space isn't that large for rvv which is a perk for sure
Can't wait to seem more implementions of rvv to actually test some of it's ideas
Will be interesting to see if longer encodings for RVV with encoded vtype or whatever ever materialize.
But any core I can think of as 'high-performance' is OOO.
Just before libraries for training neural nets on GPUs became available I worked on a product that had a SIMD based neural network trainer that was written in hand-coded assembly. We were a generation behind in our AVX instructions so we gave up half of the performance we could have got, but that was the least of the challenges we had to overcome to get the product in front of customers. [1]
My software-centric view of Intel's problems is that they've been spending their customers and shareholders money to put features in chips that are fused off or might as well be fused off because they aren't widely supported in the industry. And that they didn't see this as a problem and neither did their enablers in the computing media and software industry. Just for example, Apple used to ship the MKL libraries which like a turbocharger for matrix math back when they were using Intel chips. For whatever reason, Microsoft did not do this with Windows and neither did most Linux distributions so "the rest of us" are stuck with a fraction of the performance that we paid for.
AMD did the right thing in introducing double pumped AVX-512 because at least assembly language wizards have some place where their code runs and the industry gets closer to the place where we can count on using an instruction set defined 12 years ago.
[1] If I'd been tasked with updating the to next generation I would have written a compiler (if I take that many derivatives by hand I'll get one wrong.) My boss would have ordered me not to, I would have done it anyway and not checked it in.
Baffling that MS didn’t use it. They have a pretty close relationship…
Agree that they are sort of going after hard-to-use niche features nowadays. But I think it is just that the real thing we want—single threaded performance for branchy code—is, like, incredibly difficult to improve nowadays.
https://lemire.me/blog/2023/08/12/transcoding-utf-8-strings-...
and web browsers at the very least spent a lot of cycles on decoding HTML and Javascript which is UTF-8 encoded. It turns out AVX-512 is good at a lot of things you wouldn't think SIMD would be good at. Intel's got the problem that people don't want to buy new computers because they don't see much benefit from buying a new computer, but a new computer doesn't have the benefit it could have because of lagging software support, and the software support lags because there aren't enough new computers to justify the work to do the software support. Intel deserves blame for a few things, one of which is that they have dragged their feet at getting really innovative features into their products while turning people off with various empty slogans.
They really do have a new instruction set that targets plain ordinary single threaded branchy code
https://www.intel.com/content/www/us/en/developer/articles/t...
they'll probably be out of business before you can use it.
Unless if said optimization on parsing runs at the very core of JS.
It might be fast, but it's not a UTF-8 decoder. It's a transcoder to a fixed, and very limited, target encoding.
Even the Steam Hardware Survey, which is skewed toward upper end hardware, only shows 16% availability of baseline AVX-512, compared to 94% for AVX2.
The downside is that AMD also increased the latency of all formerly cheap integer vector ops. This removes one of the main advantages against NEON, which historically has had richer operations but worse latencies. That's one thing I hope Intel doesn't follow.
Also interesting is that Intel's E-core architecture is improving dramatically compared to the P-core, even surpassing it in some cases. For instance, Skymont finally has no penalty for denormals, a long standing Intel weakness. Would not be surprising to see the E-core architecture take over at some point.
yeah, that's crazy to me. Intel has been so completely discunctional for the last 15 years. I feel like you couldn't have a clearer sign of "we have 2 completely separate teams that are competing with each other and aren't allowed to/don't want to talk to each other". it's just such a clear sign that the chicken is running around headless
The downfall of Pentium 4 was that they had been stuffing things into longer and longer pipes to keep up the frequency race(with horrible branch latencies as a result). They scaled it all away by "resetting" to the P3/P-M/Core architecture and scaling up from that again.
Pipes today are even _longer_ and if E-cores has shorter pipes at a similar frequency then "regular" JS,Java,etc code will be far more performant even if you lose a bit of perf for "performance" cases where people vectorize (Did the HPC computing crowd influence Intel into a ditch AGAIN? wouldn't be surprising!).
The P-cores have their warts, but are still much more well-rounded than the P4 was.
I'm not sure why OS kernels couldn't have become partners in CPU capability queries (where a program starting execution could request a CPU core with 'X' such as AVX-512F, for example) -- but without that the whole P-core/E-core hybrid concept was DOA for capabilities which were not least-common denominators. If I had to guess, marketing got ahead of engineering and testing on that one.
This is also the same reason that having AVX-512 only on the P-cores wouldn't have worked, even with thread director support. It would only take one small routine in a common location to push most threads off the P-cores.
I'm of the opinion that Intel's hybrid P/E-arch has been mostly useless anyway and only good for winning benchmarks. My current CPU has a 6P4E configuration and the scheduler hardly uses the E-cores at all unless forced, plus performance was better and more stable with the E-cores disabled.
AVX-512 should be just fine via intrinsics/high-level vector types, not different from AVX2 in this regard.
But it also has a bunch of specialized instructions that can boost performance beyond just the 2x width. One of them is VPCOMPRESSB, which accelerates compact encoding of sparse data. Another is GF2P8AFFINEQB, which is targeted at specific encryption algorithms but can also be abused for general bit shuffling. Algorithms like computing a histogram can benefit significantly, but it requires reshaping the algorithm around very particular and peculiar intermediate data layouts that are beyond the transformations a compiler can do. This doesn't literally require assembly language, though, it can often be done with intrinsics.
I think this may be domain-specific. I help maintain several open-source audio libraries, and wind up being the one to review the patches when people contribute SIMD for some specific ISA, and I think without exception they always get the tail handling wrong. Due to other interactions it cannot always be avoided by padding. It can roughly double the complexity of the code [0], and requires a disproportionate amount of thinking time vs. the time the code spends running, but if you don't spend that thinking time you can get OOB reads or writes, and thus CVEs. Masked loads/stores are an improvement, but not universally available. I don't have a lot of concrete suggestions.
I also work with a lot of image/video SIMD, and this is just not a problem, because most operations happen on fixed block sizes, and padding buffers is easy and routine.
I agree I would have picked other things for the other two in my own top-3 list.
[0] Here is a fun one, which actually performs worst when len is a multiple of 8 (which it almost always is), and has 59 lines of code for tail handling vs. 33 lines for the main loop: https://gitlab.xiph.org/xiph/opus/-/blob/main/celt/arm/celt_...
Traditionally we’ve worked around this with pretty idiomatic hacks that efficiently implement “masked load” functionality in SIMD ISAs that don’t have them. We could probably be better about not making people write this themselves every time.
Beyond that -- unit testing. I don't see enough of it for vectorized routines. SIMD widths are small enough that you can usually just test all possible offsets right up against a guard page and brute force verify that no overruns occur.
What I wanted is to write code in a more high-level language like this. For example, to compute a scalar product of a and b you write:
    1..n | a[$1] * b[$1] | sum
    x = sum for i in 1 .. n: a[i] * b[i]
Of course, the compiler vectorizing code when it can as a general optimization is still useful, but when it's critical that some operations must be vectorized, explicit SIMD structures seem nice to have.
I think SWAR-C nailed the syntax (a vector ?: operator, for example).
Another reason to prefer fixed width, compilers may pass vectors to functions in SIMD registers. When register size is unknown at compile time, they have to pass data in memory. For complicated SIMD algorithms the performance overhead gonna be huge.
[1] from CDC STAR-100 and followons like the CDC Cyber 180/990, Cyber 200 series & ETA-10.
For example, if we didn't settle on executing compiled machine code exactly as-is, and had a instruction-updating pass (less involved than a full VM byte code compilation), then we could adjust SIMD width for existing binaries instead of waiting decades for a new baseline or multiversioning faff.
Another interesting alternative is SIMT. Instead of having a handful of special-case instructions combined with heavyweight software-switched threads, we could have had every instruction SIMDified. It requires structuring programs differently, but getting max performance out of current CPUs already requires SIMD + multicore + predictable branching, so we're doing it anyway, just in a roundabout way.
Is that not where we're already going with the GPGPU trend? The big catch with GPU programming is that many useful routines are irreducibly very branchy (or at least, to an extent that removing branches slows them down unacceptably), and every divergent branch throws out a huge chunk of the GPU's performance. So you retain a traditional CPU to run all your branchy code, but you run into memory-bandwidth woes between the CPU and GPU.
It's generally the exception instead of the rule when you have a big block of data elements upfront that can all be handled uniformly with no branching. These usually have to do with graphics, physical simulation, etc., which is why the SIMT model was popularized by GPUs.
Even if you have a branchy divide-and-conquer problem ideal for diverging threads, you'll get hit by a relatively high overhead of distributing work across threads, false sharing, and stalls from cache misses.
My hot take is that GPUs will get more features to work better on traditionally-CPU-problems (e.g. AMD Shader Call proposal that helps processing unbalanced tree-structured data), and CPUs will be downgraded to being just a coprocessor for bootstrapping the GPU drivers.
Apple tried something like this: they collected the LLVM bitcode of apps so that they could recompile and even port to a different architecture. To my knowledge, this was done exactly once (watchOS armv7->AArch64) and deprecated afterwards. Retargeting at this level is inherently difficult (different ABIs, target-specific instructions, intrinsics, etc.). For the same target with a larger feature set, the problems are smaller, but so are the gains -- better SIMD usage would only come from the auto-vectorizer and a better instruction selector that uses different instructions. The expectable gains, however, are low for typical applications and for math-heavy programs, using optimized libraries or simply recompiling is easier.
WebAssembly is a higher-level, more portable bytecode, but performance levels are quite a bit behind natively compiled code.
Do people that say these things have literally any experience of merit?
> For example, if we didn't settle on executing compiled machine code exactly as-is, and had a instruction-updating pass
You do understand that at the end of the day, hardware is hard (fixed) and software is soft (malleable) right? There will be always be friction at some boundary - it doesn't matter where you hide the rigidity of a literal rock, you eventually reach a point where you cannot reconfigure something that you would like to. And also the parts of that rock that are useful are extremely expensive (so no one is adding instruction-updating pass silicon just because it would be convenient). That's just physics - the rock is very small but fully baked.
> we could have had every instruction SIMDified
Tell me you don't program GPUs without telling me. Not only is SIMT a literal lie today (cf warp level primitives), there is absolutely no reason to SIMDify all instructions (and you better be a wise user of your scalar registers and scalar instructions if you want fast GPU code).
I wish people would just realize there's no grand paradigm shift that's coming that will save them from the difficult work of actually learning how the device works in order to be able to use it efficiently.
Users are much more forgiving about software that runs a bit slower than software that doesn't run at all. ~95% of x86_64 CPUs have AVX2 support, but compiling binaries to unconditionally rely on it makes the remaining users complain. If it was merely slower on potato hardware, it'd be an easier tradeoff to make.
This is the norm on GPUs thanks to shader recompilation (they're far from optimal for all hardware, but at least get to use the instruction set of the HW they're running on, instead of being limited to the lowest common denominator). On CPUs it's happening in limited cases: Zen 3 added AVX-512 by executing two 256-bit operations serially, and plenty of less critical instructions are emulated in microcode, but it's done by the hardware, because our software isn't set up for that.
Compilers already need to make assumptions about pipeline widths and instruction latencies, so the code is tuned for specific CPU vendors/generations anyway, and that doesn't get updated. Less explicitly, optimized code also makes assumptions about cache sizes and compute vs memory trade-offs. Code may need L1 cache of certain size to work best, but it still runs on CPUs with a too-small L1 cache, just slower. Imagine how annoying it would be if your code couldn't take advantage of a larger L1 cache without crashing on older CPUs. That's where CPUs are with SIMD.
> compiled machine code exactly as-is, and had a instruction-updating pass
implies there should be silicon that implements the instruction-updating - what else would be "executing" compiled machine code other than the machine itself...........
What I'm suggesting is adding a translation/fixup step after loading a binary, before the code is executed, to make it more tolerant to hardware changes. It doesn’t have to be full abstract portable bytecode compilation, and not even as involved as PTX to SASS, but more like a peephole optimizer for the same OS on the same general CPU architecture. For example, on a pre-AVX2 x86_64 CPU, the OS could scan for AVX2 instructions and patch them to do equivalent work using SSE or scalar instructions. There are implementation and compatibility issues that make it tricky, but fundamentally it should be possible. Wilder things like x86_64 to aarch64 translation have been done, so let's do it for x86_64-v4 to x86_64-v1 too.
Scalar code should be unrolled by the compiler to the SIMD word width to expose potential parallelism. But other than that, correctly predicted branches are free, and so is loop instruction overhead on modern wide-dispatch processors. For example, even running a maximally efficient AVX512 kernel on a zen5 machine that dispatches 4 EUs and some load/stores and calculates 2048 bits in the vector units every cycle, you still have a ton of dispatch capacity to handle the loop overhead in the scalar units.
The cost of unrolling is decreased code density and reduced effectiveness of the instruction / uOp cache. I wish Clang in particular would stop unrolling the dang vector loops.
There are some cases where useful code density goes up.
Ex: unroll the Goertzel algorithm by a even number, and suddenly the entire delay line overhead evaporates.
They don’t always do that well when you need a reduction in that loop, e.g. you are searching for something in memory, or computing dot product of long vectors.
Reductions in the loop form a continuous data dependency chain between loop iteration, which prevents processors from being able to submit instructions for multiple iterations of the loop. Fixable with careful manual unrolling.
I cannot agree because in an unrolled loop you have less counter increment instructions.
Unrolling is definitely needed for properly scheduling and pipelining SIMD code even on OoO cores. Remember that an OoO core cannot reorder dependent instructions, so the dependencies need to be manually broken, for example by adding additional accumulators, which in turn requires additional unrolling, this is especially important on SIMD code which typically is non-branchy with long dependency chains.
Unrolling trivial loops to remove loop counter overhead hasn't been productive for quite a whole now but unfortunately it's still the default for many compilers.
I look at SIMD as the same idea as any other aspect of the x86 instruction set. If you are directly interacting with it, you should probably have a good reason to be.
I primarily interact with these primitives via types like Vector<T> in .NET's System.Numerics namespace. With the appropriate level of abstraction, you no longer have to worry about how wide the underlying architecture is, or if it even supports SIMD at all.
I'd prefer to let someone who is paid a very fat salary by a F100 spend their full time job worrying about how to emit SIMD instructions for my program source.
It requires new opcodes. It does not strictly require new encodings. Several new encodings are legacy compatible and can encode previous generations vector instructions.
> so the architecture must provide enough SIMD registers to avoid register spilling.
Or the architecture allows memory operands. The great joy of basic x86 encoding is that you don't actually need to put things in registers to operate on them.
> Usually you also need extra control logic before the loop. For instance if the array length is less than the SIMD register width, the main SIMD loop should be skipped.
What do you want? No control overhead or the speed enabled by SIMD? This isn't a flaw. This is a necessary price to achieve the efficiency you do in the main loop.
That's just spilling with fewer steps. The executed uops should be the same.
Another way to say this is it's "more efficient."
> The executed uops should be the same.
And "more densely coded."
vaddps zmm1,zmm0,ZMMWORD PTR [r14]
takes six bytes to encode:
62 d1 7c 48 58 0e
In SVE and RVV a load+add takes 8 bytes to encode.
That's... 1 register saved, out of 16 (or 32 on AVX-512). Perhaps useful sometimes, but far from a particularly significant aspect spill-wise.
And doing that means you lose the ability to move the load earlier (perhaps not too significant on OoO hardware, but still a consideration; while reorder windows are multiple hundreds of instructions, the actual OoO limit is scheduling queues, which are frequently under a hundred entries, i.e. a couple dozen cycles worth of instructions, at which point the ≥4 cycle latency of a load is not actually insignificant. And putting the load directly in the arith op is the worst-case scenario for this)
Arm’s SVE, and RISC-V’s vector extension are all vector-length-agnostic. RISC-V’s implementation is particularly nice, you only have to compile for one code path (unlike avx with the need for fat-binary else/if trees).
In contrast, the systems that have flexible widths have never taken off. It's seemingly much harder to design a programming language for a flexible width SIMD.
2. Not a problem for GPUs. It should be noted that kernels allocate custom amounts of registers: one kernel may use 56 registers, while another kernel might use 200 registers. All GPUs will run these two kernels simultaneously (256+ registers per CU or SM is commonly supported, so both 200+56 registers kernels can run together).
3. Not a problem for GPUs or really any SIMD in most cases. Tail handling is O(1) problem in general and not a significant contributor to code length, size, or benchmarks.
Overall utilization issues are certainly a concern. But in my experience this is caused by branching most often. (Branching in GPUs is very inefficient and forces very low utilization).
It's absolutely a significant contributor to code size (..in scenarios where vectorized code in general is a significant contributor to code size, which admittedly is only very-specialized software).
If(my lane is beyond the buffer) then (exec flag off, do not store my lane).
Which in practice should be a simple vcompress instruction (AVX512 register) and maybe a move afterwards??? I admit that I'm not an AVX512 expert but it doesn't seem all that difficult with vcompress instructions + execmask.
Doing the tail separately but with masking SIMD is an improvement over a scalar loop perf-wise (..perhaps outside of the case of 1 or 2 elements, which is a realistic situation for a bunch of loops too), but it'll still add a double-digit percentage to code size over just a plain SIMD loop without tail handling.
And this doesn't help pre-AVX-512, and AVX-512 isn't particularly widespread (AVX2 does have masked load/store with 32-/64-bit granularity, but not 8-/16-bit, and the instrs that do exist are rather slow on AMD (e.g. unconditional 12 cycles/instr throughput for masked-storing 8 32-bit elements); SSE has none, and ARM NEON doesn't have any either (and ARM SVE isn't widespread either, incl. not supported on apple silicon))
(don't need vcompress, plain masked load/store instrs do exist in AVX-512 and are sufficient)
Keep a register with the values IdxAdjustment = [0, 1, 2, 3, 4, 5, 6, 7].
ExecutionMask = (Broadcast(CurIdx) + IdxAdjustment) < Length
Keep looping while(any(vector) < Length), which is as simple as "while(exec_mask != 0)".
I'm not seeing this take up any "extra" instructions at all. You needed the while() loop after all. It costs +1 Vector Register (IdxAdjustment) and a kMask by my count.
> And this doesn't help pre-AVX-512, and AVX-512 isn't particularly widespread
AVX512 is over 10 years old now. And the premier SIMD execution instruction set is CUDA / NVidia, not AVX512.
AVX512 is now available on all AMD CPUs and has been for the last two generations. It is also available on a select number of Intel CPUs. There is also AMD RDNA, Intel Xe ISAs that could be targeted.
> instrs that do exist are rather slow on AMD (e.g. unconditional 12 cycles/instr throughput for masked-storing 8 32-bit elements);
Okay, I can see that possibly being an issue then.
EDIT: AMD Zen5 Optimization Manual states Latency1 and throughput 2-per-clocktick, while Intel's Skylake documentation of https://www.intel.com/content/www/us/en/docs/intrinsics-guid... states Latency5 Throughput 1-per-clock-tick.
AMD Zen5 seems to include support to vmovdqu8 (its in the optimization guide .xlsx sheet with latencies/throughputs, also as 1-latency / 4-throughput). This includes vmovdqu8 (
I'm not sure if the "mask" register changes the instruction. I'll do some research to see if I can verify your claim (I don't have my Zen5 computer built yet... but its soon).
And those two instrs are vector instrs, i.e. competing with execution units for the actual thing you want to compute, whereas scalar instrs have at least some independent units that allow avoiding desiring infinite unrolling.
And if your loop is processing 32-bit (or, worse, smaller) elements, those indices, if done as 64-bit, as most code will do, will cost even more.
AVX-512 might be 10 years old, but Intel's latest (!) cores still don't support it on hardware with E-cores, so still a decade away from being able to just assume it exists. Another thread on this post mentioned that Intel has shipped hardware without AVX/AVX2/FMA as late as 2021 even.
> Okay, I can see that possibly being an issue then.
To be clear, that's only the AVX2 instrs; AVX-512 masked loads/stores are fast (..yes, even on Zen 4 where the AVX-512 masked loads/stores are fast, the AVX2 ones that do an equivalent amount of work (albeit taking the mask in a different register class) are slow). uops.info: https://uops.info/table.html?search=maskmovd%20m256&cb_lat=o...
Intel also has AVX-512 masked 512-bit 8-bit-elt stores at half the throughput of unmasked for some reason (not 256-bit or ≥16-bit-elt though; presumably culprit being the mask having 64 elts): https://uops.info/table.html?search=movdqu8%20m512&cb_lat=on...
And masked loads use some execution ports on both Intel and AMD, eating away from throughput of the main operation. All in all just not implemented for being able to needlessly use masked loads/stores in hot loops.
Overall, I agree that AVX and Neon have their warts and performance issues. But they're like 15+ years old now and designed well before GPU Compute was possible.
> using gathers/scatters for those would be stupid and slow
This is where CPUs are really bad. GPUs will coalesce gather/scatters thanks to __shared__ memory (with human assistance of course).
But also the simplest of load/store patters are auto-detected and coalesced. So a GPU programmer doesn't have to worry about SIMD lane load/store (called vgather in AVX512) being slower. It's all optimized to hell and back.
Having a full lane-to-lane crossbar and supporting high performance memory access patterns needs to be a priority moving forward.
A messy thing with memory performance on CPUs is that either you share the same cache hardware between scalar and vector, thereby significantly limiting how much latency you can trade for throughput, or you have to add special vector L1 cache, which is a ton of mess and silicon area; never mind uses of SIMD that are latency-sensitive, e.g. SIMD hashmap probing, or small loops.
I guess you don't necessarily need that for just detecting patterns in gather indices, but nothing's gonna get a gather of consecutive 8-bit elts via 64-bit indices to not perform much slower than a single contiguous load, and 8-bit elts are quite important on CPUs for strings & co.
Three Fundamental Flaws of SIMD - https://news.ycombinator.com/item?id=28114934 - Aug 2021 (20 comments)
I found that a lot of the custom simd cores I've written for simply cannot issue instructions fast enough risvc. Or when they it's in quick bursts and than increments and loop controls that leave the engine idling for more than you'd like.
Better dual issue helps but when you have seperate vector queue you are sending things to it's not that much to add increments into vloads and vstores
I think this is not important anymore because modern architectures allow to add offset to register value so you can write something like, using weird RISC-V syntax for addition:
    ld r2, 0(r1)
    ld r3, 4(r1)
    ld r4, 8(r1)
With vector SIMD you don't know the register size beforehand and therefore have to maintain and increment counters, adding extra unnecessary instructions, reducing total performance. With packed SIMD you can issue several loads immediately without dependencies, and if you look at code examples, you can see that the x86 code is more dense and uses a sequence of unrolled SIMD instructions without any extra instructions which is more efficient. While RISC-V has 4 SIMD instructions and 4 instructions dealing with counters per loop iteration, i.e. it wastes 50% of command issue bandwidth and you cannot load next block until you increment the counter.
The article mentions that you have to recompile packed SIMD code when a new architecture comes out. Is that really a problem? Open source software is recompiled every week anyway. You should just describe your operations in a high level language that gets compiled to assembly for all supported architectures.
So as a conclusion, it seems that Vector SIMD is optimized for manually-written assembly and closed-source software while Packed SIMD is made for open-source software and compilers and is more efficient. Why RISC-V community prefers Vector architecture, I don't understand.
Despite being potentially compiled recently, anything from most Linux package managers, and whatever precompiled downloadable executables, even if from open-source code, still targets the 20-year-old SSE2 baseline, wasting the majority of SIMD resources available on modern (..or just not-extremely-ancient) CPUs (unless you're looking at the 0.001% of software that bothers with dynamic dispatch; but for that approach just recompiling isn't enough, you also need to extend the dispatched target set).
RISC-V RVV's LMUL means that you get unrolling for free, as each instruction can operate over up to 8 registers per operand, i.e. essentially "hardware 8x unrolling", thereby making scalar overhead insignificant. (probably a minor nightmare from the silicon POV, but perhaps not in a particularly limiting way - double-pumping has been done by x86 many times so LMUL=2 is simple enough, and at LMUL=4 and LMUL=8 you can afford to decode/split into ups at 1 instr/cycle)
ARM SVE can encode adding a multiple of VL in load/store instructions, allowing manual unrolling without having to actually compute the intermediate sizes. (hardware-wise that's an extremely tiny amount of overhead, as it's trivially mappable to an immediate offset at decode time). And there's an instruction to bump a variable by a multiple of VL.
And you need to bump pointers in any SIMD regardless; the only difference is whether the bump size is an immediate, or a dynamic value, and if you control the ISA you can balance between the two as necessary. The packed SIMD approach isn't "free" either - you need hardware support for immediate offsets in SIMD load/store instrs.
Even in a hypothetical non-existent bad vector SIMD ISA without any applicable free offsetting in loads/stores and a need for unrolling, you can avoid having a dependency between unrolled iterations by precomputing "vlen*2", "vlen*3", "vlen*4", ... outside of the loop and adding those as necessary, instead of having a strict dependency chain.
There is an enormous quantity of SIMD code in the world that isn't SAXPY, and doesn't stay neatly in lane. Instead it's things like "base64 encode this data" or "unpack and deinterleave this 4:2:2 pixel data, apply a colorspace conversion as a 3x3 sparse matrix and gamma adjustment in 16Q12 fixed-point format, resize and rotate by 15˚ with three shear operations represented as a linear convolution with a sinc kernel per row," or "extract these fields from this JSON data". All of which _can totally be done_ with a well-designed vector ISA, but the comparison doesn't paint nearly as rosy of a picture. The reality is that you really want a mixture of ideas that come from fixed-width SIMD and ideas that come from the vector world (which is roughly what people actually shipping hardware have been steadily building over the last two decades, implementing more support for unaligned access, predication, etc, while the vector ISA crowd writes purist think pieces)
I think there is a way: vary register size per CPU, but also add an instruction to retrieve register size. Then, code using the vector unit will sometimes have to dynamically allocate a buffer for intermediate values, but it would allow for software to run across CPUs with different vector lengths. Does anybody know whether any architecture does this?
I remember seeing presentations of extensions to AVX (during probably a supercomputing related event in Spain years ago ?) that some complex, matrix to matrix instructions could have data dependent execution time, in addition to possible hardware register size dependencies.
In some contexts, and for overall security, this could be very problematic. Has this been discussed?
pkhuong•6mo ago
camel-cdr•6mo ago
Most problems don't require this: E.g. your basic penalizable math stuff, unicode conversion, base64 de/encode, json parsing, set intersection, quicksort*, bigint, run length encoding, chacha20, ...
And if you run into a problem that benefits from knowing the SIMD width, then just specialize on it. You can totally use variable-length SIMD ISA's in a fixed-length way when required. But most of the time it isn't required, and you have code that easily scales between vector lengths.
*quicksort: most time is spent partitioning, which is vector length agnostic, you can handle the leafs in a vector length agnostic way, but you'll get more efficient code if you specialize (idk how big the impact is, in vreg bitonic sort is quite efficient).