The messy reality of SIMD (vector) functions

https://johnnysswlab.com/the-messy-reality-of-simd-vector-functions/

73•mfiguiere•7h ago

Comments

exDM69•4h ago

This right here illustrates why I think there should be better first class SIMD in languages and why intrinsics are limited.

When using GCC/clang SIMD extensions in C (or Rust nightly), the implementation of sin4f and sin8f are line by line equal, with the exception of types. You can work around this with templates/generics.

The sin function is entirely basic arithmetic operations, no fancy instructions are needed (at least for the "computer graphics quality" 32 bit sine function I am using).

Contrast this with intrinsics where the programmer needs to explicitly choose the mm128 or mm256 instruction even for trivial stuff like addition and other arithmetic.

Similarly, a 4x4 matrix multiplication function is the exact same code for 64 bit double and 32 bit float if you're using built in SIMD. A bit of generics and no duplication is needed. Where as intrinsics again needs two separate implementations.

I understand that there are cases where intrinsics are required, or can deliver better performance but both C/C++ and Rust have zero cost fallback to intrinsics. You can "convert" between f32x4 and mm128 at zero cost (no instructions emitted, just compiler type information).

I do use some intrinsics in my SIMD code this way (rsqrt, rcp, ...). The CPU specific code is just a few percent of the overall lines of code, and that's for Arm and x86 combined.

The killer feature is that my code will compile into x86_64/SSE and Aarch64/neon. And I can use wider vectors than the CPU actually supports, the compiler knows how to break it down to what the target CPU supports.

I'm hoping that Rust std::simd would get stabilized soon, I've used it for many years and it works great. And when it doesn't I have a zero cost fallback to intrinsics.

Some very respected people have the opinion that std::simd or its C equivalent suffer from a "least common denominator problem". I don't disagree with the issue but I don't think it really matters when we have a zero cost fallback available.

camel-cdr•3h ago

My personal gripe with Rust's std::simd in its current form is that it makes writing portable SIMD hard while making non-portable SIMD easy. [0]

> the implementation of sin4f and sin8f are line by line equal, with the exception of types. You can work around this with templates/generics

This is true, I think most SIMD algorithms can be written in such a vector length-agnostic way, however almost all code using std::simd specifies a specific lane count instead of using the native vector length. This is because the API favors the use of fixed-size types (e.g. f32x4), which are exclusively used in all documentation and example code.

If I search github for `f32x4 language:Rust` I get 6.4k results, with `"Simd<f32," language:Rust NOT "Simd<f32, 2" NOT "Simd<f32, 4" NOT "Simd<f32, 8"` I get 209.

I'm not even aware of a way to detect the native vector length using std::simd. You have to use the target-feature or multiversion crate, as shown as the last part of the rust-simd-book [1]. Well, kind of like that, because their suggestion using "suggested_vector_width", which doesn't exist. I could only find a suggested_simd_width.

Searching for "suggested_simd_width language:Rust", we are now down to 8 results, 3 of which are from the target-feature/multiversion crates.

---

What I'm trying to say is that, while being able to specify a fixed SIMD width can be useful, the encouraged default should be "give me a SIMD vector of the specified type corresponding to the SIMD register size". If your problem can only be solved with a specific vector length, great, then hard-code the lane count, but otherwise don't.

See [0] for more examples of this.

[0] https://github.com/rust-lang/portable-simd/issues/364#issuec...

[1] https://calebzulawski.github.io/rust-simd-book/4.2-native-ve...

exDM69•2h ago

The binary portability issue is not specific to Rust or std::simd. You would have to solve the same problems even if you use intrinsics or C++. If you use 512 but vectors in the code, you will need to check if the CPU supports it or add multiversioning dispatch or you will get a SIGILL.

I have written both type generic (f32 vs f64) and width generic (f32x4 vs f32x8) SIMD code with Rust std::simd.

And I agree it's not very pretty. I had to resort to having a giant where clause for the generic functions, explicitly enumerating the required std::ops traits. C++ templates don't have this particular issue, and I've used those for the same purpose too.

But even though the implementation of the generic functions is quite ugly indeed, using the functions once implemented is not ugly at all. It's just the "primitive" code that is hairy.

I think this was a huge missed opportunity in the core language, there should've been a core SIMD type with special type checking rules (when unifying) for this.

However, I still think std::simd is miles better than intrinsics for 98% of the SIMD code I write.

The other 1% (times two for two instruction sets) is just as bad as it is in any other language with intrinsics.

The native vector width and target-feature multiversioning dispatch are quite hairy. Adding some dynamic dispatch in the middle of your hot loops can also have disastrous performance implications because they tend to kill other optimizations and make the cpu do indirect jumps.

Have you tried just using the widest possible vector size? e.g. f64x64 or something like it. The compiler can split these to the native vector width of the compiler target. This happens at compile time so it is not suitable if you want to run your code on CPUs with different native SIMD widths. I don't have this problem with the hardware I am targeting.

Rust std::simd docs aren't great and there have been some breaking changes in the few years I've used it. There is certainly more work on that front. But it would be great if at least the basic stuff would get stabilized soon.

MangoToupe•2h ago

> there should've been a core SIMD type with special type checking rules

What does this mean?

dzaima•57m ago

Overly-wide vectors I'd say are a pretty poor choice in general.

If you're using shuffles at times, you must use native-width vectors to be able to apply them.

If you're doing early-exit loops, you also want the vector width to be quite small to not do useless work.

f64x64 is presumably an exaggeration, but an important note is that overly long vectors will result in overflowing the register file and thus will make tons of stack spills. A single f64x64 takes up the entire AVX2 or ARM NEON register file! There's not really much room for a "widest" vector - SSE only has a tiny 2048-bit register file, the equivalent of just four AVX-512 registers, 1/8th of its register file.

And then there's the major problem that using fixed-width vectors will end up very badly for scalable vector architectures, i.e. ARM SVE and RISC-V RVV; of course not a big issue if you do a native build or do dynamic dispatch, but SVE and RVV are specifically made such that you do not have to do a native build nor duplicate code for different hardware vector widths.

And for things that don't do fancy control flow or use specialized instructions, autovectorization should cover you pretty well anyway; if you have some gathers or many potentially-aliasing memory ranges, on clang & gcc you can _Pragma("clang loop vectorize(assume_safety)") _Pragma("GCC ivdep") to tell the compiler to ignore aliasing and vectorize anyway.

exDM69•29m ago

> f64x64 is presumably an exaggeration

IIRC, 64 elements wide vectors are the widest that LLVM can work with. It will happily compile code that uses wider vectors than the target CPU has and split accordingly.

That doesn't necessarily make it a good idea.

Autovectorization works great for simple stuff and has improved a lot in the past decade (e.g. SIMD gather loads).

It doesn't work great for things like converting a matrix to quaternion (or vice versa), and then doing that in a loop. But if you write the inner primitive ops with SIMD you get all the usual compiler optimizations in the outer loop.

You should not unroll the outer loop like in the Quake 3 days. The compiler knows better how many times it should be unrolled.

I chose this example because I recently ported the Quake 3 quaternion math routines to Rust for a hobby project. It was a lot faster than the unrolled original (thanks to LLVM, same would apply to Clang).

MangoToupe•2h ago

> portable SIMD

this seems like an oxymoron

mort96•4m ago

Why? There's a ton of commonality between SIMD instruction sets. As long as you have a model for SIMD where you can avoid hard-coding a vector width and you use a relatively constrained set of vector instructions, there's no fundamental reason why the same source code shouldn't be able to compile down to AVX-512, AVX2, SSE, ARM NEON, ARM SVE and RVV instructions. For most use cases, we're doing the same basic set of operations: cpoy data from memory into vector registers (maybe with some transformation, like copy 8-bit ints from memory into a vector of u16), do math operations on vector registers, copy back to memory.

MangoToupe•2h ago

> first class SIMD in languages

People have said this for longer than I've been alive. I don't think it's a meaningful concept.

Sharlin•1h ago

It’s such a ridiculous situation we’re in. Just about every consumer CPU of the past 20 years packs an extra order of magnitude or two of punch for data processing workloads, but to not let it go to waste you have to resort to writing your inner loops using low-level nonportable intrinsics that are just a step above assembly. Or pray that the gods of autovectorization are on your side.

MangoToupe•1h ago

Well, yea. You need to describe your data flow in a way the CPU can take advantage of it. Compilers aren't magic.

Sharlin•1h ago

That’s like saying that you have to describe your data flow in terms of gotos because the CPU doesn’t understand for loops and compilers aren’t magic. I don’t mean that autovectorization should just work (tm), I just mean that reasonable portable SIMD abstractions should not be this hard.

Earw0rm•1h ago

There's different ways of approaching it which have different performance consequences. Which is why accelerated libraries are common, but if you want accelerated primitives, you kinda have to roll your own.

tomsmeding•27m ago

> I just mean that reasonable portable SIMD abstractions should not be this hard.

Morally, no, it really ought to not be this hard, we need this. Practically, it really is hard, because SIMD instruction sets in CPUs are a mess. X86 and ARM have completely different sets of things that they have instructions for, and even within the X86 family, even within a particular product class, things are inconsistent:

- On normal words, one has lzcnt (leading-zero count) and tzcnt (trailing-zero count), but on SIMD vectors there is only lzcnt. And you get lzcnt only on AVX512, the latest-and-greatest in X86.

- You have horizontal adds (adding adjacent cells in a vector) for 16-bit ints, 32-bit ints, floats and doubles, and saturating horizontal add for 16-bit ints. https://www.intel.com/content/www/us/en/docs/intrinsics-guid... Where are horizontal adds for 8-bit or 64-bit ints, or any other saturating instructions?

- Since AVX-512 filled up a bunch of gaps in the instruction set, you have absolute value instructions on 8, 16, 32 and 64 bit ints in 128, 256 and 512 bit vectors. But absolute value on floats only exists on 512-bit vectors.

These are just the ones that I could find now, there is more. With this kind of inconsistency, any portable SIMD abstraction will be difficult to efficiently compile for the majority of CPUs, negating part of the advantage.

jltsiren•37m ago

Adding parallelism is much easier on the hardware side than the software side. We've kind of figured out the easy cases, such as independent tasks with limited interactions with each other, and made them practical for the average developer. But nobody understands the harder cases (such as SIMD) well enough to create useful abstractions that don't constrain hardware development too much.

the__alchemist•25m ago

Evidence point towards the autovectorization gods being dead, false, or too weak. I hear, but don't believe their prophets.

exDM69•53m ago

Here we are discussing the merits of built in SIMD facilities of not one but two programming languages. Waiting for the Zig guys to chime in to make it a three.

Four if you include LLVM IR (I don't).

No reason to be dismissive about it.

the__alchemist•27m ago

Here is how I would have it:

Instead of writing:

  fn do_athing(val: f32x69) -> f32x69 {
  }

Instead you do:

  fn do_athing(val: fx) -> fx {
  }

And it just works. f32, f64, as wide as your CPU supports.

ozgrakkurt•1h ago

std::simd in rust had atrocious compile time last time I tried it, is it fixed already?

the__alchemist•36m ago

I've been dodging the f32/f64-specificity in rust using macros, but I don't love it. I do it because I'm not sure what else to do.

I think `core::simd` is probably not near, but 512-bit AVX SIMD will be out in a month or two! So, you could use f64x8, f32x16.

I've built my own f32xY, Vec3xY etc structs, since I want to use stable rust, but don't want to wait for core::SIMD. These all have syntaxes that mimic core::SIMD. I set up pack/unpack functions too, but they are still a hassle compared to non-SIMD operations.

THis begs the question: If core::simd doesn't end up ideal (For flexibility etc), how far can we get with a thin-wrapper lib? The ideal API (imo) is transparent, and supports the widest instructions available on your architecture, falling back to smaller ones or non-SIMD.

It also begs the question of if you should stop worrying and love the macro... We are talking f32/f64. Different widths. Different architectures. And now conflated with len 3 vecs, len 4 vecs etc (Where each vector/tensor item it a SIMD intrinsic). How does every vector (Not in the SIMD sense) handle this? Macros. This leads me to think we macro this too.

kookamamie•4h ago

I don't think the native C++, even when bundled with OMP, goes far enough.

In my experience, ISPC and Google's Highway project lead to better results in practice - this mostly due to their dynamic dispatching features.

William_BB•3h ago

Could you elaborate on the dynamic dispatching features a bit more? Is that for portability?

camel-cdr•2h ago

Here is an example using google highway: https://godbolt.org/z/Y8vsonTb8

See how the code has only been written once, but multiple versions of the same functions where generated targeting different hardware features (e.g. SSE, AVX, AVX512). Then `HWY_DYNAMIC_DISPATCH` can be used to dynamically call the fastest one matching your CPU at runtime.

William_BB•1h ago

Thank you so much, this explains it well. I was initially afraid that the dispatch would be costly, but from what I understand it's (almost) zero cost after the first call.

I only code for x86 with vectorclass library, so I never had to worry about portability. In practice, is it really possible to write generic SIMD code like the example using Highway? Or could you often find optimization opportunities if you targeted a particular architecture?

ashvardanian•55m ago

You can go quite far with such libraries if you only perform data-parallel numerics on the CPU. However, if you work on complex algorithms or exotic data structures, there's almost always more upside in avoiding them and writing specialized code for each platform of interest.

jeffreygoesto•33m ago

Nice. First time that I saw this dynamic dispatch was in FFTW.

ashvardanian•57m ago

Here's an explanation from one of my repos: <https://github.com/ashvardanian/simsimd?tab=readme-ov-file#d...>

dwattttt•4h ago

> Function calls also have that negative property that the compiler doesn’t know what happens after calling them so it needs to assume the worst happens. And by the worst, it has to assume that the function can change any memory location and optimize for such a case. So it omits many useful compiler optimizations.

This is not the case in C. It might be technically possible for a function to modify any memory, but it wouldn't be legal, and compilers don't need to optimise for the illegal cases.

RossBencina•4h ago

Sounds like the author hasn't heard of full program optimisation. EDIT: except they explicitly mention LTO near the end.

mattmaynes•2h ago

This is where the power and expressiveness of kdb+ shines. It has SIMD primitives out of the box and can optimize your code based on data types to take advantage of it. https://kx.com/blog/what-makes-time-series-database-kdb-so-f...

MangoToupe•2h ago

Time series is vector processing on easy mode, though. The hard part is applying SIMD to problems that aren't shaped to be easily processed in parallel.

jgalt212•2h ago

Fine. What's canonical or most basic example of where SIMD should be applied, but isn't because it's too tricky to do so?

In our shop, we never look to vectorize any function or process unless it's called inside a loop many times.

MangoToupe•1h ago

> What's canonical or most basic example of where SIMD should be applied, but isn't because it's too tricky to do so?

There is none. That's a contradiction in terms. SIMD either fits the shape or it doesn't.

krapht•56m ago

Variable length parallelism is hard. You can go to highload.fun (SIMD competition site) for problems that are only parallelized after significant effort.

Try problem #1, parsing numbers.

ashvardanian•1h ago

Text processing. Loading/branching/storing content 1 byte at a time is the CPUs worst nightmare, but most text processing is quite tricky in SIMD.

X-Clacks-Overhead

The messy reality of SIMD (vector) functions

Being too ambitious is a clever form of self-sabotage

The Moat of Low Status

Mini NASes marry NVMe to Intel's efficient chip

Build Systems à la Carte (2018) [pdf]

The History of Electronic Music in 476 Tracks (1937–2001)

Go, PET, Let Hen - Curious adventures in (Commodore) BASIC tokenizing

Gecode is an open source C++ toolkit for developing constraint-based systems

EverQuest

Incapacitating Google Tag Manager (2022)

N-Back – A Minimal, Adaptive Dual N-Back Game for Brain Training

Why I left my tech job to work on chronic pain

Scientists capture slow-motion earthquake in action

Baba Is Eval

Nvidia won, we all lost

Telli (YC F24) Is Hiring Engineers [On-Site Berlin]

We're all CTO now

Problems the AI industry is not addressing adequately

In a milestone for Manhattan, a pair of coyotes has made Central Park their home

Show HN: I AI-coded a tower defense game and documented the whole process

A 37-year-old wanting to learn computer science

The story behind Caesar salad

Impact of PCIe 5.0 Bandwidth on GPU Content Creation and LLM Performance

Wind Knitting Factory

ADXL345 (2024)

Writing a Game Boy Emulator in OCaml (2022)

Robots move Shanghai city block [video]

Large language models are improving exponentially?

The ITTAGE indirect branch predictor

X-Clacks-Overhead

The messy reality of SIMD (vector) functions

Being too ambitious is a clever form of self-sabotage

The Moat of Low Status

Mini NASes marry NVMe to Intel's efficient chip

Build Systems à la Carte (2018) [pdf]

The History of Electronic Music in 476 Tracks (1937–2001)

Go, PET, Let Hen - Curious adventures in (Commodore) BASIC tokenizing

Gecode is an open source C++ toolkit for developing constraint-based systems

EverQuest

Incapacitating Google Tag Manager (2022)

N-Back – A Minimal, Adaptive Dual N-Back Game for Brain Training

Why I left my tech job to work on chronic pain

Scientists capture slow-motion earthquake in action

Baba Is Eval

Nvidia won, we all lost

Telli (YC F24) Is Hiring Engineers [On-Site Berlin]

We're all CTO now

Problems the AI industry is not addressing adequately

In a milestone for Manhattan, a pair of coyotes has made Central Park their home

Show HN: I AI-coded a tower defense game and documented the whole process

A 37-year-old wanting to learn computer science

The story behind Caesar salad

Impact of PCIe 5.0 Bandwidth on GPU Content Creation and LLM Performance

Wind Knitting Factory

ADXL345 (2024)

Writing a Game Boy Emulator in OCaml (2022)

Robots move Shanghai city block [video]

Large language models are improving exponentially?

The ITTAGE indirect branch predictor

The messy reality of SIMD (vector) functions

Comments