> Yes, it is absolutely key to build your app as ARM, not to rely on Windows ARM emulation.
Using AVX2 and using an emulator have contradictory goals. Of course there can be a better emulator or actually matching hardware design (since both Apple and Microsoft actually exploit the similar register structure between ARM64 and x86_64). However, this means you have increased complexity and reduced reliability / predictability.
The big question then is, why are ARM desktop (and server?) cores so far behind on wider SIMD support? It's not like Intel/AMD came up with these extensions for x86 yesterday; AVX2 is over 15 years old.
https://ashvardanian.com/posts/aws-graviton-checksums-on-neo...
Graviton3 has 256-bit SVE vector registers but only four 128-bit SIMD execution units, because NEON needs to be fast.
Intel previously was in such a dominant market position that they could require all performance-critical software to be rewritten thrice.
You can treat both SVE and RVV as a regular fixed-width SIMD ISA.
"runtime variable width vectors" doesn't capture well how SVE and RVV work. An RVV and SVE implementation has 32 SIMD registers of a single fixed power-of-two size >=128. They also have good predication support (like AVX-512), which allows them to masked of elements after certain point.
If you want to emulate avx2 with SVE or RVV, you might require that the hardware has a native vector length >=256, and then you always mask off the bits beyond 256, so the same code works on any native vector length >=256.
The main cores in the K3 have 256-bit vectors with two 128-bit wide exexution units, and two seperate 128-bit wide vector load/store units.
See also: https://forum.spacemit.com/uploads/short-url/60aJ8cYNmrFWqHn...
But yes, RVV already has more diverse vector width hardware than SVE.
Also, it doesn't just speed up vector math. Compilers these days with knowledge of these extensions can auto-vectorize your code, so it has the potential to speed up every for-loop you write.
So operations that are not performance critical and are needed once or twice every hour? Are you sure you don't want to include a dedicated cluster of RTX 6090 Ti GPUs to speed them up?
In Apple's case, they have both the GPU and the NPU to fall back on, and a more closed/controlled ecosystem that breaks backwards compatibility every few years anyway. But Qualcomm is not so lucky; Windows is far more open and far more backwards compatible. I think the bet is that there are enough users who don't need/care about that, but I would question why they would even want Windows in the first place, when macOS, ChromeOS, or even GNU/Linux are available.
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Dow...
My experience is that trying to get benefits from the vector extensions is incredibly hard and the use cases are very narrow. Having them in a standard BLAS implementation, sure, but outside of that I think they are not worth the effort.
SIMD is not limited to mathy linear algebra things anymore. Did you know that lookup tables can be accelerated with AVX2? A lot of branchy code can be vectorized nowadays using scatter/gather/shuffle/blend/etc. instructions. The benefits vary, but can be significant. I think a view of SIMD as just being a faster/wider ALU is out of date.
iberator•1h ago
Most of the world lives of 300$ per month
SecretDreams•1h ago
Maybe you're thinking of avx512 or avx10?
jorvi•50m ago
AVX512 is so kludgy that it usually leads to a detriment in performance due to the extreme power requirements triggering thermal throttling.
badgersnake•43m ago
SecretDreams•43m ago
kimixa•41m ago
But there's plenty in avx512 the really helps real algorithms outside the 512-wide registers - I think it would be perceived very differently if it was initially the new instructions on the same 256-wide registers - ie avx10 - in the first place, then extended to 512 as the transistor/power budgets allowed. AVX512 was just tying too many things together too early than "incremental extensions".
otherjason•9m ago
AVX512 leading to thermal throttling is a common myth that from what I can tell traces its origins to a blog post about clock throttling on a particular set of low-TDP SKUs from the first generation of Xeon CPUs that supported it (Skylake-X), released over a decade ago: https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...
The results were debated shortly after that by well-known SIMD authors that were unable to duplicate the results: https://lemire.me/blog/2018/08/25/avx-512-throttling-heavy-i...
In practice, this has not been an issue for a long time, if ever; clock frequency scaling for AVX modes has been continually improved in subsequent Intel CPU generations (and even more so in AMD Zen 4/5 once AVX512 support was added).
winstonwinston•16m ago
Edit. Furthermore, i think that none of these low budget CPUs support AVX2, until Tiger lake released in 2020.
thrtythreeforty•1h ago
Tuldok•57m ago
jsheard•20m ago
Other Settings > AVX2 > 95.11% supported (+0.30% this month)