AVX2 is slower than SSE2-4.x under Windows ARM emulation

https://blogs.remobjects.com/2026/02/17/nerdsniped-windows-arm-emulation-performance/

48•vintagedave•1h ago

Comments

iberator•1h ago

AVX2 should be banned anyway. Only expensive CPUs have it, ruining mininum games requirements and making hardware obsolete.

Most of the world lives of 300$ per month

SecretDreams•1h ago

Almost every x86 cpu made in the last decade should have avx2.

Maybe you're thinking of avx512 or avx10?

jorvi•50m ago

Yeah, sounds like they're confusing AVX2 for AVX512. AVX2 has been common for a decade at least and greatly accelerates performance.

AVX512 is so kludgy that it usually leads to a detriment in performance due to the extreme power requirements triggering thermal throttling.

badgersnake•43m ago

I think that's slightly old information as well, AVX512 works well on Zen5.

SecretDreams•43m ago

Agree. It's only recently with modern architectures in the server space that avx512 has shown some benefit. But avx2 is legit and has been for a long time.

kimixa•41m ago

AMD's implementation very much doesn't have that issue - it throttles slightly, maybe, but it's still a net benefit. The problem with Intel's implementation is that the throttling was immediate - and took noticeable time to then settle and actually start processing again - from any avx512 instruction, so the "occasional" avx512 instruction (in autovectorized code, or something like the occasional optimized memcpy or similar) was a net negative in performance. This meant that it only benefitted large chunks of avx512-heavy code, so this switching penalty was overcome.

But there's plenty in avx512 the really helps real algorithms outside the 512-wide registers - I think it would be perceived very differently if it was initially the new instructions on the same 256-wide registers - ie avx10 - in the first place, then extended to 512 as the transistor/power budgets allowed. AVX512 was just tying too many things together too early than "incremental extensions".

otherjason•9m ago

See this correct comment above: https://news.ycombinator.com/item?id=47061696

AVX512 leading to thermal throttling is a common myth that from what I can tell traces its origins to a blog post about clock throttling on a particular set of low-TDP SKUs from the first generation of Xeon CPUs that supported it (Skylake-X), released over a decade ago: https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...

The results were debated shortly after that by well-known SIMD authors that were unable to duplicate the results: https://lemire.me/blog/2018/08/25/avx-512-throttling-heavy-i...

In practice, this has not been an issue for a long time, if ever; clock frequency scaling for AVX modes has been continually improved in subsequent Intel CPU generations (and even more so in AMD Zen 4/5 once AVX512 support was added).

winstonwinston•16m ago

Not really, Intel Celeron/Pentium/Atom (apollo lake) that was made in the last decade does not have AVX. These CPUs were very popular for low-cost, low-tdp quad-core machines such as Intel NUC mini PC.

Edit. Furthermore, i think that none of these low budget CPUs support AVX2, until Tiger lake released in 2020.

thrtythreeforty•1h ago

Au contraire: AVX2 is the vector ISA for x86 that doesn't suck. And it's basically ubiquitous at this point.

Tuldok•57m ago

I, too, hate progress. By the way, the AMD Athlon 3000G system I helped build for a friend has AVX2. Even the old HP T630 thin client (https://www.parkytowers.me.uk/thin/hp/t630/) I bought for $15 as a home network router has AVX2.

jsheard•20m ago

https://store.steampowered.com/hwsurvey

Other Settings > AVX2 > 95.11% supported (+0.30% this month)

Aissen•1h ago

Spoiler is in the conclusion:

> Yes, it is absolutely key to build your app as ARM, not to rely on Windows ARM emulation.

okanat•13m ago

Is this actually surprising? Once you use stuff like vectorization you want to get as much performance out of a system. If you're not natively compiling for a system, you won't get any performance.

Using AVX2 and using an emulator have contradictory goals. Of course there can be a better emulator or actually matching hardware design (since both Apple and Microsoft actually exploit the similar register structure between ARM64 and x86_64). However, this means you have increased complexity and reduced reliability / predictability.

kbolino•1h ago

I suspected this was because the vector units were not wide enough, and it seems that is the case. AVX2 is 256-bit, ARM NEON is only 128-bit.

The big question then is, why are ARM desktop (and server?) cores so far behind on wider SIMD support? It's not like Intel/AMD came up with these extensions for x86 yesterday; AVX2 is over 15 years old.

phonon•1h ago

Well, you can always use a Fujitsu A64FX...let me check eBay.. :-)

jsheard•57m ago

SVE was supposed to be the next step for ARM SIMD, but they went all-in on runtime variable width vectors and that paradigm is still really struggling to get any traction on the software side. RISC-V did the same thing with RVV, for better or worse.

kbolino•53m ago

Yeah, the extensions exist, and as pointed out by a sibling comment to yours, have been implemented in supercomputer cores made by Fujitsu. However, as far as I know, neither Apple nor Qualcomm have made any desktop cores with SVE support. So the biggest reason there's no desktop software for it is because there's no hardware support.

jsheard•41m ago

ARMs Neoverse IP does support SVE, so it's at least already relevant in cloud applications. Apparently AWS Graviton3 had 256bit SVE, but Graviton4 regressed back to 128bit SVE for some reason?

https://ashvardanian.com/posts/aws-graviton-checksums-on-neo...

camel-cdr•17m ago

The problem with SVE is that ARM vendors need to make NEON as fast as possible to stay competitive, so there is little incentive to implement SVE with wider vectors.

Graviton3 has 256-bit SVE vector registers but only four 128-bit SIMD execution units, because NEON needs to be fast.

Intel previously was in such a dominant market position that they could require all performance-critical software to be rewritten thrice.

justincormack•25m ago

I think the CIX P1 has support, but I havent got one yet to verify, this is a cheap SOC core.

Tuldok•41m ago

The only time I've encountered ARM SVE being used in the wild is in the FEX x86 emulator (https://fex-emu.com/FEX-2407/).

camel-cdr•24m ago

> SVE was supposed to be the next step for ARM SIMD, but they went all-in on runtime variable width vectors and that paradigm is still really struggling to get any traction on the software side.

You can treat both SVE and RVV as a regular fixed-width SIMD ISA.

"runtime variable width vectors" doesn't capture well how SVE and RVV work. An RVV and SVE implementation has 32 SIMD registers of a single fixed power-of-two size >=128. They also have good predication support (like AVX-512), which allows them to masked of elements after certain point.

If you want to emulate avx2 with SVE or RVV, you might require that the hardware has a native vector length >=256, and then you always mask off the bits beyond 256, so the same code works on any native vector length >=256.

0x000xca0xfe•18m ago

RISC-V chip designers at least seem to be more bullish on vectors. There is seriously cool stuff coming like the SpacemiT K3 with 1024-bit vectors :)

camel-cdr•13m ago

The 1024-bit RVV cores in the K3 are mostly that size to feed a matmul engine. While the vector registers are 1024-bit, the two exexution units are only 256-bit wide.

The main cores in the K3 have 256-bit vectors with two 128-bit wide exexution units, and two seperate 128-bit wide vector load/store units.

But yes, RVV already has more diverse vector width hardware than SVE.

otherjason•16m ago

The only CPU I've encountered that supports SVE is the Cortex-X925/A725 that is used in the NVIDIA DGX Spark platform. The vector width is still only 128 bits, but you do get access to the other enhancements the SVE instructions give, like predication (one of the most useful features from Intel's AVX512).

leeter•55m ago

[removed]

jovial_cavalier•48m ago

A ton of vector math applications these days are high dimensional vector spaces. A good example of that for arm would I guess be something like fingerprint or face id.

Also, it doesn't just speed up vector math. Compilers these days with knowledge of these extensions can auto-vectorize your code, so it has the potential to speed up every for-loop you write.

josefx•23m ago

> A good example of that for arm would I guess be something like fingerprint or face id.

So operations that are not performance critical and are needed once or twice every hour? Are you sure you don't want to include a dedicated cluster of RTX 6090 Ti GPUs to speed them up?

kbolino•33m ago

Part of the reason, I think, is that Qualcomm and Apple cut their teeth on mobile devices, and yeah wider SIMD is not at all a concern there. It's also possible they haven't even licensed SVE from Arm Holdings and don't really want to spend the money on it.

In Apple's case, they have both the GPU and the NPU to fall back on, and a more closed/controlled ecosystem that breaks backwards compatibility every few years anyway. But Qualcomm is not so lucky; Windows is far more open and far more backwards compatible. I think the bet is that there are enough users who don't need/care about that, but I would question why they would even want Windows in the first place, when macOS, ChromeOS, or even GNU/Linux are available.

bhouston•40m ago

Hasn't there been issues with AVX2 causing such a heavy load on the CPU that frequency scaling would kick in a lot of cases slowing down the whole CPU?

https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Dow...

My experience is that trying to get benefits from the vector extensions is incredibly hard and the use cases are very narrow. Having them in a standard BLAS implementation, sure, but outside of that I think they are not worth the effort.

jsheard•38m ago

Throttling was mainly an issue with AVX512, which is twice the width of AVX2, and only really on the early Skylake (2015) implementation. From your own source Ice Lake (2019) barely flinches and Rocket Lake (2021) never proactively downclocks when encountering any SIMD instructions. AMDs implementation is also solid.

kccqzy•27m ago

That’s only on very old CPUs. Getting benefits from vector extensions is incredibly easy if you do any kind of data crunching. A lot of integer operations not covered by BLAS can benefit including modern hash tables.

kbolino•19m ago

This is a bit short-sighted. Yes, it is kinda tricky to get right, and a number of programming languages are quite behind on good SIMD support (though many are catching up).

SIMD is not limited to mathy linear algebra things anymore. Did you know that lookup tables can be accelerated with AVX2? A lot of branchy code can be vectorized nowadays using scatter/gather/shuffle/blend/etc. instructions. The benefits vary, but can be significant. I think a view of SIMD as just being a faster/wider ALU is out of date.

crest•29m ago

I wouldn't be surprised for SSE4 to be the fastest cause it's easiest to map to NEON as both use 128 bit registers and offer a fairly simlar feature set.

TheJoeMan•16m ago

I tried searching "SSE2-4.x" and this is the top result in DDG and Google, so I was initially confused what instruction set the article is referring to. However, this appears to be shorthand for SSE2 through SSE4? Perhaps a rephrasing of the article title could be helpful.

The Ricoh Printing Experience

Show HN: Agentpriv – Sudo for AI Agents

The Ozempic Effect: How McDonald's Is Reinventing Fast Food

I've Disallowed LLMs

Open sourcing the Liveblocks sync engine and dev server

Show HN: Turn any OpenAPI spec into agent-callable skills

Show HN: A pay-per-request API to search social media posts

Show HN: VectorNest responsive web-based SVG editor

Reddit and Discord users forced to use biometric ID system backed by Palantir

Detection of Spoilage-Associated Acetic Acid Levels Using a Whole-Cell Biosensor

What Cooking Tells Us About AI

Extracting Financial Data Using LLMs Without Reading Every Email

FDA No Longer Warns Against Ineffective Autism Treatments Like Chlorine Dioxide

Show HN: Free Windows shell extension for quick .NET assembly inspection

After Microsoft's AI overreach, Gentoo begins its march away from GitHub

Experiential Reinforcement Learning

If you're always listening to an audiobook, you're not alone

Show HN: Mock any HTTP request from DevTools, with AI-generation and zero setup

Show HN: Poncho, a general agent harness built for the web

What determines whether a post gets visibility on Hacker News

Breccia: Single-file, append-only, blob storage with efficient random access

Perpetual Calendar

Show HN: Air – Open-source black box for AI agents (tamper-evident audit trails)

What's a "gig work minimum wage"

'No meat on its bones': Federal judge dismisses lawsuit over boneless wings

Show HN: Aspara – Open-source ML metrics tracker that stays fast at scale

EFF to Wisconsin Legislature: VPN Bans Are Still a Terrible Idea

Show HN: TUI cross-platform Python tool for network discovery and port auditing

Retrospective: Realms Campaign Setting

OpenClaw creator slams Europe's regulations as he moves to the US