[x86] AI Compute Extensions (ACE) Specification

https://x86ecosystem.org/resource/ai-compute-extensions-ace-specification/

19•matt_d•2h ago

Comments

dgoldstein0•1h ago

So how does this differ from available sse / avx instructions already in most x64 machines?

dmitrygr•1h ago

this also adds new registers to operate on (more state) - 1KB more state at least (512b x 16)

anematode•1h ago

One thing that stuck out to me is that deals with a lot more data formats, in particular, low-precision formats like FP4, FP6 and FP8. Manipulating those formats can take a lot of annoying effort; in general, x86 (until AVX-512, at least) has unconvincing support for so-called "lane-crossing" instructions that move data across 16-byte boundaries within a vector. So you can imagine unpacking, e.g., tightly packed 7-bit data to 8-bit data is a real slog.

I can already immediately think of a use case for vunpackb in some of the stuff I'm working on, where we'd like to efficiently unpack weights from the high half of a vector.

Separately, adding all signed–unsigned variants of the VNNI dot product instructions is a welcome (albeit niche) change. There was an annoying divergence here between major ISAs: x86 added vpdpbusd which computed a dot product between u8 and i8, while ARM added vdotq, which computes a dot product either between u8 and u8 elements, or i8 and i8. So for broad compatibility, you generally had to restrict one of your inputs to [0,127]. This difference shows in the design of (for example) WASM relaxed SIMD, where the result of wasm.dot.i8x16.i7x16.add.signed is implementation-defined if you exceed the [0,127] range. ARM later added mixed-sign variants, and now x86 consummates it.

adrian_b•14m ago

This is not a vector extension (like Intel AVX/AVX-512 or Arm SVE), but a matrix extension (like Intel AMX or Arm SME or the "tensor" operations of NVIDIA GPUs).

Some of the latest generations of Intel server CPUs with P-cores already have the AMX matrix extension, which can be used to implement fast AI inference.

AMD has not implemented AMX yet, and probably they will not implement it, because this new "AI Compute Extension", which has been defined by Intel and AMD together, is an alternative to AMX.

Matrix extensions are more efficient for AI inference than vector extensions, because they reduce the ratio between memory accesses and computation operations.

sorenjan•1h ago

AVX512 isn't available on most new CPUs, I'm guessing ACE will only be available on server CPUs for at least a couple of years at launch?

deadmutex•25m ago

> AVX512 isn't available on most new CPUs

Please define new. Also, I think AMD uses very similar cores in server and client. So, disabling AVX512 may be an Intel thing (my guess is that they can easily move threads between E & P cores).

murderfs•7m ago

They didn't disable it at first on their client CPUs, and it resulted in code randomly crashing depending on whether ifunc resolvers first ran on a big core or a little core.

It's pretty surprising that multiple CPU vendors have run into issues like this (some more than once, fucking Samsung), when it's pretty much the first thing that anyone on the toolchain side of thing asks when they hear about heterogenous cores on a CPU.

BobbyTables2•46m ago

Thank $ALL_DIETIES that the TCG wasn’t involved!

Midjourney Medical

Lore – Open source version control system designed for scalability

Local Qwen isn't a worse Opus, it's a different tool

US holds off blacklisting DeepSeek, more than 100 firms deemed security risks

Taxonomy of the Occlupanida (parasitoids on bread bag tags)

Storied Colors – a catalogue of named colors

Clojure Hosted on Go

Show HN: Spin Lab

[x86] AI Compute Extensions (ACE) Specification

Loreline – Tools for writing interactive fiction

Show HN: We built an 8-bit CPU as 2nd year EE students

Launch HN: Adam (YC W25) – Open-Source AI CAD

How we run Firecracker VMs inside EC2 and start browsers in less than 1s

How Madrid built its metro cheaply (2024)

RFC 10008: The new HTTP Query Method

Biological evolution and information acquisition

Nim Conf 2026 (Online, Sat June 20)

Show HN: An 8-bit live gamecast for baseball

Tesco moving 40k server workloads off VMware amid Broadcom's abusive conduct

Why thinking out loud with someone beats thinking alone

Volkswagen started blocking GrapheneOS users

Show HN: Inkwash, a watercolor sketching app and explanation

GLM-5.2 is the new leading open weights model on Artificial Analysis

U.S. science is in chaos

The Return of Rigorous Full-System Timing Simulation

MicroUI – A tiny, portable, immediate-mode UI library written in ANSI C

Image Compression

Want your images back? That'll be $5

The founder's playbook: Building an AI-native startup

Trellis AI (YC W24) hiring a product lead to build agents for healthcare access

[x86] AI Compute Extensions (ACE) Specification

Comments

Midjourney Medical

Lore – Open source version control system designed for scalability

Local Qwen isn't a worse Opus, it's a different tool

US holds off blacklisting DeepSeek, more than 100 firms deemed security risks

Taxonomy of the Occlupanida (parasitoids on bread bag tags)

Storied Colors – a catalogue of named colors

Clojure Hosted on Go

Show HN: Spin Lab

[x86] AI Compute Extensions (ACE) Specification

Loreline – Tools for writing interactive fiction

Show HN: We built an 8-bit CPU as 2nd year EE students

Launch HN: Adam (YC W25) – Open-Source AI CAD

How we run Firecracker VMs inside EC2 and start browsers in less than 1s

How Madrid built its metro cheaply (2024)

RFC 10008: The new HTTP Query Method

Biological evolution and information acquisition

Nim Conf 2026 (Online, Sat June 20)

Show HN: An 8-bit live gamecast for baseball

Tesco moving 40k server workloads off VMware amid Broadcom's abusive conduct

Why thinking out loud with someone beats thinking alone

Volkswagen started blocking GrapheneOS users

Show HN: Inkwash, a watercolor sketching app and explanation

GLM-5.2 is the new leading open weights model on Artificial Analysis

U.S. science is in chaos

The Return of Rigorous Full-System Timing Simulation

MicroUI – A tiny, portable, immediate-mode UI library written in ANSI C

Image Compression

Want your images back? That'll be $5

The founder's playbook: Building an AI-native startup

Trellis AI (YC W24) hiring a product lead to build agents for healthcare access