From Rust to reality: The hidden journey of fetch_max

https://questdb.com/blog/rust-fetch-max-compiler-journey/

259•bluestreak•4mo ago

Comments

IshKebab•4mo ago

Yeah this comes from ARM and AXI, which has atomic max (and min, add, set, clear and xor). I assume ARM has all the corresponding instructions. RISC-V also has all of these in Zaamo.

yshui•4mo ago

That's a cool find. I wonder if LLVM also does the other way around operation, where it pattern matches handwritten CAS loops and transform them into native ARM64 instructions.

jerrinot•4mo ago

That's a very good question. A proper compiler engineer would know, but I will do my best to find something and report back.

Edit: I could not find any pass with a pattern matching to replace CAS loops. The closest thing I could find is this pass: https://github.com/llvm/llvm-project/blob/06fb26c3a4ede66755... I reckon one could write a similar pass to recognize CAS idioms, but its usefulness would be probably rather limited and not worth the effort/risks.

Arnavion•4mo ago

I checked Godbolt, with RISC-V instead of ARM since I'm more familiar with that, and it doesn't look like it.

https://gcc.godbolt.org/z/b5s4WjnTG

(amomax is the atomic fetch-max instruction. lr and sc are load-reserved and store-conditional instructions; sc is like a regular store except it only succeeds if the address was not modified since the previous lr that accessed it. IOW the assembly is basically one-to-one with the C source.)

tialaramex•4mo ago

The term of art for this technique is "idiom recognition" and it's proper ancient, like, APL compilers did have some idiom recognition 50+ years ago.

An example you'll see in say a modern C compiler is that if you write the obvious loop to calculate how many bits are set in an int, the actual machine code on a brand new CPU should be a single population count instruction, C provides neither intrinsics (like Rust) not a dedicated "popcount" feature, so you can't write that but it's obviously what you want here and yup an optimising C compiler will do that.

However, LLVM is dealing with an IR generated by other compiler folk so I think it probably has less use for idiom recognition. Clang would do the recognition and lower to the same LLVM IR as Rust does for its intrinsic population count core::intrinsics::ctpop so the LLVM backend doesn't need to spot this. I might be wrong, but I think that's how it works.

toth•4mo ago

> An example you'll see in say a modern C compiler is that if you write the obvious loop to calculate how many bits are set in an int, the actual machine code on a brand new CPU should be a single population count instruction, C provides neither intrinsics (like Rust) not a dedicated "popcount" feature, so you can't write that but it's obviously what you want here and yup an optimising C compiler will do that.

C compilers definitely have intrinsics for this, for GCC for instance it is `__builtin_popcount`.

And apparently it has even standard language support for it since C23, it's `stdc_count_ones` [1] and in C++ you have `std::popcount` [2]

[1] https://en.cppreference.com/w/c/numeric/bit_manip.html [2] https://en.cppreference.com/w/cpp/numeric/popcount.html

tialaramex•4mo ago

The existence of platform specific hacks is not interesting. In reality what happens is that software which has at any point cared about being portable doesn't use them.

But yes stdc_count_ones is indeed the intrinsic you'd want here, and only a few years after I stopped writing C, so thanks for mentioning that.

std::popcount is C++ but it's also kinda miserable that it took until C++ 20 and yet they still only landed the unsigned integer types, even though C++ 20 also insists the signed integers have two's complement representation, so the signed integers do have these desirable properties in fact but you can't use that.

toth•4mo ago

Maybe we are just quibbling over semantics but the compiler intrinsic here is '__builtin_popcount'. 'stdc_count_ones' is a standard library element that presumably will be implemented using the intrinsic.

And FWIW all major C/C++ have for a long time have had a an intrinsic for this. In clang it even has the same name, Visual Studio it's something like just '_popcount'. So it has long been easy to roll your own macro that works everywhere.

tialaramex•4mo ago

Yes, just semantics. But I don't think I can agree that because you could have ensured this works portably people actually did. That's not been my experience.

Yesterday I watched that "Sea of Thieves" C++ 14 to C++ 20 upgrade story on Youtube, that feels much more like what I've seen - code that shouldn't have worked but it did, kept alive by people whose priority is a working game.

gpderetta•4mo ago

__builtin_popcount is not platform specific.

tialaramex•4mo ago

OK, sure, vendor specific then. C23 does not promise this incantation, it's presumably a GCCism.

articulatepang•4mo ago

> In reality what happens is that software which has at any point cared about being portable doesn't use them.

I don't think this generalization is actually true. Fast portable software compiles conditionally based on the target platform, picking the fast platform-specific intrinsic, and falls back to a slow but guaranteed portable software implementation. This pattern is widespread in numerical linear algebra, media codecs, data compressors, encryption, graphics, etc.

jerrinot•4mo ago

Hi, author here. My superpower is spending unreasonable amounts of time researching things with no practical purpose. Occasionally I blog about it - as a warning to others.

Ethee•4mo ago

It's these kinds of posts that I appreciate reading the most, so thank you for sharing!

owls-on-wires•4mo ago

“…no practical purpose” Nonsense, I learned something about compilation today. Thank you for sharing.

trws•4mo ago

I liked the article. I saw your PS that we added it to the working draft for c++26, we also made it part of OpenMP as of 5.0 I think. It’s sometimes a hardware atomic like on arm, but what made the case was that it’s common to implement it sub-optimally even on x86 or LL-SC architectures. Often the generic cas loop gets used, like in your lambda example, but it lacks an early cutout since you can ignore any input value that’s on the wrong side of the op by doing a cheap atomic read or just cutting out of the loop after the first failed CAS if the read back shows it can’t matter. Also can benefit from using slightly different memory orders than the default on architectures like ppc64. It’s a surprisingly useful op to support that way.

If this kind of thing floats your boat, you might be interested in the non-reading variants of these as well. Mostly for things like add, max, etc but some recent architectures actually offer alternate operations to skip the read-back. The paper calls them “atomic reduction operations” https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p31...

anematode•4mo ago

Curious: even with hardware atomics, wouldn't it be a good idea to first perform a non-atomic load to check for whether the store might be necessary (which would require the cache line to be locked), then only run the atomic max if it might change the value?

adgjlsfhk1•4mo ago

This depends heavily on what concurrency optimizations your processor implements (and unfortunately this is the sort of thing that doesn't get doccumented and is somewhat hard to test).

anematode•4mo ago

I did a little unscientific test here on an Apple M4 Pro with n threads spamming atomic operations with pseudorandom values on one memory location (the worst case). Used inline asm to make sure there was no funny business going on.

  atomic adds
  n = 1 ->  333e6 adds/second
  n = 2 ->  174e6
  n = 4 ->   95e6
  n = 8 ->   63e6

  atomic maxs
  n = 1 ->  161e6 maxs/second
  n = 2 ->   59e6
  n = 4 ->   39e6
  n = 8 ->   27e6

  atomic maxs with preceding check
  n = 1 ->  929e6 maxs/second
  n = 2 -> 1541e6
  n = 4 -> 3494e6
  n = 8 -> 5985e6

So evidently the M4 doesn't do this optimization. Of course if your distribution is different you'd get different results, and this level of contention is unrealistic, but I don't see why you'd EVER not do a check before running atomic max. I also find it interesting that atomic max is significantly slower than atomic add

thequux•4mo ago

I think that this can change the semantics though; with the preceding check you can miss the shared variable being decremented from another thread. In some cases, such as if the shared value is monotonic, this is done, but not in the general case.

anematode•4mo ago

With a relaxed ordering I'm not sure if that's right, since the ldumax would have no imposed ordering relation with the (atomic) decrement on another thread and so could very well have operated on the old value obtained by the non-atomic load

ibraheemdev•4mo ago

It does make a difference of course if you're running fetch_max from multiple threads, adding a load fast-path introduces a race condition.

masklinn•4mo ago

Does it tho? Assuming no torn reads/writes at those sizes, given the location should be strictly increasing are there situations where you could read a higher-than-stored value which would cause skipping a necessary update?

Afaik on all of x86, arm, and riscv an atomic load of a word sized datum is just a regular load.

gpderetta•4mo ago

It doesn't need to be strictly increasing some other thread could be making other arbitrary operations. Still even in that case, as Dylan16807 pointed out, it likely doesn't matter.

masklinn•4mo ago

> It doesn't need to be strictly increasing some other thread could be making other arbitrary operations

We're talking about collating a maximum, by definition every write to that is an increase.

gpderetta•4mo ago

If you are implementing a library function atomic<T>::fetch_max, you cannot assume that every other thread is also performing a fetch_max on that object. There might be little reason for it, but other operations are allowed so the the sequence of modifications might not be strictly increasing (but then again, it doesn't matter for this specific optimization).

gpderetta•4mo ago

All operations on a single memory location are always totally ordered in a CC system, no matter how relaxed the memory model is.

Also am I understanding it correctly that n is the number of threads in your example? Don't you find it suspicious that the number of operations goes up as the thread count goes up?

edit: ok, you are saying that under heavy contention the check avoids having to do the store at all. This is racy, and whether this is correct or not, would be very application specific.

edit2: I thought about this a bit, and I'm not sure i can come up with a scenario where the race matters...

edit3: ... as long as all threads are only doing atomic_max operations on the memory location, which an implementation can't assume.

Dylan16807•4mo ago

> as long as all threads are only doing atomic_max operations on the memory location, which an implementation can't assume.

What assumes that?

If your early read gives you a higher number, quitting out immediately is the same as doing the max that same nanosecond. You avoid setting a variable to the same value it already is. Doing or not doing that write shouldn't affect other atomics users, should it?

In general, I should be able to add or remove as many atomic(x=x) operations as I want without changing the result, right?

And if your early read is lower then you just do the max and having an extra read is harmless.

The only case I can think of that goes wrong is the read (and elided max) happening too early in relation to accesses to other variables, but we're assuming relaxed memory order here so that's explicitly acceptable.

gpderetta•4mo ago

Yes, probably you are right: a load that finds a larger value is equivalent to a max. As the max wouldn't store any value in this case, also it wouldn't introduce any synchronization edge.

A load that finds a smaller value is trickier to analyze, but i think you are just free to ignore it and just proceed with the atomic max. An underlying LL/SC loop to implement a max operation might spuriously fail anyway.

edit: here is another argument in favour: if your only atomic RMW is a cas, to implement X.atomic_max(new) you would:

  1: expected <- X 
  2: if new < expected: done
  3: else if X.cas(expected, y): done
     else goto 2 # expected implicitly refreshed

So a cas loop would naturally implement the same optimization (unless it starts with a random expected), so the race is benign.

adwn•4mo ago

Yes, this can make sense if

- the value is often doesn't require an update, and

- there's contention on the cache line, i.e., at least two cores frequently read or write that cache line.

But there are important details to consider:

1) The probing load must be atomic. Both the compiler and the processor in general are allowed to split non-atomic loads into two or more partial loads. Only atomic loads – even with relaxed ordering – are guaranteed to not return intermediate or mixed values from other atomic stores.

2) If the ordering on the read part of the atomic read-modify-write operation is not relaxed, the probing load must reflect this. For example, an acq-rel RMW op would require an acquire ordering on the probing read.

anematode•4mo ago

Thanks for your insights. (2) makes sense to me, but for (1), on ARM64 can an aligned 64-bit store really tear in a 64-bit non-atomic load? The spec says "A write that is generated by a store instruction that stores a single general-purpose register and is aligned to the size of the write in the instruction is single-copy atomic" (B2.2.1)

adwn•4mo ago

> […] on ARM64 […]

Well, if you target a specific architecture, then of course you can assume more guarantees than in general, portable code. And in general, a processor might distinguish between non-atomic and relaxed-atomic reads and writes – in theory.

But more important, and relevant in practice, is the behavior of the compiler. C, C++, and Rust compilers are allowed to assume that non-atomic reads aren't influenced by concurrent writes, so the compiler is allowed to split non-atomic reads into smaller reads (unlikely) or even optimize the reads away if it can prove that the memory location isn't written to by the local thread (more likely).

anematode•4mo ago

Sure, no doubt a non-atomic load would be dangerous to write in C, C++, or Rust rather than in assembly

SkiFire13•4mo ago

> but it lacks an early cutout since you can ignore any input value that’s on the wrong side of the op by doing a cheap atomic read or just cutting out of the loop after the first failed CAS if the read back shows it can’t matter.

I believe this is a bit trickier than that, you would also need at least some kind of atomic barrier to preserve the ordering semantics of the successful update case.

ajayka•4mo ago

Great article! Did you end up hiring the candidate?!

xarope•4mo ago

looks around room, heads nodding.

Ah, a magician. welcome.

michalsustr•4mo ago

Thank you for sharing, loved the article!

tux3•4mo ago

This blog sent me into a memory models rabbit hole again. Each time I end up feeling like I'm finally starting to get it, only for a 6 line litmus test with 4 loads and 2 stores to send me crashing back down.

It makes me feel a little better reading about the history of memory models in CPUs. If this stuff wasn't intuitive to Intel either, I'm at least in good company in being confused (https://research.swtch.com/hwmm#path_to_x86-tso)

I actually knew about fetch_max from "implementing" the corresponding instruction (risc-v amomax), but I haven't done any of the fun parts yet since my soft-CPU still only has a single core.

jamesmunns•4mo ago

If you haven't seen it, Mara Bos' "Rust Atomics and Locks"[0] is an excellent book on this topic, even if you aren't particularly interested in Rust.

[0]: https://marabos.nl/atomics/

tux3•4mo ago

Thank you, it looks lovely!

orlp•4mo ago

Aarch64 does indeed have a proper atomic max, but even on x86-64 you can get a wait-free atomic max as long as you only need to support integers up to 64. In that case you can simply do a `lock or` with 1 << i as your maximum. You can even support larger sizes by using multiple registers, e.g. four 64-bit registers for a u8 maximum.

In most cases it's even better to just store a maximum per thread separately and loop over all threads once to compute the current maximum if you really need it.

jerrinot•4mo ago

That’s a neat trick, albeit with limited applicability given the very narrow range. Thanks for sharing!

MountainTheme12•4mo ago

Only slightly related, but GPUs also have such instructions (exposed as InterlockedMax in HLSL and atomicMax in GLSL and CUDA).

minedwiz•4mo ago

Did he get the job?

brcmthrowaway•4mo ago

Wasn't a culture fit

ShroudedNight•4mo ago

Was this compiled at O0? The generated code looks unnecessarily long-winded - at the very least I would expect the match jump table to get culled to only the Relaxed implementation.

ambicapter•4mo ago

> Note we did not ask rustc to optimize the code. If we did, the compiler would generate more efficient assembly: No spills to the stack, fewer jumps, no dispatch on memory ordering, etc. But I wanted to keep the output as close to the original IR as possible to make it easier to follow.

RTFA

ShroudedNight•4mo ago

I did, however that call out did, admittedly, slip past me

ambicapter•4mo ago

No worries, I felt like an ass using the idiom but it was so literally appropriate.

delifue•4mo ago

When reading I expected it to mention that each thread maintain thread local max and periodically sync to a global atomic can improve performance

jerrinot•4mo ago

I expect candidates to suggest similar optimisations, but I felt it was unnecessary for the article itself.

anematode•4mo ago

Great article :)

markcjeffrey•4mo ago

Related: fetch_max is an instance of what the following SPAA 2013 paper calls an atomic "priority update" or atomic "write-with-max". This type of atomic operation can have much lower contention than its counterparts like atomic increment.

https://doi.org/10.1145/2486159.2486189 https://jshun.csail.mit.edu/contention.pdf

Jweb_Guru•4mo ago

One of the most practically important papers out there, I wish it were better known (but fortunately I think the "right" people know about it).

Arnavion•4mo ago

>Hold on. This wasn't a wrapper around a loop pattern - this was a first-class atomic operation, sitting right there next to fetch_add and fetch_or. Java doesn't have this. C++ doesn't have this. How could Rust just... have this?

C++26 (work-in-progress) does have std::atomic<T>::fetch_max . Not implemented in any toolchains yet, though.

https://en.cppreference.com/w/cpp/atomic/atomic/fetch_max

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p04...

bilkow•4mo ago

That info is included later in the article:

> PS: After conducting this journey I learned that C++26 adds fetch_max too!

vips7L•4mo ago

Fun read! Makes me realize I should probably go reread Java Concurrency in Practice.

TuxSH•4mo ago

Somewhat related: I find annoying that C++ doesn't have fetch_update and that Rust's fetch_update doesn't support LL/SC.

Rust fetch_update uses the lowest common denominator, CAS, regardless of platform: https://godbolt.org/z/ncssGnsfx (see the call __aarch64_cas8_acq_rel). In hot loops this can mean double-digit perf loss.

gpderetta•4mo ago

It is very hard to support LL/SC in generalized user code as the specific rules of what cause an LL lease to fail are generally non-portable (possibly not even within an architecture).

It could be implemented with a CAS fallback of course, but it seems a performance trap.

You could add the logic to the compiler to detect which specific code sequences are LL/SC safe, but at that point just providing built-ins for the most common operations is simpler.

Al Lowe on model trains, funny deaths and working with Disney

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The AI boom is causing shortages everywhere else

Reinforcement Learning from Human Feedback

The Waymo World Model

Start all of your commands with a comma (2009)

Selection Rather Than Prediction

Vocal Guide – belt sing without killing yourself

Speed up responses with fast mode

France's homegrown open source online office suite

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Software factories and the agentic moment

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Ga68, a GNU Algol 68 Compiler

Show HN: If you lose your memory, how to regain access to your computer?

An Update on Heroku

Show HN: I spent 4 years building a UI design tool with only the features I use

Al Lowe on model trains, funny deaths and working with Disney

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The AI boom is causing shortages everywhere else

Reinforcement Learning from Human Feedback

The Waymo World Model

Start all of your commands with a comma (2009)

Selection Rather Than Prediction

Vocal Guide – belt sing without killing yourself

Speed up responses with fast mode

France's homegrown open source online office suite

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Software factories and the agentic moment

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Ga68, a GNU Algol 68 Compiler

Show HN: If you lose your memory, how to regain access to your computer?

An Update on Heroku

Show HN: I spent 4 years building a UI design tool with only the features I use

From Rust to reality: The hidden journey of fetch_max

Comments