Myths Programmers Believe about CPU Caches (2018)

https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/

156•whack•3mo ago

Comments

yohbho•3mo ago

2018, discussed on HN in 2023: https://news.ycombinator.com/item?id=36333034

  Myths Programmers Believe about CPU Caches (2018) (rajivprab.com)
 176 points by whack on June 14, 2023 | hide | past | favorite | 138 comments

breppp•3mo ago

Very interesting, I am hardly an expert, but this gives the impression that if only without that meddling software we will all live in a synchronous world.

This ignores store buffers and consequently memory fencing which is the basis for the nightmarish std::memory_order, the worst api documentation you will ever meet

jeffbee•3mo ago

If the committee had any good taste at all they would have thrown DEC AXP overboard before C++11, which would have cut the majority of the words out of the specification for std::memory_order. It was only the obsolete and quite impossible to program Alpha CPU that requires the ultimate level of complexity.

gpderetta•3mo ago

If Alpha never existed, memory_order_consume would still exist. But we have done this discussion before.

QuaternionsBhop•3mo ago

Since the CPU is doing cache coherency transparently, perhaps there should be some sort of way to promise that an application is well-behaved in order to access a lower-level non-transparent instruction set to manually manage the cache coherency from the application level. Or perhaps applications can never be trusted with that level of control over the hardware. The MESI model reminded me of Rust's ownership and borrowing. The pattern also appears in OpenGL vs Vulkan drivers, implicit sync vs explicit sync. Yet another example would be the cache management work involved in squeezing out maximum throughput CUDA on an enterprise GPU.

cpgxiii•3mo ago

There are some knobs that newer processors give for cache control, mostly to partition or reserve cache space to improve security or reduce cache contention between processes.

Actual manual cache management is way too much of an implementation detail for a general-purpose CPU to expose; doing so would deeply tie code to a specific set of processor behavior. Cache sizes and even hierarchies change often between processor generations, and some internal cache behavior has changed within a generation as a result of microcode and/or hardware steppings. Actual cache control would be like MIPS exposing delay slots but so much worse (at least older delay slots really only turn into performance issues, older cache control would easily turn into correctness issues).

Really the only way to make this work is for the final compilation/"specialization" step to occur on the specific device in question, like with a processor using binary translation (e.g. Transmeta, Nvidia Denver) or specialization (e.g. Mill) or a system that effectively enforces runtime compilation (e.g. runtime shader/program compilation in OpenGL and OpenCL).

groundzeros2015•3mo ago

Did this article share more than one myth?

The reason why programmers don’t believe in cache coherency is because they have experienced a closely related phenomena, memory reordering. This requires you to use a memory fence when accessing a shared value between multiples cores - as if they needed to synchronize.

Lwerewolf•3mo ago

I'm pretty sure that most cases of x86 reordering issues are a matter of the compiler reordering things, which isn't (afaik) solved with just "volatile". Caveat - haven't dealt with this for at least over a year (multicore sync without using OS primitives in general).

nly•3mo ago

x86 has a Total Store Order (TSO) memory model, which effectively means (in a mental model where only 1 shared memory operation happens at once and completes before the next) stores are queued but loads can be executed immediately even if stores are queued in the store buffer.

On a single core a load can be served from the store buffer (queue), but other cores can't see those stores yet, which is where all the inconsistencies come from.

fer•3mo ago

Literally the volatile keyword in Java is to make the Java compiler aware in order to insert memory barriers. That only guarantees consistent reads or writes, but it doesn't make it thread safe (i.e. write after read), that's what atomics are for.

Also not only compilers reorder things, most processors nowadays do OoOE; even if the order from the compiler is perfect in theory, different latencies for different instruction operands may lead to execute later things earlier not to stall the CPU.

zozbot234•3mo ago

Note that this is only true of the Java/C# volatile keyword. The volatile keyword in C/C++ is solely about direct access in hardware to memory-mapped locations, for such purposes as controlling external devices; it is entirely unrelated to the C11 memory model for concurrency, does not provide the same guarantees, and should never be used for that purpose.

igtztorrero•3mo ago

In Golang there are sync.Mutex, sync.Atomic and Channels to create this fence and prevent data races. I prefer sync.Mutex.

Does anyone understand how Go handles the CPU cache?

groundzeros2015•3mo ago

Yes. Locks will use a memory fence. More advanced programs will need fence without locking.

daemontus•3mo ago

I may be completely out of line here, but isn't the story on ARM very very different? I vaguely recall the whole point of having stuff like weak atomics being that on x86, those don't do anything, but on ARM they are essential for cache coherency and memory ordering? But then again, I may just be conflating memory ordering and coherency.

jeffbee•3mo ago

Well, since this is a thread about how programmers use the wrong words to model how they think a CPU cache works, I think it bears mentioning that you've used "atomics" here to mean something irrelevant. It is not true that x86 atomics do nothing. Atomic instructions or, on x86, their prefix, make a naturally non-atomic operation such as a read-modify-write atomic. The ARM ISA actually lacked such a facility until ARMv8.1.

The instructions to which you refer are not atomics, but rather instructions that influence the ordering of loads and stores. x86 has total store ordering by design. On ARM, the program has to use LDAR/STLR to establish ordering.

phire•3mo ago

Everything it says about cache coherency is exactly the same on ARM.

Memory ordering has nothing to do with cache coherency, it's all about what happens within the CPU pipeline itself. On ARM reads and writes can become reordered within the CPU pipeline itself, before they hit the caches (which are still fully coherent).

ARM still has strict memory ordering for code within a single core (some older processors do not), but the writes from one core might become visible to other cores in the wrong order.

gpderetta•3mo ago

you are getting downvoted, but you are of course correct.

gpderetta•3mo ago

Normally when talking about relaxed memory models, full cache coherency is still assumed. For example the C++11 memory model cannot be implemented on a non-cache-coherent system, at least not without massive performance penalties.

ashvardanian•3mo ago

Here's my favorite practically applicable cache-related fact: even on x86 on recent server CPUs, cache-coherency protocols may be operating at a different granularity than the cache line size. A typical case with new Intel server CPUs is operating at the granularity of 2 consecutive cache lines. Some thread-pool implementations like CrossBeam in Rust and my ForkUnion in Rust and C++, explicitly document that and align objects to 128 bytes [1]:

  /**
   *  @brief Defines variable alignment to avoid false sharing.
   *  @see https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size
   *  @see https://docs.rs/crossbeam-utils/latest/crossbeam_utils/struct.CachePadded.html
   *
   *  The C++ STL way to do it is to use `std::hardware_destructive_interference_size` if available:
   *
   *  @code{.cpp}
   *  #if defined(__cpp_lib_hardware_interference_size)
   *  static constexpr std::size_t default_alignment_k = std::hardware_destructive_interference_size;
   *  #else
   *  static constexpr std::size_t default_alignment_k = alignof(std::max_align_t);
   *  #endif
   *  @endcode
   *
   *  That however results into all kinds of ABI warnings with GCC, and suboptimal alignment choice,
   *  unless you hard-code `--param hardware_destructive_interference_size=64` or disable the warning
   *  with `-Wno-interference-size`.
   */
  static constexpr std::size_t default_alignment_k = 128;

As mentioned in the docstring above, using STL's `std::hardware_destructive_interference_size` won't help you. On ARM, this issue becomes even more pronounced, so concurrency-heavy code should ideally be compiled multiple times for different coherence protocols and leverage "dynamic dispatch", similar to how I & others handle SIMD instructions in libraries that need to run on a very diverse set of platforms.

[1] https://github.com/ashvardanian/ForkUnion/blob/46666f6347ece...

Sesse__•3mo ago

This makes attempts of cargo-culting __attribute__((aligned(64))) without benchmarking even more hilarious. :-)

rnrn•3mo ago

It’s not a cargo cult if the actions directly cause cargo to arrive based on well understood mechanics.

Regardless of whether it would be better in some situations to align to 128 bytes, 64 bytes really is the cache line size on all common x86 cpus and it is a good idea to avoid threads modifying the same cacheline.

Sesse__•3mo ago

It indeed isn't, but I've seen my share of systems where nobody checked if cargo arrived. (The code was checked in without any benchmarks done, and after many years, it was found that the macros used were effectively no-ops :-) )

rnrn•3mo ago

> even on x86 on recent server CPUs, cache-coherency protocols may be operating at a different granularity than the cache line size. A typical case with new Intel server CPUs is operating at the granularity of 2 consecutive cache lines

I don’t think it is accurate that Intel CPUs use 2 cache lines / 128 bytes as the coherency protocol granule.

Yes, there can be additional destructive interference effects at that granularity, but that’s due to prefetching (of two cachelines with coherency managed independently) rather than having coherency operating on one 128 byte granule

AFAIK 64 bytes is still the correct granule for avoiding false sharing, with two cores modifying two consecutive cachelines having way less destructive interference than two cores modifying one cacheline.

j_seigh•3mo ago

Coherent cache is transparent to the memory model. So if someone trying to explain memory model and ordering mentioned cache as affecting the memory model, it was pretty much a sign they didn't fully understand what they were talking about.

wpollock•3mo ago

This post, along with the tutorial links it and the comments contain, provide good insights on the topic of caches, coherence, and related topics. I would like to add a link that I feel is also very good, maybe better:

<https://marabos.nl/atomics/hardware.html>

While the book this chapter is from is about Rust, this chapter is pretty much language-agnostic.

mannyv•3mo ago

The author talks about Java and cache coherency, but ignores the fact that the JVM implementation is above the processors caches and per the spec can have different caches per thread.

This means you need to synchronize every shared access, whether it's a read or write. In hardware systems you can cheat because usually a write performs a write-through. In a JVM that's not the case.

It's been a long time since I had to think about this, but it bit us pretty hard when we found that.

Start all of your commands with a comma (2009)

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The Waymo World Model

Vocal Guide – belt sing without killing yourself

Reinforcement Learning from Human Feedback

Making geo joins faster with H3 indexes

Where did all the starships go?

Unseen Footage of Atari Battlezone Arcade Cabinet Production

ga68, the GNU Algol 68 Compiler – FOSDEM 2026 [video]

Jeffrey Snover: "Welcome to the Room"

What Is Ruliology?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: I spent 4 years building a UI design tool with only the features I use

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Show HN: If you lose your memory, how to regain access to your computer?

Microsoft open-sources LiteBox, a security-focused library OS

An Update on Heroku

Cross-Region MSK Replication: K2K vs. MirrorMaker2

PC Floppy Copy Protection: Vault Prolok

Was Benoit Mandelbrot a hedgehog or a fox?

The AI boom is causing shortages everywhere else

Dark Alley Mathematics

How to effectively write quality code with AI

Delimited Continuations vs. Lwt for Threads

I now assume that all ads on Apple news are scams

Introducing the Developer Knowledge API and MCP Server

Understanding Neural Network, Visually

Start all of your commands with a comma (2009)

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The Waymo World Model

Vocal Guide – belt sing without killing yourself

Reinforcement Learning from Human Feedback

Making geo joins faster with H3 indexes

Where did all the starships go?

Unseen Footage of Atari Battlezone Arcade Cabinet Production

ga68, the GNU Algol 68 Compiler – FOSDEM 2026 [video]

Jeffrey Snover: "Welcome to the Room"

What Is Ruliology?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: I spent 4 years building a UI design tool with only the features I use

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Show HN: If you lose your memory, how to regain access to your computer?

Microsoft open-sources LiteBox, a security-focused library OS

An Update on Heroku

Cross-Region MSK Replication: K2K vs. MirrorMaker2

PC Floppy Copy Protection: Vault Prolok

Was Benoit Mandelbrot a hedgehog or a fox?

The AI boom is causing shortages everywhere else

Dark Alley Mathematics

How to effectively write quality code with AI

Delimited Continuations vs. Lwt for Threads

I now assume that all ads on Apple news are scams

Introducing the Developer Knowledge API and MCP Server

Understanding Neural Network, Visually

Myths Programmers Believe about CPU Caches (2018)

Comments