Myths Programmers Believe about CPU Caches (2018)

https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/

156•whack•3mo ago

Comments

yohbho•3mo ago

2018, discussed on HN in 2023: https://news.ycombinator.com/item?id=36333034

  Myths Programmers Believe about CPU Caches (2018) (rajivprab.com)
 176 points by whack on June 14, 2023 | hide | past | favorite | 138 comments

breppp•3mo ago

Very interesting, I am hardly an expert, but this gives the impression that if only without that meddling software we will all live in a synchronous world.

This ignores store buffers and consequently memory fencing which is the basis for the nightmarish std::memory_order, the worst api documentation you will ever meet

jeffbee•3mo ago

If the committee had any good taste at all they would have thrown DEC AXP overboard before C++11, which would have cut the majority of the words out of the specification for std::memory_order. It was only the obsolete and quite impossible to program Alpha CPU that requires the ultimate level of complexity.

gpderetta•3mo ago

If Alpha never existed, memory_order_consume would still exist. But we have done this discussion before.

QuaternionsBhop•3mo ago

Since the CPU is doing cache coherency transparently, perhaps there should be some sort of way to promise that an application is well-behaved in order to access a lower-level non-transparent instruction set to manually manage the cache coherency from the application level. Or perhaps applications can never be trusted with that level of control over the hardware. The MESI model reminded me of Rust's ownership and borrowing. The pattern also appears in OpenGL vs Vulkan drivers, implicit sync vs explicit sync. Yet another example would be the cache management work involved in squeezing out maximum throughput CUDA on an enterprise GPU.

cpgxiii•3mo ago

There are some knobs that newer processors give for cache control, mostly to partition or reserve cache space to improve security or reduce cache contention between processes.

Actual manual cache management is way too much of an implementation detail for a general-purpose CPU to expose; doing so would deeply tie code to a specific set of processor behavior. Cache sizes and even hierarchies change often between processor generations, and some internal cache behavior has changed within a generation as a result of microcode and/or hardware steppings. Actual cache control would be like MIPS exposing delay slots but so much worse (at least older delay slots really only turn into performance issues, older cache control would easily turn into correctness issues).

Really the only way to make this work is for the final compilation/"specialization" step to occur on the specific device in question, like with a processor using binary translation (e.g. Transmeta, Nvidia Denver) or specialization (e.g. Mill) or a system that effectively enforces runtime compilation (e.g. runtime shader/program compilation in OpenGL and OpenCL).

groundzeros2015•3mo ago

Did this article share more than one myth?

The reason why programmers don’t believe in cache coherency is because they have experienced a closely related phenomena, memory reordering. This requires you to use a memory fence when accessing a shared value between multiples cores - as if they needed to synchronize.

Lwerewolf•3mo ago

I'm pretty sure that most cases of x86 reordering issues are a matter of the compiler reordering things, which isn't (afaik) solved with just "volatile". Caveat - haven't dealt with this for at least over a year (multicore sync without using OS primitives in general).

nly•3mo ago

x86 has a Total Store Order (TSO) memory model, which effectively means (in a mental model where only 1 shared memory operation happens at once and completes before the next) stores are queued but loads can be executed immediately even if stores are queued in the store buffer.

On a single core a load can be served from the store buffer (queue), but other cores can't see those stores yet, which is where all the inconsistencies come from.

fer•3mo ago

Literally the volatile keyword in Java is to make the Java compiler aware in order to insert memory barriers. That only guarantees consistent reads or writes, but it doesn't make it thread safe (i.e. write after read), that's what atomics are for.

Also not only compilers reorder things, most processors nowadays do OoOE; even if the order from the compiler is perfect in theory, different latencies for different instruction operands may lead to execute later things earlier not to stall the CPU.

zozbot234•3mo ago

Note that this is only true of the Java/C# volatile keyword. The volatile keyword in C/C++ is solely about direct access in hardware to memory-mapped locations, for such purposes as controlling external devices; it is entirely unrelated to the C11 memory model for concurrency, does not provide the same guarantees, and should never be used for that purpose.

igtztorrero•3mo ago

In Golang there are sync.Mutex, sync.Atomic and Channels to create this fence and prevent data races. I prefer sync.Mutex.

Does anyone understand how Go handles the CPU cache?

groundzeros2015•3mo ago

Yes. Locks will use a memory fence. More advanced programs will need fence without locking.

daemontus•3mo ago

I may be completely out of line here, but isn't the story on ARM very very different? I vaguely recall the whole point of having stuff like weak atomics being that on x86, those don't do anything, but on ARM they are essential for cache coherency and memory ordering? But then again, I may just be conflating memory ordering and coherency.

jeffbee•3mo ago

Well, since this is a thread about how programmers use the wrong words to model how they think a CPU cache works, I think it bears mentioning that you've used "atomics" here to mean something irrelevant. It is not true that x86 atomics do nothing. Atomic instructions or, on x86, their prefix, make a naturally non-atomic operation such as a read-modify-write atomic. The ARM ISA actually lacked such a facility until ARMv8.1.

The instructions to which you refer are not atomics, but rather instructions that influence the ordering of loads and stores. x86 has total store ordering by design. On ARM, the program has to use LDAR/STLR to establish ordering.

phire•3mo ago

Everything it says about cache coherency is exactly the same on ARM.

Memory ordering has nothing to do with cache coherency, it's all about what happens within the CPU pipeline itself. On ARM reads and writes can become reordered within the CPU pipeline itself, before they hit the caches (which are still fully coherent).

ARM still has strict memory ordering for code within a single core (some older processors do not), but the writes from one core might become visible to other cores in the wrong order.

gpderetta•3mo ago

you are getting downvoted, but you are of course correct.

gpderetta•3mo ago

Normally when talking about relaxed memory models, full cache coherency is still assumed. For example the C++11 memory model cannot be implemented on a non-cache-coherent system, at least not without massive performance penalties.

ashvardanian•3mo ago

Here's my favorite practically applicable cache-related fact: even on x86 on recent server CPUs, cache-coherency protocols may be operating at a different granularity than the cache line size. A typical case with new Intel server CPUs is operating at the granularity of 2 consecutive cache lines. Some thread-pool implementations like CrossBeam in Rust and my ForkUnion in Rust and C++, explicitly document that and align objects to 128 bytes [1]:

  /**
   *  @brief Defines variable alignment to avoid false sharing.
   *  @see https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size
   *  @see https://docs.rs/crossbeam-utils/latest/crossbeam_utils/struct.CachePadded.html
   *
   *  The C++ STL way to do it is to use `std::hardware_destructive_interference_size` if available:
   *
   *  @code{.cpp}
   *  #if defined(__cpp_lib_hardware_interference_size)
   *  static constexpr std::size_t default_alignment_k = std::hardware_destructive_interference_size;
   *  #else
   *  static constexpr std::size_t default_alignment_k = alignof(std::max_align_t);
   *  #endif
   *  @endcode
   *
   *  That however results into all kinds of ABI warnings with GCC, and suboptimal alignment choice,
   *  unless you hard-code `--param hardware_destructive_interference_size=64` or disable the warning
   *  with `-Wno-interference-size`.
   */
  static constexpr std::size_t default_alignment_k = 128;

As mentioned in the docstring above, using STL's `std::hardware_destructive_interference_size` won't help you. On ARM, this issue becomes even more pronounced, so concurrency-heavy code should ideally be compiled multiple times for different coherence protocols and leverage "dynamic dispatch", similar to how I & others handle SIMD instructions in libraries that need to run on a very diverse set of platforms.

[1] https://github.com/ashvardanian/ForkUnion/blob/46666f6347ece...

Sesse__•3mo ago

This makes attempts of cargo-culting __attribute__((aligned(64))) without benchmarking even more hilarious. :-)

rnrn•3mo ago

It’s not a cargo cult if the actions directly cause cargo to arrive based on well understood mechanics.

Regardless of whether it would be better in some situations to align to 128 bytes, 64 bytes really is the cache line size on all common x86 cpus and it is a good idea to avoid threads modifying the same cacheline.

Sesse__•3mo ago

It indeed isn't, but I've seen my share of systems where nobody checked if cargo arrived. (The code was checked in without any benchmarks done, and after many years, it was found that the macros used were effectively no-ops :-) )

rnrn•3mo ago

> even on x86 on recent server CPUs, cache-coherency protocols may be operating at a different granularity than the cache line size. A typical case with new Intel server CPUs is operating at the granularity of 2 consecutive cache lines

I don’t think it is accurate that Intel CPUs use 2 cache lines / 128 bytes as the coherency protocol granule.

Yes, there can be additional destructive interference effects at that granularity, but that’s due to prefetching (of two cachelines with coherency managed independently) rather than having coherency operating on one 128 byte granule

AFAIK 64 bytes is still the correct granule for avoiding false sharing, with two cores modifying two consecutive cachelines having way less destructive interference than two cores modifying one cacheline.

j_seigh•3mo ago

Coherent cache is transparent to the memory model. So if someone trying to explain memory model and ordering mentioned cache as affecting the memory model, it was pretty much a sign they didn't fully understand what they were talking about.

wpollock•3mo ago

This post, along with the tutorial links it and the comments contain, provide good insights on the topic of caches, coherence, and related topics. I would like to add a link that I feel is also very good, maybe better:

<https://marabos.nl/atomics/hardware.html>

While the book this chapter is from is about Rust, this chapter is pretty much language-agnostic.

mannyv•3mo ago

The author talks about Java and cache coherency, but ignores the fact that the JVM implementation is above the processors caches and per the spec can have different caches per thread.

This means you need to synchronize every shared access, whether it's a read or write. In hardware systems you can cheat because usually a write performs a write-through. In a JVM that's not the case.

It's been a long time since I had to think about this, but it bit us pretty hard when we found that.

Ask HN: What are the word games do you play everyday?

Show HN: Paper Arena – A social trading feed where only AI agents can post

TOSTracker – The AI Training Asymmetry

The Devil Inside GitHub

Show HN: Distill – Migrate LLM agents from expensive to cheap models

Show HN: Sigma Runtime – Maintaining 100% Fact Integrity over 120 LLM Cycles

Make a local open-source AI chatbot with access to Fedora documentation

Introduce the Vouch/Denouncement Contribution Model by Mitchellh

Software Factories and the Agentic Moment

The Neuroscience Behind Nutrition for Developers and Founders

Bang bang he murdered math {the musical } (2024)

A Night Without the Nerds – Claude Opus 4.6, Field-Tested

Could ionospheric disturbances influence earthquakes?

SpaceX's next astronaut launch for NASA is officially on for Feb. 11 as FAA clea

Show HN: One-click AI employee with its own cloud desktop

Show HN: Poddley – Search podcasts by who's speaking

Same Surface, Different Weight

The Rise of Spec Driven Development

The first good Raspberry Pi Laptop

Seas to Rise Around the World – But Not in Greenland

Will Future Generations Think We're Gross?

State Department will delete Xitter posts from before Trump returned to office

Show HN: Verifiable server roundtrip demo for a decision interruption system

Impl Rust – Avro IDL Tool in Rust via Antlr

Stories from 25 Years of Software Development

minikeyvalue

Neomacs: GPU-accelerated Emacs with inline video, WebKit, and terminal via wgpu

Show HN: Moli P2P – An ephemeral, serverless image gallery (Rust and WebRTC)

How I grow my X presence?

What's the cost of the most expensive Super Bowl ad slot?

Ask HN: What are the word games do you play everyday?

Show HN: Paper Arena – A social trading feed where only AI agents can post

TOSTracker – The AI Training Asymmetry

The Devil Inside GitHub

Show HN: Distill – Migrate LLM agents from expensive to cheap models

Show HN: Sigma Runtime – Maintaining 100% Fact Integrity over 120 LLM Cycles

Make a local open-source AI chatbot with access to Fedora documentation

Introduce the Vouch/Denouncement Contribution Model by Mitchellh

Software Factories and the Agentic Moment

The Neuroscience Behind Nutrition for Developers and Founders

Bang bang he murdered math {the musical } (2024)

A Night Without the Nerds – Claude Opus 4.6, Field-Tested

Could ionospheric disturbances influence earthquakes?

SpaceX's next astronaut launch for NASA is officially on for Feb. 11 as FAA clea

Show HN: One-click AI employee with its own cloud desktop

Show HN: Poddley – Search podcasts by who's speaking

Same Surface, Different Weight

The Rise of Spec Driven Development

The first good Raspberry Pi Laptop

Seas to Rise Around the World – But Not in Greenland

Will Future Generations Think We're Gross?

State Department will delete Xitter posts from before Trump returned to office

Show HN: Verifiable server roundtrip demo for a decision interruption system

Impl Rust – Avro IDL Tool in Rust via Antlr

Stories from 25 Years of Software Development

minikeyvalue

Neomacs: GPU-accelerated Emacs with inline video, WebKit, and terminal via wgpu

Show HN: Moli P2P – An ephemeral, serverless image gallery (Rust and WebRTC)

How I grow my X presence?

What's the cost of the most expensive Super Bowl ad slot?

Myths Programmers Believe about CPU Caches (2018)

Comments