Read Locks Are Not Your Friends

https://eventual-consistency.vercel.app/posts/write-locks-faster

16•emschwartz•2d ago

Comments

whizzter•1h ago

I'd be super interested in how this compares between cpu architectures, is there an optimization in Apple silicon that makes this bad while it'd fly on Intel/AMD cpus?

hansvm•1h ago

I've observed the same behavior on AMD and Intel at $WORK. Our solution (ideal for us, reads happening roughly 1B times more often than writes) was to pessimize writes in favour of reads and add some per-thread state to prevent cache line sharing.

We also tossed in an A/B system, so reads aren't delayed even while writes are happening; they just get stale data (also fine for our purposes).

the_duke•39m ago

Rust has an interesting crate for this, arc-swap [1].

It's essentially just an atomic pointer that can be swapped out.

[1] https://docs.rs/arc-swap/latest/arc_swap/

gpderetta•56m ago

the behaviour is quite typical for any MESI style cache coherence system (i.e. most if not all of them).

A specific microarchitecture might alleviate this a bit with lower latency cross-core communication, but the solution (using a single naive RW lock to protect the cache) is inherently non-scalable.

PunchyHamster•46m ago

Read lock requires communication between cores. It just can't scale with CPU count

_dky•1h ago

If implementation is task based and task always runs on same virtual CPU (slots equaling CPUs or parallelism), wonder if something like below might help.

RW lock could be implemented using an array of length equal to slots and proper padding to ensure each slot is in its own face line (avoid invalidating CPU cache when different slot is read/written).

For read lock: Each task acquires the lock for their slot.

For write lock: Acquire lock from left most slot to right. Writes can starve readers when they block on in-flight reader at a different slot when moving from left to right.

I do not know how Rust RW locks are implemented.

amluto•1h ago

The code examples are confusing. The show the code that takes the locks, but they don’t show any of the data structures involved. The rwlock variant clones the Arc (makes sense), but the mutex variant does not (is it hidden inside inner.get)?

In any case, optimizing this well would require a lot more knowledge of what’s going on under the hood. What are the keys? Can the entire map be split into several maps? Can a reader hold the rwlock across multiple lookups? Is a data structure using something like RCU an option?

Retr0id•1h ago

claudes love to talk about The Hardware Reality

stuaxo•1h ago

"The performance Death Spiral" was the point I realised I was being LLMd and bailed out.

ot•1h ago

This is drawing broad conclusions from a specific RW mutex implementation. Other implementations adopt techniques to make the readers scale linearly in the read-mostly case by using per-core state (the drawback is that write locks need to scan it).

One example is folly::SharedMutex, which is very battle-tested: https://uvdn7.github.io/shared-mutex/

There are more sophisticated techniques such as RCU or hazard pointers that make synchronization overhead almost negligible for readers, but they generally require to design the algorithms around them and are not drop-in replacements for a simple mutex, so a good RW mutex implementation is a reasonable default.

PaulHoule•52m ago

I think it’s not unusual that reader-writer locks, even if well implemented, get in places where there are so many readers stacked up that writers never get to get a turn or 1 writer winds up holding up N readers which is not so scalable as you increase N.

mike_hearn•48m ago

Right, and if you're on the JVM you have access to things like ConcurrentHashMap which is lock free.

Jyaif•38m ago

And a Rust equivalent of folly::SharedMutex: https://docs.rs/crossbeam-utils/latest/crossbeam_utils/sync/...

api•54m ago

Take a look at crates like arc_swap if you have a read often write rarely lock case. You can easily implement the RCU pattern. Just be sure to read about how to use RCU properly.

Well done this pattern gives you nearly free reads and cheap writes, sometimes cheaper than a lock.

For frequent writes a good RWLock is often better since RCU can degrade rapidly and badly under write contention.

sevensor•40m ago

Does this apply also to std::shared_mutex in C++? This is a timely article if so; I’m in the middle of doing some C++ multithreading that relies on a shared_mutex. I have some measuring to do.

gpderetta•25m ago

mostly yes.

Never Buy A .online Domain

Danish government agency to ditch Microsoft software (2025)

Show HN: A real-time strategy game that AI agents can play

How to fold the Blade Runner origami unicorn (1996)

100M-Row Challenge with PHP

I'm helping my dog vibe code games

LLM=True

Event Horizon Labs (YC W24) Is Hiring

Pi – A minimal terminal coding harness

Claude Code Remote Control

Turing Completeness of GNU find

Mercury 2: Fast reasoning LLM powered by diffusion

Show HN: Moonshine Open-Weights STT models – higher accuracy than WhisperLargev3

Japanese Death Poems

Mac mini will be made at a new facility in Houston

Cl-kawa: Scheme on Java on Common Lisp

Hacking an old Kindle to display bus arrival times

I pitched a roller coaster to Disneyland at age 10 in 1978

Show HN: Scheme-langserver – Digest incomplete code with static analysis

Nearby Glasses

Show HN: Emdash – Open-source agentic development environment

Steel Bank Common Lisp

Amazon accused of widespread scheme to inflate prices across the economy

Half million 'Words with Spaces' missing from dictionaries

Cell Service for the Fairly Paranoid

Hugging Face Skills

Anthropic Drops Flagship Safety Pledge

Running RISC-V in a VM to test my snaps

Meta problem with URPF our bundle in Boca raton

Stripe valued at $159B, 2025 annual letter