Spinlocks vs. Mutexes: When to Spin and When to Sleep

https://howtech.substack.com/p/spinlocks-vs-mutexes-when-to-spin

84•birdculture•2mo ago

Comments

Tom1380•2mo ago

I didn't know about using alignment to avoid cache bouncing. Fascinating stuff

bitexploder•2mo ago

Yep. Super important in lock free synchronization primitives like ring buffers. Cache line padded atomics are really cool :)

staticfloat•2mo ago

I love that this article includes a test program at the bottom to allow you to verify its claims.

markisus•2mo ago

Where do lock free algorithms fall in this analysis?

raggi•2mo ago

Some considerations can be similar, but the total set of details are different.

It also depends which lock free solutions you're evaluating.

Some are higher order spins (more similar high level problems), others have different secondary costs (such as copying). A common overlap is the inter-core, inter-thread, and inter-package side effects of synchronization points, for a lot of stuff with a strong atomic in the middle that'll be stuff like sync instruction costs, pipeline impacts of barriers/fences, etc.

bluecalm•2mo ago

In a very basic case lock free data structures make threads race instead of spin. A thread makes their copy of a part of list/tree/whatever it needs updating, introduces changes to that copy and then tries to substitute their own pointer for the data structure pointer if it hasn't changed in the meantime (there is a CPU atomic instruction for that). If the thread fails (someone changed the pointer in the meantime) it tries again.

adrian_b•2mo ago

There are several classes of lock free algorithms with different properties.

Lock free algorithms for read only access to shared data structures have only seldom disadvantages (when the shared data structure is modified extremely frequently by writers, so the readers never succeed to read it between changes), so for read-only access they are typically the best choice.

On the other hand lock free algorithms for read/write access to shared data structures must be used with great caution, because they frequently have a higher cost than using mutual exclusion. Such lock free algorithms are based on the optimistic assumption that your thread will complete the access before the shared structure is accessed by another competing thread. Whenever this assumption fails (which will happen when there is high contention) the transaction must be retried, which will lead to much more wasted work than the work that is wasted in a spinlock.

Lock free algorithms for read/write access are normally preferable only when it is certain that there is low contention for the shared resource, but in that case also a spinlock may waste negligible time.

The term "lock-free" is properly applied only to the access methods based on optimistic access instead of mutual exclusion (which uses locks).

However, there is a third kind of access methods, which use neither optimistic access nor mutual exclusion with locks, so some authors may conflate such methods together with the lock-free algorithms based on optimistic access.

This third kind of access methods for shared data have properties very different from the other two kinds, so they should better be considered separately. They are based on the partitioning of the shared data structure between the threads that access it concurrently, so such methods are applicable only to certain kinds of data structures, mainly to arrays and queues. Nevertheless, almost any kind of inter-thread communication can be reorganized around message queues and shared buffers, so most of the applications that use either mutual exclusion with locks or lock-free optimistic access can be rewritten to use concurrent accesses to dynamically partitioned shared data structures, where the access is deterministic unlike with lock-free optimistic algorithms, and there are no explicit locks, but the partitioning of the shared data structure must be done with atomic instructions (usually atomic fetch-and-add), which contain implicit locks, but they are equivalent with extremely short locked critical sections.

gpderetta•2mo ago

Typically lock-free algorithms do not retry when a read happens concurrently with a write[1], so they scale very well for read mostly data or when data is partitioned between writers.

Scalability is always going to be poor when writers attempt to modify the same object, no matter the solution you implement.

[1] of course you could imagine a lock-free algorithm where reads actually mutate, but that's not common.

menaerus•2mo ago

> Scalability is always going to be poor when writers attempt to modify the same object, no matter the solution you implement.

MVCC.

gpderetta•2mo ago

Well, yes, that's one way of avoiding mutating the same object of course.

menaerus•2mo ago

Technically it is not because eventually it will be mutated, and that's one way of achieving the scalability in multiple writers scenario.

charleslmunger•2mo ago

>Critical section under 100ns, low contention (2-4 threads): Spinlock. You’ll waste less time spinning than you would on a context switch.

If your sections are that short then you can use a hybrid mutex and never actually park. Unless you're wrong about how long things take, in which case you'll save yourself.

>alignas(64) in C++

    std::hardware_destructive_interference_size

Exists so you don't have to guess, although in practice it'll basically always be 64.

The code samples also don't obey the basic best practices for spinlocks for x86_64 or arm64. Spinlocks should perform a relaxed read in the loop, and only attempt a compare and set with acquire order if the first check shows the lock is unowned. This avoids hammering the CPU with cache coherency traffic.

Similarly the x86 PAUSE instruction isn't mentioned, even though it exist specifically to signal spin sections to the CPU.

Spinlocks outside the kernel are a bad idea in almost all cases, except dedicated nonpreemptable cases; use a hybrid mutex. Spinning for consumer threads can be done in specialty exclusive thread per core cases where you want to minimize wakeup costs, but that's not the same as a spinlock which would cause any contending thread to spin.

magicalhippo•2mo ago

> Spinlocks outside the kernel are a bad idea in almost all cases, except dedicated nonpreemptable cases; use a hybrid mutex

Yeah, pure spinlocks in user-space programs is a big no-no in my book. If you're on the happy path then it costs you nothing extra in terms of performance, and if you for some reason slide off the happy path you have a sensible fall-back.

raggi•2mo ago

> Spinlocks outside the kernel are a bad idea in almost all cases, except dedicated nonpreemptable cases; use a hybrid mutex. Spinning for consumer threads can be done in specialty exclusive thread per core cases where you want to minimize wakeup costs, but that's not the same as a spinlock which would cause any contending thread to spin.

Very much this. Spins benchmark well but scale poorly.

charleshn•2mo ago

> std::hardware_destructive_interference_size Exists so you don't have to guess, although in practice it'll basically always be 64.

Unfortunately it's not quite true, do to e.g. spacial prefetching [0]. See e.g. Folly's definition [1].

[0] https://community.intel.com/t5/Intel-Moderncode-for-Parallel...

[1] https://github.com/facebook/folly/blob/d2e6fe65dfd6b30a9d504...

saagarjha•2mo ago

> std::hardware_destructive_interference_size

Of course, this is just the number the compiler thinks is good. It’s not necessarily the number that is actually good for your target machine.

surajrmal•2mo ago

Hybrid locks are also bad for overall system performance by maximizing local application performance. There is a reason default lock implementations from OS don't spin even a little bit.

menaerus•2mo ago

> There is a reason default lock implementations from OS don't spin even a little bit.

glibc pthread mutex uses a user-space spinlock to mitigate the syscall cost for uncontended cases.

charleslmunger•2mo ago

That depends on your workload. If you're making a game that's expected to use near 100% of system resources, or a real time service pinned to specific cores, your local application is the overall system.

surajrmal•1mo ago

Totally agree. However it's important to differentiate those workloads from the average workload which is to participate in a larger system.

nly•2mo ago

GNU libc posix mutexes do spin...

surajrmal•1mo ago

And I think it'd a poor choice that causes worse system performance. Android's bionic doesn't spin, nor does Windows or Fuchsia. Avoiding the syscall overhead is generally detrimental to overall system performance especially when the CPU load is high.

imtringued•2mo ago

This is nonsense. If the lock hasn't been acquired, you don't spin to begin with and if the lock has been acquired and the lock is being released shortly after, the spinning avoids a context switch. If the maximum number of retries has been reached, the thread was going to sleep anyway and starts scheduling the next thread (which was only delayed by the few attempted spins). This means in the worst case the next spin will only happen once all the other queued up threads have had their turn and that's assuming you're immediately running into another acquired lock.

surajrmal•1mo ago

It's makes the worse case sufficiently bad and unfair such that it makes things worse overall. If the lock is contended by a thread with higher priority, then that blocking thread will have its priority increased. Now if the ends thread to get the lock is one spinning on it rather than actual high priority one, then this will repeat, leading to large latency on front of the high priority thread and a lot of misaligned CPU utilization by a lower priority thread.

Spinning on a CAS is far more expensive than spinning on most other instructions as well as it affects all core that may try to access that cache line, which may include things other than the lock itself.

Also consider how the system acts under high CPU load. You will end up with threads holding locks when not running leading to the majority of the time you miss the lock you spin all 100 times. This just exacerbate the CPU load issues even more. Hybrid locks are only helpful under lower CPU load.

nly•2mo ago

The PAUSE instruction isn't actually as good as it used to be. In, iirc, Skylake Intel massively increased the latency to improve utilisation under hyperthreading. The latency of this instruction is now really high.

Most people using spinlocks really care about latency, and many will have hyperthreading disabled to reduce jitter

SkiFire13•2mo ago

If the PAUSE instruction is too fast doesn't that kinda defeat its purpose?

menaerus•2mo ago

Yeah, I think so too now that I read some documentation about it. It appears that the main issue with the spinlock pattern is that it inhibits "a severe performance penalty when exiting the [spinlock] loop because it [CPU] detects a possible memory order violation." [0].

~10 years ago, on Haswell, it took ~9 cycles to retire, and from Skylake onward, with some exceptions, it takes a magnitude more - ~140 cycles.

These numbers alone suggests that it really messes up hard with the CPU pipeline, perhaps BP (?) or speculative execution (?) or both (?) such that it will basically force the CPU to flush the whole pipeline. This is at least how I read this. I will remember this instruction as "damage control" instruction from now on.

[0] https://www.felixcloutier.com/x86/pause

nly•1mo ago

Not sure if you'll see this now, but the actual reason you want to use it is as a speculation barrier and a hint to various predictors.

Lfence is the better choice these days.

menaerus•2mo ago

Some things from the article are debatable for sure, and some are maybe missing like the one you mention with PAUSE instruction, which I also have not been aware of, but generally speaking I thought it was a really good content. Lean system engineering skills applied to real world problems. I especially appreciated the examples of large-scale infra codebases doing it in practice.

EdSchouten•2mo ago

I don’t understand why I would need to care about this. Can’t my operating system and/or pthread library sort this out by itself?

senderista•2mo ago

Pretty much, given that any decent pthreads implementation will offer an adaptive mutex. Unless you really need a mutex the size of a single bit or byte (which likely implies false sharing), there's little reason to ever use a pure spinlock, since a mutex with adaptive spinning (up to context switch latency) gives you the same performance for short critical sections without the disastrous worst-case behavior.

nly•2mo ago

Some people don't want to block for a microsecond when their lock goes 1ns over your adaptive mutexes spin deadline. That kind of jitter is unacceptable.

senderista•2mo ago

I assume those people are already running 1 pinned thread/core and have no issues with unbounded spinning in the first place. In which case, go nuts.

baobun•2mo ago

General heuristics only get you so far and at the limit come with their own overhead compared to what you can do with a tailored solution with knowledge about your usage and data access patterns. The cases where this makes a practical difference for higher-level apps are rare but they exist.

haileys•2mo ago

Please just don't use spinlocks in userland code. It's really not the appropriate mechanism.

Your code will look great in your synthetic benchmarks and then it will end up burning CPU for no good reason in the real world.

nly•2mo ago

Burning CPU is preferable in some industries where latency is all that matters.

bob1029•2mo ago

Gaming and high frequency trading are the most obvious examples where this is desirable.

If you adjust the multimedia timer to its highest resolution (1ms on windows), sleeping is still a non-starter. Even if the sleep was magically 0ms whenever needed, you still have risk of context switching wrecking your cache and jacking up memory bandwidth utilization.

masklinn•2mo ago

Even outside of such, if your contention is low and critical section short, spinning a few rounds to avoid a syscall is likely to be a gain not just in terms of latencies but also in terms of cycles waste.

imtringued•2mo ago

Ok, but you do realize that you're now deep in the realm of real time Linux and you're supposed to allocate entire CPU cores to individual processes?

What I'm trying to express here is that the spinlock isn't some special tool that you pull out of the toolbox to make something faster and call it a day.

It's like a cryogenic superconductor that requires extreme caution to use properly. It's something you avoid doing because it's a pain in the ass.

gpderetta•2mo ago

Exclusively allocating a cpu to a specific thread is not exactly rocket science and it is a fairly mundane task.

gebdev•2mo ago

Loved this article. It showed how lacking my knowledge is in how operating systems implement concurrency primitives. It motivated me to do a bunch of research and learn more.

Notably the claim about how atomic operations clear the cache line in every cpu. Wow! Shared data can really be a performance limitation.

jcalvinowens•2mo ago

> The Linux kernel learned this the hard way. Early 2.6 kernels used spinlocks everywhere, wasting 10-20% CPU on contended locks because preemption would stretch what should’ve been 100ns holds into milliseconds. Modern kernels use mutexes for most subsystems.

That's not accurate: the scalability improvements in Linux are a result of broadly eliminating serialization, not something as trivial as using a different locking primitive. The BKL didn't go away until 2.6.37! As much as "spinlock madness" might make a nice little story, it's just simply not true.

AI and Education: Generative AI and the Future of Critical Thinking

Maple Mono: Smooth your coding flow

Moltbook isn't real but it can still hurt you

Take Back the Em Dash–and Your Voice

Show HN: 289x speedup over MLP using Spectral Graphs

Teaching Mathematics

3D Printed Microfluidic Multiplexing [video]

Abstractions Are in the Eye of the Beholder

Show HN: Routed Attention – 75-99% savings by routing between O(N) and O(N²)

We didn't ask for this internet – Ezra Klein show [video]

The Real AI Talent War Is for Plumbers and Electricians

Show HN: MimiClaw, OpenClaw(Clawdbot)on $5 Chips

I Maintain My Blog in the Age of Agents

The Fall of the Nerds

I'm 15 and built a free tool for reading Greek/Latin texts. Would love feedback

How close is AI to taking my job?

You are the reason I am not reviewing this PR

Show HN: FamilyMemories.video – Turn static old photos into 5s AI videos

How Meta Made Linux a Planet-Scale Load Balancer

A Turing Test for AI Coding

How to Identify and Eliminate Unused AWS Resources

A2CDVI – HDMI output from from the Apple IIc's digital video output connector

CLI for Common Playwright Actions

Would you use an e-commerce platform that shares transaction fees with users?

Show HN: SafeClaw – a way to manage multiple Claude Code instances in containers

The Future of the Global Open-Source AI Ecosystem: From DeepSeek to AI+

The Evolution of the Interface

Azure: Virtual network routing appliance overview

Seedance2 – multi-shot AI video generation

Πfs – The Data-Free Filesystem

AI and Education: Generative AI and the Future of Critical Thinking

Maple Mono: Smooth your coding flow

Moltbook isn't real but it can still hurt you

Take Back the Em Dash–and Your Voice

Show HN: 289x speedup over MLP using Spectral Graphs

Teaching Mathematics

3D Printed Microfluidic Multiplexing [video]

Abstractions Are in the Eye of the Beholder

Show HN: Routed Attention – 75-99% savings by routing between O(N) and O(N²)

We didn't ask for this internet – Ezra Klein show [video]

The Real AI Talent War Is for Plumbers and Electricians

Show HN: MimiClaw, OpenClaw(Clawdbot)on $5 Chips

I Maintain My Blog in the Age of Agents

The Fall of the Nerds

I'm 15 and built a free tool for reading Greek/Latin texts. Would love feedback

How close is AI to taking my job?

You are the reason I am not reviewing this PR

Show HN: FamilyMemories.video – Turn static old photos into 5s AI videos

How Meta Made Linux a Planet-Scale Load Balancer

A Turing Test for AI Coding

How to Identify and Eliminate Unused AWS Resources

A2CDVI – HDMI output from from the Apple IIc's digital video output connector

CLI for Common Playwright Actions

Would you use an e-commerce platform that shares transaction fees with users?

Show HN: SafeClaw – a way to manage multiple Claude Code instances in containers

The Future of the Global Open-Source AI Ecosystem: From DeepSeek to AI+

The Evolution of the Interface

Azure: Virtual network routing appliance overview

Seedance2 – multi-shot AI video generation

Πfs – The Data-Free Filesystem

Spinlocks vs. Mutexes: When to Spin and When to Sleep

Comments