Spinning around: Please don’t – Common problems with spin locks

https://www.siliceum.com/en/blog/post/spinning-around/

157•bdash•1w ago

Comments

gafferongames•1w ago

Great article! Thanks for posting this.

CamperBob2•1w ago

Sheesh. Can something this complicated ever truly be said to work?

nh23423fefe•1w ago

Isn't it the opposite? The complication is evidence of function. The simple code doesn't work.

kelnos•1w ago

That assertion feels suspiciously like a logical fallacy.

maxbond•1w ago

Not really. If the solution has less complexity than is inherent in the problem, it can't possibly work. If the solution has complexity equal to or greater than the complexity inherent in the problem, it may work. So if you see complex code handling many different edge cases, you can take that as an indicator the author understood the problem. That doesn't mean they do understand or that the solution does work; only that you have more confidence than you did initially.

It's a weak signal but the reasoning is sound.

pyrolistical•1w ago

Everything should be made as simple as possible, but not simpler.

Code has a minimum complexity to solve the problem

pixl97•1w ago

Not really. A different place to look for this is in chemical reactions and things biological life does.

You may have some simple chemical life needs, and life may have some other simple chemical it can use to get the needed simple chemical, but the processing steps are complex and limited by physics themselves. Evolution almost always finds a path of using the minimum activation energy to let these reactions occur. Trying to make the process simpler just doesn't get you what you need.

bluGill•1w ago

You can limit yourself to the performance of a 1mhz 6502 with no OS if you don't like it. Even MSDos on a 8086 with 640K ram allows for things that require complexity of this type (not spin locks, but the tricks needed to make "terminate stay resident" work are evil in a similar way)

yjftsjthsd-h•1w ago

I don't think that's fair. You can go fast, just not more than one task at a time.

bluGill•1w ago

Modern CPUs (since around 2000) go faster in large part because they have multiple cores that can do more than one thing in a time. If your program needs to go faster using more cores is often your best answer and then you will need these tricks. (SIMD or the GPU are also common answers that might or might not be better for your problem)

yjftsjthsd-h•1w ago

Modern CPUs can do 4-5 GHz singled threaded. (Sometimes you can even get a higher clock speed by disabling other cores.) This somewhat outpaces "a 1mhz 6502" even without parallelization.

bluGill•1w ago

They can, but nobody runs a single process on such CPUs. They run some form of OS which implements spinlock, mutexes, and all these other complex things.

I suppose someplace someone is running an embedded system without an OS on such a processor - but I'd expect they are still using extra cores and so have all of the above tricks someplace.

pjmlp•1w ago

I never get the single threaded assertions regarding CPU performance, it is mostly useless in the day of premptive scheduling in modern OSes.

Yes it matters on MS-DOS like OS design, like some embedded deployments and that is about it.

It is even impossible to guarantee a process doesn't get rescheduled into another CPU with the performance impact it entails, unless the process explicitly sets its CPU affinity.

bluGill•1w ago

If you don't allow complev things like spinlocks then all that is left is single thread performance.

pjmlp•1w ago

Except that ignores the amount of times the OS preempts the thread, or moves it into another CPU trashing all the cache contents in the process, and related NUMA patterns.

The way it is measured, is mostly ideal, assuming that threads run to completion without any of those side effects taking place.

adrr•1w ago

OS kernel runqueue is using a spinlock to schedule everything. So it works. Should you ever use a spinlock in application code? No. Let the OS via the synchronization primitives in whatever language your app is in.

direwolf20•1w ago

Yes, if you're careful. Actually careful, not pretend careful. Which is pretty normal in C and C++.

lmm•1w ago

Probably not, not without formal verification which is usually lacking.

Everyone's computers hang or get slow some of the time. Probably all of our locks have bugs in them, but good luck getting to the bottom of that, right now the industry is barely capable of picking a sorting algorithm that actually works.

imtringued•1w ago

It works if there is no scheduler, or you tell the scheduler what you're doing.

Turns out the first scenario is rare outside of embedded or OS development. The second scenario defeats the purpose because you're doing the same thing a mutex would be doing. It's not like mutexes were made slow on purpose to bully people. They're actually pretty fast.

jcranmer•1w ago

The basic rule of writing your own cross-thread datastructures like mutexes or condition variables is... don't, unless you have very good reason not to. If you're in that rare circumstance where you know the library you're using isn't viable for some reason, then the next best rule is to use your OS's version of a futex as the atomic primitive, since it's going to solve most of the pitfalls for you automatically.

The only time I've manually written my own spin lock was when I had to coordinate between two different threads, one of which was running 16-bit code, so using any library was out of the question, and even relying on syscalls was sketchy because making sure the 16-bit code is in the right state to call a syscall itself is tricky. Although in this case, since I didn't need to care about things like fairness (only two threads are involved), the spinlock core ended up being simple:

    "thunk_spin:",
        "xchg cx, es:[{in_rv}]",
        "test cx, cx",
        "jnz thunk_has_data",
        "pause",
        "jmp thunk_spin",
    "thunk_has_data:",

kccqzy•1w ago

Another time when writing a quick and dirty spinlock is reasonable is inside a logging library. A logging library would normally use a full-featured mutex, but what if we want the mutex implementation to be able to log? Say the mutex can log that it is non recursive yet the same thread is acquiring it twice; or that it has detected a deadlock. The solution is to introduce a special subset of the logging library to use a spinlock.

wizzwizz4•1w ago

I'm not sure how a spinlock solves this problem. Wouldn't that just cause the process to hang busy?

direwolf20•1w ago

Only until the other thread leaves the logger

wizzwizz4•1w ago

Oh, I see: the spinlock is for logging the deadlocks of other mutices, not for magically remediating deadlocks.

fasterik•1w ago

As always: use standard libraries first, profile, then write your own if the data indicate that it's necessary. To your point, the standard library probably already uses the OS primitives under the hood, which themselves do a short userspace spin-wait and then fall back to a kernel wait queue on contention. If low latency is a priority, the latter might be unacceptable.

The following is an interesting talk where the author used a custom spinlock to significantly speed up a real-time physics solver.

Dennis Gustafsson – Parallelizing the physics solver – BSC 2025 https://www.youtube.com/watch?v=Kvsvd67XUKw

Lectem•1w ago

> which themselves do a short userspace spin-wait and then fall back to a kernel wait queue on contention.

Yes, but sadly not all implementations... The point remains that you should prefer OS primitives when you can, profile first, reduce contention, and then only, maybe, if you reeeally know what you're doing, on a system you mostly know and control, then perhaps you may start doing it yourself. And if you do, the fallback under contention must be the OS primitive

wallstop•1w ago

I wrote my own spin lock library over a decade ago in order to learn about multi threading, concurrency, and how all this stuff works. I learned a lot!

squirrellous•1w ago

Another somewhat known case of a spinlock is in trading, where for latency purposes the OS scheduler is essentially bypassed by core isolation and thread pinning, so there’s nothing better for the CPU to do than spinning.

imtringued•1w ago

This is the primary use case for spinlocks, which is why the vast majority of developers shouldn't use them. When you use a spinlock, you're dedicating an entire CPU core to the thread or else it doesn't work in terms of correctness or performance.

If you want scheduling, then the scheduler needs to be aware of task dependencies and you must accept that your task will be interrupted.

When a lock is acquired on resource A by the first thread, the second thread that tries to acquire A will have a dependency on the release of A, meaning that it can only be scheduled after the first thread has left the critical section. With a spinlock, the scheduler is not informed of the dependency and thinks that the spinlock is performing real work, which is why it will reschedule waiting threads even if resource A has not been released yet.

If you do thread pinning and ensure there are less threads than CPU cores, but still have other threads be scheduled on those cores, it might still work, but the latency benefits are most likely gone.

jeffbee•1w ago

"Unfair" paragraph is way too short. This is the main problem! The outlier starvation you get from contended spinlocks is extraordinary and, hypothetically, unbounded.

tialaramex•1w ago

Well, you need to have specified what you actually want. "Fair" sounds like it's just good, but it's expensive, so unless you know that you need it, which probably means knowing why, you probably don't want to pay the price.

Stealing is an example of an unfairness which can significantly improve overall performance.

a-dub•1w ago

i always got the sense that spinlocks were about maximum portability and reliability in the face of unreliable event driven approaches. the dumb inefficient thing that makes the heads of the inexperienced explode, but actually just works and makes the world go 'round.

rdtsc•1w ago

> Notice that in the Skylake Client microarchitecture the RDTSC instruction counts at the machine’s guaranteed P1 frequency independently of the current processor clock (see the INVARIANT TSC property), and therefore, when running in Intel® Turbo-Boost-enabled mode, the delay will remain constant, but the number of instructions that could have been executed will change.

rdtsc may execute out of order, so sometimes an lfence (previously cpuid) can be used and there is also rdtscp

See https://github.com/torvalds/linux/blob/master/arch/x86/inclu...

And just because rdtsc is constant doesn't mean the processor clock will be constant that could be fluctuating.

Lectem•1w ago

The issue with that is that a load fence may be very detrimental to perf. It doesn't really matter if rdtsc executes out of order in this code anyway, and there is no need for sync between cores.

rdtsc•1w ago

You could first measure the perf impact of the fence instruction and then subtract that out? But yeah I guess it may not matter much for quick and dirty calibration loop.

I found somewhere (https://aloiskraus.wordpress.com/2018/06/16/why-skylakex-cpu...) that the pause instruction had this wild cycle difference between different CPU and it caused some grief, I had no idea. I stopped doing low level coding a while back.

pizlonator•1w ago

TFA lists WebKit as a project that "does it wrong".

The author should read https://webkit.org/blog/6161/locking-in-webkit/ so that they understand what they are talking about.

WebKit does it right in the sense that:

- It as an optimal amount of spinning

- Threads wait (instead of spinning) if the lock is not available immediately-ish

And we know that the algorithms are optimal based on rigorous experiments.

bbri06•1w ago

This is an incredible blog post. Super educational, and I think directly applicable to my work. Thanks for sharing!

Lectem•1w ago

The author (me) actually read this long ago

> - It as an optimal amount of spinning

No it isn't, it has a fixed number of yields, which has a very different duration on various CPUs

> Threads wait (instead of spinning) if the lock is not available immediately-ish

They use parking lots, which is one way to do futew (in fact, WaitOnAddress is implemented similarly). And no if you read the code, they do spin. Worse, they actually yield the thread before properly parking.

Phelinofist•1w ago

I guess you mean this regarding spin locks? https://web.archive.org/web/20250219201712/https://www.intel...

The direct link to Intel 404s.

jacobp100•1w ago

The guy you relied to wrote the locking code. If you’re so certain they’re doing it wrong, would it not be easier to just prove it? It’s only one file, and they already have benchmarking set up

Lectem•1w ago

I mean my "No it isn't, it has a fixed number of yields, which has a very different duration on various CPUs" can be verified directly by having a look at the table in my article showing different timings for pause.

For the yield part, I already linked to the part that shows that. Yes it doesn't call yield if it sees others are parked, but on quick lock/unlock of threads it happens that it sees nobody parked and fails, yielding directly to the OS. This is not frequent, but frequent enough that it can introduce delay issues.

arghwhat•1w ago

For reference, golang's mutex also spins by up to 4 times before parking the goroutine on a semaphore. A lot less than the 40 times in the webkit blogpost, but I would definitely consider spinning an appropriate amount before sleeping to be common practice for a generic lock. Granted, as they have a userspace scheduler things do differ a bit there, but most concepts still apply.

https://github.com/golang/go/blob/2bd7f15dd7423b6817939b199c...

pizlonator•1w ago

> No it isn't, it has a fixed number of yields, which has a very different duration on various CPUs

You say this with zero data.

I know that yielding 40 times is optimal for WebKit because I measured it. In fact it was re-measured many times because folks like you would doubt that it could’ve optimal, suggest something different, and then again the 40 yields would be shown to be optimal.

> And no if you read the code, they do spin. Worse, they actually yield the thread before properly parking.

Threads wait if the lock is not available immediately-ish.

Yes, they spin by yielding. Spinning by pausing or doing anything else results in worse performance. We measured this countless times.

I think the mistake you’re making is that you’re imagining how locks work. Whereas what I am doing is running rigorous experiments that involved putting WebKit through larger scale tests

Lectem•1w ago

>You say this with zero data.

Or so you assume

> Spinning by pausing or doing anything else results in worse performance. We measured this countless times.

And I've seen the issue in hundreds of captures using a profiler. I suppose we just have a different definitions of the what "worse performance" is.

> Whereas what I am doing is running rigorous experiments that involved putting WebKit through larger scale tests

Or perhaps the fish was drown in the stats, or again different metrics.

pizlonator•1w ago

> Or so you assume

You're not including data in your discussion of this topic. Your post included zero data.

My post on WTF locks has tons of data.

So, I'm not assuming; I'm observing.

> And I've seen the issue in hundreds of captures using a profiler. I suppose we just have a different definitions of the what "worse performance" is.

Nobody cares what you saw in the profiler.

What matters is the performance users experience.

By any metric of observable performance, yielding is the optimal way of spinning.

ablob•1d ago

>> No it isn't, it has a fixed number of yields, which has a very different duration on various CPUs

> You say this with zero data.

Wouldn't the null hypothesis be that the same program behaves differently on different CPUs? Is "different people require different amounts of time to run 100m" a statement that requires data?

spacechild1•1w ago

Nice article! Yes, using spinlocks in normal userspace applications is not recommended.

One area where I found spinlocks to be useful is in multithreaded audio applications. Audio threads are not supposed to be preempted by other user space threads because otherwise they may not complete in time, leading to audio glitches. The threads have a very high priority (or have a special scheduling policy) and may be pinned to different CPU cores.

For example, multiple audio threads might read from the same sample buffer, whose content is occasionally modified. In that case, you could use a reader-writer-spinlock where multiple readers would be able to progress in parallel without blocking each other. Only a writer would block other threads.

What would be the potential problems in that scenario?

Lectem•1w ago

I've heard of issues on Arm devices with properly isolated cores (only one thread allowed, interrupts disabled) because the would interact with other threads using such a spinlock, threads which were not themselves isolated. The team replaced it all with a futex and it ended up working better in the end. Sadly this happened while I was under another project so I don't have the details, but this can be problematic in audio too. To avoid the delay of waking up thread you can actually wake them a tiny bit early and then spin (not on a lock), since you know work is incoming.

spacechild1•1w ago

For task queues we would use a lockfree queue, wake up the threads once at the beginning of the audio callback and then spin while waiting for tasks, just as you described.

My example above was rather about the DSP graphs themselves that are computed in parallel. These require to access to shared resources like audio buffers, but under no circumstance should they give up their timeslice and yield back to the scheduler. That's why we're using reader-writer spinlocks to synchronize access to these resources. I really don't see any other practical alternative... Any ideas?

Lectem•1w ago

I suppose you need to be able to read data from the buffers to know what parts of the graph to cull? Is computing the graph really long or the graph needs update mid execution? If you really have nothing else to do on those threads/cores, spinning might actually be the solution(considering a high sampling rate). I'd still fallback to the OS after a certain amount of time, as it would mean you failed to meet the deadline anyway. I would also reduce as much as possible the need for writes to synchronized resources where possible, so that you can just read values knowing no writes can happen during your multiple reads.

m-schuetz•1w ago

Recently implemented a fixed-size memory pool with spinlocks and now I'm wondering - how would one implement them without a spinlock?

Edit: Maybe I'm confusing terminology. What I'm doing is looping until other threads returned memory, but I'm also doing a short sleep during each loop iteration.

Lectem•1w ago

That's a spin loop ;)

m-schuetz•1w ago

I see, thanks.

horizion2025•1w ago

My concurrency knowledge is a bit rusty but aren't spinlocks only supposed to be used for very brief waits like in the hundreds of cycles (or situations where you can't block... like internal o/s scheduling structures in SMP setups)? If so how much does all this back off and starvation of higher priority threads even matter? If it is longer then you should use a locking primitive (except for in those low level os structures!) where most of the things discussed are not an issue. Would love to hear the use cases where spin locks are needed in eg user space, I dont doubt they occur.

Lectem•1w ago

That's how they are supposed to work indeed! But spin locks aren't the only spin loops you may find, and allocator for example do spin. And for example under an allocation heavy code (that you should avoid too, but happens due to 3rd parties in real life), this can trigger contention, so you need contention to not be the worse type of contention.

imtringued•1w ago

How can you guarantee that the OS doesn't preempt your thread in the middle of the spinlock? Suddenly your 100 cycle spinlock turns into millions or billions of wasted cycles, because the other threads that are trying to acquire the same lock are spinning and didn't bother informing the OS scheduler that they need the thread that is holding the spinlock, which also didn't inform the OS, to finish its business ASAP.

fsckboy•1w ago

what is/are the thread synchronization protocol called which is the equivalent to ethernet's CSMA? there's no "carrier sensing", but instead "who won or mistakes were made" sensing. or is that just considered a form of spinlock? (you're not waiting for a lock, you perform your operation then see if it worked; though you could make the operation be "acquire lock" in which case it's a spinlock)

kobebrookskC3•1w ago

isn't that more like optimistic concurrency control?

Animats•1w ago

I struggled with this in Wine. "malloc" type memory allocation involves at least two levels of spinlocks. When you do a "realloc", the spinlocks are held during the copying operation. If you use Vec .push in Rust, you do a lot of reallocs. In a heavily multithreaded program, this can knock performance down by more than two orders of magnitude. It's hard to reproduce this with a simple program; it takes a lot of concurrency to hit futex congesion.

Real Windows, and Linux, don't have this problem. Only Wine's "malloc" in a DLL, which does.

Bug reports resulted in finger-pointing and denial.[1] "Unconfirmed", despite showing debugger output.

[1] https://bugs.winehq.org/show_bug.cgi?id=54979

rendaw•1w ago

Reading the bug report, I don't see any denial. The maintainers are pretty clear that they acknowledge the issue, but don't know how to fix it.

Animats•1w ago

Yes, although it took a while to get there. This confirms the OP's line "Spinning around: Please don't". You can get huge performance hits that are hard to fix. Huge.

electroglyph•1w ago

Microsoft's rwlock implementation was borked up until sometime last year iirc. this stuff is difficult to do correctly

foldr•1w ago

> The code is not thread-safe as, if multiple threads attempt to use this lock, we could read invalid values of isLocked (in theory, and on a CPU where tearing could happen on its word size).

The issue isn’t just tearing but also memory order. On some architectures you can read a valid but out of date value in Thread A after Thread B has updated that value. (Memory order is mentioned later in the article, to be fair.)

DoNotNotify is now Open Source

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

Haskell for all: Beyond agentic coding

Matchlock: Linux-based sandboxing for AI agents

SectorC: A C Compiler in 512 bytes (2023)

LLMs as the new high level language

The Architecture of Open Source Applications (Volume 1) Berkeley DB

Moroccan sardine prices to stabilise via new measures: officials

Software factories and the agentic moment

Speed up responses with fast mode

LineageOS 23.2

Modern and Antique Technologies Reveal a Dynamic Cosmos

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Wood Gas Vehicles: Firewood in the Fuel Tank (2010)

Vocal Guide – belt sing without killing yourself

uLauncher

First Proof

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Substack confirms data breach affects users’ email addresses and phone numbers

Start all of your commands with a comma (2009)

Al Lowe on model trains, funny deaths and working with Disney

The AI boom is causing shortages everywhere else

LLMs as Language Compilers: Lessons from Fortran for the Future of Coding

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Where did all the starships go?

Show HN: A luma dependent chroma compression algorithm (image compression)

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

The Scriptovision Super Micro Script video titler is almost a home computer

DoNotNotify is now Open Source

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

Haskell for all: Beyond agentic coding

Matchlock: Linux-based sandboxing for AI agents

SectorC: A C Compiler in 512 bytes (2023)

LLMs as the new high level language

The Architecture of Open Source Applications (Volume 1) Berkeley DB

Moroccan sardine prices to stabilise via new measures: officials

Software factories and the agentic moment

Speed up responses with fast mode

LineageOS 23.2

Modern and Antique Technologies Reveal a Dynamic Cosmos

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Wood Gas Vehicles: Firewood in the Fuel Tank (2010)

Vocal Guide – belt sing without killing yourself

uLauncher

First Proof

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Substack confirms data breach affects users’ email addresses and phone numbers

Start all of your commands with a comma (2009)

Al Lowe on model trains, funny deaths and working with Disney

The AI boom is causing shortages everywhere else

LLMs as Language Compilers: Lessons from Fortran for the Future of Coding

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Where did all the starships go?

Show HN: A luma dependent chroma compression algorithm (image compression)

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

The Scriptovision Super Micro Script video titler is almost a home computer

Spinning around: Please don’t – Common problems with spin locks

Comments