The Concurrency Trap: How an Atomic Counter Stalled a Pipeline

https://www.conviva.com/platform/the-concurrency-trap-how-an-atomic-counter-stalled-a-pipeline/

34•delifue•4d ago

Comments

hinkley•1d ago

> However, further analysis in the perf environment revealed that while there was a spike in the number of active sessions, those gradually dropped off while the DAG processing time continued to stay high.

Thundering herds will do that. Queuing theory. Once the queue builds up it takes far longer to clear than intuition suggests.

hinkley•1d ago

> The big difference between a concurrent hash map and ArcSwap is that ArcSwap requires swapping out the entirety of the underlying data with every write but trades this off with very cheap reads. Writers don’t even have to wait for all readers to finish since a new epoch is created with the new version of data.

This is where tree data structures win out. Git and subversion and pure functional languages don’t replace the entire data structure. They make a new element and attach it to the tree by creating a new version of the intended parent, and a new version of its parent, on up to the root element.

So somewhere out there is space for a hybrid of ArcSwap and DashMap that uses sharded copy on write. These guys probably don’t need it given their low write rate, but shards should help with write contention for other use cases.

masklinn•1d ago

You still need something close to ArcSwap with your persistent data structure, because once you’ve built your new tree (possibly using transients) you need to overwrite the old root with the new, atomically (and probably in a CAS loop really).

Left-right is an option which iirc avoids the CAS loop, as it’s a single writer primitive.

cmrdporcupine•1d ago

im::HashMap is a great tool

I would love, though, to have a similar persistent structure where the node pointers at each level were thread shareded, instead of using e.g. atomics. A mutation at a branch could record that the mutation is "just" for thread ID X that did the mutation. At some later point a "commit" could reconcile the storage to global, maybe

cryptonector•1d ago

You could easily organize the hashmap hierarchically, yes.

deepsun•23h ago

So the root element single point of congestion for all writes, right?

hinkley•13h ago

With a filesystem based trees that’s often sufficient to help reduce the amount of write contention. Because you’re not serializing the whole goddamned thing back to disk, just the parts that changed, which are logarithmic to the size of the edit history. And git uses GC to deal with writes that get to the end and then fail due to concurrent access collisions.

For in-memory problems, there are B-trees, which are an older solution to many problems than most of HN’s userbase, including myself. Then you still have the sharding which I did in fact point out would be part of a better hybrid solution.

nick_•1d ago

It's unclear if the author understands false sharing. The excessive "ping-pong"ing of cache lines between cores means the cache line that the counter sits on is shared with other memory that's being accessed in contention.

Making sure the counter variable is sitting alone on its own cache line would completely remove the excessive ping-pong behaviour. The cache line with the counter would, of course, still ping-pong as it's mutated, but not excessively.

jeffbee•1d ago

Long blog, corporate formatting covered in popups and hero images, screenshots of the incredibly ugly ketchup and mustard flame graph, discussion with some errors or vagueness a completely vanilla performance detail. These are the universal ingredients of the midsize SaaS company semitechnical marketing game.

finnh•1d ago

It's also unclear how the first graph, whose y-axis is labelled "req/s", is showing a latency spike. Latency is not visible on that graph, AFAICT, only throughput.

gpderetta•1d ago

I don't think false sharing matters here. Cacheline bouncing of the counter, which is written even by readers would be enough to explain the bottleneck. The barrier being likely implied or explicit in the atomic RMW used to update the counter will also stall the pipeline, preventing any other memory operation to happen concurrently.

All threads repeatedly CASing on the same memory location is enough to bring any cpu to its knees.

The critical part of the fix is that readers no longer have to modify any shared memory location and rely on deferred reclamation.

senderista•1d ago

Why not? If enough readers are mutating the same ref counter, with very short critsecs, then this could definitely be a problem, regardless of false sharing.

forrestthewoods•1d ago

> If enough readers are mutating the same ref counter

Well if they were mutating the ref counter I’d call them writers not readers. =P

But yes. Many people incorrectly assume that writing an atomic_int from many threads is super fast. I mean it’s atomic! It couldn’t be a smaller atom of work! Alas this is incredibly incorrect. Even atomic ints can be extremely slow.

gpderetta•18h ago

They are logically readers, the mutation is an implementation detail.

senderista•14h ago

Which relates to a fundamental principle of concurrent programming: “(logical) readers shouldn’t (physically) write*”.

*to contended shared memory

gpderetta•12h ago

Well yes, hence the fix.

duskwuff•1d ago

> The cache line with the counter would, of course, still ping-pong as it's mutated, but not excessively.

And the way you solve this is by using a separate counter for each core, at least a cache line apart from each other to ensure they all get separate cache lines. The downside is that you lose the ability to read the counter atomically - but if the counter is incrementing fast enough that you need per-CPU counters, a point-in-time value is basically meaningless anyway.

gpderetta•1d ago

The "counter" here is an RWMutex (or RWSpinLock). You can't really make it distributed while guaranteeing mutual exclusion [1]. You could use an hashmap with more fine grained locking, but to deal with rehashing you end up reinventing something like ArcSwap anyway.

[1] True no-writes-on-shared-locking RWMutexes do exist, but are significantly more complex and require epoch detection like the ArcSwap, so there is no significant advantage.

hinkley•1d ago

For popcounts this is the classical solution. Make aggregation lazy and worry about keeping k smaller numbers accurate cheaply, and let the reader worry about calculating a total, which is likely eventually consistent anyway.

But for multiple read exclusive write this is never going to work. Or at least not without extra silicon. Which maybe should be a thing. The problem of course would be that a multi-word atomic sum instruction would actually have to cross cache lines to avoid false sharing. So you'd end up with counters on contiguous cache lines but occupying 64 bytes apiece, which is a lot.

It would almost call for a special region of memory that has different cache coherency behavior and I can't see that being easy, fast, or backward compatible which is maybe why we don't do it.

cryptonector•1d ago

Or maybe it just so happens that under the observed load most readers really wanted to frob that particular counter, in which case the only solution is to get rid of the counter.

hinkley•1d ago

I think it may have been worth calling out the false sharing to readers who are unfamiliar with the problem domain, but I don't get the impression this was false sharing but regular old sharing.

nick_•13h ago

I think you're right.

vrosas•1d ago

> While we knew where to look, this investigation had already taken weeks and things took a turn for the worse when we hit the issue again on February 23rd.

It blows me away an issue like this could take weeks to track down. If I were in any leadership position at this company I'd be rolling heads with the lack of telemetry or domain knowledge for these systems.

dblohm7•1d ago

I can't say I'm surprised, TBH. I had a rough idea of where the problem might lie just by reading the title of the post. But I was fortunate enough to do an undergraduate degree where concurrency was actually taught, plus I've learned a lot over the years working in highly-concurrent, asynchronous environments.

Concurrent programming has been mainstream for some time now, but I don't think the level of expertise of most engineers has kept up. That becomes most apparent when software starts hitting concurrency pitfalls: performance problems, deadlocks, UAFs, and so on...

cryptonector•1d ago

In my career I've had to deal with two big outages that took over a week to diagnose and fix. One of those involved my code, though the conditions that caused it involved bugs in others' code interacting with bugs in mine such that in isolation none of them would have caused an outage. The first one took two weeks to diagnose, and I had to write specialty tools to help diagnose it. The second took lots of data gathering, code inspection, and making and testing many hypotheses. In neither case did heads roll. The staff who go through such a trial by fire come out of it all the more valuable than before, and that includes those whose "fault" it might have been.

Sometimes you can't have the domain knowledge you'd want beforehand. The reasons vary a great deal. For example, the software in question might be commercial, and you might not have an alternative vendor to switch to. Other times your "bus factor" might have dropped to uncomfortably small values through no fault of anyone in the organization (people leaving too quickly for reasons having nothing to do with the way the org is run, people dying, headcount and budgets not allowing for remediating the situation, and just having way too many critical systems whose bus factor to keep wide at all times).

kccqzy•1d ago

I haven't seen a pure user space implementation of RCU that works well without kernel support. The RCU methodology is easy to understand, but how does a user space implementation understand when it is safe to dispose of a piece of data structure? Don't you need a primitive to run some code on each CPU?

j_seigh•1d ago

You just need to keep track of each thread's quiescent states however they are implemented.

kccqzy•1d ago

Right but isn't that difficult to implement in a library?

cryptonector•1d ago

Not really. https://github.com/nicowilliams/ctp is mine -- the API is not RCU, but a simpler and easier to use "thread-safe variable" construct. You'll find two C implementations there. I've been using this in production for almost a decade without any trouble. (For a long time I had a missing producer membar in the create primitive (oops!), but because it only happens once it never caused me any problems.)

To be fair I thought about it for a long time first. But nowadays these techniques are well known, so now it's easier to "copy" the thinking others have done, and that thinking is the hard part.

kccqzy•16h ago

I took a brief look at your link. How does this even qualify to be RCU? The read-copy-update paradigm requires that the update step be able to know whether the value to be updated is derived from the recent copy. The set function here doesn't even allow you to atomically compare your read version against the current version. Imagine you are implementing a simple array with RCU; you read the array, make a copy, append to the array, and then you need to atomically compare the array pointer pre-append and set the array just so that you don't lose writes by other threads.

cryptonector•9h ago

The get function always returns you a stable value if at least one value was ever written (else you get NULL, natch).

The set function will not invalidate any values seen by get that might still be referenced, but the new value will be seen by all subsequent gets.

Old values will be destructed when no longer referenced.

To read-copy-update you'll need to take a lock (that readers don't), get the current value (it will be the current one given you're locking out other writers), copy it, modify the copy, then set the new value.

You don't have to read-copy-update if the new values don't depend on the preceding values. So my API does not force read-copy-update, but you can do RCU with it.

kccqzy•7h ago

Got it. So basically you want writers to use their own lock to coordinate writes. Okay, then this makes it inconvenient to deal with readers that could possibly but infrequently become writers. They have to do a second read under a lock to write. And if one chooses a data structure for which copy isn't a cheap operation, this can be slower than classic RCU.

cryptonector•6h ago

> Okay, then this makes it inconvenient to deal with readers that could possibly but infrequently become writers. They have to do a second read under a lock to write.

Well, first of all reads are real cheap, so that's not a big concern. In some systems readers don't ever turn around to write, so that's ok, and in others they don't modify the current value. E.g., in one system I've built we use TinyCDB files and when new files are renamed into place we open those and set the handle on one of these thread-safe variables -- no read-modify involved here.

> And if one chooses a data structure for which copy isn't a cheap operation, this can be slower than classic RCU.

Either way it's RCU. It's best to publish immutable data with RCU and design data structures where copy-on-write modifications don't have to copy the whole data structure but rather just small parts of it. E.g., in a tree you should only have to copy the interior nodes from the one you're modifying to the root.

One can always build a hash table where each cell is one of these thread-safe variables (or plain old RCU), though one would have to make these thread-safe variables lighter weight to make having lots of them realistic (currently my implementation has a pthread_key_t per variable, which does not scale).

In TFA ArcSwap is essentially indistinguishable from a read-only hash table published on something like one of my thread-safe variables, and for TFA this worked better than the available alternatives in spite of the copy-on-write burden.

BTW, I built this thing originally to allow global configuration to be shared with otherwise-shared-nothing threads and allow the service to be reconfigurable at run-time. Without something like this (or plain old RCU) your options are limited to things like using read-write locks (which are way worse than anything you can imagine being bad about RCU used with immutable data structures). I was very displeased with read-write locks when I needed to solve this particular problem at Sun, but I didn't build this thing until after I left Sun (Oracle).

Nowadays RCU-ish libraries are pretty widely available. OpenSSL 3.3 (or was it 3.4?) has something like this after the threaded performance debacles of 3.0. In particular OpenSSL has an RCU-ish hash table. You might be interested in it. IIRC (but it's been a while since I looked) it uses one hazard-pointer-per-thread to acquire a stable reference to a key or value long enough to then incref a refcount -- pretty clever, and that's essentially the way to make a "thread-safe variable" light-weight enough to be able to use one for each slot in a hash table.

j_seigh•1d ago

Thread local vars would be used with lazy initialization. Clean up might be a little tricky depending on what implementation of thread local you use. Thread local support is not as great as it should be.

cryptonector•1d ago

You don't need kernel assistance to make user-land RCU-like structures work well enough. I've done it twice. The price you pay for not having kernel assistance is that some threads sometimes have to do extra work, like signal waiting writers, loop over a very short read, write to hazard pointer, read again until the value read is stable, or do a garbage collection, but this price is not too heavy.

athrowaway3z•1d ago

I like the idea of arc_swap, but I've tried skimming the code to see if it fit my mental model and found it way more complex than i expected.

cryptonector•1d ago

TFA is about concurrent hashmaps that croak under heavy read load because the synchronization mechanisms of those hashmaps caused cache ping-pong, and the fix was to use RCU with a copy-on-write data structure for the hashmap. Each write requires cloning the previous hashmap, but because the readers don't perform any atomic operations other than one read (for the current read-only hashmap) it all works out as the cache lines don't have to ping-pong given that the readers are not writing.

The stuff about "debt" is about the design of ArcSwap, which uses "debt" instead of "hazard pointers". The idea is that every thread keeps track of the last version of the hashmap that it's seen, and this is "debt" (references that need to be released eventually), and this debt is a thread-local object that is linked with all the other threads' debt, which essentially allows garbage collect of "debt" (debt collection?), or something like that. See https://github.com/vorner/arc-swap/blob/master/src/docs/inte...

I've implemented something similar myself. Instead of implementing user-land RCU (which is what perhaps I should have done) I implemented a "thread-safe variable" where reads are lock-less and wait-free, and writers pay all the costs. My thread-safe variable API is real simple and intuitive: `create(value_destructor) -> thread_safe_var`, `set(thread_safe_var, value)`, `get(thread_safe_var) -> value`, `release(thread_safe_var)`, and `destroy(thread_safe_var)` (presumably only during an atexit() handler or shared object destructor call, if at all). I have two implementations, one with a linked list of per-thread hazard pointers, and the other with a tuple of {next/current, current/previous} "cells" and a protocol that gives readers time to read the cells, figure out which is "current" and then dereference and incref a refcount.

Regardless of how one implements this data structure, the key concept is that you have to atomically compose two distinct atomic operations: a read, and a dereference to atomically increment a reference count -- this is remarkably tricky!

When you have a GC it's really easy: just read -- there is no reference count to increment.

With a linked list of per-thread hazard pointers all you have to do is read the thing then write it into your thread-local hazard pointer then read again, looping until the second read is the same as the first read. Having the value thus safely copied to the reader's thread-local hazard pointer then allows for notional garbage collection. Any thread that would release an old version of the value has to check all the hazard pointers to see that it's not in use, and this amounts to a small garbage collection.

I especially like the hazard pointer approach.

See also MVars (Haskell), transactional memory, RCU, etc.

nasretdinov•23h ago

Hate to be that person, but Go developers hit it first and then implemented (a really nice in my opinion) concurrent hash map that does stuff similar to what is described in the article, but without copying the entire hash table each time: https://pkg.go.dev/sync#Map

benmmurphy•19h ago

Concurrent maps are often easier to implement in GC languages because you can use the garbage collector for cleaning old values. Whereas without a GC you will need an alternative like RCU.

Expanding Racks [video]

Chatterbox TTS

Microsoft Office migration from Source Depot to Git

Show HN: Eyesite - experimental website combining computer vision and web design

Research suggests Big Bang may have taken place inside a black hole

Show HN: Spark, An advanced 3D Gaussian Splatting renderer for Three.js

Plants hear their pollinators, and produce sweet nectar in response

V-JEPA 2 world model and new benchmarks for physical reasoning

The hunt for Marie Curie's radioactive fingerprints in Paris

How I Program with Agents

How long it takes to know if a job is right for you or not

TV Fool: See OTA channels you can receive

My Cord-Cutting Adventure

Unveiling the EndBOX – A microcomputer prototype for EndBASIC

Show HN: Ikuyo a Travel Planning Web Application

Congratulations on creating the one billionth repository on GitHub

Bypassing GitHub Actions policies in the dumbest way possible

Show HN: RomM – An open-source, self-hosted ROM manager and player

OpenAI o3-pro

The curious case of shell commands, or how "this bug is required by POSIX" (2021)

EchoLeak – 0-Click AI Vulnerability Enabling Data Exfiltration from 365 Copilot

Show HN: S3mini – Tiny and fast S3-compatible client, no-deps, edge-ready

Shaped (YC W22) Is Hiring

AOSP project is coming to an end

The Canadian C++ Conference

Firefox OS's story from a Mozilla insider not working on the project (2024)

Fine-tuning LLMs is a waste of time

Story of Sosumi and the Mac Startup Sound

DeskHog, an open-source developer toy

Researchers discover evidence in the mystery of America's 'Lost Colony'

Expanding Racks [video]

Chatterbox TTS

Microsoft Office migration from Source Depot to Git

Show HN: Eyesite - experimental website combining computer vision and web design

Research suggests Big Bang may have taken place inside a black hole

Show HN: Spark, An advanced 3D Gaussian Splatting renderer for Three.js

Plants hear their pollinators, and produce sweet nectar in response

V-JEPA 2 world model and new benchmarks for physical reasoning

The hunt for Marie Curie's radioactive fingerprints in Paris

How I Program with Agents

How long it takes to know if a job is right for you or not

TV Fool: See OTA channels you can receive

My Cord-Cutting Adventure

Unveiling the EndBOX – A microcomputer prototype for EndBASIC

Show HN: Ikuyo a Travel Planning Web Application

Congratulations on creating the one billionth repository on GitHub

Bypassing GitHub Actions policies in the dumbest way possible

Show HN: RomM – An open-source, self-hosted ROM manager and player

OpenAI o3-pro

The curious case of shell commands, or how "this bug is required by POSIX" (2021)

EchoLeak – 0-Click AI Vulnerability Enabling Data Exfiltration from 365 Copilot

Show HN: S3mini – Tiny and fast S3-compatible client, no-deps, edge-ready

Shaped (YC W22) Is Hiring

AOSP project is coming to an end

The Canadian C++ Conference

Firefox OS's story from a Mozilla insider not working on the project (2024)

Fine-tuning LLMs is a waste of time

Story of Sosumi and the Mac Startup Sound

DeskHog, an open-source developer toy

Researchers discover evidence in the mystery of America's 'Lost Colony'

The Concurrency Trap: How an Atomic Counter Stalled a Pipeline

Comments