This is where tree data structures win out. Git and subversion and pure functional languages don’t replace the entire data structure. They make a new element and attach it to the tree by creating a new version of the intended parent, and a new version of its parent, on up to the root element.
So somewhere out there is space for a hybrid of ArcSwap and DashMap that uses sharded copy on write. These guys probably don’t need it given their low write rate, but shards should help with write contention for other use cases.
Left-right is an option which iirc avoids the CAS loop, as it’s a single writer primitive.
I would love, though, to have a similar persistent structure where the node pointers at each level were thread shareded, instead of using e.g. atomics. A mutation at a branch could record that the mutation is "just" for thread ID X that did the mutation. At some later point a "commit" could reconcile the storage to global, maybe
For in-memory problems, there are B-trees, which are an older solution to many problems than most of HN’s userbase, including myself. Then you still have the sharding which I did in fact point out would be part of a better hybrid solution.
Making sure the counter variable is sitting alone on its own cache line would completely remove the excessive ping-pong behaviour. The cache line with the counter would, of course, still ping-pong as it's mutated, but not excessively.
All threads repeatedly CASing on the same memory location is enough to bring any cpu to its knees.
The critical part of the fix is that readers no longer have to modify any shared memory location and rely on deferred reclamation.
Well if they were mutating the ref counter I’d call them writers not readers. =P
But yes. Many people incorrectly assume that writing an atomic_int from many threads is super fast. I mean it’s atomic! It couldn’t be a smaller atom of work! Alas this is incredibly incorrect. Even atomic ints can be extremely slow.
*to contended shared memory
And the way you solve this is by using a separate counter for each core, at least a cache line apart from each other to ensure they all get separate cache lines. The downside is that you lose the ability to read the counter atomically - but if the counter is incrementing fast enough that you need per-CPU counters, a point-in-time value is basically meaningless anyway.
[1] True no-writes-on-shared-locking RWMutexes do exist, but are significantly more complex and require epoch detection like the ArcSwap, so there is no significant advantage.
But for multiple read exclusive write this is never going to work. Or at least not without extra silicon. Which maybe should be a thing. The problem of course would be that a multi-word atomic sum instruction would actually have to cross cache lines to avoid false sharing. So you'd end up with counters on contiguous cache lines but occupying 64 bytes apiece, which is a lot.
It would almost call for a special region of memory that has different cache coherency behavior and I can't see that being easy, fast, or backward compatible which is maybe why we don't do it.
It blows me away an issue like this could take weeks to track down. If I were in any leadership position at this company I'd be rolling heads with the lack of telemetry or domain knowledge for these systems.
Concurrent programming has been mainstream for some time now, but I don't think the level of expertise of most engineers has kept up. That becomes most apparent when software starts hitting concurrency pitfalls: performance problems, deadlocks, UAFs, and so on...
Sometimes you can't have the domain knowledge you'd want beforehand. The reasons vary a great deal. For example, the software in question might be commercial, and you might not have an alternative vendor to switch to. Other times your "bus factor" might have dropped to uncomfortably small values through no fault of anyone in the organization (people leaving too quickly for reasons having nothing to do with the way the org is run, people dying, headcount and budgets not allowing for remediating the situation, and just having way too many critical systems whose bus factor to keep wide at all times).
To be fair I thought about it for a long time first. But nowadays these techniques are well known, so now it's easier to "copy" the thinking others have done, and that thinking is the hard part.
The set function will not invalidate any values seen by get that might still be referenced, but the new value will be seen by all subsequent gets.
Old values will be destructed when no longer referenced.
To read-copy-update you'll need to take a lock (that readers don't), get the current value (it will be the current one given you're locking out other writers), copy it, modify the copy, then set the new value.
You don't have to read-copy-update if the new values don't depend on the preceding values. So my API does not force read-copy-update, but you can do RCU with it.
Well, first of all reads are real cheap, so that's not a big concern. In some systems readers don't ever turn around to write, so that's ok, and in others they don't modify the current value. E.g., in one system I've built we use TinyCDB files and when new files are renamed into place we open those and set the handle on one of these thread-safe variables -- no read-modify involved here.
> And if one chooses a data structure for which copy isn't a cheap operation, this can be slower than classic RCU.
Either way it's RCU. It's best to publish immutable data with RCU and design data structures where copy-on-write modifications don't have to copy the whole data structure but rather just small parts of it. E.g., in a tree you should only have to copy the interior nodes from the one you're modifying to the root.
One can always build a hash table where each cell is one of these thread-safe variables (or plain old RCU), though one would have to make these thread-safe variables lighter weight to make having lots of them realistic (currently my implementation has a pthread_key_t per variable, which does not scale).
In TFA ArcSwap is essentially indistinguishable from a read-only hash table published on something like one of my thread-safe variables, and for TFA this worked better than the available alternatives in spite of the copy-on-write burden.
BTW, I built this thing originally to allow global configuration to be shared with otherwise-shared-nothing threads and allow the service to be reconfigurable at run-time. Without something like this (or plain old RCU) your options are limited to things like using read-write locks (which are way worse than anything you can imagine being bad about RCU used with immutable data structures). I was very displeased with read-write locks when I needed to solve this particular problem at Sun, but I didn't build this thing until after I left Sun (Oracle).
Nowadays RCU-ish libraries are pretty widely available. OpenSSL 3.3 (or was it 3.4?) has something like this after the threaded performance debacles of 3.0. In particular OpenSSL has an RCU-ish hash table. You might be interested in it. IIRC (but it's been a while since I looked) it uses one hazard-pointer-per-thread to acquire a stable reference to a key or value long enough to then incref a refcount -- pretty clever, and that's essentially the way to make a "thread-safe variable" light-weight enough to be able to use one for each slot in a hash table.
The stuff about "debt" is about the design of ArcSwap, which uses "debt" instead of "hazard pointers". The idea is that every thread keeps track of the last version of the hashmap that it's seen, and this is "debt" (references that need to be released eventually), and this debt is a thread-local object that is linked with all the other threads' debt, which essentially allows garbage collect of "debt" (debt collection?), or something like that. See https://github.com/vorner/arc-swap/blob/master/src/docs/inte...
I've implemented something similar myself. Instead of implementing user-land RCU (which is what perhaps I should have done) I implemented a "thread-safe variable" where reads are lock-less and wait-free, and writers pay all the costs. My thread-safe variable API is real simple and intuitive: `create(value_destructor) -> thread_safe_var`, `set(thread_safe_var, value)`, `get(thread_safe_var) -> value`, `release(thread_safe_var)`, and `destroy(thread_safe_var)` (presumably only during an atexit() handler or shared object destructor call, if at all). I have two implementations, one with a linked list of per-thread hazard pointers, and the other with a tuple of {next/current, current/previous} "cells" and a protocol that gives readers time to read the cells, figure out which is "current" and then dereference and incref a refcount.
Regardless of how one implements this data structure, the key concept is that you have to atomically compose two distinct atomic operations: a read, and a dereference to atomically increment a reference count -- this is remarkably tricky!
When you have a GC it's really easy: just read -- there is no reference count to increment.
With a linked list of per-thread hazard pointers all you have to do is read the thing then write it into your thread-local hazard pointer then read again, looping until the second read is the same as the first read. Having the value thus safely copied to the reader's thread-local hazard pointer then allows for notional garbage collection. Any thread that would release an old version of the value has to check all the hazard pointers to see that it's not in use, and this amounts to a small garbage collection.
I especially like the hazard pointer approach.
See also MVars (Haskell), transactional memory, RCU, etc.
hinkley•1d ago
Thundering herds will do that. Queuing theory. Once the queue builds up it takes far longer to clear than intuition suggests.