I have finally bitten the bullet and learned Rust in the last few months and ended up really liking it, but I have to admit that it's a bit lower level than I generally work in.
I have generally avoided locks by making very liberal use of Tokio channels, though that isn't for performance reasons or anything: I just find locks really hard to reason about for anything but extremely trivial usecases, and channels are a more natural abstraction for me.
I've never really considered what goes into these lock-free structures, but that might be one of my next "unemployment projects" after I finish my current one.
> Don't Communicate by Sharing Memory; Share Memory by Communicating
> In distributed systems there is no real shared state (imagine one machine in the USA another in Sweden) where is the shared state? In the middle of the Atlantic? — shared state breaks laws of physics. State changes are propagated at the speed of light — we always know how things were at a remote site not how they are now. What we know is what they last told us. If you make a software abstraction that ignores this fact you’ll be in trouble.
He wrote this to me in 2014, and it has really informed how I think about these things.
Even if you choose to close a channel because it's useful to you, it's not necessarily shared state. In a lot of cases, closing a channel behaves just like a message in its queue.
Additionally the safe way to use closing of a channel is the writer closing it. If you have multiple writers, you have to either synchronise them, or don’t close the channel.
orthogonality needs to be valid for all subspaces.
that's all well and good until you realize you are reimplementing a slow, buggy version of MESI in software.
Proper concurrency control is the key. Shared memory vs message passing is incidental and application specific.
Is it though? Aren't atomic load/store instructions the actual important thing. I know the type system ensures that `AtomicUsize` can only be accessed using atomic instructions but saying it's being watched by the CPU is inaccurate.
https://en.wikipedia.org/wiki/Load-link/store-conditional
It's not "24/7" but it is "watching" in some sense of the word. So not entirely unfair.
The edgy tones sound like from an LLM to me..
Other than that I take the other side. I’ve read (and subsequently never finished) dozens of programming books because they are so god awfully boring. This writing style, perhaps dialed back a little, helps keep my interest. I like the feel of a zine where it’s as technical as a professional write up but far less formal.
I often find learning through analogy useful anyway and the humor helps a lot too.
This has no unsafe code. It's all done with compare and swap. There is locking here, but it's down at the hardware level within the compare and swap instruction. This is cleaner and more portable than relying on cross-CPU ordering of operations.
[1] https://github.com/John-Nagle/rust-vulkan-bindless/blob/main...
How would this compare to the lock free abstractions that come with e.g. the java.concurrent package? It has a lot of useful primitives and data structures. I expect the memory overhead is probably worse for those.
Support for this is one of the big reason Java and the jvm has been a popular choice for companies building middleware and data processing frameworks for the last few decades. Exactly the kind of stuff that the author of this article is proposing you could build with this. Things like Kafka, Lucene, Spark, Hadoop, Flink, Beam, etc.
Indeed; normally we call it the system allocator.
A good system allocator will use per thread or per cpu free-lists so that it doesn't need to do CAS loops for every allocation though. At the very least will use hashed pools to reduce contention.
Of course if the mutex array is doing a linear scan to find the insertion point that would explain the difference but: a) I can't see the code for the alternative and b) there is no reason why the mutex variant can't use a free list.
Remember:
- Lock free doesn't automatically means faster (still it has other properties that might be desirable even if slower)
- Never trust a benchmark you didn't falsify yourself.
[1] when uncontended; when contended cache coherence cost will dominate over everything else, lock-free or not.
As the benchmark is very dependent on contention, it would give very different results if the the threads are scheduled serially as opposed to running truly concurrently (for example using a spin lock would be awful if running on a single core).
So again, you need to be very careful to understand what you are actually testing.
that benchmarking is something i should have added more alternatives to.
A lock-free queue by itself isn't very useful. You need a polling strategy that doesn't involve a busy loop. If you use mutexes and condvars, you've basically turned it into a lock based queue. Eventcounts work much better.
If I run more threads than CPUs and enough work so I get time slice ends, I get about 1160 nsecs avg enq/deq for mutex version, and about 146 nsecs for eventcount version.
Timings will vary based on how man threads you use and cpu affinity that takes your hw thread/core/cache layout into consideration. I have gen 13 i5 that runs this slower than my gen 10 i5 because of the former's efficiency cores even though it is supposedly faster.
And yes, a queue is a poster child for cache contention problems, une enfant terrible. I tried a back off strategy at one point but it didn't help any.
I rewrote the entire queue with lock-free CAS to manage insertions/removals on the list and we finally got some better numbers. But not always! We found it worked best either as a single thread, or during massive contention. With a normal load it wasn't really much better.
The article doesn't go into details but this is subtle way to mess up writing lock free data structures:
1. Sometimes "lock-free" actually means using lower-level primitives that use locks internally but don't expose them, with fewer caveats than using them at a higher level. For example, compare-and-set instructions offered by CPUs, which may use bus locks internally but don't expose them to software.
2. Depending on the lower-level implementation, a simple lock may not be enough. For example, in a multi-CPU system with weaker cache coherency, a simple lock will not get rid of outdated copies of data (in caches, queues, ...). Here I write "simple" lock because some concepts of a lock, such as Java's "synchronized" statement, bundle the actual lock together with guaranteed cache synchronization, whether that happens in hardware or software.
The CPU clock itself can be thought of as a kind of lock.
- you don't reallocate the array
- you don't allow updating/ removing past inserted values
In essence it become a log, a Vec<OnceCell<T>> or a Vec<UnsafeCell<Option<T>>>. Works well, but only for a bounded array. So applications like messaging, or inter-thread communication are not a perfect fit.
It's a fixed-size vector that can be read at the same time it's being written to. It's no a common need.
Also, I don’t feel like that’s true: Rust has the exact same non blocking IO primitives in the stdlib as any other systems language: O_NONBLOCK, multiplexers, and so on. Combined with async/await syntax sugar for concurrency and backends like Tokyo, I’m not sure how you end up at “rust IO is still blocking”.
For a system language it's great to finally have a lock free library for some containers, but that's still not safe.
That's probably the most scary sentence in the whole article.
I wonder if this is connected to that rust optimisation bounty post we saw the other day where they couldn't get rust safe decoder closer than 5% to their C implementation. Maybe that's just the cost of safety
r3tr0•1mo ago
I used humor and analogies in the article not just to be entertaining, but to make difficult concepts like memory ordering and atomics more approachable and memorable.
tombert•1mo ago
Have you looked at CTries before? They're pretty interesting, and I think are probably the future of this space.
nmca•1mo ago
bobbyraduloff•1mo ago
my suggestion to OP: this was interesting material, ChatGPT made it had to read. use your own words to explain it. most people interested in this deeply technical content would rather read your prompt than the output.
zbentley•1mo ago
Who knows, maybe someone accidentally over-weighted my writing by a factor of a trillion in ChatGPT’s training set?
r3tr0•1mo ago
The core ideas, jokes, code, and analogies are 100% mine.
Human chaos. Machine polish.