Everything goes cleanly away when there are no more waiters, and the kernel never even sees a mutex where there's no contention.
I would be interested in a technical deep dive of how the kernel manages these in a performant way, however.
EDIT: TIL about futex2 as well: https://docs.kernel.org/userspace-api/futex2.html
To prevent that, many operating systems allocate these 'queue objects' whenever threads are created and will attach a pointer to it from the thread object. Whenever a thread then stumbles upon a contended lock, it will effectively 'donate' this queue object to that lock, meaning that every lock having one or more waiters will have a linked list of 'queue objects' attached to it. When threads are woken up, they will each take one of those objects with them on the way out. But there's no guarantee that they will get their own queue object back; they may get shuffled! So by the time a thread terminates, it will free one of those objects, but that may not necessarily be the one it created.
I think the first operating system to use this method was Solaris. There they called these 'queue objects' turnstiles. The BSDs adopted the same approach, and kept the same name.
https://www.oreilly.com/library/view/solaristm-internals-cor...
https://www.bsdcan.org/2012/schedule/attachments/195_locking...
The book is quite clearly about concurrency in general, and not for a specific platform. The author of this article has set up a straw man to facilitate the writing and marketing of an otherwise moderately interesting article on futexes.
Personally I find the approach taken by this article more than a little distasteful - presenting from a point of exaggerated conflict is both tiresome and likely to confuse. This article could easily have been written from the perspective "what TAoMP doesn't tell you" and in that vein be taken a lot more collaboratively.
Of course it doesn't escape me that this blog is new, this article was posted by Phil, and Phil has promoted one of their other articles before.
So in no way was it meant to be a strawman around a "hey, learn about the futex!" post (as evidenced by other complaints at the end of things lacking). The fact is, I was disappointed enough with the book, that I put aside another post I was writing for it.
But as for Phil, we did work together several years ago, and he reads my stuff. I didn't just start writing, and have never had problems finding an audience in the past, Phil or not.
I appreciate that not everyone loves my style of humor, but I know when I read things with similar styles, it keeps me more engaged with the material.
Still, I am not trying to make jokes at the expense of actual people, so I'll take the note and try to avoid, thanks.
There are definitely user-land TCP/IP implementations, thread implementations (Go's being particularly notable) and even file systems (FUSE).
You're 100% right that there are plenty of other considerations, often positive for lifting things out, like minimization of ring 0 attack surface.
If you can stay in user space, or stay in kernel space, that is likely to be better performance than going back and forth.
This is why sendfile is great; rather than storage -> kernel memory -> user memory -> kernel memory -> network, you get storage -> kernel memory -> network. That's two less copies, and fewer context switches.
[1] https://cis.temple.edu/~giorgio/cis307/readings/futex.pdf
This is such a frustrating stance that most standards have, honestly. "Well, obviously we can't expect the OS/language implementers to be able to reliably implement feature X ― let's just leave it to the application programmer to deal with; they are, after all, are expected to have great skill sets and could easily work around it". Or rather, "well, we can't force feature X on the people who will actually implement the standard (they are the members of this very committee, after all), but we can't trivially force the downstream users to cope with the feature's absence because seriously, what can those losers do? Switch the vendors?".
The standard didn't say "you must implement std::unordered_map as a hash table with chained buckets and extra memory allocations", but ithe standard specified several guarantees that make it very difficult to implement hash tables with open addressing.
Every constraint that you specify potentially locks out a better implementation.
For recursive rwlocks, there's a lot of ways to implement them. Do you want to lock out high performance implementations that do less error checking, for example?
OpenAddressing means that an address of map[thing] could change on insert. Which means iterators and pointer invalidation concepts can go stale on insert.
C++11 standard for unordered_map guarantees this won't happen. But that forces slower implementations.
And now people rely upon the standard so we can't change it. At best we do fast_unordered_map or unordered_map2 with different guarantees.
The most stupid thing about std::unordered_map is that it was standardized in 2011, so it isn't from 1998 like much of the C++ standard library containers, it's newer and yet apparently nothing was learned.
And no, I don't want to "high performance" lock implementations that regularly completely deadlock the whole process (deadlocked processes are not very performant) unless wrap every single one of the several hundred uses of them in a test with dubiously-predictable branches, or worse, just completely break the lock invariants (e.g., the lock is now permanently unlocked, even after a lock() on it succeeds) ― it really is not that important how fast you can do the wrong thing.
It would be better if the spec simply said it was disallowed.
On paper, unordered_map sounds great. It lists all the admirable properties you would theoretically want in a hashtable. Then in practice when you go to implement it, you realize that you've painted yourself into a garbage fire, as the saying goes.
I suppose this is a failing of the design by committee method, where the committee isn't directly responsible for implementation either before or during standard writing.
I hate it, but it's true
There's been a nice stream of improvements to futex2 since.
NUMA support (finally landing!), https://www.phoronix.com/news/FUTEX2-NUMA-Small-Futex https://www.phoronix.com/news/FUTEX2-Improvements-Linux-6.16 (see also this fantastic recent submission on NUMA in general, absolutely critical performance stuff, https://news.ycombinator.com/item?id=44936575)
Io_uring support in 6.7 (2024), (with a nice write up on it speeding up postgresql aio), https://www.phoronix.com/news/IO_uring-FUTEX-Linux-6.7
Small requeue and single wait additions in 6.7, https://www.phoronix.com/news/Linux-6.7-Locking-FUTEX2
The linux equivalent of WFMO is select/poll/epoll.
> People often describe the futex as, "Wait on memory address". That overlooks the notification side, but it’s a much more apt name, and why Windows’ name for this API (WaitOnAddress) is superior API naming (to be fair, they did have a decade to think about the name).
The difference between an Address and an Object feels pretty abstract to me. The API surfaced otherwise feels extremely similar. So I'm not sure that there's a ton of ground to stand on for this distinction you are trying to draw. Your assertions could use some argumentation to back them up.
From the Futex2 pull requests in 5.16:
> Add a new sys_futex_waitv() syscall which allows to wait on multiple futexes. The main use case is emulating Windows' WaitForMultipleObjects which allows Wine to improve the performance of Windows Games.
https://github.com/digital-fabric/uringmachine/blob/main/ext...
While WaitForMultipleObjects was an advantage of Windows NT over UNIX, it was nothing new. IBM PL/I had an equivalent function already in 1965, almost 30 years before Windows NT.
The "wait" function of IBM PL/I was actually the model for the UNIX "wait", but the UNIX function was extremely simplified and much weaker than its model, like it was also the case with the many features inherited by UNIX from Multics. Unfortunately, many decades had to pass until the descendants of UNIX began to gain features comparable in sophistication with those of the ancestors of UNIX.
However the Microsoft WaitForSingleObject and WaitForMultipleObjects did not have an efficient implementation, which is why they had to add WaitOnAddress, the equivalent of Linux futex.
It is true however that the Linux futex had and still has some annoying limitations, like the size of only 32 bits of the futex value, instead of 64 bits, and the fact that it is possible to wait only on a single event. Using atomic bit operations on the futex value it is actually possible to wait on multiple events, though not in the most efficient way. However here is where the 32-bit size of the futex value becomes annoying.
Therefore the work that attempts to combine the advantages of "futex" with some of the advantages of WaitForMultipleObjects is very welcome.
However this does not ape Windows, but it just reimplements techniques that are much older than the Microsoft company, which were well known more than a half of century ago.
I agree with Linux still only supporting 32-bit futexes is a bit baffling. The only reason the width matters is for the race condition check, but that's a huge reason. I'd want to have the option to wait on values as wide as whatever the underlying hardware supports, at least!
WaitForSingle/MultipleObjects wait for kernel objects, similiar to poll. WaitOnAddress is lightweight synchronization, equivalent to futex. Windows doesn't have something like WaitForMultipleAddresses. futex_waitv was added because Wine implements NT kernel objects in userspace, and there were some semantic differences that made it hard to represent them as eventfds.
You can’t really use semaphores to implement things that can’t mutexes or semaphores so the overall utility is limited compare to futexes that you can use for condvars and other primitives too.
Futex doesn't need any kernel initialization, and from perspective of kernel it doesn't exist at all when there are no waiters.
(see also CRITICAL_SECTION, which originally always had kernel object created with it, but it was later changed to create them on-demand falling back to keyed event on failure. SRWLock only uses keyed events. Keyed event only needs one object per process, and otherwise consumes kernel resources only if there are any waiters.)
Given that primitive, the Benaphore is a good way to use it, like if you've got a 1930s refrigerator and you've got a clever technique to reduce frost build-up - a modern fridge has a smarter controller and so it'll just defrost itself anyway automatically, no sweat. The Benaphore is thus redundant today - like that anti-frost technique for your 90 year old fridge.
Follow it up with something appropriate to the language you're using, like C++ Concurrency in Action for C++ (much of it transfers to other languages).
It's just that book ignores the very real changes that are at the core of modern concurrency (including Async), and that I don't believe is a positive.
And while it did mention recursion, I looked really hard for it to cover things like shared lock-cleanup on `fork()` or thread crash, and a number of other important (often encountered) real-world concerns, and didn't find anything.
That doesn't help if the entire process dies for any reason and you want to clean up the locks. Solution to that is called "robust" locks. You can register list of held futexes with the kernel using sys_set_robust_list, and when the thread dies kernel for each entry will set a specific bit and wake waiter if there's one.
My biggest worry with that kind of thing is that the lock was guarding something which is now in an inconsistent state.
Without thoroughly understanding how/why the particular thread crashed, there's no guarantee that the data is in any sort of valid or recoverable state. In that case, crashing the whole app is absolutely a better thing to do.
It's really cool that the capabilities exist to do cleanup/recovery after a single thread crashed. But I think (off-the-cuff guess) that 95% of engineers won't know how to properly utilize robust locks with robust data structures, 4% won't have the time to engineer (including documentation) that kind of solution, and the last 1% are really really well-paid (or, should be) and would find better ways to prevent the crash from happening in the first place.
In a multi-threaded context, memory reads and writes can be reordered by hardware. It gets more complicated with shared cache. Imagine that you have core 1 writing to some address at (nearly) the same time that core 2 reads from that. Does core 2 read the old or the new? Especially if they don't share the same cache -- core 1 might "write" to a given address, but it only gets written to core 1's cache and then "scheduled" to be written out to main memory. Meanwhile, later core 2 tries to read that address, it's not in its cache, so it pulls from main memory before core 1's cache has flushed. As far as core 2 is concerned, the write happened after it read from the address even though physically the write finished in core 1 before core 2's read instruction might have even started.
A memory barrier tells the hardware to ensure that reads-before is also "happens-before" (or after) a given writen to the same address. It's often (but not always) a cache and memory synchronization across cores.
I found Fedora Pikus's cppcon 2017 presentation [1] to be informative, and Michael Wong's 2015 presentation [0] filled in some of the gaps.
C++, being a generic language for many hardware implementations, provides a lot more detailed concepts for memory ordering [2], which is important for hardwares that have more granularity in barrier types that what most people are used to with x86-derived memory models.
[0]: https://www.youtube.com/watch?v=DS2m7T6NKZQ
[1]: https://www.youtube.com/watch?v=ZQFzMfHIxng
[2]: https://en.cppreference.com/w/cpp/atomic/memory_order.html
If you mean a barrier in terms of a memory "fence", that's an event on CPUs whereby any pending memory instructions that have been pipelined and thus not committed are forced to commit and complete before continuing. Usually only relevant for a single core, but they're used to make sure that other cores will see the same memory values and your pending writes would reflect (or, conversely, sometimes making sure your own core sees the reads from other cores as fresh as possible before the actual read op).
"The requeue-once rule is enforced by only allowing requeueing to the futex previously passed to futex_wait_requeue_pi as uaddr2, so it's not possible to requeue from A to B, then from B to C - but it is possible to requeue from B to B.
When this happens, if (!q.rt_waiter) passes, so rt_mutex_finish_proxy_lock is never called. (Also, AFAIK, free_pi_state is never called, which is true even without this weird requeue; in the case where futex_requeue calls requeue_pi_wake_futex directly, pi_state will sit around until it gets cleaned up in exit_pi_state_list when the thread exits. This is not a vulnerability.) futex_wait_requeue_pi exits, and various pointers to rt_waiter become dangling. "
afr0ck•3h ago
The whole point is that implementing a mutex requires doing things that only the privileged OS kernel can do (e.g. efficiently blocking/unblocking processes). Therefore, for systems like Linux, it made sense to combine the features for a fast implementation.
viega•3h ago
And that can absolutely save a bunch of system calls, especially vs. polling mixed with `sleep()` or similar.
ajross•3h ago
It does not, in fact the two are fundamentally inseparable and the state of the memory address must be treated atomically with the waiting state. The magic of futex is that you can use a hardware atomic operation (c.f. lock cmpxchg on x86) to get the lock in the common/uncontended case, but if you have to wait you need to tell the kernel both that you need to wait and the address on which you're waiting, so it can use the same hardware interlocks along with its own state locking to put you to sleep race-free.
viega•3h ago
It's true you could use it that way, but it's not the way it's meant to be used, defeating the purpose by requiring a system call even for uncontended locks.
ajross•1h ago
It just doesn't "allocate" it on its own and lets the process use its own mapped memory. But to pretend that it doesn't have to do any work or that the memory is "separated" betrays some misunderstandings about what is actually happening here.
viega•39m ago
It does not care how you use the queue, at all. It doesn't have to be done with a locking primitive, whatsoever. You absolutely can use the exact same mechanism to implement a thread pool with a set of dormant threads, for instance.
The state check in the basic futex is only done to avoid a race condition. None of the logic of preventing threads from entering critical sections is in the purview of the kernel, either. That's all application-level.
And most importantly, no real lock uses a futex for the locking parts. As mentioned in the article, typically a mutex will directly try to acquire the lock with an atomic operation, like an atomic fetch-and-or, fetch-and-add, or even compare-and-swap.
A single atomic op, even if you go for full sequential consistency (which comes w/ full pipeline stalls), is still a lot better than a trip into the kernel when you can avoid it.
Once again, I'm not saying you couldn't use the futex state check to decide what's locked and what's not. I'm saying nobody should, and it was never the intent.
The intent from the beginning was to separate out the locking from the waiting, and I think that's pretty clear in the original futex paper (linked to in my article).
viega•3h ago
The point of the article anyway is that it's inexcusable to have a modern concurrency textbook and not cover the futex, since it's at the core of any efficient primitive on modern hardware.
koverstreet•3h ago
io_uring is supposed to be about solving this, but it's quite the kitchen sink so I have no idea how complete it is on the "arbitrary syscall async" front.
viega•3h ago
johncolanduoni•3h ago
viega•3h ago
ajross•3h ago
In point of fact futex is really not a particularly simple syscall and has a lot of traps, see the man page. But the core idea is indeed "not that deep".
viega•3h ago
Many things are obvious after, but there was plenty of time before for other people to do the same thing, it's not like we didn't know sysv semaphores didn't scale well.
"ad hoc" feels like an opinion here. My opinion is that when separation of concerns leads to big gains like the futex did, that's elegant, and an achievement. No need to diminish the good work done!
bicolao•2h ago
ajross•1h ago
Also, while googling for some examples for you I was reminded of this LWN article from a few years back that details some of the issues: https://lwn.net/Articles/823513/
viega•30m ago
The fact that Linux has extended it in so many ways is, in fact, a testament to it to how impactful the futex concept has been to the world of concurrency.
The fact that it's also at the core of other OS primitives does as well. At least on the MacOS side, those primitives do have much simpler APIs, as well. For instance, here's the main wait function:
`extern int os_sync_wait_on_address(void * addr, uint64_t value, size_t size, os_sync_wait_on_address_flags_t flags);`
There's also one with a timeout.
The wake side is equally simple, with two calls, one to wake one thread, one to wake all threads. No other number matters, so it's a great simplification in my view.
Your fundamental point is that the futex is actually a pretty unimportant construct. Clearly I don't agree, and it's okay not to agree, but I really am struggling to see your counter-argument.
If futexes aren't important to good locks, then, if modern OSes all felt compelled to jettison the futex for some reason, you'd have pthread implementations do ... what exactly??