I'm trying to understand why all command line tools don't use io_uring.
As an example, all my nvme's on usb 3.2 gen 2 only reach 740MB/s peak.
If I use tools with aio or io_uring I get 1005MB/s.
I know I may not be copying many files simultaneously every time, but the queue length strategies and the fewer locks also help I guess.
But I agree. It would be cool if it was transparent, but this is actually what a bunch of io-uring runtimes do, using epoll as a fallback (eg in Rust monoio)
Besides, io_uring is not yet stable and who knows may be in 10 years it will be replaced by yet another mechanism to take advantage of even newer hardware. So simply waiting for io_uring prove it is here to stay is very viable strategy. Besides in 10 years we may have tools/AI that will do the rewrite automatically...
The *context() family of formerly-POSIX functions (clownishly deprecated as “use pthreads instead”) is essentially a full implementation of stackful coroutines. Even the arguable design botch of them preserving the signal mask (the reason why they aren’t the go-to option even on Linux) is theoretically fixable on the libc level without system calls, it’s just a lot of work and very few can be bothered to do signals well.
As far as stackless coroutines, there’s a wide variety of libraries used in embedded systems and such (see the recent discussion[1] for some links), which are by necessity awkward enough that I don’t see any of them becoming broadly accepted. There were also a number of language extensions, among which I’d single out AC[2] (from the Barrelfish project) and CPC[3]. I’d love for, say, CPC to catch on, but it’s been over a decade now.
[1] https://news.ycombinator.com/item?id=44546640
Because it's fairly new. The coreutils package which contains the ls command (and the three earlier packages which were merged to create it) is decades old; io_uring appeared much later. It will take time for the "shared ring buffer" style of system call to win over traditional synchronous system calls.
strace -c ls gave me this
100.00 0.002709 13 198 5 total
strace -c eza gave me this
100.00 0.006125 12 476 48 total
strace -c lsr gave me this
100.00 0.001277 33 38 total
So seeing the number of syscalls in the calls directory
198 : ls
476 : eza
33 : lsr
A meaningful difference indeed!
35x less system calls = others wait less for the kernel to handle their system calls
That isn't how it works. There isn't a fixed syscall budget distributed among running programs. Internally, the kernel is taking many of the same locks and resources to satisfy io_uring requests as ordinary syscall requests.
Also, more fs-related system calls mean less available kernel threads to process these system calls. eg. XFS can paralellize mutations only up to its number of allocation groups (agcount)
Again, this just isn't true. The same "stat" operations are being performed one way or another.
> Also, more fs-related system calls mean less available kernel threads to process these system calls.
Generally speaking sync system calls are processed in the context of the calling (user) thread. They don't consume kernel threads generally. In fact the opposite is true here -- io_uring requests are serviced by an internal kernel thread pool, so to the extent this matters, io_uring requests consume more kernel threads.
Again, it just is true.
More fs-related operations mean less kthreads available for others. More syscalls means more OS overhead. It's that simple.
Theoretically "intr" mounts allowed signals to interrupt operations waiting on a hung remote server, but Linux removed the option long ago[1] (FreeBSD still supports it)[2]. "soft" might be the only workaround on Linux.
[1]: https://man7.org/linux/man-pages/man5/nfs.5.html
[2]: https://man.freebsd.org/cgi/man.cgi?query=mount_nfs&sektion=...
I don't know how io_uring solves this - does it return an error if the underlying NFS call times out? How long do you wait for a response before giving up and returning an error?
I don't agree that it was a reasonable tradeoff. Making an unreliable system emulate a reliable one is the very thing I find to be a bad idea. I don't think this is unique to NFS, it applies to any network filesystem you try to present as if it's a local one.
> What does vi do when the server hosting the file you're editing stop responding? None of these tools have that kind of error handling.
That's exactly why I don't think it's a good idea to just pretend a network connection is actually a local disk. Because tools aren't set up to handle issues with it being down.
Contrast it with approaches where the client is aware of the network connection (like HTTP/GRPC/etc)... the client can decide for itself how long it should retry failed requests, whether it should bubble up failures to the caller, or work "offline" until it gets an opportunity to resync, etc. With NFS the syscall just hangs forever by default.
Distributed systems are hard, and NFS (and other similar network filesystems) just pretend it isn't hard at all, which is great until something goes wrong, and then the abstraction leaks.
(Also I didn't say io_uring solves this, but I'm curious as to whether its performance would be any better than blocking calls.)
Yes lsr also colors the output but it doesn't know as many things as eza does
For example .opus will show up as a music icon and with the right color (green-ish in my case?) in eza whereas it would be shown up as any normal file in lsr.
Really no regrets though, its quite easy to patch I think but yes this is rock solid and really fast I must admit.
Can you please create more such things but for cat and other system utilities too please?
Also love that its using tangled.sh which is using atproto, kinda interesting too.
I also like that its written in zig which imo feels way more easier for me to touch as a novice than rust (sry rustaceans)
Then all the software wanting to use io_uring wouldn't need to write their low-level things twice.
... no. It's just not interesting or particularly valuable to optimize ls, and Jens probably just used it as a demo and didn't want to keep it around.
These are actual discovered vulnerabilities, typically assigned CVEs and often exploited in sandbox escapes or privilege escalations: 1. CVE-2021-3491 (Kernel 5.11+)
Type: Privilege escalation
Mechanism: Failure to check CAP_SYS_ADMIN before registering io_uring restrictions allowed unprivileged users to bypass sandboxing.
Impact: Bypass of security policy mechanisms.
2. CVE-2022-29582 Type: UAF (Use-After-Free)
Mechanism: io_uring allowed certain memory structures to be freed and reused improperly.
Impact: Local privilege escalation.
3. CVE-2023-2598 Type: Race condition
Mechanism: A race in the io_uring timeout code could lead to memory corruption.
Impact: Arbitrary code execution or kernel crash.
4. CVE-2022-2602, CVE-2022-1116, etc. Type: UAF and out-of-bounds access
Impact: Escalation from containers or sandboxed processes.
5. Exploit Tooling: Tools like io_uring_shock and custom kernel exploits often target io_uring in container escape scenarios (esp. with Docker or LXC).
Implicit Vulnerabilities (Architectural and Latent Risks)
These are not necessarily exploitable today, but reflect deeper systemic design risks or assumptions.
1. Shared Memory Abuse io_uring uses shared rings (memory-mapped via mmap) between kernel and user space.
Risk: If ring buffer memory management has reference count bugs, attackers could force races, data corruption, or misuse stale pointers.
2. User-Controlled Kernel Pointers
Some features allow user-specified buffers, SQEs, and CQEs to reference arbitrary memory (e.g. via IORING_OP_PROVIDE_BUFFERS, IORING_OP_MSG_RING).
Risk: Incomplete validation could allow crafting fake kernel structures or triggering speculative attacks.
3. Speculative Execution & Side Channels
Since io_uring relies on pre-submitted work queues and long-lived kernel threads, it opens timing side channels.
Risk: Predictable scheduling or timing leaks, esp. combined with hardware speculation (Spectre-class).
4. Bypassing seccomp or AppArmor Filters
io_uring operations can effectively batch or obscure syscall behavior.
Example: A program restricted from calling sendmsg() directly might still use io_uring to perform similar actions.
Risk: Policy enforcement tools become less effective, requiring explicit io_uring filtering.
5. Poor Auditability
The batched and asynchronous nature makes logging or syscall audit trails incomplete or confusing.
Risk: Harder for defenders or monitoring tools to track intent or detect misuse in real time.
6. Ring Reuse + Threaded Offload
With IORING_SETUP_SQPOLL or IORING_SETUP_IOPOLL, I/O workers can run in kernel threads detached from user context.
Risk: Desynchronized security context can lead to privileged operations escaping sandbox context (e.g., post-chroot but pre-fork).
7. File Descriptor Reuse and Lifecycle Mismatch
Some operations in io_uring rely on fixed file descriptors or registered files. Race conditions with FD reuse or closing can cause inconsistencies.
Risk: UAF, type confusion, or logic bombs triggered by kernel state confusion.
Emerging Threat Vectors
eBPF + io_uring
Some exploits chain io_uring with eBPF to do arbitrary memory reads or writes. e.g., io_uring to perform controlled allocations, then eBPF to read or write memory.
io_uring + userfaultfd
Combining userfaultfd with io_uring allows very fine-grained control over page faults during I/O — great for fuzzing, also for exploit primitives.
A bit off-topic too, but I'm new to Zig and curious. This here: ``` const allocator = sfb.get();
var cmd: Command = .{ .arena = allocator };
```
means that all allocations need to be written with an allocator in mind? I.e. one has to pick an allocator per each memory allocation? Or is there a default one?There's cases where you do want to change your code based on the expectation that you will be provided a special kind of allocator (e.g. arenas), but that's a more niche thing and in any case it all comes together pretty well in practice.
But yes, there is a default allocator, std.heap.page_allocator
You should basically only use the page allocator if you're writing another allocator.
Nit: an allocator is not a "memory model", and I very much want the memory model to not change under my feet.
in libraries. if youre just writing a final product it's totally fine to pick one and use it everywhere.
> std.heap.page_allocator
strongly disrecommend using this allocator as "default", it will take a trip to kernelland on each allocation.
https://www.gnu.org/prep/maintain/maintain.html#Copyright-Pa...
Good luck getting that upstreamed and accepted. The more foundational the tools (and GNU coreutils definitely is foundational), the more difficult that process will be.
Releasing a standalone utility makes iteration much faster, partially because one is not bound to the release cycles of distributions.
which certainly is a valid way or prioritizing. similarly, distros/users may prioritize stability, which means the theoretical improvement would now be stuck in not-used-land. the value of software appears when it's run, not when it's written
Have you ever tried to contribute to open source projects?
The question was why wouldn't someone writing software not take the route likely to end in rejection/failure. I don't know about you, but if I write software, I am not going to write it for a project whose managers will make it difficult for my PR to be accepted, and that 99% likely it never will be.
I will always contribute to the project likely to appreciate my work and incorporate it.
I'll share an anecdote: I got involved with a project, filed a couple PRs that were accepted (slowly), and then I talked about refactoring something so it could be tested better and wasn't so fragile and tightly coupled to IO. "Sounds great" was the response.
So I did the refactor. Filed a PR and asked for code review. The response was (after a long time waiting) "thanks but no, we don't want this." PR closed. No feedback, nothing.
I don't even use the software anymore. I certainly haven't tried to fix any bugs. I don't like being jerked around by management, especially when I'm doing it for free.
(For the record, I privately forked the code and run my own version that is better because by refactoring and then writing tests, I discovered a number of bugs I couldn't be arsed to file with the original project.)
yes, and it was often painful enough to make me consider very well wether I want to bother contributing. I can only imagine how terrible the experience must be at a core utility such as ls.
> The question was why wouldn't someone writing software not take the route likely to end in rejection/failure
Obviously they wouldn't - in my comment I assumed that the lsr author aimed for providing a better ls for people and tried to offer a perspective with a different definition of what success is.
> I don't like being jerked around by management, especially when I'm doing it for free
I get that. The older OSS projects become, the more they fossilize too - and that makes it more annoying to contribute. But you can try to see it from the maintainers perspective too: They have actual people relying on the program being stable and are often also not paid. Noone is forcing you to contribute to their project, but if you don't want to deal with existing maintainers, you won't have their users enjoying your patchset. Know what you want to achieve and act accordingly, is all I'm trying to say.
Newer ones can be just as braindead, if they came out of some commercial entity. CLAs and such.
How `more` became `less`.
The name of 'more' was from paging - rather than having text scroll off the screen, it would show you one page, then ask if you wanted to see 'more' and scroll down.
'less' is a joke by the less authors. 'less is more' etc.
* https://freshports.org/sysutils/most/
* https://ftp.netbsd.org/pub/pkgsrc/current/pkgsrc/misc/most/i...
* https://packages.debian.org/sid/most
One can even get pg still, with Ilumos-based systems; even though that was actually taken out of the SUS years ago. This goes to show that what's standard is not the same as what exists, of course.
* https://illumos.org/man/1/pg
* https://pubs.opengroup.org/onlinepubs/9699919799.2008edition...
Plus, since I actually took stevie and screen and others from comp.sources.unix and worked on them, and wasn't able to even send my improvements to M. Salz or the original authors at all, from my country, I can attest that contributing improvements had hurdles just as large to overcome back then as there exist now. They're just different.
Still, I am yet to come across a some tests that simulate typical real life application workload.
I heard of fio but are yet to check how exactly it works and whether it might be possible to simulate real life application workload with it.
it's a good first approximation to test the cartesian product of
- sequential/random
- reads/writes
- in arbitrary sizes
- with arbitrarily many workers
- with many different backends to perform such i/o including io_uring
and its reporting is solid and thorough
implementing the same for your specific workload is often not trivial at all
I remember getting in to a situation during the ext2 and spinning rust days where production directories had 500k files. ls processes were slow enough to overload everything. ls -F saved me there.
And filesystems got a lot better at lots of files. What filesystem was used here?
It's interesting how well busybox fares, it's written for size not speed iirc?
Two points are not enough to say it's sublinear. It might very well be some constant factor that becomes less and less important the bigger the linear factor becomes.
Or in other words 10000n+C < 10000(n+C)
But for lsr, it's 9.34. The other tools have factors close to 10.09 or higher. Since ls has to sort it's output (unless -F is specified) I'd not be too surprised with a little superlinearity.
https://docs.google.com/spreadsheets/d/1EAYua3B3UeTGBtAejPw2...
Most of the coreutils are not fast enough to actually utilize modern SSDs.
Thanks for the comment, didn't know that!
Yes I know uring is an async interface, but it’s trivial to implement sync behavior on top of a single chain of async send-wait pairs, like doing a simple single threaded “conversational” implementation of a network protocol.
It wouldn’t make a difference in most individual cases but overall I wonder how big a global speed boost you’d get by removing a ton of syscalls?
Or am I failing to understand something about the performance nuances here?
- Start some sort of async executor thread to service the io_uring requests/responses
- Make it so every call to "normal" syscalls causes the calling thread to sleep until the result is available (that's 1 syscall)
- When the executor thread gets a result, have it wake up the original thread (that's another syscall)
So you're basically turning 1 syscall into 2 in order to emulate the legacy syscalls.
io_uring only makes sense if you're already async. Emulating sync on top of async is nearly always a terrible idea.
io_uring requires API changes because you don't call it like the old read(please_fill_this_buffer). You maintain a pool of buffer that belong to the ringbuffer, and reads take buffers from the pool. You consume the data from the buffer and return it to the pool.
With the older style, you're required to maintain O(pending_reads) buffers. With the io_uring style, you have a pool of O(num_reads_completing_at_once) (I assume with backpressure but haven't actually checked).
Locales also bring in a lot more complicated sorting - so that could be a factor also.
I'm curious how lsr compares to bfs -ls for example. bfs only uses io_uring when multiple threads are enabled, but maybe it's worth using it even for bfs -j1
movomito•4h ago
Imustaskforhelp•4h ago
Currently downloading zig to build it.