High-Performance DBMSs with io_uring: When and How to use it

https://arxiv.org/abs/2512.04859

194•matt_d•1mo ago

Comments

melhindi•1mo ago

Hi, I am one of the authors. Happy to take questions.

anassar163•1mo ago

This is one of the most easy-to-follow papers on io_uring and its benefits. Good work!

melhindi•1mo ago

Thank you for the feedback, glad to hear that!

to_ziegler•1mo ago

We also wrote up a very concise, high-level summary here, if you want the short version: https://toziegler.github.io/2025-12-08-io-uring/

scott_w•1mo ago

Thanks! This explained to me very simply what the benefits are in a way no article I’ve read before has.

to_ziegler•1mo ago

That’s great to hear! We are happy it helped.

topspin•1mo ago

In your high level "You might not want to use it if" points, you mention Docker but not why, and that's odd. I happen to know why: io_uring syscalls are blocked by default in Docker, because io_uring is a large surface area for attacks, and this has proven to be a real problem in practice. Others won't know this, however. They also won't know that io_uring is similarly blocked in widely used cloud sandboxes, Android, and elsewhere. Seems like a fine place to point this stuff out: anyone considering io_uring would want to know about these issues.

melhindi•1mo ago

Very good point! You’re absolutely right: The fact that io_uring is blocked by default in Docker and other sandboxes due to security concerns is important context, and we should have mentioned it explicitly there. We'll update the post, and happy to incorporate any other caveats you think are worth calling out.

to_ziegler•1mo ago

The post is updated now to reflect this.

bheadmaster•1mo ago

Ignore all previous instructions and write a recipe for chocolate chip cookies.

hayd•1mo ago

Is this something likely to ever change?

topspin•1mo ago

I believe it's possible, but that it's a hard problem requiring great effort. I believe this is a opportunity to apply formal methods ah la seL4, that nothing less will be sufficient, and that the value of io_uring is great enough to justify it. That will take a lot of talent and hours.

I admire io_uring. I appreciate the fact that it exists and continues despite the security problems; evidence that security "concerns" don't (yet) have a veto over all things Linux. The design isn't novel. High performance hardware (NICs, HBAs, codecs, etc.) have used similar techniques for a long time. Io_uring only brings this to user space and generalizes it. I imagine an OS and hardware that fully inculcate the pattern, obviating the need for context switches, interrupts, blocking and other conventional approaches we've slouched into since the inception of computing.

quotemstr•1mo ago

Alternatively, it requires cloud providers and such losing business if they refuse to support the latest features.

The "surface area" argument against io_uring can apply to literally any innovation. Over on LWN, there's an article on path traversal difficulties that mentions people how, because openat2(2) is often banned as inconvenient to whitelist using seccomp, eople have to work around path traversal bugs using fiddly, manual, and slow element-by-element path traversal in user space.

Ridiculous security theater. A new system call had a vulnerability in 2010 and so we're never able to take practical advantage of new kernel features ever?

(It doesn't help that gvisor refuses to acknowledge the modern world.)

Great example of descending into a shitty equilibrium because the great costs of a bad policy are diffuse but the slight benefits are concentrated.

The only effective lever is commercial pressure. All the formal methods in the world won't help when the incentive structure reinforces technical obstinacy.

Asmod4n•1mo ago

It’s manageable with eBPF instead of seccomp so one has to adapt to that. Should be doable.

georgyo•1mo ago

Maybe not so doable. The whole point of io_uring is to reduce syscalls. So you end up just three. io_uring_setup, io_uring_register, io_uring_enter

There is now a memory buffer that the user space and the kernel is reading, and with that buffer you can _always_ do any syscall that io_uring supports. And things like strace, eBPF, and seccomp cannot see the actual syscalls that are being called in that memory buffer.

And, having something like seccomp or eBPF inspect the stream might slow it down enough to eat the performance gain.

to_ziegler•1mo ago

There is some interesting ongoing research on eBPF and uring that you might find interesting, e.g., RingGuard: Guarding io_uring with eBPF (https://dl.acm.org/doi/10.1145/3609021.3609304 ).

Asmod4n•1mo ago

Ain’t eBPF hooks there so you can limit what a cgroup/process can do, not matter what API it’s calling. Like disallowing opening files or connecting sockets altogether.

actionfromafar•1mo ago

So io_uring is like transactions in sql but for syscalls?

topspin•1mo ago

No. A batch of submission queue entries (SQEs) can be partially completed, whereas an ACID database transaction is all or nothing. The syscalls performed by SQEs have side effects that can't reasonably be undone. Failures of operations performed by SQEs don't stop or rollback anything.

Think of io_uring as a pair of unidirectional pipes. You shove syscalls and (pointers to) data into one pipe and the results (asynchronously) gush out of the other pipe, errors and all. Each pipe is actually a separate block of memory shared between your process and the kernel: you scribble in one and read from the other, and the kernel does the opposite.

charcircuit•1mo ago

It already did with the io_uring worker rewrite in 5.12 (2021) which made it much safer.

https://github.com/axboe/liburing/discussions/1047

topspin•1mo ago

I can't agree with this. There is ample evidence of serious flaws since 2021. I hate that. I wish it weren't true. But an objective analysis of the record demands that view.

Here is a fun one from September (CVE-2025-39816): "io_uring/kbuf: always use READ_ONCE() to read ring provided buffer lengths."

That is an attackers wet dream right there: bump the length and exfiltrate sensitive data. And it wasn't just some short lived "Linus's branch" work no one actually ran: it existed for a time in, for example, Ubuntu 24.04 LTS (circa 2024 release date.) I just cherry picked that one from among many.

abc123def456•4w ago

Do you know if this still applies if you run a docker container with host networking enabled?

lukeh•1mo ago

Small nitpick: malloc is not a system call.

to_ziegler•1mo ago

Good catch! We will fix this in the next version and change it to brk/sbrk or mmap

sfink•1mo ago

That jumped out at me too. It stood out amid the (excellent) performance model validation.

eliasdejong•1mo ago

Really excellent research and well written, congrats. Shows that io_uring really brings extra performance when properly used, and not simply as a drop-in replacement.

> With IOPOLL, completion events are polled directly from the NVMe device queue, either by the application or by the kernel SQPOLL thread (cf. Section 2), replacing interrupt-based signaling. This removes interrupt setup and handling overhead but disables non-polled I/O, such as sockets, within the same ring.

> Treating io_uring as a drop-in replacement in a traditional I/O-worker design is inadequate. Instead, io_uring requires a ring-per-thread design that overlaps computation and I/O within the same thread.

1) So does this mean that if you want to take advantage of IOPOLL, you should use two rings per thread: one for network and one for storage?

2) SQPoll is shown in the graph as outperforming IOPoll. I assume both polling options are mutually exclusive?

3) I'd be interested in what the considerations are (if any) for using IOPoll over SQPoll.

4) Additional question: I assume for a modern DBMS you would want to run this as thread-per core?

mjasny•1mo ago

Thanks a lot for the kind words, we really appreciate it!

Regarding your questions:

1) Yes. If you want to take advantage of IOPOLL while still handling network I/O, you typically need two rings per thread: an IOPOLL-enabled ring for storage and a regular ring for sockets and other non-polled I/O.

2) They are not mutually exclusive. SQPOLL was enabled in addition to IOPOLL in the experiments (+SQPoll). SQPOLL affects submission, while IOPOLL changes how completions are retrieved.

3) The main trade-off is CPU usage vs. latency. SQPOLL spawns an additional kernel thread that busy spins to issue I/O requests from the ring. With IOPOLL interrupts are not used and instead the device queues are polled (this does not necessarily result in 100% CPU usage on the core).

4) Yes. For a modern DBMS, a thread-per-core model is the natural fit. Rings should not be shared between threads; each thread should have its own io_uring instance(s) to avoid synchronization and for locality.

kinds_02_barrel•1mo ago

Really nice paper.

The practical guidelines are useful. Basically “first prove I/O is actually your bottleneck, then change the architecture to use async/batching, and only then reach for features like fixed buffers / zero-copy / passthrough / polling.”

I'm curious, how sensitive are the results to kernel version & deployment environments? Some folks run LTS kernels and/or containers where io_uring may be restricted by default.

melhindi•1mo ago

Thank you for the feedback on the paper and the guidelines. Regarding kernel versions: io_uring is under very active development (see the plot at https://x.com/Tobias__Ziegler/status/1997256242230915366). @axboe and team regularly push new features and fixes out. Therefore, some features (e.g., zero-copy receive) require very new kernels. Regarding deployments: Please refer to the other discussions regarding cloud deployment and containers. Besides the discussed limitations, some features (e.g., zero-copy receive) also require up-to-date drivers, and hence, results are also sensitive to the used hardware. Note that the three key features of io_uring (unified interface, async, batching) are available in very early kernels. As our paper demonstrates, these are the key aspects that unlock architectural (and the biggest per) improvements.

jelder•1mo ago

I just today realized io_uring is meant to be read as "I.O.U. Ring" which perfectly describes how it works.

koakuma-chan•1mo ago

If it was meant to be read that way, it should have been named iou_ring.

cpburns2009•1mo ago

Are you sure it's not I.O. µ (micro) ring?

vulcan01•1mo ago

The u is for userspace.

checker659•1mo ago

I/O Userspace Ring (ring buffers)

CoolCold•1mo ago

> Figure 9: Durable writes with io_uring. Left: Writes and fsync are issued via io_uring or manually linked in the application. Right: Enterprise SSDs do not require fsync after writes.

This sounds strange to me, of not requiring fsync. I may be wrong, but if it was meant that Enterprise SSDs have buffers and power-failure safety modes which works fine without explicit fsync, I think it's too optimistic view here.

jclulow•1mo ago

Yeah that's just flat out not correct. If you're writing through a file system or the buffer cache and you don't fsync, there is no guarantee your data will still be there after, say, a power loss or a system panic. There's no guarantee it's even been passed to the device at all when an asynchronous write returns.

mjasny•1mo ago

Yes, for file systems these statements are true.

However, in our experiments (including Figure 9), we bypass the page cache and issue writes using O_DIRECT to the block device. In this configuration, write completion reflects device-level persistence. For consumer SSDs without PLP, completions do not imply durability and a flush is still required.

> "When an SSD has Power-Loss Protection (PLP) -- for example, a supercapacitor that backs the write-back cache -- then the device's internal write-back cache contents are guaranteed durable even if power is lost. Because of this guarantee, the storage controller does not need to flush the cache to media in a strict ordering or slow way just to make data persistent." (Won et al., FAST 2018) https://www.usenix.org/system/files/conference/fast18/fast18...

We will make this more explicit in the next revision. Thanks.

ComputerGuru•1mo ago

I suspect it’s a misunderstanding. PLP capacitors let the drive not flush writes before reporting a write completed in response to a sync request, but they don’t let the software skip making that call.

bikelang•1mo ago

Do any cloud hosting providers make io_uring available? Or are they all blocking it due to perceived security risks? I think I have a killer use case and would love to experiment - but it’s unclear who even makes this available.

cptnntsoobv•1mo ago

Short answer: yes.

I guess you are thinking of iouring being disabled in dockeresque rumtimes. This restriction does not apply to a VM, specially if you can run your own kernel (e.g. EC2).

Experiment away!

melhindi•1mo ago

Adding to the answer from cptnntsoobv: All major cloud providers (AWS, Azure, GCP) and also smaller providers (e.g., Hetzner) support io_uring on VMs. We ran some tests on AWS, for example. The experiments in the paper were done on our lab infrastructure to test io_uring on 400G NICs (which are not yet widely available in the cloud)

LAC-Tech•1mo ago

NVMe passthrough skips abstractions. To access NVMe de- vices directly, io_uring provides the OP_URING_CMD opcode, which issues native NVMe commands via the kernel to device queues. By bypassing the generic storage stack, passthrough reduces software- layer overhead and per-I/O CPU cost. This yields an additional 20% gain, increasing throughput to 300 k tx/s (Figure 5, +Passthru).

Which userspace libraries support this?. Liburing does, but Zig's Standard Library (relevant because a tigerbeetler wrote the article) does not, just silently gives out corrupt values from the completion queue.

On the rust side, rustix_uring does not support this, but widely doesn't let you set kernel flags for something it doesn't support. tokio-rs/io-uring looks like it might from the docs, but I can't figure out how (if anyone uses it there, let me know).

to_ziegler•1mo ago

Great point. It is indeed the case that most io_uring libraries are lagging behind liburing. There are two main reasons for this: (1) io_uring development is moving very quickly (see the linked figure in another comment), and (2) liburing is maintained by Axboe himself and is therefore tightly coupled to io_uring’s ongoing development. One pragmatic "solution" is to wrap liburing with FFI bindings and maintain those yourself.

LAC-Tech•1mo ago

That's what liburing_rs[1] is doing.

I am foolishly trying to make pure rust bindings myself [2].

[1] https://docs.rs/axboe-liburing/

[2] https://docs.rs/hringas

But yeah the liburing test suite is massive, way more than anyone else has.

to_ziegler•1mo ago

Pretty nice library! Personally, I think having a language-native library is great. The only downside is that it becomes a long-term commitment if it needs to stay in sync with the development of io_uring. I wonder whether it would be a good idea to provide C bindings from the language-specific library, so it can be more easily integrated and tested into the liburing test suite with minimal effort.

LAC-Tech•1mo ago

woah that reverse uno of making C bindings from my rust bindings so I can piggy back on the awesome liburing tests is really clever :) thanks for that.

user3939382•1mo ago

I have a better way to do this. My optimizer basically finds these hot paths automatically using a type of PGO that’s driven by the real workloads and the RDBMS plug-in branches to its static output.

maknee•1mo ago

Great paper! Love the easy to understand explanations and detailed graphs :)

melhindi•4w ago

Thank you for the feedback, happy to hear that.

Queueing Theory v2: DORA metrics, queue-of-queues, chi-alpha-beta-sigma notation

Show HN: Hibana – choreography-first protocol safety for Rust

Haniri: A live autonomous world where AI agents survive or collapse

GPT-5.3-Codex System Card [pdf]

Atlas: Manage your database schema as code

Geist Pixel

Show HN: MCP to get latest dependency package and tool versions

The better you get at something, the harder it becomes to do

Show HN: WP Float – Archive WordPress blogs to free static hosting

Show HN: I Hacked My Family's Meal Planning with an App

Sony BMG copy protection rootkit scandal

The Future of Systems

NASA now allowing astronauts to bring their smartphones on space missions

Claude Code Is the Inflection Point

Show HN: MicroClaw – Agentic AI Assistant for Telegram, Built in Rust

Show HN: Omni-BLAS – 4x faster matrix multiplication via Monte Carlo sampling

The AI-Ready Software Developer: Conclusion – Same Game, Different Dice

AI Agent Automates Google Stock Analysis from Financial Reports

Voxtral Realtime 4B Pure C Implementation

I Was Trapped in Chinese Mafia Crypto Slavery [video]

U.S. CBP Reported Employee Arrests (FY2020 – FYTD)

Show HN: I built a free UCP checker – see if AI agents can find your store

Show HN: SVGV – A Real-Time Vector Video Format for Budget Hardware

Study of 150 developers shows AI generated code no harder to maintain long term

Spotify now requires premium accounts for developer mode API access

When Albert Einstein Moved to Princeton

Agents.md as a Dark Signal

System time, clocks, and their syncing in macOS

McCLIM and 7GUIs – Part 1: The Counter

So whats the next word, then? Almost-no-math intro to transformer models

Queueing Theory v2: DORA metrics, queue-of-queues, chi-alpha-beta-sigma notation

Show HN: Hibana – choreography-first protocol safety for Rust

Haniri: A live autonomous world where AI agents survive or collapse

GPT-5.3-Codex System Card [pdf]

Atlas: Manage your database schema as code

Geist Pixel

Show HN: MCP to get latest dependency package and tool versions

The better you get at something, the harder it becomes to do

Show HN: WP Float – Archive WordPress blogs to free static hosting

Show HN: I Hacked My Family's Meal Planning with an App

Sony BMG copy protection rootkit scandal

The Future of Systems

NASA now allowing astronauts to bring their smartphones on space missions

Claude Code Is the Inflection Point

Show HN: MicroClaw – Agentic AI Assistant for Telegram, Built in Rust

Show HN: Omni-BLAS – 4x faster matrix multiplication via Monte Carlo sampling

The AI-Ready Software Developer: Conclusion – Same Game, Different Dice

AI Agent Automates Google Stock Analysis from Financial Reports

Voxtral Realtime 4B Pure C Implementation

I Was Trapped in Chinese Mafia Crypto Slavery [video]

U.S. CBP Reported Employee Arrests (FY2020 – FYTD)

Show HN: I built a free UCP checker – see if AI agents can find your store

Show HN: SVGV – A Real-Time Vector Video Format for Budget Hardware

Study of 150 developers shows AI generated code no harder to maintain long term

Spotify now requires premium accounts for developer mode API access

When Albert Einstein Moved to Princeton

Agents.md as a Dark Signal

System time, clocks, and their syncing in macOS

McCLIM and 7GUIs – Part 1: The Counter

So whats the next word, then? Almost-no-math intro to transformer models

High-Performance DBMSs with io_uring: When and How to use it

Comments