frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

io_uring, libaio performance across Linux kernels and an unexpected IOMMU trap

https://blog.ydb.tech/how-io-uring-overtook-libaio-performance-across-linux-kernels-and-an-unexpected-iommu-trap-ea6126d9ef14
26•tanelpoder•3h ago

Comments

eivanov89•3h ago
Dear folks, I'm the author of that post.

A short summary below.

We ran fio benchmarks comparing libaio and io_uring across kernels (5.4 -> 7.0-rc3). The most surprising part wasn’t io_uring gains (~2x), but a ~30% regression caused by IOMMU being enabled by default between releases.

Happy to share more details about setup or reproduce results.

jcalvinowens•10m ago
Thanks for sharing this.

Was the iommu using strict or lazy invalidation? I think lazy is the default but I'm not sure how long that's been true.

hcpp•3h ago
Why was 4K random write chosen as the main workload, and would the conclusion change with sequential I/O?
eivanov89•3h ago
That's a popular DBMS pattern. We chosen writes over reads, because on many NVMe devices writes are faster and it is easier to measure software latency.

I guess that in case of sequential I/O result would be similar. However with larger blocks and less IOPS the difference might be smaller.

menaerus•50m ago
So perhaps a mixed read+write workload would be more interesting, no? Write-only is characteristic of ingestion workloads. That said, libaio vs io_uring difference is interesting. Did you perhaps run a perf profile to understand where the differences are coming from? My gut feeling is that it is not necessarily an artifact of less context-switching with io_uring but something else.
eivanov89•35m ago
There are a couple of challenges with mixed read+write workloads on NVMe.

In practice, read latency tends to degrade over time under mixed load. We observe this even across relatively short consecutive runs. To get meaningful results, you need to first drive the device into a steady state. In our case, however, we were primarily interested in software overhead rather than device behavior.

For a cleaner comparison, it would probably make sense to use something like an in-memory block device (e.g., ublk), but we didn’t dig into it.

As for profiling: we didn’t run perf, so the following is my educated guess:

1. With libaio, control structures are copied as part of submission/completion. io_uring avoids some of this overhead via shared rings and pre-registered resources. 2. In our experience (in YDB), AIO syscall latency tends to be less predictable, even when well-tuned. 3. Although we report throughput, the setup is effectively latency-bound (single fio job). With more concurrency, libaio might catch up.

We intentionally used a single job because we typically aim for one thread per disk (two at most if polling enabled). In our setup (usually 6 disks), increasing concurrency per device is not desirable.

tanelpoder•2h ago
I understand that it's the interrupt-based I/O completion workloads that suffered from IOMMU overhead in your tests?

IOMMU may induce some interrupt remapping latency, I'd be interested in seeing:

1) interrupt counts (normalized to IOPS) from /proc/interrupts

2) "hardirqs -d" (bcc-tools) output for IRQ handling latency histograms

3) perf record -g output to see if something inside interrupt handling codepath takes longer (on bare metal you can see inside hardirq handler code too)

Would be interesting to see if with IOMMU each interrupt handling takes longer on CPU (or is the handling time roughly the same, but interrupt delivery takes longer). There may be some interrupt coalescing thing going on as well (don't know exactly what else gets enabled with IOMMU).

Since interrupts are raised "randomly", independently from whatever your app/kernel code is running on CPUs, it's a bit harder to visualize total interrupt overhead in something like flamegraphs, as the interrupt activity is all over the place in the chart. I used flamegraph search/highlight feature to visually identify how much time the interrupt detours took during stress test execution.

Example here (scroll down a little):

https://tanelpoder.com/posts/linux-hiding-interrupt-cpu-usag...

eivanov89•2h ago
Unfortunately, we don't have a proper measurements for IOPOLL mode with and without IOMMU, because initially we didn't configure IOPOLL properly. However, I bet that this mode will be affected as well, because disk still has to write using IOMMU.

You suggest a very interesting measurements. I will keep it in my mind and try during next experiments. Wish I have read this before to apply during the past runs :)

tanelpoder•2h ago
Yeah you'd still have the IOMMU DMA translation, but would avoid the interrupt overhead...
eivanov89•1h ago
BTW, the whole situation with IRQ accounting disabled reminds me the -fomit-frame-pointer case. For a long time there was no practical performance reason, but the option had been used... Making slower and harder to build stacks both for perf analyses and for stack unwinding in languages like C++.

After careful reading I'm surprised how small IRQ squares build up 30%. Should search for interrupts when I inspect our flamegraphs next time.

tanelpoder•1h ago
I was doing over 11M IOPS during that test ;-)

Edit: I wrote about that setup and other Linux/PCIe root complex topology issues I hit back in 2021:

https://news.ycombinator.com/item?id=25956670

eivanov89•1h ago
That's super hot. Especially the update with the 37M IOPS reference. Might be very useful for my next tasks related to a setup with 6 NVMe disks: 1. Get all disks saturated through the network (including RDMA usage). 2. Play with io_uring to share a polling thread. Currently, no luck: if I share kernel poller between two devices then improvement is just +30% (at a cost of 1 core). Considering alternative schemes now.

LiteLLM Python package compromised by supply-chain attack

https://github.com/BerriAI/litellm/issues/24512
549•theanonymousone•4h ago•222 comments

No Terms. No Conditions

https://notermsnoconditions.com
67•bayneri•1h ago•12 comments

Hypothesis, Antithesis, Synthesis

https://antithesis.com/blog/2026/hegel/
60•alpaylan•1h ago•26 comments

LaGuardia pilots raised safety alarms months before deadly runway crash

https://www.theguardian.com/us-news/2026/mar/24/laguardia-airplane-pilots-safety-concerns-crash
134•m_fayer•1h ago•93 comments

Run a 1T parameter model on a 32gb Mac by streaming tensors from NVMe

https://github.com/t8/hypura
27•tatef•56m ago•11 comments

WolfGuard: WireGuard with FIPS 140-3 cryptography

https://github.com/wolfssl/wolfguard
13•789c789c789c•1h ago•3 comments

Show HN: Gemini can now natively embed video, so I built sub-second video search

https://github.com/ssrajadh/sentrysearch
37•sohamrj•1h ago•11 comments

Nanobrew: The fastest macOS package manager compatible with brew

https://nanobrew.trilok.ai/
85•syrusakbary•5h ago•47 comments

Testing the Swift C compatibility with Raylib (+WASM)

https://carette.xyz/posts/swift_c_compatibility_with_raylib/
19•LucidLynx•2d ago•6 comments

Microsoft's "Fix" for Windows 11: Flowers After the Beating

https://www.sambent.com/microsofts-plan-to-fix-windows-11-is-gaslighting/
720•h0ek•7h ago•534 comments

Tony Hoare and His Imprint on Computer Science

https://cacm.acm.org/blogcacm/tony-hoare-and-his-imprint-on-computer-science/
9•matt_d•3d ago•2 comments

Debunking Zswap and Zram Myths

https://chrisdown.name/2026/03/24/zswap-vs-zram-when-to-use-what.html
118•javierhonduco•6h ago•28 comments

Ripgrep is faster than grep, ag, git grep, ucg, pt, sift (2016)

https://burntsushi.net/ripgrep/
239•jxmorris12•10h ago•98 comments

Secure Domain Name System (DNS) Deployment 2026 Guide [pdf]

https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-81r3.pdf
65•XzetaU8•4h ago•3 comments

curl > /dev/sda: How I made a Linux distro that runs wget | dd

https://astrid.tech/2026/03/24/0/curl-to-dev-sda/
113•astralbijection•6h ago•44 comments

Opera: Rewind The Web to 1996 (Opera at 30)

https://www.web-rewind.com
157•thushanfernando•9h ago•91 comments

io_uring, libaio performance across Linux kernels and an unexpected IOMMU trap

https://blog.ydb.tech/how-io-uring-overtook-libaio-performance-across-linux-kernels-and-an-unexpe...
26•tanelpoder•3h ago•12 comments

Box of Secrets: Discreetly modding an apartment intercom to work with Apple Home

https://www.jackhogan.me/blog/box-of-secrets/
236•jackhogan11•1d ago•86 comments

Apple Business

https://www.apple.com/newsroom/2026/03/introducing-apple-business-a-new-all-in-one-platform-for-b...
88•soheilpro•1h ago•72 comments

Log File Viewer for the Terminal

https://lnav.org/
261•wiradikusuma•11h ago•41 comments

LLM Neuroanatomy II: Modern LLM Hacking and Hints of a Universal Language?

https://dnhkng.github.io/posts/rys-ii/
61•realberkeaslan•6h ago•19 comments

So where are all the AI apps?

https://www.answer.ai/posts/2026-03-12-so-where-are-all-the-ai-apps.html
207•tanelpoder•2h ago•229 comments

The Jellies That Evolved a Different Way to Keep Time

https://www.quantamagazine.org/the-jellies-that-evolved-a-different-way-to-keep-time-20260320/
15•jyunwai•4d ago•5 comments

MSA: Memory Sparse Attention

https://github.com/EverMind-AI/MSA
69•chaosprint•3d ago•5 comments

iPhone 17 Pro Demonstrated Running a 400B LLM

https://twitter.com/anemll/status/2035901335984611412
686•anemll•1d ago•310 comments

Autoresearch on an old research idea

https://ykumar.me/blog/eclip-autoresearch/
406•ykumards•22h ago•89 comments

NanoClaw Adopts OneCLI Agent Vault

https://nanoclaw.dev/blog/nanoclaw-agent-vault/
88•turntable_pride•4h ago•25 comments

BIO – The Bao I/O Co-Processor

https://www.crowdsupply.com/baochip/dabao/updates/bio-the-bao-i-o-co-processor
73•hasheddan•2d ago•18 comments

FCC updates covered list to include foreign-made consumer routers

https://www.fcc.gov/document/fcc-updates-covered-list-include-foreign-made-consumer-routers
432•moonka•19h ago•286 comments

The bridge to wealth is being pulled up with AI

https://danielhomola.com/m%20&%20e/ai/your-bridge-to-wealth-is-being-pulled-up/
224•dankai•2h ago•246 comments