I guess that in case of sequential I/O result would be similar. However with larger blocks and less IOPS the difference might be smaller.
In practice, read latency tends to degrade over time under mixed load. We observe this even across relatively short consecutive runs. To get meaningful results, you need to first drive the device into a steady state. In our case, however, we were primarily interested in software overhead rather than device behavior.
For a cleaner comparison, it would probably make sense to use something like an in-memory block device (e.g., ublk), but we didn’t dig into it.
As for profiling: we didn’t run perf, so the following is my educated guess:
1. With libaio, control structures are copied as part of submission/completion. io_uring avoids some of this overhead via shared rings and pre-registered resources. 2. In our experience (in YDB), AIO syscall latency tends to be less predictable, even when well-tuned. 3. Although we report throughput, the setup is effectively latency-bound (single fio job). With more concurrency, libaio might catch up.
We intentionally used a single job because we typically aim for one thread per disk (two at most if polling enabled). In our setup (usually 6 disks), increasing concurrency per device is not desirable.
IOMMU may induce some interrupt remapping latency, I'd be interested in seeing:
1) interrupt counts (normalized to IOPS) from /proc/interrupts
2) "hardirqs -d" (bcc-tools) output for IRQ handling latency histograms
3) perf record -g output to see if something inside interrupt handling codepath takes longer (on bare metal you can see inside hardirq handler code too)
Would be interesting to see if with IOMMU each interrupt handling takes longer on CPU (or is the handling time roughly the same, but interrupt delivery takes longer). There may be some interrupt coalescing thing going on as well (don't know exactly what else gets enabled with IOMMU).
Since interrupts are raised "randomly", independently from whatever your app/kernel code is running on CPUs, it's a bit harder to visualize total interrupt overhead in something like flamegraphs, as the interrupt activity is all over the place in the chart. I used flamegraph search/highlight feature to visually identify how much time the interrupt detours took during stress test execution.
Example here (scroll down a little):
https://tanelpoder.com/posts/linux-hiding-interrupt-cpu-usag...
You suggest a very interesting measurements. I will keep it in my mind and try during next experiments. Wish I have read this before to apply during the past runs :)
After careful reading I'm surprised how small IRQ squares build up 30%. Should search for interrupts when I inspect our flamegraphs next time.
Edit: I wrote about that setup and other Linux/PCIe root complex topology issues I hit back in 2021:
eivanov89•3h ago
A short summary below.
We ran fio benchmarks comparing libaio and io_uring across kernels (5.4 -> 7.0-rc3). The most surprising part wasn’t io_uring gains (~2x), but a ~30% regression caused by IOMMU being enabled by default between releases.
Happy to share more details about setup or reproduce results.
jcalvinowens•10m ago
Was the iommu using strict or lazy invalidation? I think lazy is the default but I'm not sure how long that's been true.