io_uring, libaio performance across Linux kernels and an unexpected IOMMU trap

https://blog.ydb.tech/how-io-uring-overtook-libaio-performance-across-linux-kernels-and-an-unexpected-iommu-trap-ea6126d9ef14

7•tanelpoder•1h ago

Comments

eivanov89•1h ago

Dear folks, I'm the author of that post.

A short summary below.

We ran fio benchmarks comparing libaio and io_uring across kernels (5.4 -> 7.0-rc3). The most surprising part wasn’t io_uring gains (~2x), but a ~30% regression caused by IOMMU being enabled by default between releases.

Happy to share more details about setup or reproduce results.

hcpp•1h ago

Why was 4K random write chosen as the main workload, and would the conclusion change with sequential I/O?

eivanov89•1h ago

That's a popular DBMS pattern. We chosen writes over reads, because on many NVMe devices writes are faster and it is easier to measure software latency.

I guess that in case of sequential I/O result would be similar. However with larger blocks and less IOPS the difference might be smaller.

tanelpoder•51m ago

I understand that it's the interrupt-based I/O completion workloads that suffered from IOMMU overhead in your tests?

IOMMU may induce some interrupt remapping latency, I'd be interested in seeing:

1) interrupt counts (normalized to IOPS) from /proc/interrupts

2) "hardirqs -d" (bcc-tools) output for IRQ handling latency histograms

3) perf record -g output to see if something inside interrupt handling codepath takes longer (on bare metal you can see inside hardirq handler code too)

Would be interesting to see if with IOMMU each interrupt handling takes longer on CPU (or is the handling time roughly the same, but interrupt delivery takes longer). There may be some interrupt coalescing thing going on as well (don't know exactly what else gets enabled with IOMMU).

Since interrupts are raised "randomly", independently from whatever your app/kernel code is running on CPUs, it's a bit harder to visualize total interrupt overhead in something like flamegraphs, as the interrupt activity is all over the place in the chart. I used flamegraph search/highlight feature to visually identify how much time the interrupt detours took during stress test execution.

Example here (scroll down a little):

https://tanelpoder.com/posts/linux-hiding-interrupt-cpu-usag...

eivanov89•36m ago

Unfortunately, we don't have a proper measurements for IOPOLL mode with and without IOMMU, because initially we didn't configure IOPOLL properly. However, I bet that this mode will be affected as well, because disk still has to write using IOMMU.

You suggest a very interesting measurements. I will keep it in my mind and try during next experiments. Wish I have read this before to apply during the past runs :)

tanelpoder•30m ago

Yeah you'd still have the IOMMU DMA translation, but would avoid the interrupt overhead...

Jean E Sammet

The Computer That Predicted the U.S. Would Win the Vietnam War

Show HN: Giftwrap, a simple go build and release tool

Transitioning from Java Syntax to Tier-1 Production Standards

Building an Invisible Daemon: Architecture Patterns for Local Developer Tools

InariWatch – Open-source AI monitoring that writes the fix while you sleep

Meta's Rogue AI Agent Gave Engineers Access They Shouldn't Have Had

BasicBox: A 486 PC emulator written in Visual Basic 6

Show HN: Gemini can now natively embed video, so I built sub-second video search

Finland reconsiders AWS election system migration, citing geopolitical tensions

Zooming Out: WebinarTV's Rampant Scraping of Online Meetings

Show HN: WordPress Next-Gen images with cross browser support

Musk says SpaceX and Tesla to build advanced chip factories in Austin

Fragile States Index

Andy Weir on Writing the Hit Book Behind the Movie 'Project Hail Mary'

The Engineer Who Tried to Put Age Verification into Linux

Deepfake X-Rays Fool Radiologists and AI

Telegram Outages Spike in Kremlin's Push for Digital Control

Tango Therapy: How the Dance of Passion Is Helping Parkinson's Patients

Designing a Test Runner for AI Agents

The Game of Terminal Maneuvers

My boss won't let OpenClaw run SSH commands anymore – until I added this guard

Tell HN: report@cisa.dhs.gov no longer operational

The more AI I used, the worse my code got

Databricks Announces Lakewatch: New Open, Agentic SIEM

Inceptionlabs AI SDK Provider

AGPLv3 argument against clean room (re)engineering

Don't Forget Your Superpower

VesselJS – A JavaScript library for conceptual ship design

A tool that learns to write like you