Async I/O on Linux in databases

https://blog.canoozie.net/async-i-o-on-linux-and-durability/

205•jtregunna•7mo ago

Comments

jtregunna•7mo ago

Post talks about how to use io_uring, in the context of building a "database" (a demonstration key-value cache with a write-ahead log), to maintain durability.

tlb•7mo ago

The recovery process is to "only apply operations that have both intent and completion records." But then I don't see the point of logging the intent record separately. If no completion is logged, the intent is ignored. So you could log the two together.

Presumably the intent record is large (containing the key-value data) while the completion record is tiny (containing just the index of the intent record). Is the point that the completion record write is guaranteed to be atomic because it fits in a disk sector, while the intent record doesn't?

ta8645•7mo ago

It's really not clear in the article. But I _think_ the gains are to be had because you can do the in-memory updating during the time that the WAL is being written to disk (rather than waiting for it to flush before proceeding). So I'm guessing the protocol as presented, is actually missing a key step:

    Write intent record (async)
    Perform operation in memory
    Write completion record (async)
    * * Wait for intent and completion to be flushed to disk * *
    Return success to client

gsliepen•7mo ago

But this makes me wonder how it works when there are concurrent requests. What if a second thread requests data that is being written to memory by the first thread? Shouldn't it also wait for both the write intent record and completion record having been flushed to disk? Otherwise you could end up with a query that returns data that after a crash won't exist anymore.

Manuel_D•7mo ago

It's not the write ahead log that prevents that scenario, it's transaction isolation. And note that the more permissive isolation levels offered by Postgres, for example, do allow that failure mode to occur.

Demiurge•7mo ago

If thats the hypothesis, it would be good to see some numbers or proof of concept. The real world performance impact seems not that obvious to predict here.

avinassh•7mo ago

    * * Wait for intent and completion to be flushed to disk * *

if you wait for both to complete, then how it can be faster than doing a single IO?

cbzbc•7mo ago

Presumably the intent record is large (containing the key-value data) while the completion record is tiny

I don't think this is necessarily the case, because the operations may have completed in a different order to how they are recorded in the intent log.

jmpman•7mo ago

“Write intent record (async) Perform operation in memory Write completion record (async) Return success to client

During recovery, I only apply operations that have both intent and completion records. This ensures consistency while allowing much higher throughput. “

Does this mean that a client could receive a success for a request, which if the system crashed immediately afterwards, when replayed, wouldn’t necessarily have that request recorded?

How does that not violate ACID?

JasonSage•7mo ago

As best I can tell, the author understands that the async write-ahead fails to be a guarantee where the sync one does… then turns their async write into two async writes… but there’s still no guarantee comparable to the synchronous version.

So I fail to see how the two async writes are any guarantee at all. It sounds like they just happen to provide better consistency than the one async write because it forces an arbitrary amount of time to pass.

m11a•7mo ago

Yeah, I feel like I’m missing the point of this. The original purpose of the WAL was for recovery, so WAL entries are supposed to be flushed to disk.

Seems like OP’s async approach removes that, so there’s no durability guarantee, so why even maintain a WAL to begin with?

nephalegm•7mo ago

Reading through the article it’s explained in the recovery process. He reads the intent log entries and the completion entries and only applies them if they both exist.

So there is no guarantee that operations are committed by virtue of not being acknowledged to the application (asynchronous) the recovery replay will be consistent.

I could see it would be problematic for any data where the order of operations is important, but that’s the trade off for performance. This does seem to be an improvement to ensure asynchronous IO will always result in a consistent recovery.

ori_b•7mo ago

There's not even a guarantee that the intent log flushes to disk before the completion log. You can get completions entries in the completion log that were lost in the intent log. So, no, there's no guarantee of consistent recovery.

You'd be better off with a single log.

toast0•7mo ago

There's no guarantee of ordering of writes within the two logs either.

This seems nightmarish to recover from.

lmeyerov•7mo ago

I think he says he checks for both

It's interesting as a weaker safety guarantee. He is guaranteeing write integrity, so valid WAL view on restart by throwing out mismatching writes. But from an outside observation, premature signaling of completion, which would mean data loss as a client may have moved on without retries due to thinking the data was safely saved. (I was a bit confused in the completion meaning around this, so not confident.)

We hit some similar scenarios in Graphistry where we treat recieving server disk/RAM during browser uploads as writethrough caches in front of our cloud storage persistence tiers. The choice of when to signal success to the uploader is funny -- disk/RAM vs cloud storage -- and timing difference is fairly observable to the web user.

nephalegm•7mo ago

The first part is correct, which is why during recovery transactions need to exist in both places to be applied else they are discarded (from either). If it works as stated on paper then it would give the C for consistency in recovery but of course fails at durability.

zozbot234•7mo ago

> Does this mean that a client could receive a success for a request, which if the system crashed immediately afterwards, when replayed, wouldn’t necessarily have that request recorded?

Yup. OP says "the intent record could just be sitting in a kernel buffer", but then the exact same issue applies to the completion record. So confirmation to the client cannot be issued until the completion record has been written to durable storage. Not really seeing the point of this blogpost.

nromiun•7mo ago

Slightly off topic but anyone knows when/if Google is going to enable io_uring for Android?

jeffbee•7mo ago

Hopefully never. It almost seems to have been purpose-built for local privilege escalation exploits.

ozgrakkurt•7mo ago

Great to see someone going into this. I wanted to do a simple LSM tree using io_uring in Zig for some time but couldn't get into it yet.

I always use this approach for crash-resistance:

- Append to the data (WAL) file normally.

- Have a seperate small file that is like a hash + length for WAL state.

- First append to WAL file.

- Start fsync call on the WAL file, create a new hash/length file with different name and fsync it in parallel.

- Rename the length file onto the real one for making sure it is fully atomic.

- Update in-memory state to reflect the files and return from the write function call.

Curious if anyone knows tradeoffs between this and doing double WAL. Maybe doing fsync on everything is too slow to maintain fast writes?

I learned about append/rename approach from this article in case anyone is interested:

- https://discuss.hypermode.com/t/making-badger-crash-resilien...

- https://research.cs.wisc.edu/adsl/Publications/alice-osdi14....

toolslive•7mo ago

it's possible to unify the WAL and the tree. There are some append only B-tree implementations. https://github.com/Incubaid/baardskeerder fe.

avinassh•7mo ago

There are also CoW B Trees not entirely similar, but kinda same.

toolslive•7mo ago

there's a whole class of persistent persistent (the repetition is intentional here) data structures. Some of them even combine performance with elegance.

tobias3•7mo ago

I don't get this. How can two(+) WAL operations be faster than one (double the sync IOPS)?

I think this database doesn't have durability at all.

benjiro•7mo ago

fsync waits for the drive to report back the success write. When you do a ton of small writes, fsync becomes a bottleneck. Its a issue of context switching and pipelining with fsync.

When you async write data, you do not need to wait for this confirmation. So by double writing two async requests, you are better using all your system CPU cores as they are not being stalled waiting for that I/O response. Seeing a 10x performance gain is not uncommon using a method like this.

Yes, you do need to check if both records are written and then report it back to the client. But that is a non-fsync request and does not tax your system the same as fsync writes.

It has literally the same durability as a fsync write. You need to take in account, that most databases are written 30, 40 ... years ago. In the time when HDDs ruled and stuff like NVME drives was a pipedream. But most DBs still work the same, and threat NVME drives like they are HDDs.

Doing this above operation on a HDD, will cost you 2x the performance because you barely have like 80 to 120 IOPS/s. But a cheap NVME drive easily does 100.000 like its nothing.

If you even monitored a NVME drive with a database write usage, you will noticed that those NVME drives are just underutilized. This is why you see a lot more work in trying new data storage layers being developed for Databases that better utilize NVME capabilities (and trying to bypass old HDD era bottlenecks).

zozbot234•7mo ago

> It has literally the same durability as a fsync write

I don't think we can ensure this without knowing what fsync() maps to in the NVMe standard, and somehow replicating that. Just reading back is not enough, e.g. the hardware might be reading from a volatile cache that will be lost in a crash.

benjiro•7mo ago

Unless your running cheap consumer NVME drives, that is not a issue on Enterprise SSD/NVMEs as they have their own capacitors to ensure data is always written.

On cheaper NVME drives, your point is valid. But we also need to add, how much at risk are you. What is the chance of a system doing funky issues, that you just happened to send X amount of confirm requests to clients, with data that never got written.

For specific companies, they will not cheap out and spend tons of enterprise level of hardware. But for the rest of us? I mean, have you seen the German Hetzner, where 97% of their hardware is mostly consumer level hardware. Yes, there is a risk, but nobody complains about that risk.

And frankly, everything can be a risk if you think about it. I have had EXT3 partition's corrupt on a production DB server. That is why you have replication and backups ;)

TiDB, or was it another distributed DB is also not consistency guaranteed, if i remember correctly. They give for performance eventual consistency.

gpderetta•7mo ago

Forget about consumer FD, unless you are explicitly doing O_DIRECT, why would you expect that a notification that your IO has completed would mean that it has reached the disk at all? The data might still be just in the kernel page buffer and not gotten close to the disk at all.

You mention you need to wait for the compilation record to be written. But how do you do that without fsync or O_DIRECT? A notification that the write is completed is not that.

Edit: maybe you are using RWF_SYNC in your write call. That could work.

codys•7mo ago

> Yes, you do need to check if both records are written and then report it back to the client. But that is a non-fsync request and does not tax your system the same as fsync writes.

What mechanism can be used to check that the writes are complete if not fsync (or adjacent fdatasync)? What specific io_uring operation or system call?

avinassh•7mo ago

I don't get this scheme at all. The protocol violates durability, because once the client receives success from server, it should be durable. However, completion record is async, it is possible that it never completes and server crashes.

During recovery, since the server applies only the operations which have both records, you will not recover a record which was successful to the client.

benjiro•7mo ago

I think you missed the part in the middle:

-----------------

So the protocol ends up becoming:

Write intent record (async) Perform operation in memory Write completion record (async) Return success to client

-----------------

In other words, the client only knows its a success when both wal files have been written.

The goal is not to provide faster responses to the client, on the first intent record, but to ensure that the system is not stuck with I/O Waiting on fsync requests.

When you write a ton of data to database, you often see that its not the core writes but the I/O > fsync that eat a ton of your resources. Cutting back on that mess, results that you can push more performance out of a write heavy server.

jcgrillo•7mo ago

There's no fsync in the async version, though, unless I missed it? The problem with the two WAL approach is that now none of the WAL writes are durable--you could encounter a situation where a client reads an entry on the completion WAL which upon recovery does not exist on disk. Before with the single fsynced WAL, writes were durably persisted.

loeg•7mo ago

No, we saw this scheme, it just doesn't work. Either of the async writes can fail after ack'ing the logical write to the client as successful (e.g., kernel crash or power failure) and then you have lost data.

cyanydeez•7mo ago

You can always have data loss. The intent is that when the client is told the data is saved, it doesnt happen before the garuntee.

I dont know if OP achieved this, but the client isnt told "we have your data" until both of the WALs are agreeing. If the system goes down those WALs are used to rebuild data in flight.

The speed up allows for decoupling synchronous disk writes that are now parallel.

You are not conceptualizing what data loss means in the ACID contract between DB and Client.

But you

loeg•7mo ago

> I dont know if OP achieved this,

They did not.

> but the client isnt told "we have your data" until both of the WALs are agreeing.

Wrong. In the proposed scheme, the client writes are ack'd before the WAL writes are flushed. Their contents may or may not agree after subsequent power loss or kernel crash.

(It is generally considered unacceptable for network databases/filers to be lossier than the underlying media. Sometimes stronger guarantees are required/provided, but that is usually the minimum.)

LAC-Tech•7mo ago

Great article, but I have a question:

The problem with naive async I/O in a database context at least, is that you lose the durability guarantee that makes databases useful. When a client receives a success response, their expectation is the data will survive a system crash. But with async I/O, by the time you send that response, the data might still be sitting in kernel buffers, not yet written to stable storage.

Shouldn't you just tie the successful response to a successful fsync?

Async or sync, I'm not sure what's different here.

leentee•7mo ago

First, I think the article provides false claim, the solution doesn't guarantee durability. Second, I believe good synchronous code is better than bad asynchronous code, and it's way easier to write good synchronous code than asynchronous code, especially with io_uring. Modern NVMe are fast, even with synchronous IO, enough for most applications. Before thinking about asynchronous, make sure your application use synchronous IO well.

benjiro•7mo ago

Speaking from experience, its easy to make Postgres (for example), just trash your system usage on a lot of individual or batch inserts. The NVME drives are often extreme underutilized, and your bottleneck is the whole fsync layer.

Second, the durability is the same as fsync. The client only gets reported a success, if both wall writes have been done.

Its the same guarantee as fsync but you bypass the fsync bottleneck, what in turn allows for actually using the benefits of your NVME drives better (and shifting away the resource from the i/o blocking fsync).

Yes, it involves more management because now you need to maintain two states, instead of one with the synchronous fsync operation. But that is the thing about parallel programming, its more complex but you get a ton of benefits from it by bypassing synchronous bottlenecks.

jorangreef•7mo ago

To be clear, this is different to what we do (and why we do it) in TigerBeetle.

For example, we never externalize commits without full fsync, to preserve durability [0].

Further, the motivation for why TigerBeetle has both a prepare WAL plus a header WAL is different, not performance (we get performance elsewhere, through batching) but correctness, cf. “Protocol-Aware Recovery for Consensus-Based Storage” [1].

Finally, TigerBeetle's recovery is more intricate, we do all this to survive TigerBeetle's storage fault model. You can read the actual code here [2] and Kyle Kingsbury's Jepsen report on TigerBeetle also provides an excellent overview [3].

[0] https://www.youtube.com/watch?v=tRgvaqpQPwE

[1] https://www.usenix.org/system/files/conference/fast18/fast18...

[2] https://github.com/tigerbeetle/tigerbeetle/blob/main/src/vsr...

[3] https://jepsen.io/analyses/tigerbeetle-0.16.11.pdf

quietbritishjim•7mo ago

The article claims that, when they switched to io_uring,

> throughput increased by an order of magnitude almost immediately

But right near the start is the real story: the sync version had

> the classic fsync() call after every write to the log for durability

They are not comparing performance of sync APIs vs io_uring. They're comparing using fsync vs not using fsync! They even go on to say that a problem with async API is that

> you lose the durability guarantee that makes databases useful. ... the data might still be sitting in kernel buffers, not yet written to stable storage.

No! That's because you stopped using fsync. It's nothing to do with your code being async.

If you just removed the fsync from the sync code you'd quite possibly get a speedup of an order of magnitude too. Or if you put the fsync back in the async version (I don't know io_uring well enough to understand that but it appears to be possible with "io_uring_prep_fsync") then that would surely slide back. Would the io_uring version still be faster either way? Quite possibly, but because they made an apples-to-oranges comparison, we can't know from this article.

(As other commenters have pointed out, their two-phase commit strategy also fails to provide any guarantee. There's no getting around fsync if you want to be sure that your data is really on the storage medium.)

zozbot234•7mo ago

So OP's real point is that fsync() sucks in the context of modern hardware where thousands of I/O reqs may be in flight at any given time. We need more fine-grained mechanisms to ensure that writes are committed to permanent storage, without introducing undue serialization.

quietbritishjim•7mo ago

Well, there already is slightly more fine gained control: in the sync version, you can perhaps call sync write() a few times before calling fsync() once i.e. basically batch up a few writes. That does have the disadvantage that you can't easily queue new writes while waiting for the previous ones. Perhaps you could use calls to write() in another thread while the first one is waiting for fsync() for the previous batch? You could even have lots of threads doing that in parallel, but probably not the thousands that you mentioned. I don't know the nitty gritty of Linux file IO well enough to know how well that would work.

As I said, I don't know anything about fsync in io_uring. Maybe that has more control?

An article that did a fair comparison, by someone who actually knows what they're talking about, would be pretty interesting.

loeg•7mo ago

> As I said, I don't know anything about fsync in io_uring. Maybe that has now control?

io_uring fsync has byte range support: https://man7.org/linux/man-pages/man2/io_uring_enter.2.html#...

quietbritishjim•7mo ago

Sorry, that was a typo in my comment (now edited). "Now" was meant to be "more" i.e. "perhaps [io_uring] has more control [than sync APIs]?"

Byte range is support is interesting but also present in the Linux sync API:

https://man7.org/linux/man-pages/man2/sync_file_range.2.html

I meant more like, perhaps it's possible to concurrently queue fsync for different writes in a way that isn't possible with the sync API. From your link, it appears not (unless they're isolated at non-overlapping byte ranges, but that's no different from what you can do with sync API + threads):

> Note that, while I/O is initiated in the order in which it appears in the submission queue, completions are unordered. For example, an application which places a write I/O followed by an fsync in the submission queue cannot expect the fsync to apply to the write. The two operations execute in parallel, so the fsync may complete before the write is issued to the storage.

So if two writes are for an overlapping byte range, and you wanted to write + fsync the first one then write + fsync the second then you'd need to queue those four operations in application space, ensuring only one is submitted to io_uring at a time.

gpderetta•7mo ago

You can insert synchronization OPs (i.e. barriers) in the queue to guarantee in-order execution.

wtallis•7mo ago

You can also directly link submitted operations into a chain that will be executed in-order but without ordering dependencies on other operations not submitted as part of the chain.

jlokier•7mo ago

> Byte range is support is interesting but also present in the Linux sync API: https://man7.org/linux/man-pages/man2/sync_file_range.2.html

Unfortunately, I think sync_file_range() provides much weaker guarantees than byte-range fsync() and even byte-range fdatasync().

As I understand it from historical behaviour and documentation, sync_file_range() doesn't push durability barriers down the underlying storage devices, nor does it ensure that all metadata needed to access the written pages is itself written and made durable, for example when writing to a hole in a sparse file, to the end-hole created by enlarging a file with ftruncate(), or to fallocate'd pages.

As a result, that means sync_file_range() can only be used as a performance tweak, and not for any durability guarantees that fdatasync() / fsync() are used for.

I'd be delighted to find this has improved since I last looked, but that's what I recall about sync_file_range().

immibis•7mo ago

Postgres claims to have some kind of commit batching, but I couldn't figure out how to turn it on.

I wanted to scrub a table by processing each row, but without holding locks, so I wanted to commit every few hundred rows, but with only ACI and not D, since I could just run the process again. I don't think Postgres supports this feature. It also seemed to be calling fsync much more than once per transaction.

sgarland•7mo ago

Maybe I don’t understand what you’re trying to do, but you can directly control how frequently commits occur.

    BEGIN
    INSERT … —- batch of N size
    COMMIT AND CHAIN
    INSERT …

PaulDavisThe1st•7mo ago

Chance of Postgres commit mapping 1:1 onto posix fsync or equivalent: slim.

anarazel•7mo ago

Without parallelism, each commit will be at least one fdatasync (or fsync, O_SYNC/O_DSYNC write, depending on configuration). With parallelism, concurrent transaction might be flushed together, reducing the total number of fsyncs.

azlev•7mo ago

commit_delay

https://www.postgresql.org/docs/current/runtime-config-wal.h...

morningsam•7mo ago

Looking through the options listed under "Non-Durable Settings", [1] I guess synchronous_commit = off fits the bill?

[1]: https://www.postgresql.org/docs/current/non-durability.html

hardwaresofton•7mo ago

Nope, Other commenter noted it:

https://www.postgresql.org/docs/current/runtime-config-wal.h...

Don't use synchronous_commit = off is durability ~= 0 (i.e. "I hope the write made it to disk")

anarazel•7mo ago

> It also seemed to be calling fsync much more than once per transaction.

If it's called many more times than once per transaction the likely reason is that wal_buffers is sized small. Whenever generated WAL exceeds wal_buffers, postgres flushes the WAL, so it does not have to reopen the file later. At that point you already gotten most benefits from batching too.

Edit: A second reason is that data pages need to be written out due to cache pressure or such, and that requires the WAL to be flushed first.

stefanha•7mo ago

The Linux RWF_DSYNC flag sets the Full Unit Access (FUA) bit in write requests. This can be used instead of fdatasync(2) in some cases. It only syncs a specific write request instead of the entire disk write cache.

zozbot234•7mo ago

You should prefer RWF_SYNC in case the write involves changes to the file metadata (For example, most append operations will alter the file size).

LtdJorge•7mo ago

Not really, RWF_DSYNC is equivalent to open(2) with O_DSYNC when writing which is equivalent to write(2) followed by fdatasync(2) and:

  fdatasync() is similar to fsync(), but does not flush modified
       metadata unless that metadata is needed in order to allow a
       subsequent data retrieval to be correctly handled.  For example,
       changes to st_atime or st_mtime (respectively, time of last access
       and time of last modification; see inode(7)) do not require
       flushing because they are not necessary for a subsequent data read
       to be handled correctly.  On the other hand, a change to the file
       size (st_size, as made by say ftruncate(2)), would require a
       metadata flush.

stefanha•7mo ago

Agreed, when metadata changes are involved then RWF_SYNC must be used.

RWF_DSYNC is sufficient and faster when data is overwritten without metadata changes to the file.

vlovich123•7mo ago

No that’s incorrect. File size changes caused by append are covered by fdatasync in terms of durability guarantees.

stefanha•7mo ago

It looks plausible: XFS's xfs_dio_write_end_io() updates the on-disk file size. Do you have a link to documentation that confirms this is true for Linux or POSIX filesystems?

Edit: POSIX 1003.1-2017 defines fdatasync(2) behavior in 3.384 Synchronized I/O Data Integrity Completion, where it says "For write, when the operation has been completed or diagnosed if unsuccessful. The write is complete only when the data specified in the write request is successfully transferred and all file system information required to retrieve the data is successfully transferred".

So I think POSIX does guarantee that a write at the end of the file with O_DSYNC/followed by fdatasync(2) (and therefore, Linux RWF_DSYNC) is sufficient. Thank you for pointing out that RWF_DSYNC is sufficient for appends, vlovich123!

ImPostingOnHN•7mo ago

Some applications, like Apache Kafka, don't immediately fsync every write. This lets the kernel batch writes and also linearize them, both adding speed. Until synced, the data exists only in the linux page cache.

To deal with the risk of data loss, multiple such servers are used, with the hope that if one server dies before syncing, another server to which the data was replicated, performs an fsync without failure.

to11mtm•7mo ago

I feel like you can try to FAFO with that on a distributed log like Kafka (although also... eww, but also I wonder whether NATS does the same thing or not...)

I would think for something like a database, at most you'd want to have something like the io_uring_prep_fsync others mentioned with flags set to just not update the metadata.

To be clear, in my head I'm envisioning this case to be a WAL type scenario; in my head you can get away with just having a separate thread or threads pulling from WAL and writing to main DB files... but also I've never written a real database so maybe those thoughts are off base.

osigurdson•7mo ago

Suggest watching the Tigerbeatle video link in the article. There they discuss bitrot, "fsync gate", how Postgres used fsync wrong for 30 years, etc. It is very interesting even as pure entertainment.

jorangreef•7mo ago

Thanks! Great to hear you enjoyed our talk. Most of it is simply putting the spotlight on UW-Madison’s work on storage faults.

Just to emphasize again that this blog post here is really quite different, since it does not fsync and breaks durability.

Not what we do in TigerBeetle or would recommend or encourage.

mhuffman•7mo ago

Hi! I don't have a need for your products directly, but I was very intrigued when I saw TB's demo and talk on ThePrimeagen YT channel. I have be developing software for a looooong time and it was a breath of fresh air in a sea of startups to see a company champion optimization, speed, and security without going too deep in the weeds and slowing development. These days, that typically comes more as an afterthought or as a response to an incident. Or not at all. I would recommend any developer with an open mind to read this short document[0]. I have been integrating it into my own company's development practices with good results.

[0]https://github.com/tigerbeetle/tigerbeetle/blob/main/docs/TI...

jorangreef•7mo ago

Appreciate your taking the time to write these kind words. Great to hear that TigerStyle has been making an impact on your company’s developer practices!

ajross•7mo ago

> There's no getting around fsync if you want to be sure that your data is really on the storage medium.

That's not correct; io_uring supports O_DIRECT write requests just fine. Obviously bypassing the cache isn't the same as just flushing it (which is what fsync does), so there are design impacts.

But database engines are absolutely the target of io_uring's feature set and they're expected to be managing this complexity.

zozbot234•7mo ago

That's not what O_DIRECT is for. Did you mean O_SYNC ?

codys•7mo ago

> But database engines are absolutely the target of io_uring's feature set and they're expected to be managing this complexity.

io_uring includes an fsync opcode (with range support). When folks talk about fsync generally here, they're not saying the io_uring is unusable, they're saying that they'd expect the fsync to be used whether it's via the io_uring opcode, the system call, or some other mechanism yet to be created.

jandrewrogers•7mo ago

O_DIRECT is not a substitute for fsync(). It only guarantees that data gets to the storage device cache, which is not durable in most cases.

somat•7mo ago

My understanding is that the storage device cache is opaque, that is, drives tend to lie, saying the write is done when it is in cache, and depend on having enough internal power capacity to flush on power loss.

loeg•7mo ago

Consumer devices sometimes lie (enterprise products less so), but there is a distinction between O_DIRECT and actual fsync at the protocol layer (e.g., in NVMe, fsync maps into a Flush command).

quietbritishjim•7mo ago

Is that's true (notwithstanding objections from sibling comments) then that's just another spelling of fsync.

My point was really: you can't magically get the performance benefits of omitting fsync (or functional equivalent) while still getting the durability guarantees it gives.

codys•7mo ago

> > you lose the durability guarantee that makes databases useful. ... the data might still be sitting in kernel buffers, not yet written to stable storage.

> No! That's because you stopped using fsync. It's nothing to do with your code being async.

From that section, it sounds like OP was tossing data into the io_uring submition queue and calling it "done" at that point (ie: not waiting for the io_uring completion queue to have the completion indicated). So yes, fsync is needed, but they weren't even waiting for the kernel to start the write before indicating success.

I think to some extent things have been confused because io_uring has a completion concept, but OP also has a separate completion concept in their dual wal design (where the second WAL they call the "completion" WAL).

But I'm not sure if OP really took away the right understanding from their issues with ignoring io_uring completions, as they then create a 5 step procedure that adds one check for an io_uring completion, but still omits another.

> 1. Write intent record (async)

> 2. Perform operation in memory

> 3. Write completion record (async)

> 4. Wait for the completion record to be written to the WAL

> 5. Return success to client

Note the lack of waiting for the io_uring completion of the intent record (and yes, there's still not any reference to fsync or alternates, which is also wrong). There is no ordering guarantee between independent io_urings (OP states they're using separate io_uring instances for each WAL), and even in the same io_uring there is limited ordering around completions (IOSQE_IO_LINK exists, but doesn't allow traversing submission boundaries, so won't work here because OP submits the work a separate times. They'd need to use IOSQE_IO_DRAIN which seems like it would effectively serialize their writes. which is why It seems like OP would need to actually wait for completion of the intent write).

cryptonector•7mo ago

Correct, TFA needs to wait for the completion of _all_ writes to the WAL, which is what `fsync()` was doing. Waiting only for the completion of the "completion record" does not ensure that the "intent record" made it to the WAL. In the event of a power failure it is entirely possible that the intent record did not make it but the completion record did, and then on recovery you'll have to panic.

codys•7mo ago

Yes, but I suspect there might be some confusion by the author and others between "io_uring completion of a write" (ie: io_uring sends its completion queue event that corresponds to a previous submission queue event) and "fsync completion" (as you've put as "completion of all writes", though note that fsync the api is fd scoped and the io_uring operation for fsync has file range support).

The CQEs on a write indicate something different compared to the CQE of an fsync operation on that same range.

osigurdson•7mo ago

I've watched the Tigerbeatle talk (youtube link in the article). This is very interesting even for those not in the space.

demaga•7mo ago

I feel like writing asynchronously to a WAL defeats its purpose.

jasonthorsness•7mo ago

Is the underlying NVME storage interface the kernel/drivers get to use cleaner/simpler than the Linux abstractions? Or does it get more complicated? Sometimes I wonder if certain high-performance applications would be better off running as special-purpose unikernels unburdened by interfaces designed for older generations of technology.

loeg•7mo ago

Also an option with io_uring: https://www.usenix.org/conference/fast24/presentation/joshi

(We use it at work it in a network object storage service in order to use the underlying NVMe T10-DIF[1], which isn't exposed nicely by conventional POSIX/Linux interfaces.)

Ultimately, having a full, ~normal Linux stack around makes system management / orchestration easier. And programs other than our specialized storage software can still access other partitions, etc.

[1]: https://en.wikipedia.org/wiki/Data_Integrity_Field

eatonphil•7mo ago

From the title I was hoping this would be a survey of databases using io_uring, since there've been quips on the internet (here, twitter, etc) that no one uses io_uring in production. In my brief search TigerBeetle (and maybe Turso's Limbo) was the only database in production that I remember doing io_uring (by default). Some other databases had it as an option but didn't seem to default to it.

If anyone else feels like doing this survey and publishing the results I'd love to see it.

jtregunna•7mo ago

Update:

I updated the post based on the conversation below, I wholly missed an important callout about performance, and wasn't super clear that you do need to wait for the completion record to be written before responding to the client. That was implicitly mentioned by writing the completion record coming before responding, but I made it clearer to avoid confusion.

Also the dual WAL approach is worse for latency, unless you can amortize the double write over multiple async writes, so the cost paid amortizes across the batch, but when batch size is closer to 1, the cost is higher.

gpderetta•7mo ago

How can you know that the completion record is written to disk?

codys•7mo ago

From the update added to the post:

> This is tracked through io_uring's completion queue - we only send a success response after receiving confirmation that the completion record has been persisted to stable storage.

Which completion queue event(s) are you examining here? I ask because the way this is worded makes it sound like you're waiting solely for the completion queue event for the _write_ to the "completion wal".

Doing that (waiting only on the "completion wal" write CQE)

1. doesn't ensure that the "intent wal" has been written (because it's a different io_uring and a different submission queue event used to do the "intent wal" write from the "completion wal" write), and

2. doesn't indicate the "intent wal" data or the "completion wal" data has made it to durable storage (one needs fsync for that, the completion queue events for writes don't make that promise. The CQE for an fsync opcode would indicate that data has made it to durable storage if the fsync has the right ordering wrt the writes and refers to the appropriate fd and data ranges. Alternatively, there are some flags that have the effect of implying an fsync following a write that could be used, but those aren't mentioned)

ptrwis•7mo ago

For some background, it is now a single guy paid by Microsoft to work on implementing async direct I/O for PostgreSQL (github.com/anarazel)

lstroud•7mo ago

About 10ish years ago, I ended up finding a deadlock in the Linux raid driver when turning on Oracle’s async writes with raid10 on lvm on AWS. I traced it to the ring buffers the author mentioned, but ended up having to remove lvm (since it wasn’t that necessary on this infrastructure) to get the throughput I needed.

BeeOnRope•7mo ago

What is the point of the intent entry at all? It seems like operations are only durable after the completion record is written so the intent record seems to serve no purpose (unless it is say much larger).

sethev•7mo ago

There's some faulty reasoning in this post. Without the code, it's hard to pin down exactly where things went wrong.

These are the steps described in the post:

   1. Write intent record (async)
   2. Perform operation in memory
   3. Write completion record (async)
   4. Wait for the completion record to be written to the WAL
   5. Return success to client

If 4 is done correctly then 3 is not needed - it can just wait for the intent to be durable before replying to the client. Perhaps there's a small benefit to speculatively executing the operation before the WAL is committed - but I'm skeptical and my guess is that 4 is not being done correctly. The author added an update to the article:

> This is tracked through io_uring's completion queue - we only send a success response after receiving confirmation that the completion record has been persisted to stable storage

This makes it sound like he's submitting write operations for the completion record and then misinterpreting the completion queue for those writes as "the record is now in durable storage".

jeffbee•7mo ago

What's baffling to me about this post is that anyone would believe that io_uring was even capable of speeding up this workload by 10x. Unless your profile suggests that syscall entry is taking > 90% of your CPU time, that is impossible. The only thing io_uring can do for you is reduce your syscall count, so the upper bound of its utility is whatever you are currently spending on sysenter/exit.

loeg•7mo ago

You could also imagine it hiding write latency by allowing a very naive single-threaded application to do IOs concurrently, overlapped in time, instead of serialized. (But a threadpool would do much the same thing.)

gpderetta•7mo ago

Io_uring could allow for better throughout by simply having multiple operations in flight and allow for better I/O scheduling.

But yes, this specific case seems to be a misunderstanding in what io_uring write completion means.

You would expect that they would have tested recovery by at least simulating system stops immediately after after Io completion notification.

Unless they are truly using asynchronous O_SYNC writes and are just bad at explaining it.

misiek08•7mo ago

1. Write intent 2. Don’t use intent write as success 3. Report success on different operation completion.

While restoring: 1. Ignore all intents 2. Use only different operations with corresponding intents.

I think this article introduces so much chaos that it’s like many „almost” helpful info on io_uring and finally hurts the tech. io_uring IMHO lacks clean and simple examples and here we again have some bad-explained theories instead of meat.

hardwaresofton•7mo ago

Great article -- it's really subtle to see where the real perf gains came from but will try and summarize it for those who may come after:

The gains are from batching and doing work in-between. io_uring does "batching at a distance", and the DB can write to memory and perform operations in between. When io_uring checks the queues (intent/operation), it will find more than one operation, and do them all at once.

You don't lose durability with this setup -- you just do more speculative work (if you got the worst possible crash at the worst possible time), and if a bunch of things completed (because io_uring did them all at once) you get more confirmations you can send back faster.

Latency MIGHT suffer, but throughput would (and does) increase.

Leaving Google has actively improved my life

OpenAI raises $110B on $730B pre-money valuation

The Robotic Dexterity Deadlock

NASA announces overhaul of Artemis program amid safety concerns, delays

A better streams API is possible for JavaScript

Let's discuss sandbox isolation

Dan Simmons, author of Hyperion, has died

A Chinese official’s use of ChatGPT revealed an intimidation operation

Writing a Guide to SDF Fonts

A new California law says all operating systems need to have age verification

Allocating on the Stack

Kyber (YC W23) Is Hiring an Enterprise Account Executive

Modeling cycles of grift with evolutionary game theory

"Just a little detail that wouldn't sell anything"

We Built Secure, Scalable Agent Sandbox Infrastructure

PCB Tracer

Court finds Fourth Amendment doesn’t support broad search of protesters’ devices

Get free Claude max 20x for open-source maintainers

Open source calculator firmware DB48X forbids CA/CO use due to age verification

Show HN: Claude-File-Recovery, recover files from your ~/.claude sessions

Reading English from 1000 AD

Implementing a Z80 / ZX Spectrum emulator with Claude Code

Can you reverse engineer our neural network?

Tell HN: MitID, Denmark's digital ID, was down

Show HN: RetroTick – Run classic Windows EXEs in the browser

Rob Grant, creator of Red Dwarf, has died

We gave terabytes of CI logs to an LLM

Sprites on the Web

Statement from Dario Amodei on our discussions with the Department of War

F-Droid Board of Directors nominations 2026

Leaving Google has actively improved my life

OpenAI raises $110B on $730B pre-money valuation

The Robotic Dexterity Deadlock

NASA announces overhaul of Artemis program amid safety concerns, delays

A better streams API is possible for JavaScript

Let's discuss sandbox isolation

Dan Simmons, author of Hyperion, has died

A Chinese official’s use of ChatGPT revealed an intimidation operation

Writing a Guide to SDF Fonts

A new California law says all operating systems need to have age verification

Allocating on the Stack

Kyber (YC W23) Is Hiring an Enterprise Account Executive

Modeling cycles of grift with evolutionary game theory

"Just a little detail that wouldn't sell anything"

We Built Secure, Scalable Agent Sandbox Infrastructure

PCB Tracer

Court finds Fourth Amendment doesn’t support broad search of protesters’ devices

Get free Claude max 20x for open-source maintainers

Open source calculator firmware DB48X forbids CA/CO use due to age verification

Show HN: Claude-File-Recovery, recover files from your ~/.claude sessions

Reading English from 1000 AD

Implementing a Z80 / ZX Spectrum emulator with Claude Code

Can you reverse engineer our neural network?

Tell HN: MitID, Denmark's digital ID, was down

Show HN: RetroTick – Run classic Windows EXEs in the browser

Rob Grant, creator of Red Dwarf, has died

We gave terabytes of CI logs to an LLM

Sprites on the Web

Statement from Dario Amodei on our discussions with the Department of War

F-Droid Board of Directors nominations 2026

Async I/O on Linux in databases

Comments