Presumably the intent record is large (containing the key-value data) while the completion record is tiny (containing just the index of the intent record). Is the point that the completion record write is guaranteed to be atomic because it fits in a disk sector, while the intent record doesn't?
Write intent record (async)
Perform operation in memory
Write completion record (async)
* * Wait for intent and completion to be flushed to disk * *
Return success to client
* * Wait for intent and completion to be flushed to disk * *
if you wait for both to complete, then how it can be faster than doing a single IO?During recovery, I only apply operations that have both intent and completion records. This ensures consistency while allowing much higher throughput. “
Does this mean that a client could receive a success for a request, which if the system crashed immediately afterwards, when replayed, wouldn’t necessarily have that request recorded?
How does that not violate ACID?
So I fail to see how the two async writes are any guarantee at all. It sounds like they just happen to provide better consistency than the one async write because it forces an arbitrary amount of time to pass.
Seems like OP’s async approach removes that, so there’s no durability guarantee, so why even maintain a WAL to begin with?
So there is no guarantee that operations are committed by virtue of not being acknowledged to the application (asynchronous) the recovery replay will be consistent.
I could see it would be problematic for any data where the order of operations is important, but that’s the trade off for performance. This does seem to be an improvement to ensure asynchronous IO will always result in a consistent recovery.
Yup. OP says "the intent record could just be sitting in a kernel buffer", but then the exact same issue applies to the completion record. So confirmation to the client cannot be issued until the completion record has been written to durable storage. Not really seeing the point of this blogpost.
I always use this approach for crash-resistance:
- Append to the data (WAL) file normally.
- Have a seperate small file that is like a hash + length for WAL state.
- First append to WAL file.
- Start fsync call on the WAL file, create a new hash/length file with different name and fsync it in parallel.
- Rename the length file onto the real one for making sure it is fully atomic.
- Update in-memory state to reflect the files and return from the write function call.
Curious if anyone knows tradeoffs between this and doing double WAL. Maybe doing fsync on everything is too slow to maintain fast writes?
I learned about append/rename approach from this article in case anyone is interested:
- https://discuss.hypermode.com/t/making-badger-crash-resilien...
- https://research.cs.wisc.edu/adsl/Publications/alice-osdi14....
I think this database doesn't have durability at all.
During recovery, since the server applies only the operations which have both records, you will not recover a record which was successful to the client.
The problem with naive async I/O in a database context at least, is that you lose the durability guarantee that makes databases useful. When a client receives a success response, their expectation is the data will survive a system crash. But with async I/O, by the time you send that response, the data might still be sitting in kernel buffers, not yet written to stable storage.
Shouldn't you just tie the successful response to a successful fsync?
Async or sync, I'm not sure what's different here.
For example, we never externalize commits without full fsync, to preserve durability [0].
Further, the motivation for why TigerBeetle has both a prepare WAL plus a header WAL is different, not performance (we get performance elsewhere, through batching) but correctness, cf. “Protocol-Aware Recovery for Consensus-Based Storage” [1].
Finally, TigerBeetle's recovery is more intricate, we do all this to survive TigerBeetle's storage fault model. You can read the actual code here [2] and Kyle's Jepsen report on TigerBeetle also provides an excellent overview [3].
[0] https://www.youtube.com/watch?v=tRgvaqpQPwE
[1] https://www.usenix.org/system/files/conference/fast18/fast18...
[2] https://github.com/tigerbeetle/tigerbeetle/blob/main/src/vsr...
jtregunna•5h ago