If you care about WAL write/commit latency, you could provision a small-ish EBS io2 Block Express device (with provisioned IOPS) just for your WAL files and the rest of your data can still reside on cheaper EBS storage. And you might not even need to hugely overprovision your WAL device IOPS (as databases can batch commit writes for multiple transactions).
But the main point is that once your WAL files are on a completely separate blockdevice from all the other datafile I/O, they won't suffer from various read & write IO bursts that can happen during regular database activity. On Oracle databases, I put controlfiles to these separate devices too, as they are on the critical path during redo log switches...
One thing that I try to achieve anyway, is to spread & "smoothen" the database checkpoint & fsync activity over time via database checkpointing parameters, so you won't have huge "IO storms" every 15 minutes, but just steady writing of dirty buffers going on all the time. So, even if all your files are stored on the same blockdevice, you'll less likely see a case where your WAL writes wait behind 50000 checkpoint write requests issued just before.
open_datasync 249.578 ops/sec 4007 usecs/op
fdatasync 608.573 ops/sec 1643 usecs/op
open_datasync (i.e. O_DSYNC) ends up as FUA writes, fdatasync() as a plain write followed by a cache flush.On just about anything else a single FUA write is either the same speed as a write + fdatasync, or considerably faster.
This is pretty annoying, as using O_DSYNC is a lot more suitable for concurrent WAL writes, but because Samsung SSDs are widespread, changing the default would regress performance substantially for a good number of users.
singron•1d ago
tanelpoder•1d ago
I tend to set up a small, but completely separate block device (usually on enterprise SAN storage or cloud block store) just for WAL/redo logs to have a different device with its own queue for that. So that when that big database checkpoint or fsync happens against datafiles, the thousands of concurrently submitted IO requests won't get in the way of WAL writes that still need to complete fast. I've done something similar in the past with separate filesystem journal devices too (for niche use cases...)
Edit: Another use case for this is that ZFS users can put the ZIL on low-latency devices, while keeping the main storage on lower cost devices.
natmaka•1d ago
I'm not sure about this, as this separate device may handle more of the total (aggregated) work by being a member of an unique pool (RAID made of all available non-spare devices) used by the PostgreSQL server.
It seems to me that in most cases the most efficient setup, even when trying hard to reduce the maximal latency (and therefore to sacrifice some throughput), is an unique pool AND an adequate I/O scheduling enforcing a "max latency" parameter.
If, during peaks of activity, your WAL-dedicated device isn't permanently at 100% usage while the data pool is, then dedicating it may (overall) bump up the max latency and reduce throughput.
Tweaking some parameters (bgwriter, full_page_writes, wal_compression, wal_writer_delay, max_wal_senders, wal_level, wal_buffers, wal_init_zero...) with respect to the usage profile (max tolerated latency, OLTP, OLAP, proportion of SELECTs and INSERTs/UPDATEs, I/O subsystem characteristics and performance, kernel parameters...) is key.
tanelpoder•1d ago
All this depends on what kind of storage backend you're on, local consumer SSDs with just one NVMe namespace each, or local SSDs with multiple namespaces (with their own queues) or a full-blown enterprise storage backend where you have no idea what's really going in the backend :-)
[1]: https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-th...
Edit: Note that I wasn't proposing using an entire physical disk device (or multiple) for the low latency files, but just a part of it. Local enterprise-grade SSDs support multiple namespaces (with their own internal queues) so you can carve out just 1% of that for separate I/O processing. And with enterprise SAN arrays (or cloud elastic block store offerings) this works too, you don't know how many physical disks are involved in the backend anyway, but at your host OS level, you get a separate IO queue that is not gonna be full of thousands of checkpoint writes.
fendale•1d ago
What do you mean by namespaces here? Are they created by having different partitions or LVM volumes? As you mentioned consumer grade SSDs only have a single namespace, I am guessing this is something that needs some config when mounting the drive?
tanelpoder•23h ago
/dev/nvme0n1 /dev/nvme0n2 /dev/nvme0n3 ...
Consumer disks support only a single namespace, as far as I've seen. Different namespaces give you extra flexibility, I think some even support different sector sizes for different namespaces).
So under the hood you'd still be using the same NAND storage, but the controller can now process incoming I/Os with awareness of which "logical device" they came from. So, even if your data volume has managed to submit a burst of 1000 in-flight I/O requests via its namespace, the controller can still pick some latest I/Os from other (redo volume) namespaces to be served as well (without having to serve the other burst of I/Os first).
So, you can create a high-priority queue by using multiple namespaces on the same device. It's like logical partitioning of the SSD device I/O handling capability, not physical partitioning of disk space like the OS "fdisk" level partitioning would be. The OS "fdisk" partitioning or LVM mapping is not related to NVMe namespaces at all.
Also, I'm not a NVMe SSD expert, but this is my understanding and my test results agree so far.
fendale•23h ago
tanelpoder•22h ago