For security, I implemented TailScale which adds only ~5ms of latency while completely eliminating public network exposure - a worthwhile tradeoff for the significant security benefits.
My optimization approach includes:
- Workload-specific configuration generated via PGTune (https://pgtune.leopard.in.ua/)
- Real-time performance monitoring with PgHero for identifying bottlenecks
- Automated VACUUM ANALYZE operations scheduled via pgcron targeting write-heavy tables, which prevents performance degradation and helps me sleep soundly
- A custom CLI utility I built for ZSTD-compressed backups that achieves impressive compression ratios while maintaining high throughput, with automatic S3 uploading: https://github.com/overflowy/pgbackup
This setup has been remarkably stable and performant, handling our workloads with substantial headroom for growth.
You are just wrapping pgdump, which is not a full featured backup solution. Great for a snapshot...
Use some of the existing tools and you get point-in-time recovery, easy restores to hot standbys for replication, a good failover story, backup rotations, etc.
I have been using barman indirectly through CloundNativePG with my latest company, but don't have the operational experience to speak on it yet.
When it’s set up properly, it’s solid as rock. I’d really recommend you check it out again; it likely solves everything you did more elegantly, and also covers a ton of things you didn’t think of. Been there, done that :)
Why don't you set (per table) the autovacuum_analyze_scale_factor parameter (or autovacuum_analyze_threshold) then let AUTOVACUUM handle this?
Note that we do not offload IO to workers when doing I/O that the caller will synchronously wait for, just when the caller actually can do IO asynchronously. That reduces the need to avoid the offload cost.
It turns out, as some of the results in Lukas' post show, that the offload to the worker is often actually beneficial particularly when the data is in the kernel page cache - it parallelizes the memory copy from kernel to userspace and postgres' checksum computation. Particularly on Intel server CPUs, which have had pretty mediocre per-core memory bandwidth in the last ~ decade, memory bandwidth turns out to be a bottleneck for page cache access and checksum computations.
Edit: Fix negation
Explicitly a goal.
You can turn it on today, with a bunch of caveats (via debug_io_direct=data). If you have the right workload - e.g. read only and lots of seqscans, bitmap index scans etc you can see rather substantial perf gains. But it'll suck in any cases in 18.
We need at least:
- AIO writes in checkpointer, bgwriter and backend buffer replacement (think bulk loading data with COPY)
- readahead support in a few more places, most crucially index range scan (works out ok today if the heap is correlated with the index, sucks badly otherwise)
EDIT: Formatting
On Intel it's also not hard to simply reach the per-core memory bandwidth with modern storage HW. This matters most prominently for writes by the checkpointing process, which needs to compute data checksums given the current postgres implementation (if enabled). But even for reads it can be a bottleneck, e.g. when prewarming the buffer pool after a restart.
If interesting I can dig up the reproducer I had at some point.
Is there a page anywhere that collects these sorts of "turn the whole hardware security layer off" switches that can be flipped to get better throughput out of modern x86 CPUs, when your system has no real attack surface to speak of (e.g. air-gapped single-tenant HPC)?
Aside from a few problems in specific Linux kernel versions, it works great.
They will eventually figure out using b-trees for tables too.
InnoDB uses a clustered index approach. The primary key index is a B-tree. The actual table data is stored in the leaf nodes of this B-tree. Secondary indexes point to the primary key.
One is not better than the other in general terms. InnoDB's clustered B-tree approach shines when:
You frequently access data in primary key order
Your workload has many range scans on the primary key
You need predictable performance for primary key lookups
Your data naturally has a meaningful ordering that matches your access patterns
PostgreSQL's heap approach excels when:
You frequently update non-key columns (less page splits/reorganization)
You have many secondary indexes (they're smaller without primary keys)
Your access patterns vary widely and don't follow one particular field
You need faster table scans when indexes aren't applicable
I personally find PostgreSQL's approach more flexible for complex analytical workloads with unpredictable access patterns, while InnoDB's clustered approach feels more optimized for OLTP workloads with predictable key-based access patterns. The "better" system depends entirely on your specific workload, data characteristics, and access patterns.
PostgreSQL only has heaps for tables, and various other data structures for indices, with b-tree being the default for indices.
There are many cases where having b-trees for tables would make performance better.
GP is basically poking fun at PG for still having nothing but heaps for tables.
Having the ability to do something LSM as a storage engine would be great - and potentially allow better compression than what we currently get with TOAST - which is not a lot... PG doesn't even have oob page compression...
I always do wonder how much "arbitrary" cloud limits on things like this cause so many issues. I'm sure that async IO is very helpful anyway, but I bet on a 1million IOPS NVMe it is nowhere near as important.
We're effectively optimising critical infrastructure tech for ~2010 hardware because that's when big cloud got going and there has been so few price reductions on things since then vs the underlying hardware costs.
Obviously a consumer NVMe is not "enterprise" but my point is we are 3+ orders of magnitude off performance on cheap consumer hardware vs very expensive 'enterprise' AWS/big cloud costs.
Its worth noting that even on really fast local NVMe drives the new asynchronous I/O work delivers performance benefits, since its so much more efficient at issuing I/Os and reducing syscall overhead (for io_uring).
Andres Freund (one of the principal authors of the new functionality) did a lot of benchmarking on local NVMe drives during development. Here is one mailinglist thread I could find that shows a 2x and better benefit with the patch set at the time: https://www.postgresql.org/message-id/flat/uvrtrknj4kdytuboi...
In the face of sustained writes? For how long?
I would assume, and it might be a poor assumption, that NVMe controllers don't pull in files, but rather blocks, so even if you had a database that exceeded cache size, in theory if the active blocks of that database did not exceed cache size, it could be "indefinitely" cached for a read-only pattern.
I don't think sustained reads are a problem? Benches like the CrystalDiskMark do a full disk random read test; they're designed to bust through cache afaik. 7.2GBp of 4k reads would translate to 1.8MIOps. Even if this is massively optimistic, you need to slash a lot of zeroes/orders of magnitude to get down to 20kIOps, which you will also pay >$100/mo for.
https://azure.microsoft.com/en-us/pricing/details/managed-di...
Increasing vcpu also opened up more disk slots to try improve situation with disk striping
Which means you can count on it about as much as a server of your own, if you could not repair the server.
I know a database company that uses instance storage as the primary storage. It’s common.
The per-disk performance is still nothing to write home about, and 8 actually fast disks would blow this instance type out of the water.
But consider the counterfactual: Non-realized customers because AWS certified solutions architect(tm) software couldn't deliver the price/perf they would have needed.
At $work this is a very real problem because a software system was built on api gateway, lambdas, sqs and a whole bunch of other moving pieces (serverless! scalable! easy compliance!) that combined resulted in way too much latency to meet a client's goal.
Actually, AWS is so expensive, hosting everything we ran on Hetzner there would have simply depleted our funding, and the company would not exist anymore.
The IOPS figure usually hides the fact that it is not a single IOP that is really fast, but a collection of them.
More IOPS generally is done best by reducing latency of a single operation but the average latency is what actually contributes to the "fast query" experience. Because a lot of the next IO is branchy from the last one (like an index or filter lookup).
As more and more disks to CPU connectivity goes over the network, we can really deliver a large IOPS even when we have very high latencies (by spreading the data across hundreds of SSDs and routing it fast), because with the network storage we pay a huge latency cost for durability of the data simply because of location diversification.
Every foot is a nanosecond, approximately.
That the tradeoff is worth it, because you don't need clusters to deal with a bad CPU or two. Stop & start, to fix memory/cpu errors.
The AWS model pushes the latency problem to the customer and we see it in the IOPS measurements, but it is really the latency x queue depth we're seeing not the hardware capacity.
This also applies to services, not only compute. Anything associated with Microsoft Office 365 Exchange, scripts may run 10x slower against the cloud using the MSOnline cmdlets. It's absolute insanity, I used to perform a dump of all mailbox statistics that would take about one hour, it could take almost 24 hours against Office 365. You have to be careful to not use the same app or service account in multiple places, because the throttle limits are per-account.
You pay 60 times the price for the privilege of being on AWS.
Bandwidth is just their most egregious price difference. The servers are more expensive too. The storage is more expensive (except for Glacier). The serverless platforms are mostly more expensive than using a cheap server.
There are only two AWS products that I understand to have good prices: S3 Glacier (and only if you never restore!), and serverless apps (Lambda / API Gateway) if your traffic is low enough to fit in the Always Free tier. For everything else, it appears you get ripped off by using AWS.
Looking forward to the other places that async I/O can be used in future Postgres releases now that the groundwork is done.
https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=io_uring
https://www.theregister.com/2025/04/29/linux_io_uring_securi...
But most of the 'off by default' are from ~2023 and not a current concern.
One should evaluate the risk according to their specific use case.
It can be a good idea to disable it of you run untrusted workloads (eg: other people’s containers, sharing the same kernel) but if you have a kernel on a machine (virtual or real) dedicated to your own workload you can pretty much keep using io_uring. There are other technologies to enforce security (eg: selinux emand similar).
But in the case of io_uring, it was outright bypassing other security layers. And while we all like to think we're running trusted services/code, we have to think about supply-chain attacks that may surprise us, or zero days, etc.
Is FreeBSD doing anything significantly different and/or better?
Linux aio (the POSIX one, not the broader concept that now includes io_uring) has been fundamentally bad. Part of it is some of the implementation is delegated to user space in glibc as worker threads. https://lwn.net/Articles/671649/ surveys the issues. I have not done a deep dive into this topic in a long time but as far as I know the situation never greatly improved. io_uring does not suffer from the problems, although it is a new and non-standard API with associated pros and cons.
Thomas Munro has gone into some of the benefits and also gaps of FreeBSD's aio(4) vs io_uring here https://wiki.postgresql.org/wiki/FreeBSD/AIO. Notably, because the implementation is in kernel and has received continuous improvement it is gradually removing downsides and there are several users that need it to work well. This document undersells the problems of ZFS though: the ARC is a necromanced buffer cache which Sun ironically worked very hard to remove (http://mcvoy.com/lm/papers/SunOS.vm_arch.pdf) and nobody has fixed this in 20 years. But for UFS or raw block devs or vendor file systems that doesn't matter.
FreeBSD being a complete "src" tree yields some advantages. In concrete, there are some in tree consumers like ctld (a CAM/iSCSI server) that have been used as vehicles to provide end to end implementation of things like NIC offload of complex protocols that play well with backend block devices such that you can make an offloaded data path that is pumped via asynchronous completion events on both sides (network and block). A related whitepaper https://www.chelsio.com/wp-content/uploads/resources/T6-100G... but this concept can be extended to iSCSI, NVMeOF etc. It seems NVMeOF work sponsored by Chelsio is yielding a lot of the finishing touches https://papers.freebsd.org/2023/eurobsdcon/baldwin-implement.... I believe my colleagues are also improving aio to further optimize the Netflix case of KTLS NIC with data on disk but I am not the right person to extrapolate on that.
nu11ptr•14h ago
I know Windows has IOCP and also now an IORing implementation of its own (Less familiar with macOS capabilities other than POSIX AIO).
https://learn.microsoft.com/en-us/windows/win32/api/ioringap...
Update: Most of the comments below seem to be missing the fact that Windows now also has an IORing implementation, as I mentioned above. Comparison article here:
https://windows-internals.com/ioring-vs-io_uring-a-compariso...
stingraycharles•14h ago
PaulHoule•14h ago
immibis•13h ago
anarazel•13h ago
p_ing•13h ago
Windows even had the concept of io_uring before, but network only with Registered I/O back in the Windows 8 (8.1?) days.
Linux still lacks the "all I/O is async" NT has.
The underlying kernel and executive of Windows aren't primitive pieces of trash. They're quite advanced, ruined by PMs and the Ads division throwing crap on top.
And yes, Windows' I/O Ring is a near 1:1 copy of the Linux implementation, but IOCP/OVERLAPPED I/O data structure preceded it since NT's first release.
This isn't a pissing match, just we all hope that kernel devs learn from each other and take the best ideas. Sometimes we, IT, don't get to choose the OS we run -- it's dictated by the apps the business requires.
greenavocado•13h ago
dboreham•13h ago
p_ing•13h ago
Win32 is also responsible for core Services, which means you can't de-Windows-ify Windows and strip it down to an NT API-only. All other personalities (OS/2, POSIX, SFU) have a dependency on Win32, as well.
You're still running the WindowServer of course; it's part of the Executive.
That said, with a bunch of modifications, NTDEV did get Windows 11 down to it's bare minimum, and text only to boot. So I guess it's technically possible, though not useful.
https://www.youtube.com/watch?v=SL6t_iuitxM
smileybarry•12h ago
Yep, they've replaced nearly every UI with text (the login window is a TUI), though there's still some shell DLLs and the whole thing still uses a window manager. That's honestly for the best, since it allows you to migrate full installations with some UI-based apps to a Core installation with them intact.
> That said, with a bunch of modifications, NTDEV did get Windows 11 down to it's bare minimum, and text only to boot. So I guess it's technically possible, though not useful.
Windows has had a "text mode" since at least Windows XP IIRC, but it's really not that useful, if at all. Even for rescue operations you're better off with Windows PE.
Ericson2314•12h ago
password4321•9h ago
cyberax•12h ago
Windows does OOM far better than Linux because it doesn't really overcommit RAM.
But the CPU _scheduler_ in Linux is far, far, far better than in Windows. Linux can even do hard-realtime, after all.
nu11ptr•13h ago
And now also a Windows feature, see my comment above for info
lfittl•14h ago
The "worker" method uses a dedicated pool of I/O worker processes that run in the background, and whilst not as performant as io_uring in our benchmark, did clearly outperform the "sync" method (which is the same as what Postgres currently has in 17 and older).
nu11ptr•13h ago
Windows now also has IORing (see my comment above)
spwa4•13h ago
nu11ptr•13h ago
anarazel•13h ago
macdice•3h ago
Main learnings: the IOCP version can't do asynchronous flush! Which we want. The IoRing version can! But it can't do scatter/gather AKA vector I/O yet! Which is an essential feature for buffer pool implementation. So actually I am basically waiting for IoRing to add support for that before taking it too seriously (I can see they are working on it because the ops are present in an enum, it's just that the build functions are missing).
So my guess is that in a year or so we should be able to run all PostgreSQL disk I/O through IoRing on Windows. Maybe?
Another complications is that it really wants to be multithreaded (consuming completions for IOs started in another process requires a lot of hoop jumping, I made it work but...) This will resolve itself naturally with ongoing work to make PostgreSQL multithreaded.
The next problem is that IoRing doesn't support sockets! So in future work on async networking (early prototypes exist) we will likely also need IOCP for that part.