Speeding up PostgreSQL dump/restore snapshots

https://xata.io/blog/behind-the-scenes-speeding-up-pgstream-snapshots-for-postgresql

93•tudorg•10h ago

Comments

hadlock•6h ago

One thing that's sorely needed in the official documentation is a "best practice" for backup/restore from "cold and dark" where you lose your main db in a fire and are now restoring from offsite backups for business continuity. Particularly in the 100-2TB range where probably most businesses lie, and backup/restore can take anywhere from 6 to 72 hours, often in less than ideal conditions. Like many things with SQL there's many ways to do it, but an official roadmap for order of operations would be very useful for backup/restore of roles/permissions, schema etc. You will figure it out eventually, but in my experience the dev and prod db size delta is so large many things that "just work" in the sub-1gb scale really trip you up over 200-500gb. Finding out you did one step out of order (manually, or badly written script) halfway through the restore process can mean hours and hours of rework. Heaven help you if you didn't start a screen session on your EC2 instance when you logged in.

forinti•6h ago

If you can have a secondary database (at another site or on the cloud) being updated with streaming replication, you can switch over very quickly and with little fuss.

SoftTalker•6h ago

Which is what you must do if minimizing downtime is critical.

And, of course, your disaster recovery plan is incomplete until you've tested it (at scale). You don't want to be looking up Postgres documentation when you need to restore from a cold backup, you want to be following the checklist you have in your recovery plan and already verified.

zie•2h ago

Sure, but there are lots of failure modes where the failure goes with the streaming replication and all instances are trashed.

bityard•10m ago

There needs to be a DBA version of the saying, "RAID is not a backup"

nijave•5h ago

Ideally off-site replica you fail over too and don't need to restore.

pg_restore will handle roles, indexes, etc assuming you didn't switch the flags around to disable them

If you're on EC2, hopefully you're using disk snapshots and WAL archiving.

pgwhalen•4h ago

Of course that’s preferable, but OP is specifically asking about the cold restore case, which tends to pose different problems, and is just as important to maintain and test.

Arbortheus•4h ago

Offsite replica is only applicable if the cause is a failure of the primary. What if I’m restoring a backup because someone accidentally dropped the wrong table?

ants_everywhere•3h ago

I would hope dropping a table on a production database is something that is code reviewed

benreesman•3h ago

nah, on a long enough timeline everything will go wrong. blaming the person who managed to drop the table finally is dumb: if you can't fix literally everything that could happen to it, it's not done.

anonymars•3h ago

Isn't the entirety of disaster recovery about situations that aren't supposed to happen?

High availability is different from disaster recovery

WJW•4h ago

> in the 100-2TB range where probably most businesses lie

Assuming you mean that range to start at 100GB, I've worked with databases that size multiple times but as a freelancer it's definitely not been "most" businesses in that range.

zie•2h ago

What we do, is automated restores. We have a _hourly and an _daily restore that just happens via shell script.

We encourage staff to play with both, and they can play with impunity since it's a copy that will get replaced soon-ish.

This makes it important that both work reliably, which means we know when our backups stop working.

We haven't had a disaster recovery situation yet(hopefully never), but I feel fairly confident that getting the DB back shouldn't be a big deal.

moribunda•6h ago

While these optimizations are solid improvements, I was hoping to see more advanced techniques beyond the standard bulk insert and deferred constraint patterns. These are well-established PostgreSQL best practices - would love to see how pgstream handles more complex scenarios like parallel workers with partition-aware loading, or custom compression strategies for specific data types.

bitbasher•6h ago

pg_bulkload[1] has saved me so much time cold restoring large (1+ TB) databases. It went from 24-72 hours to an hour or two.

I also recommend pg_repack[2] to squash tables on a live system and reclaim disk space. It has saved me so much space.

1: https://ossc-db.github.io/pg_bulkload/pg_bulkload.html

2: https://github.com/reorg/pg_repack

itsthecourier•4h ago

I'm just checking it now

do you export the data with this and then import it in the other db with it?

or do you work with existing postgres backups?

bitbasher•16m ago

There’s a number of options. I mainly work with gzipped CSV dumps that I need to restore.

jpalawaga•6h ago

Postgres backups are tricky for sure. Even if you have a DR plan you should assume your incremental backups are no good and you need to restore the whole thing from scratch. That’s your real DR SLA.

If things go truly south, just hope you have a read replica you can use as your new master. Most SLAs are not written with 72h+ of downtime. Have you tried the nuclear recovery plan, from scratch? Does it work?

inslee1•33m ago

Slightly related but how does WAL-G stack up as far as backup/restoration options go for Postgres? https://github.com/wal-g/wal-g

Stop Hiding My Controls: Hidden Interface Controls Are Affecting Usability

Local-first software (2019)

Serving 200M requests per day with a CGI-bin

Cod Have Been Shrinking for Decades, Scientists Say They've Solved Mystery

Fast Code Is Easy. Measuring It Is Hard

How to Network as an Introvert

What a Hacker Stole from Me

Operators, Not Users and Programmers

Optimizing Tool Selection for LLM Workflows with Differentiable Programming

Europe's first geostationary sounder satellite is launched

Techno-Feudalism and the Rise of AGI: A Future Without Economic Rights?

macOS Icon History

ClojureScript from First Principles – David Nolen [video]

It's Illegal to Live in an RV on Your Property in These US States

Speeding up PostgreSQL dump/restore snapshots

X-Clacks-Overhead

Atomic "Bomb" Ring from KiX (1947)

The Right Way to Embed an LLM in a Group Chat

Yet Another Zip Trick

A Canadian's AI hoax duped the media and propelled a 'band' to success

Volunteer finds Holy Grail of abolitionist-era Baptist documents

Haskell, Reverse Polish Notation, and Parsing

The Calculator-on-a-Chip (2015)

WinUAE 6 Amiga Emulator

Ask HN: Advice for Starting a Hacker Space?

The Hell of Tetra Master

What 'Project Hail Mary' teaches us about the PlanetScale vs. Neon debate

Seine reopens to Paris swimmers after century-long ban

Parametric shape optimization with differentiable FEM simulation

Gecode is an open source C++ toolkit for developing constraint-based systems (2019)

Speeding up PostgreSQL dump/restore snapshots

Comments

Stop Hiding My Controls: Hidden Interface Controls Are Affecting Usability

Local-first software (2019)

Serving 200M requests per day with a CGI-bin

Cod Have Been Shrinking for Decades, Scientists Say They've Solved Mystery

Fast Code Is Easy. Measuring It Is Hard

How to Network as an Introvert

What a Hacker Stole from Me

Operators, Not Users and Programmers

Optimizing Tool Selection for LLM Workflows with Differentiable Programming

Europe's first geostationary sounder satellite is launched

Techno-Feudalism and the Rise of AGI: A Future Without Economic Rights?

macOS Icon History

ClojureScript from First Principles – David Nolen [video]

It's Illegal to Live in an RV on Your Property in These US States

Speeding up PostgreSQL dump/restore snapshots

X-Clacks-Overhead

Atomic "Bomb" Ring from KiX (1947)

The Right Way to Embed an LLM in a Group Chat

Yet Another Zip Trick

A Canadian's AI hoax duped the media and propelled a 'band' to success

Volunteer finds Holy Grail of abolitionist-era Baptist documents

Haskell, Reverse Polish Notation, and Parsing

The Calculator-on-a-Chip (2015)

WinUAE 6 Amiga Emulator

Ask HN: Advice for Starting a Hacker Space?

The Hell of Tetra Master

What 'Project Hail Mary' teaches us about the PlanetScale vs. Neon debate

Seine reopens to Paris swimmers after century-long ban

Parametric shape optimization with differentiable FEM simulation

Gecode is an open source C++ toolkit for developing constraint-based systems (2019)