We recovered from nightmare Postgres corruption on the matrix.org homeserver

https://matrix.org/blog/2025/07/postgres-corruption-postmortem/

18•Arathorn•7h ago

Comments

fowl2•5h ago

Seems like there’s a few places Postgres could benefit from some more consistency checks.

Arathorn•4h ago

we could have run with https://www.postgresql.org/docs/current/app-pgchecksums.html turned on, but it slows things down a bunch - and turning it on in retrospect would have taken days. Also not clear that it would have caught whatever the underlying corruption was here…

anarazel•2h ago

Easier said than done in this case. Actually effective crosschecks preventing this issue from occurring would entail rather massive I/O and CPU amplification in common operations.

anarazel•2h ago

A few questions:

- Are you using pg_repack? I'm fairly sure its logic has some holes - last time I checked its bug tracker listed potential for data corruption that could cause issues like this.

- Have you done OS upgrades? Did affected indexes have any columns affected by collations?

- Have you done analysis on the heap page? E.g. is there any valid data on the page? What is the page's LSN compared to the LSN on index pages pointing to non-existing tuples on the page?

dap•10m ago

The post appears to conclude that this must be a hardware issue because they have no explanation and PostgreSQL and the kernel are too reliable to have data corruption bugs. I've seen data corruption bugs in both databases and the kernel (as well as CPUs, for that matter), so I'm pretty skeptical of that explanation.

When something "can't happen" in your program, it makes sense to look at the layers below. Unfortunately, this often goes one of two ways: you ask people for help and they tell you that it's never one of the layers below ("it's never a compiler bug") or you stop at the conclusion "well, I guess the layer below [kernel/TCP/database/etc.] gave us corrupted data". The conclusion in this post kind of does both of these things. Of course, sometimes it _is_ a bug in one of those layers. But stopping there is no good either, especially when the application itself is non-trivial and you have no evidence that a lower layer is at fault.

People often treat a hypothesis like "the disk corrupted the data" as unfalsifiable. After the fact, that might be true, given the stack you're using. But that doesn't have to be the case. If you ran into a problem like this on ZFS, for example, you'd have very high confidence about whether the disk was at fault (because it can reliably detect when the disk returns data different from what ZFS wrote to it). I realize a lot goes into choosing a storage stack and maybe ZFS doesn't make sense for them. But if the hypothesis is that such a severe issue resulted from a hardware/firmware failure, I'd look pretty hard at deploying a stack that can reliably identify such failures. At the very least, if you see this again, you'll either know for sure it was the disk or you'll have high confidence that there's a software bug lurking elsewhere. Then you can add similar kinds of verification at different layers of the stack to narrow down the problem. In an ideal world, all the software should be able to help exonerate itself.

Denmark's Dark Secret: How 6M People Fooled Us All

The New Hot Topic in European Politics Is Air Conditioning

ClickHouse 25.6: CoalescingMergeTree table engine

Executive Order – Preventing Woke AI in the Federal Government

Teens say they are turning to AI for friendship

The Shady Job Pipeline Hiding in Plain Sight

Exploring Art Is Like Following a Spiral – Meet Chameleon

Self-hosted slippy maps, for novices (like me)

A DOJ Whistleblower Speaks Out

Ask HN: Even with AGI, it wouldn't know what you know. Can we preserve that?

The State of Zero Trust Report 2025 – Tailscale

Summarize a GitHub release changelog into a social media post

Complete the Square: Can you get to Level 200?

AI Coding Stack That Isn't Complete Garbage: VSCode, Roocode, Augment (May 2025)

Morally corrupt innovations are the easiest innovations to create

I made Tinder but it's only pictures of my wife and I can only swipe right

How big tech is force-feeding us AI

MatrixTransformer: Structural Pattern Discovery Without Training

Genetic Switch in Mosquitoes Halts Malaria Spread

Show HN: Palworld Breeding Calculator – Breeding Tree and Combination Visualizer

Alphabet Q2 FY25: Total Rev +14% Y/Y to $96B Google Cloud +32% Y/Y to $13.6B

Sweet spot for daily steps is lower than often thought, new study finds

Ask HN: What's your biggest productivity killer as a developer?

Winning the Race: America's AI Action Plan [pdf]

US taxpayer-funded vaccine doses may expire, lawmakers say

Google Cloud's Approach to Change

T-Mobile's Starlink Satellite Service Officially Launches with iPhone Support

Novel material efficiently removes 'forever chemicals'

Ask HN: Are you designing APIs to be "AI-ready"?

GitHub Spark in public preview for Copilot Pro+ subscribers