- Are you using pg_repack? I'm fairly sure its logic has some holes - last time I checked its bug tracker listed potential for data corruption that could cause issues like this.
- Have you done OS upgrades? Did affected indexes have any columns affected by collations?
- Have you done analysis on the heap page? E.g. is there any valid data on the page? What is the page's LSN compared to the LSN on index pages pointing to non-existing tuples on the page?
When something "can't happen" in your program, it makes sense to look at the layers below. Unfortunately, this often goes one of two ways: you ask people for help and they tell you that it's never one of the layers below ("it's never a compiler bug") or you stop at the conclusion "well, I guess the layer below [kernel/TCP/database/etc.] gave us corrupted data". The conclusion in this post kind of does both of these things. Of course, sometimes it _is_ a bug in one of those layers. But stopping there is no good either, especially when the application itself is non-trivial and you have no evidence that a lower layer is at fault.
People often treat a hypothesis like "the disk corrupted the data" as unfalsifiable. After the fact, that might be true, given the stack you're using. But that doesn't have to be the case. If you ran into a problem like this on ZFS, for example, you'd have very high confidence about whether the disk was at fault (because it can reliably detect when the disk returns data different from what ZFS wrote to it). I realize a lot goes into choosing a storage stack and maybe ZFS doesn't make sense for them. But if the hypothesis is that such a severe issue resulted from a hardware/firmware failure, I'd look pretty hard at deploying a stack that can reliably identify such failures. At the very least, if you see this again, you'll either know for sure it was the disk or you'll have high confidence that there's a software bug lurking elsewhere. Then you can add similar kinds of verification at different layers of the stack to narrow down the problem. In an ideal world, all the software should be able to help exonerate itself.
fowl2•5h ago
Arathorn•4h ago
anarazel•2h ago