Well this is from Kyle Kingsbury, the Chuck Norris of transactional guarantees. AWS has to reply or clarify, even if only seems to apply to Multi-AZ Clusters. Those are one of the two possibilities for RDS with Postgres. Multi-AZ deployments can have one standby or two standby DB instances and this is for the two standby DB instances. [1]
They make no such promises in their documentation. Their 5494 pages manual on RDS hardly mentions isolation or serializable except in documentation of parameters for the different engines.
Nothing on global read consistency for Multi-AZ clusters because why should they.... :-) They talk about semi-synchronous replication so the writer waits for one standby to confirm log record, but the two readers can be on different snapshots?
[1] - "New Amazon RDS for MySQL & PostgreSQL Multi-AZ Deployment Option: Improved Write Performance & Faster Failover" - https://aws.amazon.com/blogs/aws/amazon-rds-multi-az-db-clus...
[2] - "Amazon RDS Multi-AZ with two readable standbys: Under the hood" - https://aws.amazon.com/blogs/database/amazon-rds-multi-az-wi...
Well, as a user, I wish they would mention it though. If I migrate to RDS with multi-AZ after coming from plain Postgres (which documents snapshot isolation as a feature), I would probably want to know how the two differ.
> Amazon RDS for PostgreSQL multi-AZ clusters violate Snapshot Isolation
The way I've threaded this needle, after several frustrating attempts, is to have a policy of titling all reports "Jepsen: <system> <version>". HN is of course welcome to choose their own link text if they prefer a more descriptive, or colorful, phrase. :-)
The fact that the thread is high on HN, plus the GP comment is high in the thread, plus that the audience knows how interesting Jepsen reports get, should be enough to convey the needful.
the Jepsen writeups will surely stand the test of time thank you!
You may not be part of that world now, but you can be some day.
EDIT: forgot to say, i had to read 6 or 7 books on Bayesian statistics before i understood the most basic concepts. A few years later i wrote a compiler for a statistical programming language.
I somewhat feel that there was a generation that had it easier, because they were pioneers in a new field, allowing them to become experts quickly, while improving year-on-year, being paid well in the process, and having great network and exposure.
Of course, it can be done, but we should at least acknowledge that sometimes the industry is unforgiving and simply doesn't have on-ramps except for the privileged few.
I don't think so. I've been doing this for nearly 35 years now, and there's always been a lot to learn. Each layer of abstraction developed makes it easier to quickly iterate towards a new outcome faster or with more confidence, but hides away complexity that you might eventually need to know. In a lot of ways it's easier these days, because there's so much information available at your fingertips when you need it, presented in a multitude of different formats. I learned my first programming language by reading a QBasic textbook trying to debug a text-based adventure game that crashed at a critical moment. I had no Internet, no BBS, nobody to help, except my Dad who was a solo RPG programmer who had learned on the job after being promoted from sweeping floors in a warehouse.
The kids might not know this means "IBM mainframe" rather than "role playing game" :)
Essentially: The configuration claims "Snapshot Isolation", which means every transaction looks like it operates on a consistent snapshot of the entire database at its starting timestamp. All transactions starting after a transaction commits will see the changes made by the transaction. Jepsen finds that the snapshot a transaction sees doesn't always contain everything that was committed before its starting timestamp. Transactions A an B can both commit their changes, then transactions C and D can start with C only seeing the change made by A and D only seeing the change made by B.
Why? Because it has variables and a graph?
What sort of education background do you have?
I’ve repeatedly used ChatGPT and Claude to help me understand papers and to cut through the verbiage to the underlying concepts.
For this particular one, the graph under "Results" is the most approachable portion, I think. (Don't skip the top two sections, though … and they're so short.) In the graph, each line is a transaction, and read them left-to-right.
Hopefully I get this right, though if I do not, I'm sure someone will correct me. Our database is a set of ordered lists of integers. Something like,
CREATE TABLE test (
id int primary key,
-- (but specifically, this next column holds a list of ints, e.g.,
-- a value might be, '1,8,11'; the list of ints is a comma separated
-- string.)
list text not null
);
The first transaction: a 89 9
This is shorthand; means "(a)ppend to list #89 the integer 9" (in SQL, crudely this is perhaps something like UPDATE test SET list = CONCAT(list, ',9') WHERE id = 89;
… though we'd need to handle the case where the list doesn't exist yet, turning it into an `INSERT … ON CONFLICT … DO UPDATE …`, so it would get gnarlier.[2]); the next: r 90 nil # read list 90; the result is nil
r 89 [4 9] # read list 89; the result is [4, 9]
r 90 nil # read line 90; the result is (still) nil
I assume you can `SELECT` ;) That should provide sufficient syntax for one to understand the remainder.The arrows indicate the dependencies; if you click "read-write dependencies"[1], that page explains it.
Our first transaction appends 9 to list 89. Our second transaction reads that same list, and sees that same 9, thus, it must start after the first transaction has committed. The remaining arrows form similar dependencies, and once you take them all into account, they form a cycle; this should feel problematic. It's that they're in a cycle, which snapshot isolation does not permit, so we've observed a contradiction in the system: these cannot be obeying snapshot isolation. (This is what "To understand why this cycle is illegal…" gets at; it is fairly straightforward. T₁ is the first row in the graph, T₂ the second, so forth. But it is only straight-forward once you've understood the graph, I think.)
> This is in such a thick academic style that it is difficult to follow what the problem actually might be and how it would impact someone.
I think a lot of this is because it is written with precision, and that precision requires a lot of academic terminology.
Some of it is just syntax peculiar to Jepsen, which I think comes from Clojure, which I think most of us (myself included) are just not familiar with. Hence why I used SQL and comma-sep'd lists in my commentary above; that is likely more widely read. It's a bit rough when you first approach it, but once you get the notation, the payoff is worth it, I guess.
More generally, I think once you grasp the graph syntax & simple operations used here, it becomes easier to read other posts, since they're mostly graphs of transactions that, taken together, make no logical sense at all. Yet they happened!
> This style of writing serves mostly to remind me that I am not a part of the world that writes like this, which makes me a little sad.
I think Jepsen posts, with a little effort, are approachable. This post is a good starter post; normally I'd say Jepsen posts tend to inject faults into the system, as we're testing if the guarantees of the system hold up under stress. This one has no fault injection, though, so it's a bit simpler.
Beware though, that if you learn to read these, that you'll never trust a database again.
[1]: https://jepsen.io/consistency/dependencies
[2]: I think this is it? https://github.com/jepsen-io/postgres/blob/225203dd64ad5e5e4... — but this is pushing the limits of my own understanding.
I chuckled, but (while I don't have links to offer) I could have sworn that there were some of them which actually passed, and a handful of others that took the report to heart and fixed the bugs. I am similarly recalling that a product showed up to their Show HN or Launch HN with a Jepsen in hand, which I was especially in awe of the maturity of that (assuming, of course, I'm not hallucinating such a thing)
Am I correct in understanding either AWS is doing something with the cluster configuration or has added some patches that introduce this behavior?
Specially here: https://youtu.be/fLqJXTOhUg4?t=434
I also understand there are lots of ways to do Postgres replication in general, with varying results. For instance, here's Bin Wang's report on Patroni: https://www.binwang.me/2024-12-02-PostgreSQL-High-Availabili...
AWS patched Postgres to replicate to two instances and to call it good if one of the two acknowledges the change. When this ack happens is not public information.
My personal opinion is that filesystem level replication (think drbd) is the better approach for PostgreSQL. I believe that this is what the old school AWS Multi-AZ instances do. But you get lower throughput and you can't read from the secondary instance.
That's basically what their Aurora variant does. It uses clustered/shared storage then uses traditional replication only for cache invalidation (so replicas know when data loaded into memory/cache has changed on the shared storage)
No, it isn't an issue with single-instance PostgreSQL clusters. Multi-instance PostgreSQL clusters (single primary, plus streaming/physical replicas) are affected.
What they -too- discovered is that PostgreSQL currently doesn't have consistent snapshot behaviour between the primary and replicas. Presumably, read-only transaction T2 was executed on a secondary (replica) node, while T1, T3, and T4 (all modifying transactions) were executed on the primary.
Some background:
Snapshots on secondary PostgreSQL nodes rely on transaction persistence order (location of commit record in WAL) to determine which transactions are visible, while the visibility order on the primary is determined by when the backend that authorized the transaction first got notice that the transaction was completely committed (and then got to marking the transaction as committed). On each of these (primary and secondary) the commit order is consistent across backends that connect to that system, but the commit order may be somewhat different between the primary and the secondary.
There is some work ongoing to improve this, but that's still very much WIP.
My email is aphyr@jepsen.io, if you'd like to drop me a line. :-)
Core of the issue is that on the primary, commit inserts a WAL record, waits for durability, local and/or replicated, and then grabs a lock (ProcArrayLock) to mark itself as no longer running. Taking a snapshot takes that same lock and builds a list of running transactions. WAL insert and marking itself as visible can happen in different order. This causes an issue on the secondary where there is no idea of the apparent visibility order, so visibility order on secondary is strictly based on order of commit records in the WAL.
The obvious fix would be to make visibility happen in WAL order on the primary too. However there is one feature that makes that complicated. Clients can change the desired durability on a transaction-by-transaction basis. The settings range from confirm transaction immediately after it is inserted in WAL stream, through wait for local durability, all the way up to wait for it to be visible on synchronous replicas. If visibility happens in WAL order, then an async transaction either has to wait on every higher durability transaction that comes before it in the WAL stream, or give up on read-your-writes. That's basically where the discussion got stuck without achieving a consensus on which breakage to accept. This same problem is also the main blocker for adopting a logical (or physical) clock based snapshot mechanism.
By now I'm partial to the option of giving up on read-your-writes, with an opt-in option to see non-durable transactions as an escape hatch for backwards compatibility. Re-purposing SQL read uncommitted isolation level for this sounds appealing, but I haven't checked if there is some language in the standard that would make that a bad idea.
A somewhat elated idea is Eventual Durability, where write transactions become visible before they are durable, but read transactions wait for all observed transactions to be durable before committing.
Due to the way PostgreSQL does snapshotting, I don't believe this implies such a read might obtain a nonsense value due to only a portion of the bytes in a multi-byte column type having been updated yet.
It seems like a race condition that becomes eventually consistent. Or did anyone read this as if the later transaction(s) of a "long fork" might never complete under normal circumstances?
i was intuitively wondering the same but i'm having trouble reasoning how the post's example with transactions 1, 2, 3, 4 exhibits this behavior. in the example, is transaction 2 the only read-only transaction and therefore the only transaction to read from the read replica? i.e. transactions 1, 3, 4 use the primary and transaction 2 uses the read replica?
Since these happen sequentially, for a single update of `gps_coordinate` you would only expect to be able to observe one of:
1. Nothing updated yet, all columns have the previous value.
2. `gps_coordinate` updated, with `postal_code` and `city` still having the previous values.
3. `gps_coordinate` and `postal_code` updated with `city` still having the previous value.
4. All fields updated.
But the ordering that aphyr proved is possible allows you to see "impossible" states such as
1. `postal_code` updated with `gps_coordinate` and `city` still having the previous values.
2. `city` updated with `gps_coordinate` and `postal_code` still having the previous values.
Basically since these transactions happen in order and depend on one another you would expect that you can only see the "left to right" progression. But actually you can see some subset of the transactions applied even if that isn't a possible logical state of the database.
I was worried I had made the wrong move upgrading major versions, but it looks like this is not that. This is not a regression, but just a feature request or longstanding bug.
I understand the mission of the Jepsen project but presenting results in this format is misleading and will only sow confusion.
Transaction isolation involves a ton of tradeoffs, and the tradeoffs chosen here may be fine for most use cases. The issues can be easily avoided by doing any critical transactional work against the primary read-write node only, which would be the only typical way in which transactional work would be done against a Postgres cluster of this sort.
Lucky them, there is an automated suite[1] to verify the correct behavior :-D
Using the words Bug and guarantee is throwing the casual readers off the mark ?
If I was some database vendor that sometimes plays fast and loose (not saying Microsoft is, just an example) and my product is good for 99.95% of use cases and the remainder is exceedingly hard to fix, I'd probably be more likely to pay for Jepsen not to do an analysis, because hiring them would result in people being more likely to leave an otherwise sufficient product due to those faults being brought to light.
So unlikely v17 will make a difference.
From what I understand, multi-az has some setup with multiple semi synchronous replicas where only 1 replica needs to acknowledge the transaction.
Aurora doesn't use semi synchronous replication but uses clustered/shared storage with a different replication setup for cache invalidation
Unfortunately reliability is not that high on the priority list here. Keep up the good work!
Multi-AZ instances is a long-standing feature of RDS where the primary DB is synchronously replicated to a secondary DB in another AZ. On failure of the primary, RDS fails over to the secondary.
Multi-AZ clusters has two secondaries, and transactions are synchronously replicated to at least one of them. This is more robust than multi-AZ instances if a secondary fails or is degraded. It also allows read-only access to the secondaries.
Multi-AZ clusters no doubt have more "magic" under the hood, as its not a vanilla Postgres feature as far as I'm aware. I imagine this is why it's failing the Jepsen test.
So if snapshot violation is happening inside Multi-AZ instances, it can happen with a single region - multiple read replica kind of setup as well ? But it might be easily observable in Multi-AZ setups because the lag is high ?
Two replicas in a “semi synchronous” configuration, as AWS calls it, is to my knowledge not available in base Postgres. AWS must be using some bespoke replication strategy, which would have different bugs than synchronous replication and is less battle-tested.
But as nobody except AWS knows the implementation details of RDS, this is all idle speculation that doesn’t mean much.
So Amazon Multi-cluster seems to replicate changes out of order?
See also my comment https://news.ycombinator.com/item?id=43843790 elsewhere in this thread
There still is a Postgres deficiency that makes something similar to this pattern possible. Non-replicated transactions where the client goes away mid-commit become visible immediately. So in the example, if T1 happens on a partitioned leader, disconnects during commit, T2 also happens on a partitioned node, and T3 and T4 happen later on a new leader, you would also see the same result. However, this does not jive with the statement that fault injection was not done in this test.
Edit: did not notice the post that this pattern can be explained by inconsistent commit order on replica and primary. Kind of embarrassing given I've done a talk proposing how to fix that.
> We show that Amazon RDS for PostgreSQL multi-AZ clusters violate Snapshot Isolation
you kind of have to expect people to read
however, "multi-AZ" has been made ambiguous, because there are now multi-AZ instances and multi-AZ clusters.
...and your multi-AZ "instance", despite being not a multi-AZ "cluster" from AWS's perspective, is still two nodes that are "clustered" together and treated as one logical database from the client connection perspective.
see [0] and scroll down to the "availability and durability" screenshot for an example.
0: https://aws.amazon.com/blogs/aws/amazon-rds-multi-az-db-clus...
And the mere fact that it's hard to reason about these things means that it's hard to avoid problems. Hence, the easiest solution is likely "it may be possible to recover Snapshot Isolation by only using the writer endpoint", for anything where write is anyhow conditional on a read.
Although I'm surprised the "only using the writer endpoint" method wasn't tested, especially in availability loss situations.
- User1 comments
- User2 comments
- User1 checks (in a separate tx) that there's only 1 comment, so User1 gets the badge
- User2 checks the same (in a separate tx) and also sees only 1 comment (his), and also receives the badge.
With Snapshot isolation this isn't possible. At least one of the checks made in a separate tx would see 2 comments.
The original article on the Parallel Snapshot is a good read: https://scispace.com/pdf/transactional-storage-for-geo-repli...
not what a RDBMS stakeholder wants to wake up to on the best of days. I'd imagine there were a couple emails expressing concern internally.
hats off to aphyr as usual.
(Hats off to aphyr for sure!)
RDBMS - https://en.wikipedia.org/wiki/Relational_database#RDBMS
"Amazon RDS for PostgreSQL is an Amazon Web Services (AWS) service which provides managed instances of the PostgreSQL database. We show that Amazon RDS for PostgreSQL multi-AZ clusters violate Snapshot Isolation, the strongest consistency model supported across all endpoints. Healthy clusters occasionally allow..."
Direct, to-the-point, unembellished and analogous to how other STEM disciplines share findings. There was a time I liked reading cleverly written blog posts that use memes to explain things, but now I long for the plain and simple.
Anyway, I really enjoyed it, because I like technical writing. I found that if I wrote a deeply technical post, I’d get very few likes and comments – in fact, I even had a Staff Eng tell me I should more narrowly target the audience (you could tag groups as an intended audience; they’d only see the notification if they went to the blog, so it wasn’t intrusive) because most of engineering had no idea what I was talking about.
Then, I made a post about Kubecost (disclaimer: this was in its very early days, long before being acquired by IBM; I have no idea how it performs now, and this should not dissuade you from trying it if you want to) and how in my tests with it, its recommendations were a poor fit, and would have resulted in either minimal savings, or caused container performance issues. The post was still fairly technical, examining CPU throttling, discussing cgroups, etc. but the key difference was memes. People LOVED it.
I later repeated this experiment with something even more technical; IIRC it involved writing some tiny Python external library in C and accessing it with ctypes, and comparing stack vs. heap allocations. Except, I also included memes. Same result, slightly lessened from the FinOps one, but still far more likes and comments than I would expect for something so dry and utterly inapplicable to most people’s day-to-day job.
Like you, I find this trend upsetting, but I also don’t know how else to avoid it if you’re trying to reach a broader audience. Jensen, of course, is not, and I applaud them for their rigorous approach and pure writing.
See for example https://aphyr.com/posts/282-call-me-maybe-postgres, which makes heavy uses of memes.
1. If your memes were analogies to the dry technical concepts, then the simple, easy to digest analogies were the key here, not the memes themselves.
2. Pictures are worth a thousand words. The more visual you can make your writing the better. Even something as simple as using bullet points instead of dense paragraphs of text works wonders. But the key is to use graphs and illustrations to explain and show concepts wherever possible.
It doesn't feel full of memes but it does have great illustrations! For example, this visualization of the fragmentation of data for a UUID v4 idb file:
Software developers nowadays barely know about transactions, and definitely not about different transaction models (in my experience). I have even encountered "senior developers" (who are actually so called "CRUD developers"), who are clueless about database transactions.. In reality, transactions and transaction models matter a lot to performance and error free code (at least when you have volumes of traffic and your software solves something non-trivial).
For example: After a lot of analysis, I switched from SQL Server standard Read Committed to Read Committed Snapshot Isolation in a large project - the users could not be happier -> a lot of locking contention has disappeared. No software engineer in that project had any clue of transaction models or locks before I taught them some basics (even though they had used transactions extensively in that project)..
I’ve seen this work out terribly at certain points in my career.
But generated code by a LLM will likely also have bugs that could be fixed with transactions.
Makes me sad, since I work mostly in retail and and encounter systems that are infested with race conditions and simila errors: things where these isolation levels would be of great help.
However it's mostly engineers at startups, I have a very high opinion of typical Oracel/MSSQL developers at BigCos who at least have their fundamentals right.
Clearly it worked for them, but I spent a few different stints cleaning up after developers who didn't know this sort of thing.
We've worked around it by not touching the hot stove, but it's kind of worrying that there are consistency issues with it.
henning•2mo ago
kabes•2mo ago
__alexs•2mo ago
semiquaver•2mo ago
I’m not well versed in RDS but I believe that clustered is the only way to use it.
NewJazz•2mo ago
reissbaker•2mo ago
dragonwriter•2mo ago
dragonwriter•2mo ago
colesantiago•2mo ago
I was quite surprised to read that Stripe uses MongoDB in the early days and still today and I can't imagine the sheer nightmares they must have faced using it for all these years.
colechristensen•2mo ago
djfivyvusn•2mo ago
Literally the only job ad I've seen talking about MongoDB was a job ad for MongoDB itself.
senderista•2mo ago
computerfan494•2mo ago
bananapub•2mo ago
robterrell•2mo ago
MarkMarine•2mo ago
https://web.archive.org/web/20150312112552/http://blog.found...
bananapub•2mo ago
necubi•2mo ago
Now members of the original Foundation team have started Antithesis (https://antithesis.com/) to make it easier for other systems to adopt this sort of testing.
Thaxll•2mo ago
djfivyvusn•2mo ago
Thaxll•2mo ago
xmodem•2mo ago