Source: I have operated a large multi-master Postgres cluster.
The problem with the setup is you will have a data corruption issue at some point. It's not an "if" it's a "when". If you don't have a plan to deal with it, then you're hosed.
This is why the parent is turning around the burden of proof. If you can't definitely say why you absolutely need this, and no other solution will do, then avoid it.
IME it comes down to considering CAP against the business goals, and taking into account how much it will annoy the development team(s).
If you follow "the rules" WRT to writes, it may fit the bill. Especially these days with beauties like RDS. But then again, Aurora is pretty awesome, and did not exist/mature until only ~5 years ago or so.
Definitely more of a wart than a pancea or silver bullet. Even still, I wouldn't dismiss outright, always keen to compare alternatives.
Overall it sounds like we're in the same camp, heh.
could you tell what kind of DB was that so we can understand if it is apple to apple comparison to multi-master PG?
Why do you like Aurora? Genuinely curious. Here's my list of pros and cons, after having used Aurora MySQL.
Pro: Buffer pool persistence after restart is admittedly a very cool trick. That's it. That's the pro. The cons are long.
It's slow as hell. I don't know why this comes as a shock to anyone, but it's probably due to my statement answering your other question about a lack of knowledge of computing fundamentals. When your storage lives on 6 nodes spread dozens of miles apart, and you need quorum ack to commit, you're gonna have some pretty horrendous write latency. I have run benchmarks (realistic ones for a workload at a previous employer) comparing Aurora to some 13-year old Dell servers I have, and the ancient Dells won every time, by a lot. They didn't technically even have node-local storage; they had NVMe drives in a Ceph pool over a Mellanox Infiniband network.
The re-architecture - for MySQL anyway - required them to lose the change buffer. This buffers writes to secondary indices, which is quite helpful for performance. So now, not only do all writes to indices have to go directly to disk, they have to do so over a long distance, and achieve quorum. Oof.
Various InnoDB parameters that I would like to tune (and know how to do so correctly) are locked away.
I believe that AWS is being deceptive when they tout the ability to have 128 (now 256) TiB of storage. Yes, you can hit those numbers. Good luck operating there, though. Take one of the most common DDL operations performed: secondary index builds. AWS fully knows that this would take forever if written to the cluster volume, so they have a "local storage" drive (which is actually EBS) attached to the instance that's used for temporary storage of things like on-disk temp tables for sorts, and secondary index builds. This drive is sized vaguely proportionally to the size of the instance, and cannot be adjusted. If you have a large table - which you're likely to have if you're operating close to the cluster storage limits - you will likely discover that there isn't enough room on this drive to create an index. Sorry, have fun with that!
Finally, purely on a philosophical level, I find the idea of charging for I/O to be absolutely atrocious. Charge me a flat rate, or at the very least, a rate per byte, or some other unit that's likely to be understood by an average dev. "We charge you per page fetched from disk, except we charge for writes in 4 KiB segments, but sometimes they get batched together" - madness.
IME - both at a place using active-active, and at places that suggested using it - the core issue is developer competency. People in general like to think of themselves as above average in most areas of life (e.g. "I'm an above-average driver"). I'm certainly not excluded from this, but over the last several years, I like to think I've become self-aware enough to understand my own limitations, and to know what I am and am not an expert in.
So, you'll get devs who read some blog posts, and then when the CTO announces that they're going multi-region, they rush forward with the excitement of people not yet hardened by the horrors of distributed systems. They're probably running a distributed monolith, because obviously the original monolith had to be decomposed into micro services for trendy reasons, but since that wasn't done well, they now have a dependency chain, each with its own sub-dependencies.
There is also a general lack of understanding of computing fundamentals in the industry. By fundamentals, I mean knowledge of concepts like latency (and the relative latency of CPU cache levels, RAM, disk, network, etc.), IOPS, etc. People love to believe that these lower-order elements have been abstracted away, but abstractions leak, and then you're stuck. There are also more practical skills that I wrongly assumed were universal, like the ability to profile one's code, read logs, and read technical documentation for the tools you're using.
Finally, there is an overwhelming desire to over-complicate, and to build anew instead of using existing and proven technology. Why run HAProxy when you can build your own little health checker for fun in NodeJS (this actually happened to me)? Sure, we could redesign our schema to have better normalization, and stop using UUIDv4 PKs so our pages aren't scattered all around the B+tree, or we could just rent bigger servers, and add another caching layer.
Typically a strongly consistent (CP) system works by having a single elected master where writes are only ack'd when they're written to the majority of the cluster. The downside of this system is you need majority of the cluster working and up-to-date and the performance impact of doing this.
A multi-master system is generally ( AP ) allows writes to any master node, but has some consensus algorithm where it picks and chooses winners based on conflicting writes. It should be faster and more available at the cost of potentially lost data.
There are some systems that claim to beat CAP but they typically have caveats and assurances that are required. After-all, if you ack a write, and then that node blows up, how will it ever sync?
If by “caveats and assurances,” you mean “relax the definitions of CAP,” then yes. CAP, in its strict definition, has been formally proven [0].
> After-all, if you ack a write, and then that node blows up, how will it ever sync?
That’s just async replication.
0: https://www.comp.nus.edu.sg/~gilbert/pubs/BrewersConjecture-...
Could you elaborate on what problems you experienced?
Each node had N replicas running vanilla Postgres attached, which were on EC2s with node-local NVMe drives for higher performance. This was absolutely necessary for the application. There were also a smattering of Aurora Postgres instances attached, which the data folk used for analytics.
In no particular order:
* DDL is a nightmare. BDR by default will replicate DDL statements across the mesh, but the locking characteristics combined with the latency between `ap-southeast-2` and `us-east-1` (for example) meant that we couldn't use it; thus, we had to execute it separately on each node. Also, since the attached Aurora instances were blissfully unaware of anything but themselves, for any table-level operations (e.g. adding a column), we had to execute it on those first, lest we start building up WAL at an uncomfortable pace due to replication errors.
* You know how it's common to run without FK constraints, because "scalability," etc.? Imagine the fun of having devs manage referential integrity combined with eventual consistency across a global mesh.
* Things like maximum network throughput start to become concerns. Tbf, this is more due to modern development's tendency to use JSON everywhere, and to have heavily denormalized tables, but it's magnified by the need to have those changes replicated globally.
* Hiring is _hard_. I can already hear people saying, "well, you were running on bare EC2s," and sure, that requires Linux administration knowledge as a baseline - I promise you, that's a benefit. To effectively manage a multi-master RDBMS cluster, you need to know how to expertly administrate and troubleshoot the RDBMS itself, and to fully understand the implications and effects of some of those settings, you need to have a good handle on Linux. You're also almost certainly going to be doing some kernel parameter tuning. Plus, in the modern tech world, infra is declared in IaC, so you need to understand Terraform, etc. You're probably going to be writing various scripts, so you need to know shell and Python.
There were probably more, but those are the main ones that come to mind.
Can I ask more about this? I assume you created a procedure around making DDL changes to the global cluster... what was that procedure like? what tools did you use (create) to automate/script this? what failure modes did it encounter?
Relevant excerpt: "pgEdge offers eventual consistency between nodes using a configurable policy (e.g. last-writer-wins) for conflict resolution, along with conflict-free delta apply columns (i.e. CRDTs) for running sum fields. This allows for independent, concurrent and eventually consistent updates across multiple nodes."
Some specific documentation on the subject: https://docs.pgedge.com/spock_ext/conflicts
One of our solutions engineers (Paul Rothrock) created a video on this topic in the last month: https://www.youtube.com/watch?v=prkMkG0SOJE
And if you're interested in more information about conflict management in PostgreSQL clusters in general, this article ("Living on the Edge: Conflict Management and You") from Shaun Thomas is probably useful to check out: https://www.pgedge.com/blog/living-on-the-edge
In the meantime, you can find a lot of information in the official FAQ on how conflict resolution is handled (https://www.pgedge.com/resources/faq), but at-a-glance, "pgEdge offers eventual consistency between nodes using a configurable policy (e.g. last-writer-wins) for conflict resolution, along with conflict-free delta apply columns (i.e. CRDTs) for running sum fields. This allows for independent, concurrent and eventually consistent updates across multiple nodes."
How do you generate the timestamps for last writer wins? What happens if there is a tie?
Just my 2c: if I see a distributed database, the first question I ask is how it handles distributed transactions. Perhaps this topic should be higher on your FAQ, currently it is the 21st question.
And, one of our solutions engineers (Paul Rothrock) has a video released a month ago on this topic as well: https://www.youtube.com/watch?v=prkMkG0SOJE
Sharing these alongside my other comment in case additional information is helpful :-)
Unfortunately, Bucardo is no longer being updated.
Our goal is simply to support continued innovation of distributed PostgreSQL along with similar tools for enabling high availability / scalability in PG deployments.
So yes, license (and compatibility - see https://pgscorecard.com) are two major differences between pgEdge and CockroachDB.
pgEdge version updates also come in very close alignment with upstream PostgreSQL intentionally to make sure security patches/bugfixes and the latest features get to users ASAP.
CockroachDB != PostgreSQL.
I take great issue with the way CockroachDB marketing seeks to imply compatability, when infact what they are promising is wire protocol compatability (i.e. you can fire up your copy of psql on the CLI and it will connect).
Last time I looked, a great number of primitive, obvious, fundamental, low-hanging fruit were completely absent from CockroachDB, e.g. (IIRC) stored procedures are nowhere to be seen in CockroachDB.
See https://jepsen.io/consistency/models for a classification of consistency models.
verelo•14h ago
Many many years ago I worked on a monitoring tool that itself needed to be highly available, and we needed a solution like this. Ever since that time I've done everything in my power to avoid it.
What are the real world cases you built this for? And how can someone like me who has been bruised by past experiences get comfortable with it?
victor9000•11h ago
pgedge_postgres•11h ago
pgedge_postgres•11h ago
If understanding how conflicts are handled in pgEdge is helpful, here's a link to the docs on the subject: https://docs.pgedge.com/spock_ext/conflicts
And the FAQ also delves into it some: https://www.pgedge.com/resources/faq
baq•8h ago
jwr•4h ago
They should! Read some of the excellent Jepsen analyses to see how scary things can be: https://jepsen.io/analyses
vyruss•4h ago