Show HN: Open source, logical multi-master PostgreSQL replication

151•pgedge_postgres•4mo ago

Comments

verelo•4mo ago

Interesting, i always see attempts to make these types of database tools as super interesting but then I think about all the undocumented edge cases that can come up and they scare me off.

Many many years ago I worked on a monitoring tool that itself needed to be highly available, and we needed a solution like this. Ever since that time I've done everything in my power to avoid it.

What are the real world cases you built this for? And how can someone like me who has been bruised by past experiences get comfortable with it?

victor9000•4mo ago

What failure cases did you encounter?

pgedge_postgres•4mo ago

Getting some examples of real-world cases to share and will comment back with them ASAP; in the meantime, would you mind sharing what undocumented edge cases you came across and what solutions you explored to handle them? It would help with sharing super relevant use cases :-)

verelo•4mo ago

I tried to escape this world as quickly as possible, realizing how horrible it was, but the largest issue I ran into was around IO. Creating an environment that was highly tolerant to fault while having little to no replication delay meant checking in on the master database frequently. Keeping in mind this was around 2010 I found that the IO load on these databases was substantially larger than any database that i had ever worked on before. Things like available file handlers and other related performance problems came up more frequently than I’ve ever experienced before and frankly more frequently than I’ve ever experienced since.

If I was to summarize it, I would just say the performance characteristics were not something I was used to experiencing and often they would surprise me when they occurred, which meant having a good quality of a while for running this application was very challenging.

pgedge_postgres•4mo ago

Just a guess, but some of the undocumented edge cases you saw might be explored in this blog from one of our software engineers, Shaun Thomas. It's all about conflict resolution & avoidance in PostgreSQL, in general: https://www.pgedge.com/blog/living-on-the-edge

If understanding how conflicts are handled in pgEdge is helpful, here's a link to the docs on the subject: https://docs.pgedge.com/spock_ext/conflicts

And the FAQ also delves into it some: https://www.pgedge.com/resources/faq

baq•4mo ago

Typical use case would be a anyone who has global presence, but serves users in particular geos (think AWS): you want a global user database but it’s soooo convenient to be able to join with regional data in a single query.

jwr•4mo ago

> edge cases that can come up and they scare me off

They should! Read some of the excellent Jepsen analyses to see how scary things can be: https://jepsen.io/analyses

vyruss•4mo ago

Local write latency in a geo-distributed database is also important for some use cases.

sgarland•4mo ago

You do not want multi-master. If you think you do, think again.

Source: I have operated a large multi-master Postgres cluster.

bigwheels•4mo ago

I imagined this position would depend almost entirely on the requirements of the project. Are you able to elaborate on why it's a universal "NO" for you?

gtowey•4mo ago

That's just the point, it always sounds like a great idea to people not experienced in database operations.

The problem with the setup is you will have a data corruption issue at some point. It's not an "if" it's a "when". If you don't have a plan to deal with it, then you're hosed.

This is why the parent is turning around the burden of proof. If you can't definitely say why you absolutely need this, and no other solution will do, then avoid it.

bigwheels•4mo ago

Believe it or not, Mrs. Bigwheels is pretty experienced in the database department. I've seen multi-master HA architecture work out great for 10M+ DAU games, and many/most other cases where I wouldn't recommend it- as in it wouldn't even enter my brain, because the tradeoffs are harsh.

IME it comes down to considering CAP against the business goals, and taking into account how much it will annoy the development team(s).

If you follow "the rules" WRT to writes, it may fit the bill. Especially these days with beauties like RDS. But then again, Aurora is pretty awesome, and did not exist/mature until only ~5 years ago or so.

Definitely more of a wart than a pancea or silver bullet. Even still, I wouldn't dismiss outright, always keen to compare alternatives.

Overall it sounds like we're in the same camp, heh.

porridgeraisin•4mo ago

What would you say are the primary tradeoffs?

riku_iki•4mo ago

> I've seen multi-master HA architecture work out great for 10M+ DAU games

could you tell what kind of DB was that so we can understand if it is apple to apple comparison to multi-master PG?

bigwheels•4mo ago

It was a huge cluster (30 large servers in total) of MySQL with the MyISAM engine running in a Master-Master configuration. No foreign-keys allowed, the Apps were responsible for correctly enforcing data constraints.

Nowadays it seems the answer has somehow become "Pay MongoDB a ton of money for a support contract" and call it a day (Fortnite by Epic Games). Let's just say this isn't really my style, but somehow the game does work. To be real with you, keeping track of player scores, doing lobby matchmaking, and storing a few hundred or thousand items is pretty straightforward, even at "high-scale".

riku_iki•4mo ago

> MyISAM engine

MyISAM didn't support ACID transactions, thus can't be apples to apples comparison, its just very different niches.

There are plenty of distributed databases on the market today which could be used if you don't need ACID transactions.

bigwheels•4mo ago

Oops - I wasn't fully awake yet! It was using InnoDB.

MyISAM is mostly a huge "NOPE!" from me -_-

riku_iki•4mo ago

And do you know what was the transaction+replication story there? Were all changes transactionally and synchronously replicated?

n_u•3mo ago

> No foreign-keys allowed

So it was sharded? And no cross-shard queries or txns were allowed?

Or it was replicated? And no FKs were allowed because of consistency problems?

sgarland•4mo ago

> But then again, Aurora is pretty awesome

Why do you like Aurora? Genuinely curious. Here's my list of pros and cons, after having used Aurora MySQL.

Pro: Buffer pool persistence after restart is admittedly a very cool trick. That's it. That's the pro. The cons are long.

It's slow as hell. I don't know why this comes as a shock to anyone, but it's probably due to my statement answering your other question about a lack of knowledge of computing fundamentals. When your storage lives on 6 nodes spread dozens of miles apart, and you need quorum ack to commit, you're gonna have some pretty horrendous write latency. I have run benchmarks (realistic ones for a workload at a previous employer) comparing Aurora to some 13-year old Dell servers I have, and the ancient Dells won every time, by a lot. They didn't technically even have node-local storage; they had NVMe drives in a Ceph pool over a Mellanox Infiniband network.

The re-architecture - for MySQL anyway - required them to lose the change buffer. This buffers writes to secondary indices, which is quite helpful for performance. So now, not only do all writes to indices have to go directly to disk, they have to do so over a long distance, and achieve quorum. Oof.

Various InnoDB parameters that I would like to tune (and know how to do so correctly) are locked away.

I believe that AWS is being deceptive when they tout the ability to have 128 (now 256) TiB of storage. Yes, you can hit those numbers. Good luck operating there, though. Take one of the most common DDL operations performed: secondary index builds. AWS fully knows that this would take forever if written to the cluster volume, so they have a "local storage" drive (which is actually EBS) attached to the instance that's used for temporary storage of things like on-disk temp tables for sorts, and secondary index builds. This drive is sized vaguely proportionally to the size of the instance, and cannot be adjusted. If you have a large table - which you're likely to have if you're operating close to the cluster storage limits - you will likely discover that there isn't enough room on this drive to create an index. Sorry, have fun with that!

Finally, purely on a philosophical level, I find the idea of charging for I/O to be absolutely atrocious. Charge me a flat rate, or at the very least, a rate per byte, or some other unit that's likely to be understood by an average dev. "We charge you per page fetched from disk, except we charge for writes in 4 KiB segments, but sometimes they get batched together" - madness.

sgarland•4mo ago

I replied above with some problems I experienced, but this question is slightly different, so I'll add more here.

IME - both at a place using active-active, and at places that suggested using it - the core issue is developer competency. People in general like to think of themselves as above average in most areas of life (e.g. "I'm an above-average driver"). I'm certainly not excluded from this, but over the last several years, I like to think I've become self-aware enough to understand my own limitations, and to know what I am and am not an expert in.

So, you'll get devs who read some blog posts, and then when the CTO announces that they're going multi-region, they rush forward with the excitement of people not yet hardened by the horrors of distributed systems. They're probably running a distributed monolith, because obviously the original monolith had to be decomposed into micro services for trendy reasons, but since that wasn't done well, they now have a dependency chain, each with its own sub-dependencies.

There is also a general lack of understanding of computing fundamentals in the industry. By fundamentals, I mean knowledge of concepts like latency (and the relative latency of CPU cache levels, RAM, disk, network, etc.), IOPS, etc. People love to believe that these lower-order elements have been abstracted away, but abstractions leak, and then you're stuck. There are also more practical skills that I wrongly assumed were universal, like the ability to profile one's code, read logs, and read technical documentation for the tools you're using.

Finally, there is an overwhelming desire to over-complicate, and to build anew instead of using existing and proven technology. Why run HAProxy when you can build your own little health checker for fun in NodeJS (this actually happened to me)? Sure, we could redesign our schema to have better normalization, and stop using UUIDv4 PKs so our pages aren't scattered all around the B+tree, or we could just rent bigger servers, and add another caching layer.

phs318u•4mo ago

Multi-master can be useful in cases where writes to the data are usually logically grouped by an attribute that correlates to the distribution of masters e.g. sales info by geography. The chances of write conflicts become much smaller (though not zero.

ownagefool•4mo ago

That's not multi-master in the typical sense, it's sharding, and done correctly, you shouldn't have any write conflicts because each shard should be strongly consistent within itself.

Typically a strongly consistent (CP) system works by having a single elected master where writes are only ack'd when they're written to the majority of the cluster. The downside of this system is you need majority of the cluster working and up-to-date and the performance impact of doing this.

A multi-master system is generally ( AP ) allows writes to any master node, but has some consensus algorithm where it picks and chooses winners based on conflicting writes. It should be faster and more available at the cost of potentially lost data.

There are some systems that claim to beat CAP but they typically have caveats and assurances that are required. After-all, if you ack a write, and then that node blows up, how will it ever sync?

sgarland•4mo ago

> There are some systems that claim to beat CAP but they typically have caveats and assurances that are required.

If by “caveats and assurances,” you mean “relax the definitions of CAP,” then yes. CAP, in its strict definition, has been formally proven [0].

> After-all, if you ack a write, and then that node blows up, how will it ever sync?

That’s just async replication.

0: https://www.comp.nus.edu.sg/~gilbert/pubs/BrewersConjecture-...

ownagefool•4mo ago

No disagreement. I was making space for a system like spanner whilst alluding to the fact it still needs to choose.

jasonthorsness•4mo ago

Agree; the part of the application requiring multi-master semantics is probably a small piece and can be handled outside the database where there is enough domain-specific knowledge that it can be made simpler and more obvious how conflicts for example are avoided or handled.

pgedge_postgres•4mo ago

There's a lot of ways to approach the common problems found when running multi-master / active-active PostgreSQL. (A complete guide on this, specifically using PostgreSQL in general, was written by one of our solutions engineers, Shaun Thomas: https://www.pgedge.com/blog/living-on-the-edge)

Could you elaborate on what problems you experienced?

sgarland•4mo ago

To clarify, I was working with 2nd Quadrant BDR (now Enterprise DB Postgres Distributed), running on some extremely large EC2 instances, in a global mesh - IIRC, five regions. Also in fairness, EDB told us that we were the largest mesh they had seen, and so we frequently ran into fun edge cases.

Each node had N replicas running vanilla Postgres attached, which were on EC2s with node-local NVMe drives for higher performance. This was absolutely necessary for the application. There were also a smattering of Aurora Postgres instances attached, which the data folk used for analytics.

In no particular order:

* DDL is a nightmare. BDR by default will replicate DDL statements across the mesh, but the locking characteristics combined with the latency between `ap-southeast-2` and `us-east-1` (for example) meant that we couldn't use it; thus, we had to execute it separately on each node. Also, since the attached Aurora instances were blissfully unaware of anything but themselves, for any table-level operations (e.g. adding a column), we had to execute it on those first, lest we start building up WAL at an uncomfortable pace due to replication errors.

* You know how it's common to run without FK constraints, because "scalability," etc.? Imagine the fun of having devs manage referential integrity combined with eventual consistency across a global mesh.

* Things like maximum network throughput start to become concerns. Tbf, this is more due to modern development's tendency to use JSON everywhere, and to have heavily denormalized tables, but it's magnified by the need to have those changes replicated globally.

* Hiring is _hard_. I can already hear people saying, "well, you were running on bare EC2s," and sure, that requires Linux administration knowledge as a baseline - I promise you, that's a benefit. To effectively manage a multi-master RDBMS cluster, you need to know how to expertly administrate and troubleshoot the RDBMS itself, and to fully understand the implications and effects of some of those settings, you need to have a good handle on Linux. You're also almost certainly going to be doing some kernel parameter tuning. Plus, in the modern tech world, infra is declared in IaC, so you need to understand Terraform, etc. You're probably going to be writing various scripts, so you need to know shell and Python.

There were probably more, but those are the main ones that come to mind.

asah•4mo ago

"DDL is a nightmare"

Can I ask more about this? I assume you created a procedure around making DDL changes to the global cluster... what was that procedure like? what tools did you use (create) to automate/script this? what failure modes did it encounter?

sgarland•4mo ago

Bold of you to assume it was automated. The process I used was tmux with pane synchronization.

I asked to automate it (probably would've just been a shell script, _maybe_ Python, issuing SQL commands to stdin), but people were afraid of unknown unknowns.

bonesmoses•4mo ago

I don't recall which customer you may have been, but the standard solution to that specific DDL issue with BDR is to use Stream Triggers to enable row versioning. One of the 2ndQuadrant customers used it extensively for multi-region cross-version app schema migrations that could last for months.

Essentially what that boils down to is you create stream triggers that intercept the logical stream and modify it to fit the column orientation by version. So during the transition, the triggers would be deployed to specific nodes while modifications are rolled out. Once everything was on the new version, triggers were all dropped until the next migration.

Spock doesn't have anything like that _yet_, but as you observed, being unable to use DDL replication significantly increases complexity, and tmux is a poor substitute.

sgarland•4mo ago

That is very cool! Was/is it available on BDR v3.6? I think that’s the version I was using.

bonesmoses•3mo ago

Version 3.6 is definitely after we introduced that functionality. It was, alas, rarely deployed though. Usually clients needed a consultant to put together a configuration for them, and most didn't bother.

tonyhart7•4mo ago

how do they resolve write conflict????

pgedge_postgres•4mo ago

The official FAQ has a good amount of info on how conflict resolution is handled (https://www.pgedge.com/resources/faq)!

Relevant excerpt: "pgEdge offers eventual consistency between nodes using a configurable policy (e.g. last-writer-wins) for conflict resolution, along with conflict-free delta apply columns (i.e. CRDTs) for running sum fields. This allows for independent, concurrent and eventually consistent updates across multiple nodes."

Some specific documentation on the subject: https://docs.pgedge.com/spock_ext/conflicts

One of our solutions engineers (Paul Rothrock) created a video on this topic in the last month: https://www.youtube.com/watch?v=prkMkG0SOJE

And if you're interested in more information about conflict management in PostgreSQL clusters in general, this article ("Living on the Edge: Conflict Management and You") from Shaun Thomas is probably useful to check out: https://www.pgedge.com/blog/living-on-the-edge

OsrsNeedsf2P•4mo ago

If both nodes approve an update on the same primary key, what happens? I don't see this crucial detail described in the README

pgedge_postgres•4mo ago

Thanks for pointing out the lack of info on conflict resolution in the README! It's been reported and we'll look at getting that updated ASAP.

In the meantime, you can find a lot of information in the official FAQ on how conflict resolution is handled (https://www.pgedge.com/resources/faq), but at-a-glance, "pgEdge offers eventual consistency between nodes using a configurable policy (e.g. last-writer-wins) for conflict resolution, along with conflict-free delta apply columns (i.e. CRDTs) for running sum fields. This allows for independent, concurrent and eventually consistent updates across multiple nodes."

n_u•4mo ago

Cool project!

How do you generate the timestamps for last writer wins? What happens if there is a tie?

Just my 2c: if I see a distributed database, the first question I ask is how it handles distributed transactions. Perhaps this topic should be higher on your FAQ, currently it is the 21st question.

bonesmoses•4mo ago

There's an option in the Postgres configuration named "track_commit_timestamp" that does this automatically. It's required to be enabled when using LWW as the conflict resolution model. If there's a tie, the node with the highest node number wins.

pgedge_postgres•4mo ago

As a note, there's also specific documentation regarding this: https://docs.pgedge.com/spock_ext/conflicts

And, one of our solutions engineers (Paul Rothrock) has a video released a month ago on this topic as well: https://www.youtube.com/watch?v=prkMkG0SOJE

Sharing these alongside my other comment in case additional information is helpful :-)

bonesmoses•4mo ago

In Postgres, updates contain the entire row, including all column values. Since the Spock extension follows the "Last Write Wins" model by default, one row version will win, while the other is essentially discarded. This is assuming the update happened on each node _before_ the new value was synchronized over, or essentially simultaneously.

You can address this partially using a CRDT such as the Delta Apply functionality for certain columns:

https://docs.pgedge.com/spock_ext/conflicts

That will only work with numeric-type (INT, BIGINT, NUMERIC, etc.) columns, but effectively merges data so updates work cumulatively.

imglorp•4mo ago

Third party, multi master postgres is such an old idea, it was done in Perl...

https://github.com/bucardo/bucardo

pgedge_postgres•4mo ago

We're not claiming to be a new idea, by any means :-)

Unfortunately, Bucardo is no longer being updated.

Our goal is simply to support continued innovation of distributed PostgreSQL along with similar tools for enabling high availability / scalability in PG deployments.

vyruss•4mo ago

True, but Bucardo is trigger-based and does not use WAL-based logical replication, and is unmaintained. There is also a world of difference in performance between them.

philipallstar•4mo ago

I don't see why this matters. Ideas are easy; execution and adoption are hard. Clearly Bucado didn't take off well enough that this is a solved problem.

throwawaygo•4mo ago

pgactive?

vyruss•4mo ago

pgactive has limitations with not supporting DDL, sequence management, column and row filtering, conflict and exception handling, incompatibility with native logical replication, etc. The license is also different (Apache 2.0 for pgactive vs PostgreSQL for Spock). Most importantly, it's not "supported anywhere" by AWS, just on RDS.

pisikesipelgas•4mo ago

Hi, How do You guys resolve the application database DDL issue when multimaster is in use? One node gets updated, DDL is will be replicated (?) to second node, which is used by not-jet-updated application which is not compatible with updated database structure. This problem has bugged me for a while. And second and similar issue with most replication setups is let's take for postgis for example. Again in one node this extension gets updated. Now what? Data will be replicated to node which is not jet updated and cause whole system to be not functional.

baq•4mo ago

It’s an engineering problem: you have to design the system so that it remains functional in this exact scenario - it follows that the system isn’t just code and build artifacts, but also its deployment processes.

pisikesipelgas•4mo ago

Hi, Thanks for the reply. This is what i figured too. So there is essentially no way to achieve this without service downtime when using application which is not written to handle those kind of situations (eg. 3rd party things).

baq•4mo ago

Again an engineering problem. You can deploy with zero downtime and people have been doing this for decades. It takes infrastructure like load balancers, ability to run versions in parallel and runtime support for feature flags, but it’s absolutely doable and ultimately just another day in the office for anyone with global operations. A lot of 3rd party tools actually support these workflows for this exact reason.

qaq•4mo ago

You gonna pause writes for cutover so while not downtime, specifically for postgres load balancers, ability to run versions in parallel not gonna help you there.

foreigner•4mo ago

What are the pros and cons of this compared to CockroachDB?

snthpy•4mo ago

Not the OP nor knowledgeable in this area but I would suspect / hope postgres compatibility as a start. The last time I looked into whether I could use cockroachdb as a backend for my Airflow cluster, it wasn't possible due to compatibility issues.

pgedge_postgres•4mo ago

You're actually 100% correct! CockroachDB is only 57.25% compatible with standard PostgreSQL (according to https://pgscorecard.com, which details the way it comes up with these numbers) whereas we are 100% compatible (and 100% open-source, whereas they are source-available).

znpy•4mo ago

License. CockroachDB moved to a license that I can’t even remember if it’s source-available anymore.

pgedge_postgres•4mo ago

It technically is source-available, as of Nov 2024, anyway: https://news.itsfoss.com/cockcroachdb-no-open-source/

So yes, license (and compatibility - see https://pgscorecard.com) are two major differences between pgEdge and CockroachDB.

pgEdge version updates also come in very close alignment with upstream PostgreSQL intentionally to make sure security patches/bugfixes and the latest features get to users ASAP.

traceroute66•4mo ago

> compared to CockroachDB

CockroachDB != PostgreSQL.

I take great issue with the way CockroachDB marketing seeks to imply compatability, when infact what they are promising is wire protocol compatability (i.e. you can fire up your copy of psql on the CLI and it will connect).

Last time I looked, a great number of primitive, obvious, fundamental, low-hanging fruit were completely absent from CockroachDB, e.g. (IIRC) stored procedures are nowhere to be seen in CockroachDB.

bonesmoses•4mo ago

You can't even run pgbench unaltered on CockroachDB, as simple table structures and indexes are fundamentally different there. It is in no way a compatible product, and never has been.

jwr•4mo ago

Bear in mind this no longer provides the same consistency model as PostgreSQL does. It's not a straightforward extension of the nice serializable world. That might not be what you expect given the name, this does not provide a strict serializable consistency model.

See https://jepsen.io/consistency/models for a classification of consistency models.

DoNotNotify is now Open Source

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

Matchlock: Linux-based sandboxing for AI agents

Haskell for all: Beyond agentic coding

SectorC: A C Compiler in 512 bytes (2023)

LLMs as the new high level language

The Architecture of Open Source Applications (Volume 1) Berkeley DB

Software factories and the agentic moment

Moroccan sardine prices to stabilise via new measures: officials

Speed up responses with fast mode

LineageOS 23.2

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

Modern and Antique Technologies Reveal a Dynamic Cosmos

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Vocal Guide – belt sing without killing yourself

Wood Gas Vehicles: Firewood in the Fuel Tank (2010)

uLauncher

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

First Proof

Substack confirms data breach affects users’ email addresses and phone numbers

Start all of your commands with a comma (2009)

In the Australian outback, we're listening for nuclear tests

Al Lowe on model trains, funny deaths and working with Disney

LLMs as Language Compilers: Lessons from Fortran for the Future of Coding

The AI boom is causing shortages everywhere else

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Show HN: A luma dependent chroma compression algorithm (image compression)

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

DoNotNotify is now Open Source

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

Matchlock: Linux-based sandboxing for AI agents

Haskell for all: Beyond agentic coding

SectorC: A C Compiler in 512 bytes (2023)

LLMs as the new high level language

The Architecture of Open Source Applications (Volume 1) Berkeley DB

Software factories and the agentic moment

Moroccan sardine prices to stabilise via new measures: officials

Speed up responses with fast mode

LineageOS 23.2

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

Modern and Antique Technologies Reveal a Dynamic Cosmos

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Vocal Guide – belt sing without killing yourself

Wood Gas Vehicles: Firewood in the Fuel Tank (2010)

uLauncher

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

First Proof

Substack confirms data breach affects users’ email addresses and phone numbers

Start all of your commands with a comma (2009)

In the Australian outback, we're listening for nuclear tests

Al Lowe on model trains, funny deaths and working with Disney

LLMs as Language Compilers: Lessons from Fortran for the Future of Coding

The AI boom is causing shortages everywhere else

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Show HN: A luma dependent chroma compression algorithm (image compression)

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Show HN: Open source, logical multi-master PostgreSQL replication

Comments