A race condition in Aurora RDS

https://hightouch.com/blog/uncovering-a-race-condition-in-aurora-rds

241•theanomaly•2mo ago

Comments

redwood•2mo ago

A good reminder of how people developing a mental model of adding read replicas as a way to scale is a slippery slope. At the end of the day you're scaling only one specific part of your system with certain consistency dynamics that are difficult to reason about

terminalshort•2mo ago

Works fine for workloads like:

1. I need to grab some rows from a table

2. Eventual consistency is good enough

And that's a lot of workloads.

candiddevmike•2mo ago

As a user, I've come to realize the situations where I think eventual consistency (or delayed processing) are good enough aren't the same as the folks developing most products. Nothing annoys me more than stuff not showing up immediately or having to manually refresh.

darth_avocado•2mo ago

Sometimes users want everything to show up immediately, but not pay extra for the feature. Everything real time is expensive. Eventual consistency is a good thing for most systems.

terminalshort•2mo ago

For a workload where you need true read after write you can just send those reads to the writer. But even if you don't there are plenty of workarounds here. You can send a success response to the user when the transaction commits to the writer and update the UI on response. The only case where this will fail is if the user manually reloads the page within the replication lag window and the request goes to the reader. This should be exceedingly rare in a single region cluster, and maybe a little less rare in a multi-region set up, but still pretty rare. I almost never see > 1s replication lag between regions in my Aurora clusters. There are certainly DB workloads where this will not be true, but if you are in a high replication lag cluster, you just don't want to use that for this type of UI dependency in the first place.

nilamo•2mo ago

I think the key here is just proper notifications. Yes it's eventually consistent, but having a "processing" or "update in progress" is a huge improvement over showing a user old data.

redwood•2mo ago

The future you or future team member may struggle to reason about that in the future

morshu9001•2mo ago

That's readonly. RW workloads usually don't tolerate eventual consistency on the thing they're writing.

terminalshort•2mo ago

Yeah, if you have a mix of reads and writes in a workflow, you gotta hit the writer node. But a lot of times an endpoint is only reading data from a particular DB.

nijave•2mo ago

You can hit the same problems horizontally scaling compute. One instance reads from the DB, a request hits a different instance which updates the DB. The original instance writes to the DB and overwrites the changes or makes decisions based on stale data.

More broadly a distributed system problem

gtowey•2mo ago

This article seems to indicate that manually triggered failovers will always fail if your application tries to maintain its normal write traffic during that process.

Not that I'm discounting the author's experience, but something doesn't quite add up:

- How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?

- If they know, how is this not an urgent P0 issue for AWS? This seems like the most basic of basic usability features is 100% broken.

- Is there something more nuanced to the failure case here such as does this depend on transactions in-progress? I can see how maybe the failover is waiting for in-flight transactions to close and then maybe hits a timeout where it proceeds with the other part of the failover by accident. That could explain why it doesn't seem like the issue is more widespread.

maherbeg•2mo ago

Yeah I agree, this seems like a pretty critical feature to the Aurora product itself. We saw a similar behavior recently where we had a connection pooler in between which indicates something wrong with how they propagate DNS changes during the failover. wtf aws

CaptainKanuk•2mo ago

Whenever we have to do any type of AWS Aurora or RDS cluster modification in prod we always have the entire emergency response crew standing by right outside the door.

Their docs are not good and things frequently don't behave how you expect them to.

ekropotin•2mo ago

Oh, well, it’s always DNS!

dboreham•2mo ago

Although the article has an SEO-optimized vibe, I think it's reasonable to take it as true until refuted. My rule of thumb is that any rarely executed, very tricky operation (e.g. database writer fail over) is likely to not work because there are too many variables in play and way too few opportunities to find and fix bugs. So the overall story sounds very plausible to me. It has a feel of: it doesn't work under continuous heavy write load, in combination with some set of hardware performance parameters that plays badly with some arbitrary time out. Note that the system didn't actually fail. It just didn't process the fail over operation. It reverted to the original configuration and afaics preserved data.

theanomaly•2mo ago

I'm surprised this hasn't come up more often too. When we worked with AWS on this, they confirmed there was nothing unique about our traffic pattern that would trigger this issue. We also didn't run into this race condition in any of our other regions running similar workloads. What's particularly concerning is that this seems to be a fundamental flaw in Aurora's failover mechanism that could theoretically affect anyone doing manual failover.

twisteriffic•2mo ago

> How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?

If it's anything like how Azure handles this kind of issue, it's likely "lots of people have experienced it, a restart fixes it so no one cares that much, few have any idea how to figure out a root cause on their own, and the process to find a root cause with the vendor is so painful that no one ever sees it through"

perching_aix•2mo ago

An experience not exclusive to cloud vendors :) Even better when the vendor throws their hands up cause the issue is not reliably repro'able.

That was when I scripted away a test that ran hundreds of times a day on a lower environment, attempting repro. As they say, at scale, even insignificant issues become significant. I don't remember clearly, I think it was a 5-10% chance that the issue triggered.

At least confirming the fix, which we did eventually receive, was mostly a breeze. Had to provide an inordinate amount of captures, logs, and data to get there though. Was quite the grueling few weeks, especially all the office politics laden calls.

pixl97•2mo ago

I've had customers with load related bugs for years simply because they'd reboot when the problem happened. When dealing with the F100 it seems there is a rather limited number of people in these organizations that can troubleshoot complex issues, that or they lock them away out of sight.

perching_aix•2mo ago

It is a tough bargain to be fair, and it is seen in other places too. From developers copying out their stuff from their local git repo, recloning from remote, then pasting their stuff back, all the way to phone repair just meaning "here's a new device, we synced all your data across for you", it's fairly hard to argue with the economic factors and the effectiveness of this approach at play.

With all the enterprise solutions being distributed, loosely coupled, self-healing, redundant, and fault-tolerant, issues like this essentially just slot in perfectly. Compound this with man-hours (especially expert ones) being a lot harder to justify for any one particular bump in tail latency, and the equation is just really not there for all this.

What gets us specifically to look into things is either the issue being operationally gnarly (e.g. frequent, impacting, or both), or management being swayed enough by principled thinking (or at least pretending to be). I'd imagine it's the same elsewhere. The latter would mostly happen if fixing a given thing becomes an office political concern, or a corporate reputation one. You might wonder if those individual issues ever snowballed into a big one, but turns out human nature takes care of that just "sufficiently enough" before it would manifest "too severely". [0]

Otherwise, you're looking at fixing / RCA'ing / working around someone else's product defect on their behalf, and giving your engineers a "fun challenge". Fun doesn't pay the bills, and we rarely saw much in return from the vendor in exchange for our research. I'd love to entertain the idea that maybe behind closed doors the negotiations went a little better because of these, but for various reasons, I really doubt so in hindsight.

[0] as delightfully subjective as those get of course

hobs•2mo ago

If I had a nickel for every time I had to explain that rebooting a database server is usually the wrong choice I would have quite a fortune.

sally_glance•2mo ago

Theoretically you're supposed to assign lower prio to issues with known workarounds but then there should also be reporting for product management (which assigns weight by age of first occurrence and total count of similar issues).

Amazon is mature enough for processes to reflect this, so my guess for why something like this could slip through is either too many new feature requests or many more critical issues to resolve.

pwarner•2mo ago

Azure yes, I'd expect this and the restart would take many minutes. Been there done that.

AWS this is surprising

nijave•2mo ago

fwiw we haven't seen issues manually doing manual failovers for maintenance using the same/similar procedure described in the article. I imagine there is something more nuanced here and it's hard to draw too many conclusions without a lot more details being provided by AWS

aetherson•2mo ago

My experience with AWS is that they are extremely, extremely parsimonious about any information they give out. It is near-impossible to get them to give you any details about what is happening beyond the level of their API. So my gut hunch is that they think that there's something very rare about this happening, but they refuse to give the article writer the information that might or might not help them avoid the bug.

everfrustrated•2mo ago

If you pay for the highest level of support you will get extremely good support. But it comes with signing a NDA so you're not going to read about anything coming out of it on a blog.

I've had AWS engineers confirm very detailed and specific technical implementation details many many times. But these were at companies that happily spent over a $1M/year with AWS.

qaq•2mo ago

Nah if your monthly spend is really significant than you will get good support and issues you care about will get prioritized. Going from startup with 50K/month spend to a large company with untold millions per month spend experience is night and day. We have Dev managers and eng. from key AWS teams present in meetings when need be, we get issues we raise prioritized and added to dev roadmaps etc.

aetherson•2mo ago

I was at a company that spent over $90M a year with AWS and we got defensive, limited comms.

Hovertruck•2mo ago

Agreed, we've been running multiple aurora clusters in production for years now and have not encountered this issue with failovers.

dalyons•2mo ago

Same. There’s something missing here.

kobalsky•2mo ago

> - How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?

I know that there is no comparison in the user base, but a few years ago I ran into a massive Python + MySQL bug that:

1. made SELECT ... FOR UPDATE fail silenty 2. aborted the transaction and set the connection into autocommit mode

This basically a worst case scenario in a transactional system.

I was basically screaming like a mad man in the corner but no one seemed to care.

Someone contacted me months later telling me that they experienced the same problem with "interesting" consequences in their system.

The bug was eventually fixed but at that point I wasn't tracking it anymore, I provided a patch when I created the issue and moved on.

https://stackoverflow.com/questions/945482/why-doesnt-anyone...

sroussey•2mo ago

Converting a connection to autocommit upon error. Yikes!!

evanelias•2mo ago

If I'm reading this correctly, it sounds like the connection was already using autocommit by default? In that situation, if you initiate a transaction, and then it gets rolled back, you're back in autocommit unless/until you initiate another transaction.

If so, that part is all totally normal and expected. It's just that due to a bug in the Python client library (16 years ago), the rollback was happening silently because the error was not surfaced properly by the client library.

o11c•2mo ago

I would argue that it's a bug for it even to be possible to autocommit.

evanelias•2mo ago

What do you mean? Autocommit mode is the default mode in Postgres and MS SQL Server as well. This is by no means a MySQL-specific behavior!

When you're in autocommit mode, BEGIN starts an explicit transaction, but after that transaction (either COMMIT or ROLLBACK), you return to autocommit mode.

The situation being described upthread is a case where a transaction was started, and then rolled back by the server due to deadlock error. So it's totally normal that you're back in autocommit mode after the rollback. Most DBMS handle this identically.

The bug described was entirely in the client library failing to surface the deadlock error. There's simply no autocommit-related bug as it was described.

o11c•2mo ago

Yes, and most DBMS's are full of historical mistakes.

In a sane world, statements outside `BEGIN` would be an unconditional error.

grogers•2mo ago

Autocommit mode is pretty handy for ad-hoc queries at least. You wouldn't want to have to remember to close the transaction since keeping a transaction open is often really bad for the DB

evanelias•2mo ago

Lack of autocommit would be bad for performance at scale, since it would add latency to every single query. And the MVCC implications are non-trivial, especially for interactive queries (human taking their time typing) while using REPEATABLE READ isolation or stronger... every interactive query would effectively disrupt purge/vacuum until the user commits. And as the sibling comment noted, that would be quite harmful if the user completely forgets to commit, which is common.

In any case, that's a subjective opinion on database design, not a bug. Anyway it's fairly tangential to the client library bug described up-thread.

yencabulator•2mo ago

Is there any scenario in a sane world where a transaction ceases to be in scope just because it went into an error state? I'd have expect the client to send an explicit ROLLBACK when they realize a transaction is in an error state, not for the server to end it and just notify the client. This is how psql appears to the end user.

  postgres=# begin;
  BEGIN
  postgres=*# bork;
  ERROR:  syntax error at or near "bork"
  LINE 1: bork;
          ^
  postgres=!# select 1;
  ERROR:  current transaction is aborted, commands ignored until end of transaction block
  postgres=!# rollback;
  ROLLBACK
  postgres=# select 1;
   ?column?
  ----------
          1
  (1 row)
  
  postgres=#

evanelias•2mo ago

Every DBMS handles errors slightly differently. In a sane world you shouldn't ever ignore errors from the database. It's unfortunate to hear that a Python MySQL client library had a bug that failed to expose errors properly in one specific situation 16 years ago, but that's not terribly relevant to today.

Postgres behavior with errors isn't even necessarily desirable -- in terms of ergonomics, why should every typo in an interactive session require me to start my transaction over from scratch?

yencabulator•2mo ago

> why should every typo in an interactive session require me to start my transaction over from scratch?

That part would hold even for the MySQL auto-rollback implied above.

evanelias•2mo ago

No, the situation described upthread is about a deadlock error, not a typo. In MySQL, syntax errors throw a statement-level error but it does not affect the transaction state in any way. If you were in a transaction, you're still in a transaction after a typo.

With deadlocks in MySQL, the error you receive is "Deadlock found when trying to get lock; try restarting transaction" i.e. it specifically tells you the transaction needs to start over in that situation.

In programmatic contexts, transactions are typically represented by an object/struct type, and a correctly-implemented database driver for MySQL handles this properly and invalidates the transaction object as appropriate if the error dictates it. So this isn't really even a common foot-gun in practical terms.

benmmurphy•2mo ago

it could be most people pause writes because its going to create errors if you try and execute a write against an instance that refuses to accept and writes, and for some people those errors might not be recoverable. so they just have some option in their application that puts the application into maintenance mode where it will hard reject writes at the application layer.

biggoodwolf•2mo ago

I recall seeing this also happening in CosmosDB. Both auto and manual

nrhrjrjrjtntbt•2mo ago

P0 if it happens to everyone, right? Like the USE1 outage recently. If it is 0.001% of customers (enough to get a HN story) is may not be that high. Maybe this customer is on a migration or upgrade path under the hood. Or just on a bad unit in the rack.

belter•2mo ago

The article is low quality. It does not mention which Aurora PostgreSQL version was involved, and it provides no real detail about how the staging environment differed from production, only saying that staging “didn’t reproduce the exact conditions,” which is not actionable.

This AWS documentation section: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraPostgreSQ...

“Amazon Aurora PostgreSQL updates”: under Aurora PostgreSQL 17.5.3, September 16, 2025 – Critical stability enhancements includes a potential match:

“...Fixed a race condition where an old writer instance may not step down after a new writer instance is promoted and continues to write…”

If that is the underlying issue, it would be serious, but without more specifics we can’t draw conclusions.

For context: I do not work for AWS, but I do run several production systems on Aurora PostgreSQL. I will try to reproduce this using the latest versions over the next few hours. If I do not post an update within 24 hours, assume my tests did not surface anything.

That would not rule out a real issue in certain edge cases, configurations, or version combinations but it would at least suggest it is not broadly reproducible.

theanomaly•2mo ago

We're running Aurora PostgreSQL 15.12, which includes the fix mentioned in the release notes. Looking at this comment and the AWS documentation, I think there's an important distinction to make about what was actually fixed in Aurora PostgreSQL 15.12.4. Based on our experience and analysis, we believe AWS's fix primarily focused on data protection rather than eliminating the race condition itself.

Here's what we think is happening: Before the fix (pre-15.12.4):

1. Failover starts

2. Both instances accept and process writes simultaneously

3. Failover eventually completes after the writer steps down

4. Result: Potential data consistency issues ???

After the fix (15.12.4+):

1. Failover starts

2. If the old writer doesn't demote before the new writer is promoted, the storage layer now detects this and rejects write requests

3. Both instances restart/crash

4. Failover fails or requires manual intervention

The underlying race condition between writer demotion and reader promotion still exists - AWS just added a safety mechanism at the storage layer to prevent the dangerous scenario of two writers operating simultaneously. They essentially converted a data inconsistency risk into an availability issue. This would explain why we're still seeing failover failures on 15.12 - the race condition wasn't eliminated, just made safer.

The comment in the release notes about "fixed a race condition where an old writer instance may not step down" is somewhat misleading - it's more accurate to say they "mitigated the consequences of the race condition" by having the storage layer reject writes when it detects the problematic state and that is probably why AWS Support did not point us to this release when we raised the issue.

grogers•2mo ago

It sounds like part of the problem was how the application reacted to the reverted fail over. They had to restart their service to get writes to be accepted, implying some sort of broken caching behavior where it kept trying to send queries to the wrong primary.

It's at least possible that this sort of aborted failover happens a fair amount, but if there's no downtime then users just try again and it succeeds, so they never bother complaining to AWS. Unless AWS is specifically monitoring for it, they might be blind to it happening.

jansommer•2mo ago

People who have experience with Aurora and RDS Postgres: What's your experience in terms of performance? If you dont need multi A-Z and quick failover, can you achieve better performance with RDS and e.g. gp3 64.000 iops and 3125 throughput (assuming everything else can deliver that and cpu/mem isn't the bottleneck)? Aurora seems to be especially slow for inserts and also quite expensive compared to what I get with RDS when I estimate things in the calculator. And what's the story on read performance for Aurora vs RDS? There's an abundance of benchmarks showing Aurora is better in terms of performance but they leave out so much about their RDS config that I'm having a hard time believing them.

shawabawa3•2mo ago

> 3125 throughput

Max throughput on gp3 was recently increased to 2GB/s, is there some way I don't know about of getting 3.125?

jansommer•2mo ago

This is super confusing. Check out the RDS Postgres calculator with gp3:

> General Purpose SSD (gp3) - Throughput > gp3 supports a max of 4000 MiBps per volume

But the docs say 2000. Then there's IOPS... The calculator allows up to 64.000 but on [0], if you expand "Higher performance and throughout" it says

> Customers looking for higher performance can scale up to 80,000 IOPS and 2,000 MiBps for an additional fee.

[0] https://aws.amazon.com/ebs/general-purpose/

nijave•2mo ago

RDS PG stripes multiple gp3 volumes so that's why RDS throughput is higher than gp3

I think 80k IOPs on gp3 is a newer release so presumably AWS hasn't updated RDS from the old max of 64k. iirc it took a while before gp3 and io2 were even available for RDS after they were released as EBS options

Edit: Presumably it takes some time to do testing/optimizations to make sure their RDS config can achieve the same performance as EBS. Sometimes there are limitations with instance generations/types that also impact whether you can hit maximum advertised throughput

mkesper•2mo ago

Only if you allocate (and pay for) more than 400GB. And if you have high traffic 24/7 beware of "EBS optimized" instances which will fall down to baseline rates after a certain time. I use vantage.sh/rds (not affiliated) to get an overview of the tons of instance details stretched out over several tables in AWS docs.

nijave•2mo ago

RDS stripes multiple gp3 volumes. Docs are saying 4Gi/s per instance is the max for gp3 if I'm looking at the right table

nijave•2mo ago

We've seen better results and lower costs in a 1 writer, 1-2 reader setup on Aurora PG 14. The main advantages are 1) you don't re-pay for storage for each instance--you pay for cluster storage instead of per-instance storage & 2) you no longer need to provision IOPs and it provides ~80k IOPs

If you have a PG cluster with 1 writer, 2 readers, 10Ti of storage and 16k provision IOPs (io1/2 has better latency than gp3), you pay for 30Ti and 48k PIOPS without redundancy or 60Ti and 96k PIOPS with multi-AZ.

The same Aurora setup you pay for 10Ti and get multi-AZ for free (assuming the same cluster setup and that you've stuck the instances in different AZs).

I don't want to figure the exact numbers but iirc if you have enough storage--especially io1/2--you can end up saving money and getting better performance. For smaller amounts of storage, the numbers don't necessarily work out.

There's also 2 IO billing modes to be aware of. There's the default pay-per-IO which is really only helpful for extreme spikes and generally low IO usage. The other mode is "provisioned" or "storage optimized" or something where you pay a flat 30% of the instance cost (in addition to the instance cost) for unlimited IO--you can get a lot more IO and end up cheaper in this mode if you had an IO heavy workload before

I'd also say Serverless is almost never worth it. Iirc provisioning instances was ~17% of the cost of serverless. Serverless only works out if you have ~ <4 hours of heavy usage followed by almost all idle. You can add instances fairly quickly and failover for minimal downtime (of course barring running into the bug the article describes...) to handle workload spikes using fixed instance sizes without serverless

jansommer•2mo ago

Have you benchmarked your load on RDS? [0] says that IOPS on Aurora is vastly different from actual IOPS. We have just one writer instance and mostly write 100's of GB in bulk.

[0] https://dev.to/aws-heroes/100k-write-iops-in-aurora-t3medium...

nijave•2mo ago

We didn't benchmark--we used APM data in Datadog to compare setups before and after migration

I believe the article is talking about I/O aggregate operations vs I/O average per second. I'm talking strictly about the "average per second" variety. The former is really only relevant for billing in the standard billing mode.

Actually a big motivator for the migration was batch writes (we generate tables in Snowflake, export to S3, then import from S3 using the AWS RDS extension) and Aurora (with ability to handle big spikes) helped us a lot. We'd see application performance (query latency reported by APM) increase a decent amount during these bulk imports and it was much less impactful with Aurora.

iirc it was something like 4-5ms to 10-12ms query latency for some common queries regularly and during import respectively with RDS PG and more like 6-7ms during import on Aurora (mainly because we were exhausting IO during imports before)

jaggederest•2mo ago

I've had better results with managing my own clusters on metal instances. You get much better performance with e.g. NVMe drives in a 0+1 raid (~million iops in a pure raid 0 with 7 drives) and I am comfortable running my own instances and clusters. I don't care for the way RDS limits your options on extensions and configuration, and I haven't had a good time with the high availability failovers internally, I'd rather run my own 3 instances in a cluster, 3 clusters in different AZs.

Blatant plug time:

I'm actually working for a company right now ( https://pgdog.dev/ ) that is working on proper sharding and failovers from a connection pooler standpoint. We handle failovers like this by pausing write traffic for up to 60 seconds by default at the connection pooler and swapping which backend instance is getting traffic.

everfrustrated•2mo ago

Aurora doesn't use EBS under the hood. It has no option to choose storage type or io latency. Only a billing choice between pay per io or fixed price io.

jansommer•2mo ago

Precisely! That's why RDS sounds so interesting. I get a lot more knobs to tweak performance, but I'm curious if a maxed out gp3 with instances that support it is going to fare any better than Aurora.

Exoristos•2mo ago

We were burned by Aurora. Costs, performance, latency, all were poor and affected our product. Having good systems admins on staff, we ended up moving PostgreSQL on-prem.

Scubabear68•2mo ago

For me, the big miss with Postgres Aurora RDS was costs. We had some queries that did a fair amount of I/O in a way that would not normally be a problem, but in the Aurora Postgres RDS world that I/O was crazy expensive. A couple of fuzzy queries blew costs up to over $3,000/month for a database that should have cost maybe $50-$100/month. And this was for a dataset of only about 15 million rows without anything crazy in them.

Hexcles•2mo ago

Sounds like you need to use IO optimized storage billing mode.

paranoidrobot•2mo ago

My experience is with Aurora MySQL, not postgres. But my understanding is that the way the storage layer works is much the same.

We have some clusters with very high write IOPS on Aurora.

When looking at costs we modelled running MySQL and regular RDS MySQL.

We found for the IOPS capacity of Aurora we wouldn't be able to match it on AWS without paying a stupid amount more.

belter•2mo ago

> There's an abundance of benchmarks showing Aurora is better in terms of performance but they leave out so much about their RDS config that I'm having a hard time believing them.

Do you have a problem believing these claims on equivalent hardware?: https://pages.cs.wisc.edu/~yxy/cs764-f20/papers/aurora-sigmo...

Or do your own performance assessments, following the published document and templates available so you can find the facts on your own?

For Aurora MySql:

"Amazon Aurora Performance Assessment Technical Guide" - https://d1.awsstatic.com/product-marketing/Aurora/RDS_Aurora...

For Aurora Postgres:

"...Steps to benchmark the performance of the PostgreSQL-compatible edition of Amazon Aurora using the pgbench and sysbench benchmarking tools..." - https://d1.awsstatic.com/product-marketing/Aurora/RDS_Aurora...

"Automate benchmark tests for Amazon Aurora PostgreSQL" - https://aws.amazon.com/blogs/database/automate-benchmark-tes...

"Benchmarking Amazon Aurora Limitless with pgbench" - https://aws.amazon.com/blogs/database/benchmarking-amazon-au...

grhmc•2mo ago

Yikes! This is exactly the kind of invariant I'd expect Aurora to maintain on my behalf. It is why I pay them so much...

dangoodmanUT•2mo ago

It did, the storage layer did not allow for concurrent writes.

bob1029•2mo ago

> Aurora's architecture differs from traditional PostgreSQL in a crucial way: it separates compute from storage.

I find this approach very compelling. MSSQL has a similar thing with their hyperscale offering. It's probably the only service in Azure that I would actually use.

robinduckett•2mo ago

Glad to know I’m not crazy.

theanomaly•2mo ago

AWS Support initially pushed back and suggested it's because of high replication lag but they were looking at metrics that were more than 24 hours old. What kind of failure did you encounter? I really want to understand what edge case we triggered in their failover process - especially since we could not reproduce it in other regions.

robben1234•2mo ago

My cluster recently started to failover every few days whenever it experiences the load to trigger scale up from 1-2 to 20+ acu.

And then I also encountered errors just like op in my app layer about trying to execute a write query via read-only transaction.

The workaround so far is to invalidate connection on error. When app reconnects the cluster write endpoint correctly leads to current primary.

d1egoaz•2mo ago

> AWS has indicated a fix is on their roadmap, but as of now, the recommended mitigation aligns with our solution: use Aurora’s Failover feature on an as-needed basis and ensure that no writes are executed against the DB during the failover.

Is there a case number where we can reach out to AWS regarding this recommendation?

paranoidrobot•2mo ago

Yeah. I'd like this too.

We use Aurora MySQL but I would like to be able to point to that and ask if it applies to us.

time0ut•2mo ago

Wow. This is alarming.

We have done a similar operation routinely on databases under pretty write intensive workloads (like 10s of thousands of inserts per second). It is so routine we have automation to adjust to planned changes in volume and do so a dozen times a month or so. It has been very robust for us. Our apps are designed for it and use AWS’s JDBC wrapper.

Just one more thing to worry about I guess…

dangoodmanUT•2mo ago

Not really: Their storage layer worked perfectly and prevented the ACID violations.

almosthere•2mo ago

probably should have added postgres to end of title

evanelias•2mo ago

Absolutely this. The differences between Aurora Postgres and Aurora MySQL are quite significant. A failover bug affecting one doesn't imply the same bug exists in the other.

A lot of people seem to have the misconception that "Aurora" is its own unique database system, with different front-ends "pretending" to be Postgres or MySQL, but that isn't the case at all.

ldkge•2mo ago

Am I the only one who misread that as “AI race condition”?

dangoodmanUT•2mo ago

This confirms a lot of what their engineers preach: The lego brick model.

They made the storage layer in total isolation, and they made sure that it guaranteed correctness for exclusive writer access. When the upstream service failed to also make its own guarantees, the data layer was still protected.

Good job AWS engineering!

halifaxbeard•2mo ago

I think OP is wrong in their hypothesis based on the logs they share and the root cause AWS support provided them.

I think the promotion fails to happen and then an external watchdog notices that it didn’t, and kills everything ASAP as it’s a cluster state mismatch.

The message about the storage subsystem going away is after the other Postgres process was kill -9’d.

halfmatthalfcat•2mo ago

CC pm. MgtzkskskzjauHjhffd

shayonj•2mo ago

Sadly, its not the first time I have noticed unexpected and odd behaviors from Aurora PostgreSQL offering.

I noticed another interesting (and still unconfirmed) bug with Aurora PostgreSQL around their Zero Downtime Patching.

During an Aurora minor version upgrade, Aurora preserves sessions across the engine restart, but it appears to also preserve stale per-session execution state (including the internal statement timer). After ZDP, I’ve seen very simple queries (e.g. a single-row lookup via Rails/ActiveRecord) fail with `PG::QueryCanceled: ERROR: canceling statement due to statement timeout` in far less than the configured statement_timeout (GUC), and only in the brief window right after ZDP completes.

My working theory is that when the client reconnects (e.g. via PG::Connection#reset), Aurora routes the new TCP connection back to a preserved session whose “statement start time” wasn’t properly reset, so the new query inherits an old timer and gets canceled almost immediately even though it’s not long-running at all.

DoNotNotify is now Open Source

Why E cores make Apple Silicon fast

Dave Farber has passed away

Matchlock: Linux-based sandboxing for AI agents

Reverse Engineering Raiders of the Lost Ark for the Atari 2600

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

Haskell for all: Beyond agentic coding

Curating a Show on My Ineffable Mother, Ursula K. Le Guin

SectorC: A C Compiler in 512 bytes (2023)

(AI) Slop Terrifies Me

Rabbit Ear "Origami": programmable origami in the browser (JS)

LLMs as the new high level language

The Legacy of Daniel Kahneman: A Personal View (2025)

The Architecture of Open Source Applications (Volume 1) Berkeley DB

Software factories and the agentic moment

Modern and Antique Technologies Reveal a Dynamic Cosmos

Speed up responses with fast mode

A11yJSON: A standard to describe the accessibility of the physical world

Hoot: Scheme on WebAssembly

uLauncher

Stories from 25 Years of Software Development

Vocal Guide – belt sing without killing yourself

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Wood Gas Vehicles: Firewood in the Fuel Tank (2010)

LineageOS 23.2

First Proof

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

In the Australian outback, we're listening for nuclear tests

Start all of your commands with a comma (2009)

Arcan Explained – A browser for different webs

DoNotNotify is now Open Source

Why E cores make Apple Silicon fast

Dave Farber has passed away

Matchlock: Linux-based sandboxing for AI agents

Reverse Engineering Raiders of the Lost Ark for the Atari 2600

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

Haskell for all: Beyond agentic coding

Curating a Show on My Ineffable Mother, Ursula K. Le Guin

SectorC: A C Compiler in 512 bytes (2023)

(AI) Slop Terrifies Me

Rabbit Ear "Origami": programmable origami in the browser (JS)

LLMs as the new high level language

The Legacy of Daniel Kahneman: A Personal View (2025)

The Architecture of Open Source Applications (Volume 1) Berkeley DB

Software factories and the agentic moment

Modern and Antique Technologies Reveal a Dynamic Cosmos

Speed up responses with fast mode

A11yJSON: A standard to describe the accessibility of the physical world

Hoot: Scheme on WebAssembly

uLauncher

Stories from 25 Years of Software Development

Vocal Guide – belt sing without killing yourself

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Wood Gas Vehicles: Firewood in the Fuel Tank (2010)

LineageOS 23.2

First Proof

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

In the Australian outback, we're listening for nuclear tests

Start all of your commands with a comma (2009)

Arcan Explained – A browser for different webs

A race condition in Aurora RDS

Comments