frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

A race condition in Aurora RDS

https://hightouch.com/blog/uncovering-a-race-condition-in-aurora-rds
97•theanomaly•1h ago

Comments

redwood•1h ago
A good reminder of how people developing a mental model of adding read replicas as a way to scale is a slippery slope. At the end of the day you're scaling only one specific part of your system with certain consistency dynamics that are difficult to reason about
terminalshort•1h ago
Works fine for workloads like:

1. I need to grab some rows from a table

2. Eventual consistency is good enough

And that's a lot of workloads.

candiddevmike•1h ago
As a user, I've come to realize the situations where I think eventual consistency (or delayed processing) are good enough aren't the same as the folks developing most products. Nothing annoys me more than stuff not showing up immediately or having to manually refresh.
darth_avocado•20m ago
Sometimes users want everything to show up immediately, but not pay extra for the feature. Everything real time is expensive. Eventual consistency is a good thing for most systems.
terminalshort•9m ago
For a workload where you need true read after write you can just send those reads to the writer. But even if you don't there are plenty of workarounds here. You can send a success response to the user when the transaction commits to the writer and update the UI on response. The only case where this will fail is if the user manually reloads the page within the replication lag window and the request goes to the reader. This should be exceedingly rare in a single region cluster, and maybe a little less rare in a multi-region set up, but still pretty rare. I almost never see > 1s replication lag between regions in my Aurora clusters. There are certainly DB workloads where this will not be true, but if you are in a high replication lag cluster, you just don't want to use that for this type of UI dependency in the first place.
redwood•54m ago
The future you or future team member may struggle to reason about that in the future
morshu9001•7m ago
That's readonly. RW workloads usually don't tolerate eventual consistency on the thing they're writing.
nijave•42m ago
You can hit the same problems horizontally scaling compute. One instance reads from the DB, a request hits a different instance which updates the DB. The original instance writes to the DB and overwrites the changes or makes decisions based on stale data.

More broadly a distributed system problem

gtowey•1h ago
This article seems to indicate that manually triggered failovers will always fail if your application tries to maintain its normal write traffic during that process.

Not that I'm discounting the author's experience, but something doesn't quite add up:

- How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?

- If they know, how is this not an urgent P0 issue for AWS? This seems like the most basic of basic usability features is 100% broken.

- Is there something more nuanced to the failure case here such as does this depend on transactions in-progress? I can see how maybe the failover is waiting for in-flight transactions to close and then maybe hits a timeout where it proceeds with the other part of the failover by accident. That could explain why it doesn't seem like the issue is more widespread.

maherbeg•1h ago
Yeah I agree, this seems like a pretty critical feature to the Aurora product itself. We saw a similar behavior recently where we had a connection pooler in between which indicates something wrong with how they propagate DNS changes during the failover. wtf aws
CaptainKanuk•27m ago
Whenever we have to do any type of AWS Aurora or RDS cluster modification in prod we always have the entire emergency response crew standing by right outside the door.

Their docs are not good and things frequently don't behave how you expect them to.

dboreham•55m ago
Although the article has an SEO-optimized vibe, I think it's reasonable to take it as true until refuted. My rule of thumb is that any rarely executed, very tricky operation (e.g. database writer fail over) is likely to not work because there are too many variables in play and way too few opportunities to find and fix bugs. So the overall story sounds very plausible to me. It has a feel of: it doesn't work under continuous heavy write load, in combination with some set of hardware performance parameters that plays badly with some arbitrary time out. Note that the system didn't actually fail. It just didn't process the fail over operation. It reverted to the original configuration and afaics preserved data.
theanomaly•50m ago
I'm surprised this hasn't come up more often too. When we worked with AWS on this, they confirmed there was nothing unique about our traffic pattern that would trigger this issue. We also didn't run into this race condition in any of our other regions running similar workloads. What's particularly concerning is that this seems to be a fundamental flaw in Aurora's failover mechanism that could theoretically affect anyone doing manual failover.
twisteriffic•48m ago
> How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?

If it's anything like how Azure handles this kind of issue, it's likely "lots of people have experienced it, a restart fixes it so no one cares that much, few have any idea how to figure out a root cause on their own, and the process to find a root cause with the vendor is so painful that no one ever sees it through"

nijave•31m ago
fwiw we haven't seen issues manually doing manual failovers for maintenance using the same/similar procedure described in the article. I imagine there is something more nuanced here and it's hard to draw too many conclusions without a lot more details being provided by AWS
aetherson•29m ago
My experience with AWS is that they are extremely, extremely parsimonious about any information they give out. It is near-impossible to get them to give you any details about what is happening beyond the level of their API. So my gut hunch is that they think that there's something very rare about this happening, but they refuse to give the article writer the information that might or might not help them avoid the bug.
Hovertruck•23m ago
Agreed, we've been running multiple aurora clusters in production for years now and have not encountered this issue with failovers.
kobalsky•5m ago
> - How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?

I know that there is no comparison in the user base, but a few years ago I ran into a massive Python + MySQL bug that:

1. made SELECT ... FOR UPDATE fail silenty 2. aborted the transaction and set the connection into autocommit mode

This basically a worst case scenario in a transactional system.

I was basically screaming like a mad man in the corner but no one seemed to care.

Someone contacted me months later telling me that they experienced the same problem with "interesting" consequences in their system.

The bug was eventually fixed but at that point I wasn't tracking it anymore, I provided a patch when I created the issue and moved on.

https://stackoverflow.com/questions/945482/why-doesnt-anyone...

jansommer•1h ago
People who have experience with Aurora and RDS Postgres: What's your experience in terms of performance? If you dont need multi A-Z and quick failover, can you achieve better performance with RDS and e.g. gp3 64.000 iops and 3125 throughput (assuming everything else can deliver that and cpu/mem isn't the bottleneck)? Aurora seems to be especially slow for inserts and also quite expensive compared to what I get with RDS when I estimate things in the calculator. And what's the story on read performance for Aurora vs RDS? There's an abundance of benchmarks showing Aurora is better in terms of performance but they leave out so much about their RDS config that I'm having a hard time believing them.
shawabawa3•26m ago
> 3125 throughput

Max throughput on gp3 was recently increased to 2GB/s, is there some way I don't know about of getting 3.125?

jansommer•16m ago
This is super confusing. Check out the RDS Postgres calculator with gp3:

> General Purpose SSD (gp3) - Throughput > gp3 supports a max of 4000 MiBps per volume

But the docs say 2000. Then there's IOPS... The calculator allows up to 64.000 but on [0], if you expand "Higher performance and throughout" it says

> Customers looking for higher performance can scale up to 80,000 IOPS and 2,000 MiBps for an additional fee.

[0] https://aws.amazon.com/ebs/general-purpose/

nijave•10m ago
RDS PG stripes multiple gp3 volumes so that's why RDS throughput is higher than gp3

I think 80k IOPs on gp3 is a newer release so presumably AWS hasn't updated RDS from the old max of 64k. iirc it took a while before gp3 and io2 were even available for RDS after they were released as EBS options

Edit: Presumably it takes some time to do testing/optimizations to make sure their RDS config can achieve the same performance as EBS. Sometimes there are limitations with instance generations/types that also impact whether you can hit maximum advertised throughput

nijave•16m ago
RDS stripes multiple gp3 volumes. Docs are saying 4Gi/s per instance is the max for gp3 if I'm looking at the right table
nijave•19m ago
We've seen better results and lower costs in a 1 writer, 1-2 reader setup on Aurora PG 14. The main advantages are 1) you don't re-pay for storage for each instance--you pay for cluster storage instead of per-instance storage & 2) you no longer need to provision IOPs and it provides ~80k IOPs

If you have a PG cluster with 1 writer, 2 readers, 10Ti of storage and 16k provision IOPs (io1/2 has better latency than gp3), you pay for 30Ti and 48k PIOPS without redundancy or 60Ti and 96k PIOPS with multi-AZ.

The same Aurora setup you pay for 10Ti and get multi-AZ for free (assuming the same cluster setup and that you've stuck the instances in different AZs).

I don't want to figure the exact numbers but iirc if you have enough storage--especially io1/2--you can end up saving money and getting better performance. For smaller amounts of storage, the numbers don't necessarily work out.

There's also 2 IO billing modes to be aware of. There's the default pay-per-IO which is really only helpful for extreme spikes and generally low IO usage. The other mode is "provisioned" or "storage optimized" or something where you pay a flat 30% of the instance cost (in addition to the instance cost) for unlimited IO--you can get a lot more IO and end up cheaper in this mode if you had an IO heavy workload before

I'd also say Serverless is almost never worth it. Iirc provisioning instances was ~17% of the cost of serverless. Serverless only works out if you have ~ <4 hours of heavy usage followed by almost all idle. You can add instances fairly quickly and failover for minimal downtime (of course barring running into the bug the article describes...) to handle workload spikes using fixed instance sizes without serverless

jansommer•3m ago
Have you benchmarked your load on RDS? [0] says that IOPS on Aurora is vastly different from actual IOPS. We have just one writer instance and mostly write 100's of GB in bulk.

[0] https://dev.to/aws-heroes/100k-write-iops-in-aurora-t3medium...

grhmc•58m ago
Yikes! This is exactly the kind of invariant I'd expect Aurora to maintain on my behalf. It is why I pay them so much...
bob1029•52m ago
> Aurora's architecture differs from traditional PostgreSQL in a crucial way: it separates compute from storage.

I find this approach very compelling. MSSQL has a similar thing with their hyperscale offering. It's probably the only service in Azure that I would actually use.

robinduckett•51m ago
Glad to know I’m not crazy.
d1egoaz•16m ago
> AWS has indicated a fix is on their roadmap, but as of now, the recommended mitigation aligns with our solution: use Aurora’s Failover feature on an as-needed basis and ensure that no writes are executed against the DB during the failover.

Is there a case number where we can reach out to AWS regarding this recommendation?

time0ut•10m ago
Wow. This is alarming.

We have done a similar operation routinely on databases under pretty write intensive workloads (like 10s of thousands of inserts per second). It is so routine we have automation to adjust to planned changes in volume and do so a dozen times a month or so. It has been very robust for us. Our apps are designed for it and use AWS’s JDBC wrapper.

Just one more thing to worry about I guess…

AI World Clocks

https://clocks.brianmoore.com/
116•waxpancake•1h ago•73 comments

A race condition in Aurora RDS

https://hightouch.com/blog/uncovering-a-race-condition-in-aurora-rds
97•theanomaly•1h ago•31 comments

Manganese is Lyme disease's double-edge sword

https://news.northwestern.edu/stories/2025/11/manganese-is-lyme-diseases-double-edge-sword
62•gmays•2h ago•13 comments

The disguised return of EU Chat Control

https://reclaimthenet.org/the-disguised-return-of-the-eus-private-message-scanning-plot
195•egorfine•1h ago•100 comments

Minisforum Stuffs Entire Arm Homelab in the MS-R1

https://www.jeffgeerling.com/blog/2025/minisforum-stuffs-entire-arm-homelab-ms-r1
18•kencausey•1h ago•13 comments

Show HN: Tiny Diffusion – A character-level text diffusion model from scratch

https://github.com/nathan-barry/tiny-diffusion
17•nathan-barry•4d ago•0 comments

Structured Outputs on the Claude Developer Platform (API)

https://www.claude.com/blog/structured-outputs-on-the-claude-developer-platform
15•adocomplete•43m ago•4 comments

Awk Technical Notes (2023)

https://maximullaris.com/awk_tech_notes.html
24•signa11•1w ago•1 comments

Bitchat for Gaza – messaging without internet

https://updates.techforpalestine.org/bitchat-for-gaza-messaging-without-internet/
128•ciconia•2h ago•50 comments

Meeting notes between Forgejo and the Dutch government via Git commits

https://codeberg.org/forgejo/sustainability/pulls/137/files
65•speckx•2h ago•23 comments

US Tech Market Treemap

https://caplocus.com/
35•gwintrob•3h ago•8 comments

RetailReady (YC W24) Is Hiring

https://www.ycombinator.com/companies/retailready/jobs/kGHAith-support-engineer
1•sarah74•2h ago

AGI fantasy is a blocker to actual engineering

https://www.tomwphillips.co.uk/2025/11/agi-fantasy-is-a-blocker-to-actual-engineering/
441•tomwphillips•6h ago•405 comments

Incus-OS: Immutable Linux OS to run Incus as a hypervisor

https://linuxcontainers.org/incus-os/
108•_kb•1w ago•37 comments

Linear Algebra Explains Why Some Words Are Effectively Untranslatable

https://aethermug.com/posts/linear-algebra-explains-why-some-words-are-effectively-untranslatable
70•mrcgnc•5h ago•43 comments

Germany to Ban Huawei from Future 6G Network in Sovereignty Push

https://www.bloomberg.com/news/articles/2025-11-13/germany-to-ban-huawei-from-future-6g-network-i...
94•teleforce•2h ago•73 comments

Honda: 2 years of ml vs 1 month of prompting - heres what we learned

https://www.levs.fyi/blog/2-years-of-ml-vs-1-month-of-prompting/
237•Ostatnigrosh•4d ago•88 comments

GPG and Me (2015)

https://moxie.org/2015/02/24/gpg-and-me.html
6•cl3misch•3d ago•0 comments

Magit manuals are available online again

https://github.com/magit/magit/issues/5472
91•vetronauta•7h ago•33 comments

Show HN: Chirp – Local Windows dictation with ParakeetV3 no executable required

https://github.com/Whamp/chirp
3•whamp•39m ago•1 comments

EDE: Small and Fast Desktop Environment (2014)

https://edeproject.org/
76•bradley_taunt•6h ago•28 comments

'No One Lives Forever' Turns 25 and You Still Can't Buy It Legitimately

https://www.techdirt.com/2025/11/13/no-one-lives-forever-turns-25-you-still-cant-buy-it-legitimat...
93•speckx•3h ago•54 comments

I think nobody wants AI in Firefox, Mozilla

https://manualdousuario.net/en/mozilla-firefox-window-ai/
1007•rpgbr•5h ago•622 comments

Show HN: Dumbass Business Ideas

https://dumbassideas.com
12•elysionmind•1h ago•2 comments

Winamp clone in Swift for macOS

https://github.com/mgreenwood1001/winamp
116•hyperbole•7h ago•90 comments

Operating Margins

https://fi-le.net/margin/
228•fi-le•5d ago•87 comments

Moving Back to a Tiling WM – XMonad

https://wssite.vercel.app/blog/moving-back-to-a-tiling-wm-xmonad
51•weirdsmiley•2h ago•46 comments

Scientists Produce Powerhouse Pigment Behind Octopus Camouflage

https://today.ucsd.edu/story/scientists-produce-powerhouse-pigment-behind-octopus-camouflage
57•gmays•4d ago•5 comments

Nvidia is gearing up to sell servers instead of just GPUs and components

https://www.tomshardware.com/tech-industry/artificial-intelligence/jp-morgan-says-nvidia-is-geari...
141•giuliomagnifico•6h ago•61 comments

RegreSQL: Regression Testing for PostgreSQL Queries

https://boringsql.com/posts/regresql-testing-queries/
137•radimm•12h ago•31 comments