Not that I'm discounting the author's experience, but something doesn't quite add up:
- How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?
- If they know, how is this not an urgent P0 issue for AWS? This seems like the most basic of basic usability features is 100% broken.
- Is there something more nuanced to the failure case here such as does this depend on transactions in-progress? I can see how maybe the failover is waiting for in-flight transactions to close and then maybe hits a timeout where it proceeds with the other part of the failover by accident. That could explain why it doesn't seem like the issue is more widespread.
Their docs are not good and things frequently don't behave how you expect them to.
If it's anything like how Azure handles this kind of issue, it's likely "lots of people have experienced it, a restart fixes it so no one cares that much, few have any idea how to figure out a root cause on their own, and the process to find a root cause with the vendor is so painful that no one ever sees it through"
I know that there is no comparison in the user base, but a few years ago I ran into a massive Python + MySQL bug that:
1. made SELECT ... FOR UPDATE fail silenty 2. aborted the transaction and set the connection into autocommit mode
This basically a worst case scenario in a transactional system.
I was basically screaming like a mad man in the corner but no one seemed to care.
Someone contacted me months later telling me that they experienced the same problem with "interesting" consequences in their system.
The bug was eventually fixed but at that point I wasn't tracking it anymore, I provided a patch when I created the issue and moved on.
https://stackoverflow.com/questions/945482/why-doesnt-anyone...
Max throughput on gp3 was recently increased to 2GB/s, is there some way I don't know about of getting 3.125?
> General Purpose SSD (gp3) - Throughput > gp3 supports a max of 4000 MiBps per volume
But the docs say 2000. Then there's IOPS... The calculator allows up to 64.000 but on [0], if you expand "Higher performance and throughout" it says
> Customers looking for higher performance can scale up to 80,000 IOPS and 2,000 MiBps for an additional fee.
I think 80k IOPs on gp3 is a newer release so presumably AWS hasn't updated RDS from the old max of 64k. iirc it took a while before gp3 and io2 were even available for RDS after they were released as EBS options
Edit: Presumably it takes some time to do testing/optimizations to make sure their RDS config can achieve the same performance as EBS. Sometimes there are limitations with instance generations/types that also impact whether you can hit maximum advertised throughput
If you have a PG cluster with 1 writer, 2 readers, 10Ti of storage and 16k provision IOPs (io1/2 has better latency than gp3), you pay for 30Ti and 48k PIOPS without redundancy or 60Ti and 96k PIOPS with multi-AZ.
The same Aurora setup you pay for 10Ti and get multi-AZ for free (assuming the same cluster setup and that you've stuck the instances in different AZs).
I don't want to figure the exact numbers but iirc if you have enough storage--especially io1/2--you can end up saving money and getting better performance. For smaller amounts of storage, the numbers don't necessarily work out.
There's also 2 IO billing modes to be aware of. There's the default pay-per-IO which is really only helpful for extreme spikes and generally low IO usage. The other mode is "provisioned" or "storage optimized" or something where you pay a flat 30% of the instance cost (in addition to the instance cost) for unlimited IO--you can get a lot more IO and end up cheaper in this mode if you had an IO heavy workload before
I'd also say Serverless is almost never worth it. Iirc provisioning instances was ~17% of the cost of serverless. Serverless only works out if you have ~ <4 hours of heavy usage followed by almost all idle. You can add instances fairly quickly and failover for minimal downtime (of course barring running into the bug the article describes...) to handle workload spikes using fixed instance sizes without serverless
[0] https://dev.to/aws-heroes/100k-write-iops-in-aurora-t3medium...
I find this approach very compelling. MSSQL has a similar thing with their hyperscale offering. It's probably the only service in Azure that I would actually use.
Is there a case number where we can reach out to AWS regarding this recommendation?
We have done a similar operation routinely on databases under pretty write intensive workloads (like 10s of thousands of inserts per second). It is so routine we have automation to adjust to planned changes in volume and do so a dozen times a month or so. It has been very robust for us. Our apps are designed for it and use AWS’s JDBC wrapper.
Just one more thing to worry about I guess…
redwood•1h ago
terminalshort•1h ago
1. I need to grab some rows from a table
2. Eventual consistency is good enough
And that's a lot of workloads.
candiddevmike•1h ago
darth_avocado•20m ago
terminalshort•9m ago
redwood•54m ago
morshu9001•7m ago
nijave•42m ago
More broadly a distributed system problem