> [Our service can only go down] five minutes and 15 seconds per year.
I don't have much experience in this area, so please correct me if I'm mistaken:
Don't these two quotes together imply that they have failed to deliver on their SLA for the subset of their customers that want their service in us-east-1? I understand the customers won't be mad at them in this case, since us-east-1 itself is down, but I feel like their title is incorrect. Some subset of their service is running on top of AWS. When AWS goes down, that subset of their service is down. When AWS was down, it seems like they were also down for some customers.
We don't actually commit to running infrastructure in one specific AWS region. Customers can't request that the infra runs exactly in us-east-1, but they can request that it runs in "Eastern United States". The problem is that with scenarios that might require VPC peering or low latency connections, we can't just run the infrastructure in us-east-2 and commit to never having a problem. For the same reason, what happens if us-east-2 were to have an incident.
We have to assume that our customers need it in a relatively close region, and that at the same time need to plan for the contingency that region can be down.
Then there are the customer's users to think of as well. In some cases, those users might be globally dispersed, even if the customer infrastructure is only one major location. So while it would be nice to claim "well you were also down at that moment", in practices customer's users will notice, and realistically, we want to make sure we aren't impeding remediation on their side.
That is, even if a customer says "use us-east-1", and then us-east-1 is down, it can't look that way to the customer. This gets a lot more complicated, when the services that we are providing may be impacted differently. Consider us-east-1 dynamoDB down, but everything else was still working. Partial failure modes are much harder to deal with.
Truer words were never spoken.
In that case, no matter what we are using there is going to be a critical issue. I think the best I could suggest at that point would be to have records in your zone that round robin different cloud providers, but that comes with its own challenges.
I believe there are some articles sitting around regarding how AWS plans for failure and the fallback mechanism actually reduces load on the system rather than makes it worse. I think it would require in-depth investigation on the expected failover mode to have a good answer there.
For instance, just to make it more concrete, what sort of failure mode are you expecting to happen with the Route 53 health check? Depending on that there could be different recommendations.
Had no idea that Route 53 had this sort of functionality
(For what it's worth, for some of my services, 200ms is certainly an impact, not as bad as 2 seconds out outage but still noticable and reportable)
This is where the grey failures can come into play. It's really hard to tell, often impossible to know what the impact of an incident is to a customer, even if you know you are having an incident, without them telling you.
In order to know that you are "down", our edge of the HTTP request would need to be able to track requests. For us that is CloudFront, but if there is an issue before that, at DNS, at network level, etc... we just can't know what the actual impact is.
As far as measuring how you are down. We can pretty accurately know the list of failures that are happening, (when we can know), and what the results are.
That's because most components are behind cloudfront in any case. And if cloudfront isn't having a problem, we'll have telemetry that tells us what the HTTP request/response status codes and connection completions look like. Then it's a matter of measuring from our first detection to the actual remediation being deployed (assuming there is one).
Another thing that helps here is that we have multiple other products that also use Authress, and we can run technology in other regions that can report this information, for those accounts (obviously can't be for all customers), which can help us identify with additional accuracy, but is often unnecessary.
I'll try to add comments and answer questions where I can.
- Warren
Edit: This is a fantastic write-up by the way!
tptacek•2h ago
dang•2h ago
tptacek•1h ago
wparad•19m ago