I figured if a single AZ has an outage, let alone the entire region, I can rest easy knowing much bigger companies will have bigger problems. It will probably be newsworthy, and when customers email in, my excuse will be defensible, since I can send them links to external status pages, news articles, etc.
Whilst this was mostly true, it was still a very unpleasant experience, and my service was hanging by a thread for much of the time. I recently moved an important part of the stack from EC2 to Fargate, with two services: a single task to post jobs to a queue, and another service running many tasks to process jobs from the queue.
The incident knocked out the job posting service, which would not come back up. Had I left it to AWS to resolve automatically, my service would have been out for maybe 12 hours.
Fortunately the worker tasks were still available and waiting. I tracked down the old "job poster" code that used to run on an ec2. I sshed into an old ec2, and "deployed" the code by copying and pasting onto the server. The service came back up, although I had to edit the code directly on the ec2 to slow things down, since the ec2 had 1vCPU and an upgrade was not possible during the incident. Furthermore, Fargate workers would not scale out if they had too much work.
This was at about 2 or 3 AM my time, and was carried out whilst customers were emailing in, and cloudwatch alarms were going off all over the place. Once the service was back up, even with my unnerving hacky solution, I got a couple hours sleep.
What I've learnt:
- When the incident was first reported, I thought it would last 2 hours max. A 12 - 16 hour disruption to AWS resources is absolutely possible.
- Maybe don't use us-east-1 for future projects, but I'm not convinced there's much logic to this. Despite past issues, it's impossible to predict where an outage might occur and the affected resources, as well as spillover into other regions.
- Think of ways to make my service more portable, to other regions, even other cloud providers, but the motivation to do this will be gone by tomorrow. It's way more valuable for me to focus on customers, new features, etc, rather than bomb-proofing the service. I don't write airline or medical software. An outage of my service isn't going to kill anyone, and most users are understanding. I'll accept the hit.
merek•32m ago
Whilst this was mostly true, it was still a very unpleasant experience, and my service was hanging by a thread for much of the time. I recently moved an important part of the stack from EC2 to Fargate, with two services: a single task to post jobs to a queue, and another service running many tasks to process jobs from the queue.
The incident knocked out the job posting service, which would not come back up. Had I left it to AWS to resolve automatically, my service would have been out for maybe 12 hours.
Fortunately the worker tasks were still available and waiting. I tracked down the old "job poster" code that used to run on an ec2. I sshed into an old ec2, and "deployed" the code by copying and pasting onto the server. The service came back up, although I had to edit the code directly on the ec2 to slow things down, since the ec2 had 1vCPU and an upgrade was not possible during the incident. Furthermore, Fargate workers would not scale out if they had too much work.
This was at about 2 or 3 AM my time, and was carried out whilst customers were emailing in, and cloudwatch alarms were going off all over the place. Once the service was back up, even with my unnerving hacky solution, I got a couple hours sleep.
What I've learnt:
- When the incident was first reported, I thought it would last 2 hours max. A 12 - 16 hour disruption to AWS resources is absolutely possible.
- Maybe don't use us-east-1 for future projects, but I'm not convinced there's much logic to this. Despite past issues, it's impossible to predict where an outage might occur and the affected resources, as well as spillover into other regions.
- Think of ways to make my service more portable, to other regions, even other cloud providers, but the motivation to do this will be gone by tomorrow. It's way more valuable for me to focus on customers, new features, etc, rather than bomb-proofing the service. I don't write airline or medical software. An outage of my service isn't going to kill anyone, and most users are understanding. I'll accept the hit.