The "Mitigation and Service Restoration" phase here is definitely the weakest one in terms of explanation—5 hours for root cause discovery is a rough break, but somewhat understandable given the difficulty in isolating the hosts and debugging the obscure networking failure. But once they found the automated package update responsible, no detail was given at all for why it then took four hours to even consider a hotfix to disable the automatic package updates. Just a complete empty gap
Also, "token used for automatic updates"? So like, some kind of third-party vendor was automatically live-updating system packages on their running dynos??? Who the hell thought that would ever be a good idea? And once you've discovered an issue with it, why would you make that critical path remediation dependent on a third-party who probably doesn't even consider this a critical incident?
Here's a really simple question—once engineers confirmed the outage was major, why did it take 8 hours for someone to think of trying to log into the herokustatus Twitter account? Was there no one assigned to public comms during this incident?
nightpool•2h ago
Also, "token used for automatic updates"? So like, some kind of third-party vendor was automatically live-updating system packages on their running dynos??? Who the hell thought that would ever be a good idea? And once you've discovered an issue with it, why would you make that critical path remediation dependent on a third-party who probably doesn't even consider this a critical incident?
Here's a really simple question—once engineers confirmed the outage was major, why did it take 8 hours for someone to think of trying to log into the herokustatus Twitter account? Was there no one assigned to public comms during this incident?