frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

How when AWS was down, we were not

https://authress.io/knowledge-base/articles/2025/11/01/how-we-prevent-aws-downtime-impacts
41•mooreds•3h ago

Comments

tptacek•2h ago
This is a rare case where the original bait-y title is probably better than the de-bait-ified title, because the actual article is much less of a brag and much more of an actual case study.
dang•2h ago
Re-how'd, plus I've resisted the temptation to insert a comma that feels missing to me.
tptacek•1h ago
"How?! When AWS was down: we were not!"
wparad•19m ago
I spent a long time, trying to figure out, what the title of the article, should be. I'm terrible at SEO and generating click-bait titles, it is unfortunately, what, it, is.
pinkmuffinere•2h ago
> During this time, us-east-1 was offline, and while we only run a limited amount of infrastructure in the region, we have to run it there because we have customers who want it there

> [Our service can only go down] five minutes and 15 seconds per year.

I don't have much experience in this area, so please correct me if I'm mistaken:

Don't these two quotes together imply that they have failed to deliver on their SLA for the subset of their customers that want their service in us-east-1? I understand the customers won't be mad at them in this case, since us-east-1 itself is down, but I feel like their title is incorrect. Some subset of their service is running on top of AWS. When AWS goes down, that subset of their service is down. When AWS was down, it seems like they were also down for some customers.

PaulRobinson•1h ago
The bulk of the article discusses their failover strategy, where they detect failures in a region and how they route requests to a backup region, and how to deal with data consistency and cost issues arising from that.
loloquwowndueo•1h ago
Depends on what the SLA phrasing is - us-east-1 affinity is a requirement put forth by some customers so I would totally expect the SLA to specifically state it’s subject to us-east-1 availability. Essentially these customers are opting out of Authress’s fault-tolerant infrastructure and the SLA should be clear about that.
dylan604•1h ago
As TFA states, we have to offer services in that region because that's where some users are as well. However, the core of services are not in that region. I have also suggested when the time comes for offering SLAs, that there is explicit wording exempting us-east-1.
wparad•26m ago
It's a good point.

We don't actually commit to running infrastructure in one specific AWS region. Customers can't request that the infra runs exactly in us-east-1, but they can request that it runs in "Eastern United States". The problem is that with scenarios that might require VPC peering or low latency connections, we can't just run the infrastructure in us-east-2 and commit to never having a problem. For the same reason, what happens if us-east-2 were to have an incident.

We have to assume that our customers need it in a relatively close region, and that at the same time need to plan for the contingency that region can be down.

Then there are the customer's users to think of as well. In some cases, those users might be globally dispersed, even if the customer infrastructure is only one major location. So while it would be nice to claim "well you were also down at that moment", in practices customer's users will notice, and realistically, we want to make sure we aren't impeding remediation on their side.

That is, even if a customer says "use us-east-1", and then us-east-1 is down, it can't look that way to the customer. This gets a lot more complicated, when the services that we are providing may be impacted differently. Consider us-east-1 dynamoDB down, but everything else was still working. Partial failure modes are much harder to deal with.

macintux•11m ago
> Partial failure modes are much harder to deal with.

Truer words were never spoken.

sharklasers123•1h ago
Is there not an inherent risk using an AWS service (Route 53) to do the health check? Wouldn’t it make more sense to use a different cloud provider for redundancy?
indigodaddy•1h ago
Had the same thought, eg if things are really down can it even do the check etc
wparad•32m ago
If the check can't be done, then everything stays stable, so I'm guessing the question is, "What happens if Route 53 does the check and incorrectly reports the result?"

In that case, no matter what we are using there is going to be a critical issue. I think the best I could suggest at that point would be to have records in your zone that round robin different cloud providers, but that comes with its own challenges.

I believe there are some articles sitting around regarding how AWS plans for failure and the fallback mechanism actually reduces load on the system rather than makes it worse. I think it would require in-depth investigation on the expected failover mode to have a good answer there.

For instance, just to make it more concrete, what sort of failure mode are you expecting to happen with the Route 53 health check? Depending on that there could be different recommendations.

indigodaddy•1h ago
Back in the day (10-12 years ago) at a telecom/cable we accomplished this with F5 Big IP GSLB DNS (and later migrated to A10's GSLB equivalent devices) as the auth DNS server for services/zones that required or were suitable for HA. (I can't totally remember but I'm guessing we must have had a pretty low TTL for this).

Had no idea that Route 53 had this sort of functionality

wparad•10m ago
Maybe I should have titled the article "AWS Route53 HealthChecks are amazing" :)
iso1631•1h ago
I'm interested in how they measure that downtime. If you're down for 200 milliseconds, does that accumulate. How do you even measure that you're down for 200ms.

(For what it's worth, for some of my services, 200ms is certainly an impact, not as bad as 2 seconds out outage but still noticable and reportable)

wparad•21m ago
Good catch. The truth is, while we track downtime for incident reporting, it's much more correct to actually be tracking the number of requests that result in a failure. Our SLAs are based on request volume, and not specifically time. Most customers don't have perfect sustained usage. Being down when they aren't running is irrelevant to everyone.

This is where the grey failures can come into play. It's really hard to tell, often impossible to know what the impact of an incident is to a customer, even if you know you are having an incident, without them telling you.

In order to know that you are "down", our edge of the HTTP request would need to be able to track requests. For us that is CloudFront, but if there is an issue before that, at DNS, at network level, etc... we just can't know what the actual impact is.

As far as measuring how you are down. We can pretty accurately know the list of failures that are happening, (when we can know), and what the results are.

That's because most components are behind cloudfront in any case. And if cloudfront isn't having a problem, we'll have telemetry that tells us what the HTTP request/response status codes and connection completions look like. Then it's a matter of measuring from our first detection to the actual remediation being deployed (assuming there is one).

Another thing that helps here is that we have multiple other products that also use Authress, and we can run technology in other regions that can report this information, for those accounts (obviously can't be for all customers), which can help us identify with additional accuracy, but is often unnecessary.

wparad•35m ago
Hey, I wrote that article!

I'll try to add comments and answer questions where I can.

- Warren

ckozlowski•20m ago
Hi Warren! I'm Chris, and I'm with AWS, where among other things, I work on the Well-Architected Framework. Would you be willing to talk with us? You can reach me at kozlowck@amazon.com. Thanks!

Edit: This is a fantastic write-up by the way!

wparad•17m ago
Thank you!

Compiling Ruby to Machine Language

https://patshaughnessy.net/2025/11/17/compiling-ruby-to-machine-language
20•todsacerdoti•46m ago•0 comments

Show HN: I built a synth for my daughter

https://bitsnpieces.dev/posts/a-synth-for-my-daughter/
762•random_moonwalk•5d ago•144 comments

Show HN: PrinceJS – 19,200 req/s Bun framework in 2.8 kB (built by a 13yo)

https://princejs.vercel.app
35•lilprince1218•1h ago•13 comments

"One Student One Chip" Course Homepage

https://ysyx.oscc.cc/docs/en/
30•camel-cdr•5d ago•5 comments

My stages of learning to be a socially normal person

https://sashachapin.substack.com/p/my-six-stages-of-learning-to-be-a
147•eatitraw•2d ago•58 comments

Project Gemini

https://geminiprotocol.net/
136•andsoitis•5h ago•86 comments

FreeMDU: Open-source Miele appliance diagnostic tools

https://github.com/medusalix/FreeMDU
195•Medusalix•7h ago•45 comments

An official atlas of North Korea

https://www.cartographerstale.com/p/an-official-atlas-of-north-korea
113•speckx•2h ago•53 comments

Show HN: ESPectre – Motion detection based on Wi-Fi spectre analysis

https://github.com/francescopace/espectre
44•francescopace•6h ago•4 comments

WeatherNext 2: Our most advanced weather forecasting model

https://blog.google/technology/google-deepmind/weathernext-2/
121•meetpateltech•5h ago•46 comments

Israeli-founded app preloaded on Samsung phones is attracting controversy

https://www.sammobile.com/news/israeli-app-app-cloud-samsung-phones-controversy/
206•croes•3h ago•131 comments

Show HN: Continuous Claude – run Claude Code in a loop

https://github.com/AnandChowdhary/continuous-claude
23•anandchowdhary•2d ago•5 comments

Aldous Huxley predicts Adderall and champions alternative therapies

https://angadh.com/inkhaven-7
21•surprisetalk•5h ago•6 comments

Our dogs' diversity can be traced back to the Stone Age

https://www.bbc.com/news/articles/ce9d7j89ykro
15•1659447091•3d ago•3 comments

How to escape the Linux networking stack

https://blog.cloudflare.com/so-long-and-thanks-for-all-the-fish-how-to-escape-the-linux-networkin...
50•meysamazad•5h ago•3 comments

How when AWS was down, we were not

https://authress.io/knowledge-base/articles/2025/11/01/how-we-prevent-aws-downtime-impacts
41•mooreds•3h ago•20 comments

Astrophotographer snaps skydiver falling in front of the sun

https://www.iflscience.com/the-fall-of-icarus-you-have-never-seen-an-astrophotography-picture-lik...
111•doener•1d ago•25 comments

Giving C a superpower: custom header file (safe_c.h)

https://hwisnu.bearblog.dev/giving-c-a-superpower-custom-header-file-safe_ch/
211•mithcs•10h ago•166 comments

EEG-based neurofeedback in athletes and non-athletes

https://www.mdpi.com/2306-5354/12/11/1202
14•PaulHoule•3h ago•1 comments

A graph explorer of the Epstein emails

https://epstein-doc-explorer-1.onrender.com/
114•cratermoon•2d ago•14 comments

Show HN: Building WebSocket in Apache Iggy with Io_uring and Completion Based IO

https://iggy.apache.org/blogs/2025/11/17/websocket-io-uring/
11•spetz•2h ago•1 comments

DESI's Dizzying Results

https://www.universetoday.com/articles/desis-dizzying-results
13•belter•3h ago•1 comments

The time has finally come for geothermal energy

https://www.newyorker.com/magazine/2025/11/24/why-the-time-has-finally-come-for-geothermal-energy
57•riordan•6h ago•96 comments

Where do the children play?

https://unpublishablepapers.substack.com/p/where-do-the-children-play
243•casca•1d ago•191 comments

Google is killing the open web, part 2

https://wok.oblomov.eu/tecnologia/google-killing-open-web-2/
274•akagusu•5h ago•216 comments

Replicate is joining Cloudflare

https://replicate.com/blog/replicate-cloudflare
236•bfirsh•6h ago•54 comments

Are you stuck in movie logic?

https://usefulfictions.substack.com/p/are-you-stuck-in-movie-logic
123•eatitraw•8h ago•113 comments

People are using iPad OS features on their iPhones

https://idevicecentral.com/ios-customization/how-to-enable-ipad-features-like-multitasking-stage-...
92•K0IN•18h ago•104 comments

Show HN: Bsub.io – zero-setup batch execution for command-line tools

10•wkoszek•5h ago•5 comments

Implementing Rust newtype for errors in axum

https://rup12.net/posts/learning-rust-custom-errors/
6•ruptwelve•1h ago•3 comments