frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

AWS says data center overheating in North Virginia disrupts services

https://www.reuters.com/business/retail-consumer/amazon-cloud-unit-says-data-center-overheating-north-virginia-disrupts-services-2026-05-08/
56•christhecaribou•19h ago
https://health.aws.amazon.com/health/status?t=2026-05-07

https://www.cnbc.com/2026/05/08/aws-outage-data-center-fandu...

https://www.theregister.com/off-prem/2026/05/08/aws-warns-of...

Comments

merek•19h ago
Related:

AWS EC2 outage in use1-az4 (us-east-1)

https://news.ycombinator.com/item?id=48057294

tcp_handshaker•1h ago
I bet post-mortem will say vibe coding confused fahrenheit and celsius, we run too hot...
fabian2k•1h ago
I thought cooling was pretty much pre-planned in any data center, and you simply don't install more stuff than you can cool?

So did some cooling equipment fail here or was there an external reason for the overheating? Or does Amazon overbook the cooling in their data centers?

DevelopingElk•52m ago
One of the data center's cooling loops broke.
bdangubic•37m ago
No backups?
bradgessler•15m ago
What happens when the backup breaks?
AdamJacobMuller•35m ago
This is almost definitely an issue of equipment failure.

Cooling in datacenters is like everything else both over and under provisioned.

It's overprovisioned in the sense that the big heat exchange units are N+1 (or in very critical and smaller load facilities 2N/3N). This is done because you need to regularly take these down for maintenance work and they have a relatively high failure rate compared to traditional DC components and require mechanical repairs that require specialized labor and long lead times. In a bigger facility its not uncommon to have cooling be N+3 or more when N becomes a bigger number because you're effectively always servicing something or have something down waiting for a blower assembly which needs to be literally made by a machinist with a lathe because that part doesn't exist anymore but that's still cheaper than replacing the whole unit.

The system are also under-provisioned in the sense that if every compute capacity in the facility suddenly went from average power draw to 100% power draw you would overload the cooling capacity, you would also commonly overload things in the electrical and other paths too. Over provisioning is just the nature of the industry.

In general neither of these things poses a real problem because compute loads don't spike to 100% of capacity and when they do spike they don't spike for terribly long and nobody builds facilities on a knife-edge of cooling or power capacity.

The problem comes when you have the intersection of multiple events.

You designed your cooling system to handle 200% of average load which is great because you have lots of headroom for maintenance/outages.

Repair guy comes on Tuesday to do work on a unit and finds a bad bearing, has to get it from the next state over so he leaves the unit off overnight to not risk damaging the whole fan assembly (which would take weeks to fabricate).

The two adjacent cooling units are now working JUST A BIT harder to compensate and one of them also had a motor which was just slightly imbalanced or a fuse which was loose and warming up a bit and now with an increased duty cycle that thing which worked fine for years goes pop.

Now you're minus two units in an N+2 facility. Not really terrible, remember you designed for 200% of average load.

That 3rd unit on the other side of the first failed unit, now under way more load, also has a fault. You're now minus 3 in a N+2 facility.

Still, not catastrophic because really you designed for 200% of average load.

The thing is, it's now 4AM, the onsite ops guy can't fix these faults and needs to call the vendor who doesn't wake up till 7AM and won't be onsite till 9.

Your load starts ramping up.

Everything up above happens daily in some datacenter in the USA. It happens in every datacenter probably once a year.

What happens next is the confluence of events which puts you in the news.

One of your bigger customers decides now is a great time to start a huge batch processing job. Some fintech wants to run a huge model before market open or some oil firm wants to do some quick analysis of a new field.

They spin up 10000 new VMs.

Normally, this is fine, you have the spare capacity.

But, remember, you planned for 200% of AVERAGE cooling capacity and this is not nodes which are busy but not terribly busy, these are nodes doing intense optimized number crunching work which means they draw max power and thus expel max waste heat.

Not only has your load in terms of aggregate number of machines spiked but their waste heat impact is also greater on average.

Boom, cascading failure, your cooling is now N-4.

Server fans start ramping up faster which consumes more power.

Your cooling is now N-5.

Alarms are blaring all over the place.

Safeties on the cooling units start to trip as they exceed their load and refrigerant pressures rise.

Your cooling is now N-6.

Your cooling is now N-7.

Your cooling is now 0.

fabian2k•31m ago
I'd expect someone like AWS to just throttle machines before overloading their cooling. Because they probably can do that, while e.g. a data center that just rents the space can't really throttle their customers nicely.
cperciva•7m ago
Reducing clock speeds, even if they could do that -- and I'm not sure they can, given how Nitro is designed -- would be problematic since a lot of customer workloads assume homogeneous nodes.

But they did load-shed. Perhaps not soon enough, but the reason this is publicly known is because they reduced the amount of heat being produced.

foota•21m ago
Shouldn't there be a feedback system here preventing the scheduling of loads when cooling is degraded?
AdamJacobMuller•7m ago
With hyperscalers for sure.

But this is the physical world, shit happens.

The algorithm didn't know that fuse was lose and fine at 50% duty cycle but was high resistance and going to blow at 100%.

PunchyHamster•9m ago
The cooling units dont fail just because they get to 100% duty cycle. That's pretty much "normal operation", you just get... higher efficiency coz the cooling side is warmer
Havoc•1h ago
Could someone explain to me why they don't build these things near oceans? Like nuclear plants that need plenty cooling capacity too

Two loop cycle with heat exchanger to get rid of the heat

sheept•1h ago
This is just a guess, but land near oceans is more expensive/populated, and water is comparatively cheap
kinow•1h ago
I had a class in my masters about data centers (HPC Infrastructures). The professor was using some data centers somewhere in the middle of USA, in an area with hot weather as example. He compared that with ideal scenario (weather, power source, etc.).

In one of the slides, there were factors that influence the decision of where to build a data center, and several of the items involved finding a place with enough space and skilled people to work at this data center. He also commented sometimes there is politics involved on choosing the place for a next data center.

ikr678•53m ago
Off the top of my head: Ocean levels of salt in a water system are much more expensive to maintain (even the secondary loop).

Coastal land much more expensive. If you go to a remote coastal site, you probably won't have as good access to power.

Coastal sites usually exposed to more severe weather events.

Other fun unpredicatble things eg-Diablo Canyon nuclear facility has had issues with debris and jellyfish migration blocking their saltwater cooling intake.

https://www.nbcnews.com/news/world/diablo-canyon-nuclear-pla...

mandevil•53m ago
So Ashburn VA is a datacenter hub because the very first non-government Internet Exchange Point (IXP) anywhere in the world was there (https://en.wikipedia.org/wiki/MAE-East). Back in the 1990's something like half of all internet traffic all over the world hit MAE-East. That in turn made AWS put their first region there (us-east-1 preceded eu-west-1 by 2 years and us-west-1 by 3 years). Then because there were lots of people who knew how to build DC's- and lots of vendors who knew how to supply them- the Dulles Corridor became a major hub for lots of companies datacenters. For AWS, because us-east-1 was the first, it's by far the most gnarly and weird- and a lot of control planes for other AWS services end up relying on it.

But NoVA is basically the same sort of economic cluster that Paul Krugman won his Nobel Prize in Economics for studying, just for datacenters.

jjmarr•43m ago
Oceans have salt. Saltwater is bad for electronics beyond normal water. You also need a sufficient level of water depth otherwise it'll warm to surface temperature. It also needs to be price-competitive with traditional evaporative cooling.

Toronto is the textbook example of this working. It's on a freshwater lake that is deep relatively close to the shore, and the downtown has expensive real estate blocking traditional methods.

https://en.wikipedia.org/wiki/Deep_Lake_Water_Cooling_System

dpe82•32m ago
In a proper 2-loop cooling system, the primary loop (with direct electronics contact) and secondary loop (with seawater/external cooling source) are hydraulically isolated by a heat exchanger. The salt water or whatever never gets anywhere near the electronics.
mschuster91•22m ago
The problem is, it's still in contact with something, even if it's just the secondary loop. Saltwater is not just incredibly aggressive against metal, the major problem with using it for cooling is fouling. Fish, mussels, algae, debris, there are a lot of things that can clog up your entire setup.
arjie•18m ago
Amusingly I've been part of two critical downtime heating incidents at two different datacenters: one was when Hosting.com's SOMA datacenter got so hot that they were using hoses on the roof to cool it down; and the second one was when Alibaba's Chai Wan datacenter got so hot everything running there went down, including the control plane. So I imagine the proximity to the ocean does not yield any additional advantage in terms of emergency heat sinking. You have x capacity to pump heat out and it doesn't matter if you're next to the sea or in the middle of Nebraska because your entire system needs to be built to be rated for some performance.
PunchyHamster•7m ago
yeah but capacity is easier/cheaper to build/overbuild if you can access cold-ish water at all times
tailscaler2026•29m ago
us-east-1 is down? shocking! stop putting SPOF services there. this location has had frequent issues for the past 15 years.

Google broke reCAPTCHA for de-googled Android users

https://reclaimthenet.org/google-broke-recaptcha-for-de-googled-android-users
337•anonymousiam•4h ago•117 comments

AI is breaking two vulnerability cultures

https://www.jefftk.com/p/ai-is-breaking-two-vulnerability-cultures
173•speckx•5h ago•76 comments

You gave me a u32. I gave you root. (io_uring ZCRX freelist LPE)

https://ze3tar.github.io/post-zcrx.html
87•MrBruh•3h ago•50 comments

Cartoon Network Flash Games

https://www.webdesignmuseum.org/flash-game-exhibitions/cartoon-network-flash-games
247•willmeyers•6h ago•80 comments

AWS says data center overheating in North Virginia disrupts services

https://www.reuters.com/business/retail-consumer/amazon-cloud-unit-says-data-center-overheating-n...
60•christhecaribou•19h ago•27 comments

Non-determinism is an issue with patching CVEs

https://flox.dev/blog/achieving-rapid-cve-remediation-in-an-era-of-escalating-vulnerabilities/
22•mathewpregasen•1h ago•7 comments

Looking at the data behind prediction markets

https://asteriskmag.com/issues/14/are-prediction-markets-good-for-anything
28•kqr•1d ago•11 comments

David Attenborough's 100th Birthday

https://www.bbc.com/news/articles/cp3pww9g0p5o
362•defrost•10h ago•67 comments

Serving a website on a Raspberry Pi Zero running in RAM

https://btxx.org/posts/memory/
177•xngbuilds•7h ago•71 comments

Mux (YC W16) Is Hiring

https://www.mux.com/jobs
1•mmcclure•1h ago

All means are fair except solving the problem

https://yosefk.com/blog/all-means-are-fair-except-solving-the-problem.html
21•akkartik•2d ago•14 comments

An Introduction to Meshtastic

https://meshtastic.org/docs/introduction/
354•ColinWright•11h ago•135 comments

AWS data center outage hits trading on Fanduel, Coinbase

https://www.cnbc.com/2026/05/08/aws-outage-data-center-fanduel-coinbase.html
10•bigflern•1h ago•0 comments

Dirty Frag: Universal Linux LPE

https://github.com/V4bel/dirtyfrag
15•unbeli•2h ago•1 comments

Meta Shuts Down End-to-End Encryption for Instagram Messaging

https://www.pcmag.com/news/meta-shuts-down-end-to-end-encryption-for-instagram-dms-messaging
39•tcp_handshaker•1h ago•17 comments

Wi is Fi: Understanding Wi-Fi 4/5/6/6E/7/8 (802.11 n/AC/ax/be/bn)

https://www.wiisfi.com/
8•homebrewer•2d ago•1 comments

My first in-prod corrupted hard drive problem

https://blog.pavementlink.ch/2026/05/07/my-first-corrupted-hard-drive-problem/
33•r1chk1t•3h ago•23 comments

Rumors of my death are slightly exaggerated

1436•CliffStoll•2d ago•223 comments

Compound drivers of Antarctic sea ice loss and Southern Ocean destratification

https://www.science.org/doi/10.1126/sciadv.aeb0166
6•littlexsparkee•59m ago•0 comments

Teaching Claude Why

https://www.anthropic.com/research/teaching-claude-why
43•pretext•5h ago•5 comments

Mojo 1.0 Beta

https://mojolang.org/
255•sbt567•20h ago•167 comments

US Government releases first batch of UAP documents and videos

https://www.war.gov/UFO/
204•david-gpu•10h ago•316 comments

Poland is now among the 20 largest economies

https://apnews.com/article/poland-economy-growth-g20-gdp-26fe06e120398410f8d773ba5661e7aa
861•surprisetalk•10h ago•716 comments

PC Engine CPU

https://jsgroth.dev/blog/posts/pc-engine-cpu/
113•ibobev•8h ago•50 comments

Roadside Attraction

https://theoffingmag.com/essay/roadside-attraction/
13•aways•3h ago•3 comments

Man finds $1M worth of Yu-Gi-Oh cards in a dumpster

https://www.404media.co/man-finds-1-million-worth-of-yu-gi-oh-cards-in-a-dumpster/
87•danso•2d ago•25 comments

Maybe you shouldn't install new software for a bit

https://xeiaso.net/blog/2026/abstain-from-install/
806•psxuaw•23h ago•427 comments

Show HN: GETadb.com – every GET request creates a DB

https://www.getadb.com/
22•nezaj•6h ago•26 comments

Ask HN: We just had an actual UUID v4 collision...

266•mittermayr•15h ago•228 comments

Podman rootless containers and the Copy Fail exploit

https://garrido.io/notes/podman-rootless-containers-copy-fail/
109•ggpsv•9h ago•23 comments