frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

BERT Is Just a Single Text Diffusion Step

https://nathan.rs/posts/roberta-diffusion/
113•nathan-barry•1h ago•9 comments

Commodore 64 Ultimate

https://www.commodore.net/product-page/commodore-64-ultimate-basic-beige-batch1
44•guerrilla•59m ago•14 comments

DeepSeek OCR

https://github.com/deepseek-ai/DeepSeek-OCR
644•pierre•9h ago•162 comments

Space Elevator

https://neal.fun/space-elevator/
1015•kaonwarb•11h ago•215 comments

Servo v0.0.1 Released

https://github.com/servo/servo
237•undeveloper•3h ago•55 comments

Matrix Conference 2025 Highlights

https://element.io/blog/the-matrix-conference-a-seminal-moment-for-matrix/
88•Arathorn•3h ago•46 comments

How to stop Linux threads cleanly

https://mazzo.li/posts/stopping-linux-threads.html
26•signa11•5d ago•3 comments

Docker Systems Status: Full Service Disruption

https://www.dockerstatus.com/pages/incident/533c6539221ae15e3f000031/68f5e1c741c825463df7486c
260•l2dy•8h ago•102 comments

Anthropic and Cursor Spend This Much on Amazon Web Services

https://www.wheresyoured.at/costs/
45•isoprophlex•50m ago•15 comments

Modeling Others' Minds as Code

https://arxiv.org/abs/2510.01272
27•PaulHoule•2h ago•8 comments

Entire Linux Network stack diagram (2024)

https://zenodo.org/records/14179366
461•hhutw•12h ago•39 comments

Show HN: Playwright Skill for Claude Code – Less context than playwright-MCP

https://github.com/lackeyjb/playwright-skill
58•syntax-sherlock•3h ago•22 comments

How to Enter a City Like a King

https://worldhistory.substack.com/p/how-to-enter-a-city-like-a-king
34•crescit_eundo•1w ago•12 comments

Pointer Pointer (2012)

https://pointerpointer.com
177•surprisetalk•1w ago•19 comments

AWS Multiple Services Down in us-east-1

https://health.aws.amazon.com/health/status?ts=20251020
669•kondro•8h ago•261 comments

The Peach meme: On CRTs, pixels and signal quality (again)

https://www.datagubbe.se/crt2/
39•zdw•1w ago•11 comments

Forth: The programming language that writes itself

https://ratfactor.com/forth/the_programming_language_that_writes_itself.html
260•suioir•15h ago•116 comments

State-based vs Signal-based rendering

https://jovidecroock.com/blog/state-vs-signals/
40•mfbx9da4•6h ago•34 comments

Qt Group Buys IAR Systems Group

https://www.qt.io/stock/qt-completes-the-recommended-public-cash-offer-to-the-shareholders-of-iar...
18•shrimp-chimp•3h ago•4 comments

AWS Outage: A Single Cloud Region Shouldn't Take Down the World. But It Did

https://faun.dev/c/news/devopslinks/aws-outage-a-single-cloud-region-shouldnt-take-down-the-world...
256•eon01•3h ago•138 comments

Optimizing writes to OLAP using buffers (ClickHouse, Redpanda, MooseStack)

https://www.fiveonefour.com/blog/optimizing-writes-to-olap-using-buffers
19•oatsandsugar•5d ago•7 comments

Fractal Imaginary Cubes

https://www.i.h.kyoto-u.ac.jp/users/tsuiki/icube/fractal/index-e.html
34•strstr•1w ago•3 comments

Novo Nordisk's Canadian Mistake

https://www.science.org/content/blog-post/novo-nordisk-s-canadian-mistake
396•jbm•19h ago•207 comments

Major AWS Outage Happening

https://old.reddit.com/r/aws/comments/1obd3lx/dynamodb_down_useast1/
1018•vvoyer•8h ago•528 comments

Introduction to reverse-engineering vintage synth firmware

https://ajxs.me/blog/Introduction_to_Reverse-Engineering_Vintage_Synth_Firmware.html
146•jmillikin•12h ago•22 comments

Duke Nukem: Zero Hour N64 ROM Reverse-Engineering Project Hits 100%

https://github.com/Gillou68310/DukeNukemZeroHour
209•birdculture•19h ago•89 comments

Give Your Metrics an Expiry Date

https://adrianhoward.com/posts/give-your-metrics-an-expiry-date/
57•adrianhoward•5d ago•18 comments

Gleam OTP – Fault Tolerant Multicore Programs with Actors

https://github.com/gleam-lang/otp
165•TheWiggles•17h ago•70 comments

Airliner hit by possible space debris

https://avbrief.com/united-max-hit-by-falling-object-at-36000-feet/
372•d_silin•22h ago•196 comments

Major AWS outage takes down Fortnite, Alexa, Snapchat, and more

https://www.theverge.com/news/802486/aws-outage-alexa-fortnite-snapchat-offline
200•codebolt•7h ago•79 comments
Open in hackernews

AWS Outage: A Single Cloud Region Shouldn't Take Down the World. But It Did

https://faun.dev/c/news/devopslinks/aws-outage-a-single-cloud-region-shouldnt-take-down-the-world-but-it-did/
256•eon01•3h ago

Comments

skywhopper•3h ago
There are plenty of ways to address this risk. But the companies impacted would have to be willing to invest in the extra operational cost and complexity. They aren’t.
randomtoast•3h ago
Thing is us-east-1 the primary region for many services of AWS. DynamoDB is a very central offering used by many service. And the issue that has happend is very common[^1].

I think no matter how hard you try to avoid it, in the end there's always a massive dependency chain for modern digital infrastructure[^2].

[1]: https://itsfoss.community/uploads/default/optimized/2X/a/ad3...

[2]: https://xkcd.com/2347/

pjmlp•3h ago
It just goes to show the difference between best practices in cloud computing, and what everyone ends up doing in reality, including well known industry names.
mrbungie•2h ago
> It just goes to show the difference between best practices in cloud computing, and what everyone ends up doing in reality,

Well, inter-region DR/HA is a expensive thing to ensure (whether on salaries, infra or both), specially when you are in AWS.

esafak•2h ago
Does AWS follow its own Well-Architected Framework!?
spyspy•2h ago
Eh, the "best practices" that would've prevented this aren't trivial to implement and are definitely far beyond what most engineering teams are capable of, in my experience. It depends on your risk profile. When we had cloud outages at the freemium game company I worked at, we just shrugged and waited for the systems to come back online - nobody dying because they couldn't play a word puzzle. But I've also had management come down and ask what it would take to prevent issues like that from happening again, and then pretend they never asked once it was clear how much engineering effort it would take. I've yet to meet a product manager that would shred their entire roadmap for 6-18 months just to get at an extra 9 of reliability, but I also don't work in industries where that's super important.
pjmlp•13m ago
Indeed, yet one would expect AWS to lead by example, including all of those that are only using a single region.
Thaxll•1h ago
Best practice does not include plan for when AWS going down. Netflix does not plan for it and they have a very strong eng org.
pjmlp•14m ago
It was only one region.
mcphage•3h ago
It shouldn’t, but it does. As a civilization, we’ve eliminated resilience wherever we could, because it’s more cost-effective. Resilience is expensive. So everything is resting on a giant pile of single point of failures.

Maybe this is the event to get everyone off of piling everything onto us-east-1 and hoping for the best, but the last few outages didn’t, so I don’t expect this one to, either.

Trasmatta•2h ago
The irony is that true resilience is very complex, and complexity can be a major source of outages in and of itself
lanstin•51m ago
I have enjoyed this paper on such dynamics: https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s...

It is kind of the child of what used to be called Catastrophe Theory, which in low dimensions is essentially a classification of folding of manifolds. Now the systems are higher dinemsional and the advice more practical/heuristic.

mschuster91•2h ago
> Maybe this is the event to get everyone off of piling everything onto us-east-1 and hoping for the best, but the last few outages didn’t, so I don’t expect this one to, either.

Doesn't help either. us-east-1 hosts the internal control plane of AWS and a bunch of stuff is only available in us-east-1 at all - most importantly, Cloudfront, AWS ACM for Cloudfront and parts of IAM.

And the last is the one true big problem. When IAM has a sniffle, everything else collapses because literally everything else depends on IAM. If I were to guess IAM probably handles millions if not billions of requests a second because every action on every AWS service causes at least one request to IAM.

TheNewsIsHere•2h ago
The last re:Invent presentation I saw from one of the principals working on IAM quoted 500 million requests per second. I expect that’s because IAM also underpins everything inside AWS, too.
mschuster91•2h ago
IAM, hands down, is one of the most amazing pieces of technology there is.

The sheer volume is one thing, but... IAM's policy engine, that's another thing. Up to 5000 different roles per account, dozens of policies that can have an effect on any given user entity and on top of that you can also create IAM policies that blanket affect all entities (or only a filtered subset) in an account, and each policy definition can be what, 10 kB or so, in size. Filters can include multiple wildcards everywhere so you can't go for a fast-path in an in-memory index, and they can run variables with on-demand evaluation as well.

And all of that is reachable not on an account-specific endpoint that could get sharded from a shared instance should the load of one account become too expensive, no, it's a global (and region-shared) endpoint. And if that weren't enough, all calls are shipped off to CloudTrail's event log, always, with full context cues to have an audit and debug trail.

To achieve all that in a service quality that allows for less than 10 seconds worth of time before a change in an IAM policy becomes effective and milliseconds of call time is nothing short of amazing.

quesera•1h ago
No harshness intended, but I don't see the magic.

IAM is solid, but is it any more special than any other distributed AuthN+AuthZ service?

lanstin•47m ago
Scale is a feature. 500M per second in practice is impressive.
UltraSane•39m ago
The scale, speed, and uptime of AWS IAM is pretty special.
foinker•2h ago
No shot that happens until an outage breaks at least an entire workday in the US timezones. The only complaint I personally heard was from someone who couldn't load reddit on the train to work.
geodel•1h ago
Well by the time it really happens for a whole day Amazon leadership will be brazen enough to say "OK, enough of this my site is down, we will call back once systems are up so don't bother for a while". Also maybe responsible human engineers would fired by then and AI can be infinitely patient while working through insolvable issues.
2d8a875f-39a2-4•2h ago
> we’ve eliminated resilience wherever we could, because it’s more cost-effective. Resilience is expensive.

You are right. But alas, a peek at the AMZN stock ticker suggests that the market doesn't really value resilience that much.

Cthulhu_•55m ago
Stocks stopped being indicative of anything decades ago though.
agos•2h ago
when did we have resilience?
BoredPositron•2h ago
Cold War was pretty good in terms of resilience.
JCM9•3h ago
US-East-1 is more than just a normal region. It also provides the backbone for other services, including those in other regions. Thus simply being in another region doesn’t protect you from the consistent us-east-1 shenanigans.

AWS doesn’t talk about that much publicly, but if you press them they will admit in private that there are some pretty nasty single points of failure in the design of AWS that can materialize if us-east-1 has an issue. Most people would say that means AWS isn’t truly multi-region in some areas.

Not entirely clear yet if those single points of failure were at play here, but risk mitigation isn’t as simple as just “don’t use us-east-1” or “deploy in multiple regions with load balancing failover.”

helsinkiandrew•3h ago
>US-East-1 is more than just a normal region. It also provides the backbone for other services, including those in other regions

I thought that if us-east-1 goes down you might not be able to administer (or bring up new services) in other zones, but if you have services running that can take over from us-east-1, you can maintain your app/website etc.

I haven’t had to do this for several years but that was my experience a few years ago on an outage - obviously it depends on the services you’re using.

You can’t start cloning things to other zones after us-east-1 is down - you’ve left it too late

cmiles8•2h ago
Well that sounds like exactly the sort of thing that shouldn’t happen when there’s an issue given the usual response is to spin things up elsewhere, especially on lower priority services where instant failover isn’t needed.
sgarland•2h ago
It depends on the outage. There was one a year or two ago (I think? They run together) that impacted EC2 such that as long as you weren’t trying to scale, or issue any commands, your service would continue to operate. The EKS clusters at my job at the time kept chugging along, but had Karptenter tried to schedule more nodes, we’d have had a bad time.
bpicolo•2h ago
Static stability is a very valuable infra attribute. You should definitely consider how statically stable your services are in architecting them
yencabulator•9m ago
Meanwhile, AWS has always marketed itself as "elastic". Not being able to start new VMs in the morning to handle the daytime load will wreck many sites.
Yeul•2h ago
Internet was supposed to be a communication network if the East Coast was nuked.

What it turned into was Daedalus from Deus Ex lol.

t_sawyer•2h ago
Yeah because Amazon engineers are hypocrites. They want you to spend extra money for region failover and multi-az deploys but they don't do it themselves.
ajkjk•2h ago
They absolutely do do it themselves..
falcor84•2h ago
What do you mean? Obviously, as TFA shows and as others here pointed-out, AWS relies globally on services that are fully-dependent on us-east-1, so they aren't fully multi-region.
ajkjk•2h ago
The claim was that that they're total hypocrites aren't multi region at all. That's totally false, the amount of redundancy in aws is staggering. But there are foundational parts which, I guess, have been too difficult to do that for (or perhaps they are redundant but the redundancy failed in this case? I dunno)
t_sawyer•2h ago
There's multiple single points of failure for their entire cloud in us-east-1.

I think it's hypocritical for them to push customers to double or triple their spend in AWS when they themselves have single points of failure on a single region.

ajkjk•2h ago
That's absurd. It's hypocritical to describe best practices as best practices because you haven't perfectly implemented them? Either they're best practice or they aren't. The customers have the option of risking non-redundancy also, you know.
t_sawyer•1h ago
Yes it's hypocritical to push customers to pay you more money with best practices for uptime when you yourself don't follow them and your choices to not follow them actually make the best practices you pushed your customers to pay you more money for not fully work.

Hey! Pay us more money so when us-east-1 goes down you're not down (actually you'll still go down because us-east-1 is a single point of failure even for our other regions).

falcor84•2h ago
That's a good point, but I'd just s/Amazon engineers/AWS leadership/ , as I'm pretty sure that there's a few layers of management removed between the engineers on the ground at AWS, those who deprioritise any longer-term resilience work needed (which is a very strategic decisioN), and those those who are in charge of external comms/education about best practices for AWS customers.
ajsnigrutin•2h ago
Luckily, those people are the ones that will be getting all the phonecalls from angry customers here. If you're selling resilience and selling twice the service (so your company can still run if one location fails), and it still failed, well... phones will be ringing.
ZeroCool2u•2h ago
They can't even bother to enable billing services in GovCloud regions.
bravetraveler•2h ago
Call me crazy, because this is, perhaps it's their "Room 641a". The purpose of a system is what it does, no point arguing 'should' against reality, etc.

They've been charging a premium for, and marketing, "Availability" for decades at this point. I worked for a competitor and made a better product: it could endure any of the zones failing.

voxadam•2h ago
> perhaps it's their "Room 641a".

For the uninitiated: https://en.wikipedia.org/wiki/Room_641A

jf•1h ago
Interesting. Langley isn’t that far away
nevir•1h ago
It's really not that nefarious.

IAD datacenters have forever been the place where Amazon software developers implement services first (well before AWS was a thing).

Multi-AZ support often comes second (more than you think; Amazon is a pragmatic company), and not every service is easy to make TRULY multi-AZ.

And then other services depend on those services, and may also fall into the same trap.

...and so much of the tech/architectural debt gets concentrated into a single region.

bravetraveler•1h ago
Right, like I said: crazy. Anything production with certain other clouds must be multi-AZ. Both reinforced by culture and technical constraints. Sometimes BCDR/contract audits [zones chosen by a third party at random].
nevir•57m ago
It sure is a blast when they decide to cut off (or simulate the loss of) a whole DC just to see what breaks, I bet :)
bravetraveler•54m ago
The disconnect case was simple: breakage was as expected. The island was lost until we drew it on the map again. Things got really interesting when it was a full power-down and back on.

Were the docs/tooling up to date? Tough bet. Much easier to fix BGP or whatever.

Anon1096•1h ago
It's possible that you really could endure any zone failure. But I take these claims people make all the time with a grain of salt, unless you're working on AWS scale (basically just 3 companies) and have actually run for years and seen every kind of failure mode claiming to be higher availability is not something that's able to be accurately evaluated.

(I'm assuming by zone you mean the equivalent of an AWS region, with multiple connected datacenters)

bravetraveler•1h ago
Yes, equivalent. Did endure, repeatedly. Demonstrated to auditors to maintain compliance. They would pick the zone to cut off. We couldn't bias the test. Literal clockwork.

I'll let people guess for the sport of it, here's the hint: there were at least 30 of them comprised of Real Datacenters. Thanks for the doubt, though. Implied or otherwise.

pinkmuffinere•49m ago
Just letting you know how this response looks to other people -- Anon1096 raises legitimate objections, and their post seems very measured in their concerns, not even directly criticizing you. But your response here is very defensive, and a bit snarky. Really I don't think you even respond directly to their concerns, they say they'd want to see scale equivalent to AWS because that's the best way to see the wide variety of failure modes, but you mostly emphasize the auditors, which is good but not a replacement for the massive real load and issues that come along with it. It feels miscalibrated to Anon's comment. As a result, I actually trust you less. If you can respond to Anon's comment without being quite as sassy, I think you'd convince more people.
bravetraveler•46m ago
I appreciate the feedback, truly. Defensive and snarky are both fair, though I'm not trying to convince. The business and practices exist, today.

At risk of more snark: Clouds aren't the Death Star, they don't have to have an exhaust port. It's fair the first one does... for a while.

whatever1•45m ago
There are shared resources in different regions. Electricity. Cables. Common systems for coordination.

Your experiment proves nothing. Anyone can pull it off.

bravetraveler•45m ago
The sites were chosen specifically to be more than 50 miles apart, it proved plenty.
whatever1•43m ago
I am the CEO of your company. I forgot to pay the electricity bill. How is the multi-region resilience going?
bravetraveler•42m ago
Fine, the tab increments. Get back to hyping or something, this is not your job.
whatever1•40m ago
I doubt it should be yours if this is how you think about resilience.
bravetraveler•39m ago
Your vote has been tallied
quickthrowman•31m ago
If your accounts payable can’t pay the electric bill on time, you’ve got bigger problems.
icedchai•15m ago
If you go far up enough the pyramid, there is always a single point of failure. Also, it's unlikely that 1) all regions have the same power company, 2) all of them are on the same payment schedule, 3) all of them would actually shut off a major customer at the same time without warning, so, in your specific example, things are probably fine.
jayd16•37m ago
You were in a position to actually cut off production zones with live traffic at Amazon scale and test the recovery?
bravetraveler•35m ago
Yes, it was something we would do to maintain certain contracts. Sounds crazy, isn't: they used a significant portion of the capacity, anyway. They brought the auditors.

Real People would notice/care, but financially, it didn't matter. Contract said the edge had to be lost for a moment/restored. I've played both Incident Manager and SRE in this routine.

edit: Less often we'd do a more thorough test: power loss/full recovery. We'd disconnect more regularly given the simplicity.

stronglikedan•50m ago
Was that competitor priced competitively with AWS? I think of the project management triangle here - good, fast, or cheap - pick two. AWS would be fast and cheap.
bravetraveler•48m ago
Yes, good point. Pricing is a bit higher. As another reply pointed out: there's ~three that work on the same scale. This was one, another hint I guess: it's mostly B2B. Normal people don't typically go there.
gchamonlive•2h ago
Been a while since I last suffered from AWS arbitrary complexity, but afaik you can only associate certificates to cloudfront if they are generated in us-east-1, so it's undoubtedly a single point of failure for all CDN if this is still the case.
kokanee•2h ago
I worked at AMZN for a bit and the complexity is not exactly arbitrary; it's political. Engineers and managers are highly incentivized to make technical decisions based on how they affect inter-team dependencies and the related corporate dynamics. It's all about review time.
sharpy•2h ago
I have seen one promo docket get rejected for doing work that is not complex enough... I thought the problem was challenging, and the simple solution brilliant, but the tech assessor disagreed. I mean once you see there is a simple solution to a problem, it looks like the problem is simple...
bdbdkdksk•1h ago
I had a job interview like this recently: "what's the most technically complex problem you've ever worked on?"

The stuff I'm proudest of solved a problem and made money but it wasn't complicated for the sake of being complicated. It's like asking a mechanical engineer "what's the thing you've designed with the most parts"

arethuza•1h ago
I was once very unpopular with a team of developers when I pointed out a complete solution to what they had decided was an "interesting" problem - my solution didn't involve any code being written.
SoftTalker•1h ago
I suppose it depends on what you are interviewing for but questions like that I assume are asked more to see how you answer than the specifics of what you say.

Most web jobs are not technically complex. They use standard software stacks in standard ways. If they didn't, average developers (or LLMs) would not be able to write code for them.

mboerwink•1h ago
I think this could still be a very useful question for an interviewer. If I were hiring for a position working on a complex system, I would want to know what level of complexity a prospect was comfortable dealing with.
gchamonlive•1h ago
That's what arbitrary means to me, but sure, I see no problem calling it political too
AtlasBarfed•1h ago
Forced attrition rears its head again
xbar•2h ago
This set of facts comes to light every 3-5 years when US-East-1 has another failure. Clearly they could have architected their way out of this blast radius problem by now, but they do not.

Why would they keep a large set of centralized, core traffic services in Virginia for decades despite it being a bad design?

firesteelrain•2h ago
It’s probably because there is a lot of tech debt plus look at where it is - Virgina. It shouldn’t take much of imagination to figure out why that is strategic
dsr_•1h ago
They could put a failover site in Colorado or Seattle or Atlanta, handling just their infrastructure. It's not like the NSA wouldn't be able to backhaul from those places.
knotimpressed•1h ago
You mean the surveillance angle as reason for it being in Virginia?
AtlasBarfed•1h ago
What is the motivation of an effective Monopoly to do anything?

I mean look at their console. Their console application is pretty subpar.

cyberax•44m ago
AWS _had_ architected away from single-region failure modes. There are only a few services that are us-east-1 only in AWS (IAM and Route53, mostly), and even they are designed with static stability so that their control plane failure doesn't take down systems.

It's the rest of the world that has not. For a long time companies just ran everything in us-east-1 (e.g. Heroku), without even having an option to switch to another region.

MichaelZuo•38m ago
The parent seems to be implying there is something in us-east-1 that could take down all the various regions?
api•2h ago
My contention for a long time has been that cloud is full of single points of failure (and nightmarish security hazards) that are just hidden from the customer.

"We can't run things on just a box! That's a single point of failure. We're moving to cloud!"

The difference is that when the cloud goes down you can shift the blame to them, not you, and fixing it is their problem.

The corporate world is full of stuff like this. A huge role of consultants like McKinsey is to provide complicated reports and presentations backing the ideas that the CEO or other board members want to pursue. That way if things don't work out they can blame McKinsey.

raw_anon_1111•1h ago
You act as if that is a bug not a feature. As hypothetically someone who is responsible for my site staying up, I would much rather blame AWS than myself. Besides none of your customers are going to blame you if every other major site is down.
unethical_ban•32m ago
As someone who hypothetically runs a critical service, I would rather my service be up than down.
raw_anon_1111•14m ago
And you have never had downtime? If your data center went down - then what?
tredre3•13m ago
> As hypothetically someone who is responsible for my site staying up, I would much rather blame AWS than myself.

That's a very human sentiment, and I share it. That's why I don't swap my car wheels myself, I don't want to feel responsible if one comes loose on the highway and I cause an accident.

But at the same time it's also appalling how low the bar has gotten. We're still the ones deciding that one cloud is enough. The down being "their" fault really shouldn't excuse that fact. Most services aren't important enough to need redundancy. But if yours is, and it goes down because you decided that one provider is enough, then your provider isn't solely at fault here and as a profession I wish we'd take more accountability.

aurareturn•1h ago
Or it is a matter of efficiency. If 1 million companies design and maintain their servers, there would be 1 million (or more) incidents like these. Same issues. Same fixes. Not so efficient.
SoftTalker•1h ago
It might be worse in terms of total downtime but it likely would be much less noticable as it woould be scattered individual outages not everyone at the same time.
qaq•1h ago
even if us-east-1 was a normal region there is not enough spare capacity to take up all the workloads from us-east-1 in other regions so t's a moot point
nevir•1h ago
It also doesn't help that most companies using AWS aren't remotely close to multi-region support, and that us-east-1 is likely the most populated region.
einrealist•1h ago
It sounds like they want to avoid split-brain scenarios as much as possible while sacrificing resilience. For things like DNS, this is probably unavoidable. So, not all the responsibility can be placed on AWS. If my application relies on receipts (such as an airline ticket), I should make sure I have an offline version stored on my phone so that I can still check in for my flight. But I can accept not to be able to access Reddit or order at McDonalds with my phone. And always having cash at hand is a given, although I almost always pay with my phone nowadays.

I hope they release a good root cause analysis report.

masfuerte•58m ago
Amazon are planning to launch the EU Sovereign Cloud by the end of the year. They claim it will be completely independent. It may be possible then to have genuine resiliency on AWS. We'll see.
louthy•43m ago
Then it will be eu-east-1 taking down the EU
samcat116•27m ago
This is the difference between “partitions” and “regions”. Partitions have fully separate IAM, DNS names, etc. This is how there are things like US Gov Cloud, the Chinese AWS cloud, and now the EU sovereign cloud
belter•48m ago
> Thus simply being in another region doesn’t protect you from the consistent us-east-1 shenanigans.

Well it did for me today...Dont use us-east-1 explicitly just other regions and I had no outage today...( I get the point about the skeletons in the closet of us-east-1 ...maybe the power plug goes via Bezos wood desk? )

thayne•38m ago
There are hints at in their documentation. For example ACM certs for cloudfront and KMS keys for route53 DNSSEC have to be in the us-east1 region.
helsinkiandrew•3h ago
> The incident underscores the risks associated with the heavy reliance on a few major cloud service providers.

Perhaps for the internet as a whole, but for each individual service it underscores the risk of not hosting your service in multiple zones or having a backup

martypitt•3h ago
A bit meta - but, what is faun.dev? I visited their site - it looks like a very very slow (possibly because of the current outage?), ad-funded Reddit / HN clone?

But, in it's sidebar of "Trending technologies", it lists "Ansible" and "Jenkins" .. which while are both great, I doubt are trending currently.

Curious what this is?

Maxion•2h ago
OP is the creator of faun.dev. Seems to just be yet another tech news site.
darkwater•1h ago
OP who has more submissions than comments. And all the submissions are for either this faun.dev or thechief.io
sofixa•2h ago
> "Ansible" and "Jenkins" .. which while are both great

I would strongly argue that there is nothing great about Jenkins. It's an unholy mess of mouldy spaghetti that can sometimes be used to do achieve a goal, but is generally terrible at everything. Shit to use, shit to maintain, shit to secure. It was the best solution because of a lack of competition 20 years ago, but hasn't been relevant or anywhere near the top 50 since any competition appeared.

The fact that to this very day, nearing the end of 2025, they still don't support JWT identities for runs is embarrassing. Same goes for VMware vSphere.

lunias•1h ago
The design immediately weirded me out, felt strange. Where are they sourcing this information? Is this an AI summary of the BBC live news feed linked in "Further Reading"?
add-sub-mul-div•55m ago
I'd guess self-promoted slop, which is becoming the norm here.
GoatInGrey•41m ago
It's a "vibe-engineered" app. I find it both sad and hilarious just how quickly one can find slop with these.

https://i.ibb.co/Lzgf34mb/Screenshot-20251020-080828.png

Also, this is the exact CSS style that Claude uses whenever I have it program web elements (typically bookmarklet UIs).

fsto•3h ago
Ironically, the HTTP request to this article timed out twice before a successful response.
aeon_ai•2h ago
It's not DNS

There's no way it's DNS

It was DNS

ajross•1h ago
It's "DNS" because the problem is that at the very top of the abstraction hierarchy in any system is a bit of manual configuration.

As it happens, that naturally maps to the bootstrapping process on hardware needing to know how to find the external services it needs, which is what "DNS" is for. So "DNS" ends up being the top level of manual configuration.

But it's the inevitability of the manual process that's the issue here, not the technology. We're at a spot now where the rest of the system reliability is so good that the only things that bring it down are the spots where human beings make mistakes on the tiny handful of places where human operation is (inevitably!) required.

allarm•40m ago
> hardware needing to know how to find the external services it needs, which is what "DNS" is for. So "DNS" ends up being the top level of manual configuration.

Unless DNS configuration propagates over DHCP?

ajross•3m ago
DHCP can only tell you who the local DNS server is. That's not what's failed, nor what needs human configuration.

At the top of the stack someone needs to say "This is the cluster that controls boot storage", "This is the IP to ask for auth tokens", etc... You can automatically configure almost everything but there still has to be some way to get started.

jamesbelchamber•2h ago
This website just seems to be an auto-generated list of "things" with a catchy title:

> 5000 Reddit users reported a certain number of problems shortly after a specific time.

> 400000 A certain number of reports were made in the UK alone in two hours.

ktosobcy•2h ago
Uhm... E(U)ropean sovereigny (and in general spreading the hosting as much as possbile) needed ASAP…
BirAdam•2h ago
Well, except for a lot of business leaders saying that they don't care if it's Amazon that goes down, because "the rest of the internet will be down too."

Dumb argument imho, but that's how many of them think ime.

tjwebbnorfolk•2h ago
because... EU clouds don't break?

https://news.ycombinator.com/item?id=43749178

Cthulhu_•55m ago
Nah, because European services should not be affected by a failure in the US. Whatever systems they have running in us-east-1 should have failovers in all major regions. Today it's an outage in Virginia, tomorrow it could be an attack on undersea cables (which I'm confident are mined and ready to be severed at this point by multiple parties).
bilekas•2h ago
These things happen when profits are the measure everything. Change your provider, but if their number doesn't go up, they wont be reliable.

So your complaints matter nothing because "number go up".

I remember the good old days of everyone starting a hosting company. We never should have left.

g-b-r•2h ago
Can we not promote this AI-generated "article" on that banners-ridden site?

Previous discussions:

https://news.ycombinator.com/item?id=45640754

https://news.ycombinator.com/item?id=45640772

https://news.ycombinator.com/item?id=45640827

https://news.ycombinator.com/item?id=45640838

https://news.ycombinator.com/item?id=45641143

mrbluecoat•2h ago
> due to an "operational issue" related to DNS

Always DNS..

rose-knuckle17•1h ago
aws had an outage. Many companies were impacted. Headlines around the world blame AWS. the real news is how easy it is to identify companies that have put cost management ahead of service resiliency.

Lots of orgs operating wholly in AWS and sometimes only within us-east-1 had no operational problems last night. Some that is design (not using the impacted services). Some of that is good resiliency in design. And some of that was dumb luck (accidentally good design).

Overall, those companies that had operational problems likely wouldn't have invested in resiliancy expenses in any other deployment strategy either. It could have happened to them in Azure, GCP or even a home rolled datacenter.

aurareturn•1h ago
Redundancy is insanely expensive especially for SaaS companies where the biggest cost is cloud.

Are customers willing to pay companies for that redundancy? I think not. Once every few years outage for 3 hours is fine for non critical services.

skopje•8m ago
>> Redundancy is insanely expensive especially for SaaS companies

That right there means the business model is fucked to begin with. If you can't have a resilient service, then you should not be offering that service. Period. Solution: we were fine before the cloud, just a little slower. No problem going back to that for some things. Not everything has to be just in time at lowest possible cost.

dehrmann•35m ago
Three nines might be good enough when you're Fornite. Probably not when you're Robinhood.
LightBug1•1h ago
Remember when the "internet will just route around a network problem"?

FFS ...

bgwalter•1h ago
Probably related:

https://www.nytimes.com/2025/05/25/business/amazon-ai-coders...

"Pushed to use artificial intelligence, software developers at the e-commerce giant say they must work faster and have less time to think."

Every bit of thinking time spent on a dysfunctional, lying "AI" agent could be spent on understanding the system. Even if you don't move your mouse all the time in order to please a dumb middle manager.

renegade-otter•1h ago
If we see more of this, it would not be crazy to assume that all this compelling of engineers to "use AI" and the flood of Looks Good To Me code is coming home.
Cthulhu_•57m ago
Big if, major outages like this aren't unheard of, and so far, fairly uncommon. Definitely hit harder than their SLAs promise though. I hope they do an honest postmortem, but I doubt they would blame AI even if it was somehow involved. Not to mention you can't blame AI unless you go completely hands-off - but that's like blaming an outsourcing partner, which also never happens.
nivekney•1h ago
Wait a second, Snapchat impacted AGAIN? It was impacted during the last GCP outage.
menomatter•1h ago
What are the design best practices and industry standards for building on-premise fallback capabilities for critical infrastructure? Say for health care/banking ..etc
skopje•6m ago
A relative of mine lived and worked in the US for Oppenheimer Funds in the 1990's and they had their own datacenters all over the US, multiple redundancy for weather or war. But every millionaire feels entitled to be a billionaire now, so all of that cost was rolled into a single point of cloud failure.

Reminds me of a great Onion tagline:

"Plowshare hastily beaten back into sword."

thinkindie•52m ago
Today’s reminder: multi-region is so hard even AWS can’t get it right.
rkharsan64•50m ago
This website is just AI slop, the real reporting is in the BBC page linked at the end.

Photos and numbers seem to be stolen straight from it.

GoatInGrey•35m ago
For anyone willing to take the time and provide BBC with your personal contact and address information, you can file a "complaint" at: https://www.bbc.co.uk/contact/complaints
draxil•50m ago
I was just about to post that it didn't affect us (heavy AWS users, in eu-west-1). Buut, I stopped myself because that was just massively tempting fate :)
mannyv•49m ago
This is why we use us-east-2.
1970-01-01•35m ago
Someone, somewhere, had to report that doorbells went down because the very big cloud did not stay up.

I think we're doing the 21st century wrong.

Johnny555•10m ago
My Ring doorbell works just fine without an internet connection (or during a cloud outage). The video storage and app notifications are another matter, but the doorbell itself continues to ring when someone pushes the button.
bicepjai•19m ago
Is this the outage that took Medium down ?