But I was supposed to be commuting, so I guess I'll do that.
Then, I tried various down detecting sites and they didn't seem to work either - presumably due to Cloudflare.
it's probably related to the recent ddos attacks they helped mitigating.
You know how you measure eternity?
When you finish learning German.
Perfect.
The hilarious part of the whole story is that the same PMs and product managers were (and I cannot overemphasize this enough) absolutely militant orthodox agile practitioners with jira.
Even still, you should have policies in place to mitigate such eventualities, that way you can focus the incompetence into systematic issues instead. The larger the company, the less acceptable these failures become. Lessons learned is a better excuse for a shake and break startup than an established player that can pay to be secure.
At some point, the finger has to be pointed. Personally, I don't dread it pointing elsewhere. Just means I've done my due D and C.
What about all the other systems and people suffering elsewhere in the World?
Maybe "Erleichterung" (relief)? But as a German "Schadenserleichterung" (also: notice the "s" between both compound word parts) rather sounds like a reduction of damage (since "Erleichterung" also means mitigation or alleviation).
You gain relief, but you don't exactly derive pleasure as it's someone you know that's getting the ass end of the deal
I'd love to know more about what those specific circumstances were!
But on a personal level, this is like ordering something at a restaurant and the cook burning the kitchen because they forgot to take out your pizza out of the oven or something.
I would be telling it to everyone over beers (but not my boss).
"A specific sequence of customer configuration changes, performed across two different control plane build versions, resulted in incompatible customer configuration metadata being generated. These customer configuration changes themselves were valid and non-malicious – however they produced metadata that, when deployed to edge site servers, exposed a latent bug in the data plane. This incompatibility triggered a crash during asynchronous processing within the data plane service. This defect escaped detection due to a gap in our pre-production validation, since not all features are validated across different control plane build versions."
And not a lawsuit? Cause I've read more about that kind of reaction than of job offers. Though I guess lawsuits are more likely to be controversial and talked about.
I mean, with Cloudflare's recent (lack of) uptime, I would argue there's a degree of crashflation happening such that the prestige is less in doing so. I mean nowadays if a lawnmower drives by cloudflare and backfires that's enough to collapse the whole damn thing
Or were you purposefully going out of your way to perpetrate performative ignorance and transphobic bullying, just to let everyone know that you're a bigoted transphobic asshole?
I don't buy that it was an innocent mistake, given the context of the rest of the discussion, and your pretending to know her family better than the poster you were replying to and everyone else in the discussion, falsely denying her credit for her own work. Do you really think dang made the Hacker News header black because he and everyone else was confused and you were right?
Do you like to show up at funerals of people you don't know, just to interrupt the eulogy with insults, stuff pennies up your ass (as you claim to do), then shit and piss all over the coffin in front of their family and friends?
How long did you have to wait until she died before you had the courage to deadname, misgender, and punch down at her in a memorial, out of hate and cowardice and a perverse desire to show everyone what kind of a person you really are?
Next time, can you at least wait until after the funeral before committing your public abuse?
https://news.ycombinator.com/item?id=45975524
amypetrik8 13 hours ago [flagged] [dead] | parent | context | flag | vouch | favorite | on: Rebecca Heineman has died
The work you're outlining here is was performed by "Bill Heineman" - maybe you are mixing up Bill with his sister Rebecca?!?
When aliens study humans from this period, their book of fairy tales will include several where a terrible evil was triggered by a config push.
https://statusfield.com/status/cloudflare https://statusgator.com/services/cloudflare
EDIT: And it's back up.
EDIT EDIT: And it's back down lol
Edit: and then back down again
I wonder if it has anything to do with the replicate.com purchase? Probably not.
In many ways it's still true, but it doesn't feel like a given anymore.
Additionally, it looks like Pingdom/Solarwinds authentication is affected too - not a great look for a service in that category.
Feels like half the internet is down.
cannot login to get to workers to check - auth errors
I thought this was the point of a cached CDN!
Blame the user or just leave them at an infinite spinning circle of death.
I check the network tab and find the backend is actually returning a reasonable error but the frontend just hides it.
Most recent one was a form saying my email was already in use, when the actual backend error returned was that the password was too long.
Funny, since I would have to prove to a an AI that I am human in the first place.
Your browser: Working
Host: Working
Cloudflare: Error
What can I do?
Please try again in a few minutes.
I have Cloudflare running in production and it is affecting us right now. But at least I know what is going on and how I can mitigate (e.g. disable Cloudflare as a proxy if it keeps affecting our services at skeeled).
> If the problem isn’t resolved in the next few minutes, it’s most likely an issue with the web server you were trying to reach.
/s
Even just the basic question of "are we down or is our monitoring system just having issues" requires a human. And it's never "are we down", because these are distributed systems we're talking about.
If service X goes down entirely, does that warrant a status page update? Yes? Turns out system X is just running ML jobs in the background and has no customer impact.
If service Z's p95 response latency jumps from 10ms to 1500ms for 5 minutes, 500s spike at the same time, but overall 200s rate is around 98%, are we down? is that a status page update? Is that 1 bad actor trying to cause issues? Is that indicative of 2,000 customers experiencing an outage and the other 98,000 operating normally? Is that a bad rack switch that's causing a few random 500s across the whole customer base and the service will reject that node and auto-recover in a moment?
At the same time I'm worried about how the internet is becoming even more centralized, which goes against how it was originally designed.
So which large service we have left that could take chunk of internet out?
HAHA!
Our servers are still down, though
Maybe I'll do both
Globally meaningful outages of either are quite rare.
https://www.cloudflarestatus.com/history?page=8
https://www.cloudflarestatus.com/history?page=7
https://www.cloudflarestatus.com/history?page=6
https://www.cloudflarestatus.com/history?page=5
https://www.cloudflarestatus.com/history?page=4
https://www.cloudflarestatus.com/history?page=3
All the major cloud providers have regular incidents. Most go unnoticed, because they’re small or short.
The really big AWS ones go on https://aws.amazon.com/premiumsupport/technology/pes/
When Cloudflare goes down: Oh no
Lister: What's the damage Hol?
Holly: I don't know. The damage report machine has been damaged.https://www.penguinrandomhouse.ca/books/661/mostly-harmless-...
Happily, the small ones that I also use are still going without anyone apparently even noticing. At least, the subject has yet to reach their local timelines at the time that I write this.
2 of the other major U.K. nodes are still up, too.
Like everywhere it is mostly bots.
Look at HN frontpage, there used to be 1-2 Twitter post per day. Now it is barely per week. End even those are usually just from two accounts (Karpathy and Carmack).
Everyone laughs when Azure collapses too
I rushed to Hacker News, but it was too early. Clicking on “new” did the job to find this post before making it to the Homepage:)
The web is still alive!
edit: It now says "Cloudflare Global Network experiencing issues" but it took a while.
Not downplaying the immense work of infra / engineering at this scale but my neighborhood local grocery market shouldn’t be down
I run a small video game forum with posts going back to 2008. We got absolutely smashed by bots scraping for training data for LLMs.
So I put it behind Cloudflare and now it's down. Ho hum.
https://i.ibb.co/qHCJyY7/image.png
I wrote the below to explain to our users what was happening, so apologies if the language is too simple for a HN reader.
- 0630, we switched our DNS to proxy through CF, starting the collection of data, and implemented basic bot protections
- Unfortunately whatever anti-bot magic they have isn't quite having the effect, even after two hours.
- 0830, I sign in and take a look at the analytics. It seems like <SITE NAME> is very popular in Vietnam, Brazil, and Indonesia.
- 0845, I make it so users from those countries have to pass a CF "challenge". This is similar to a CAPTCHA, but CF try to make it so there's no "choosing all the cars in an image" if they can help it.
- So far 0% of our Asian audience have passed a challenge.
I will say one very appealing use of Anubis I'd love to try is using it as a Traefik middleware to protect services running in docker containers.
Edit: To answer my own question, yes: http://www.arijuels.com/wp-content/uploads/2013/09/JB99.pdf
Edit 2: Maybe TLS would be another reasonable place for it?
As a side note, what does your site do that it's possible to use up all server resources? Computers are stupid fast these days. I find it's really difficult to build something that doesn't scale to at least multiple hundreds of requests per second.
The other part is just how convenient it is with CF. Easy to configure, plenty of power and cheap compared to the other big ones. If they made their dashboard and permission-system better (no easy way to tell what a token can do last I checked), I'd be even more of a fan.
If Germany's Telekom was forced to peer on DE-CIX, I'd always use CF. Since they aren't and CF doesn't pay for peering, it's a hard choice for Germany but an easy one everywhere else.
Hetzner has the WEAKEST DDoS protection out of ANYTHING out there - Arbor sucks.
Send me your website url and I'll keep it down for DAYS and whenever you cry to hetzner I'll just fry it again, it's that easy and that's why they're the cheapest - because everyone ran away from them back then.
And yet my website is still up today, and has not been down for years.
The first time we switched to Cloudflare which saved us. Even with Cloudflare, the DDoS attempts are still damaging (the site goes down, we use Cloudflare to block the endpoints they're targeting, they change endpoints, etc.) but manageable. Without Cloudflare or something like it, I think it's possible that we'd be out of business.
How?
The VPS I use will nuke your instance if you run a game server. Not due to resource usage, but because it attracts DDoS like nothing else. Ban a teen for being an asshole and expect your service to be down for a week. And there isn't really Cloudflare for independent game servers. There's Steam Networking but it requires the developer to support it and of course Steam.
Valve's GDC talk about DDoS mitigation for games: https://youtu.be/2CQ1sxPppV4
And yet game servers still work fine. Which answers this subthread's question ("how likely is it to get DDoSed if you don't have Cloudflare"), answer: not very likely, it happens once in a while at most.
So why be on Cloudflare to start with? Well, if you have a more reliable way then there's no reason. If you have a less reliable way, then you're on average better off with Cloudflare.
As for websites which don't need Cloudflare, in my experience almost every website will be DdoS attacked from time to time.
And why I should overthink my architecture now? If I had to manage redundant systems and keep track of circular dependencies I just could keep managing my infra the old way, no?
I'm being sarcastic here, obviously, but really one of the selling point for cloud back in the day it was "you don't have to care about those details". You just need to care about other details, now.
The place I work at has been online since 1996, not even a DoS yet, let alone a DDoS. Though we now use CF to filter all that bot traffic.
That is true. it is also the problem. It means the biggest providers do not even need to bother to be reliable because everyone will use them anyway.
But this is not really the case. When Azure/AWS were down, same as this one with Cloudflare: significant amount of web was down but most of it was not. It just makes more obvious which provider you use.
The issue is DNS since DNS propagation takes time. Does anyone have any ideas here?
Only if you're doing very basic proxy stuff. If you stack multiple features and maybe even start using workers, there may be no 1:1 alternatives to switch to. And definitely not trivially.
There are other cloudflare products for which there are not many alternative(durable objects, workflows etc), but at least for us we don't use them in the critical path. We deliberately avoided them in the critical path because we knew we'll have to setup multi cloud for 99.999% uptime(we run a POS system so any downtime results in angry calls and long lines for our merchants)
You think we have a say in this?
If the internet was always a nice place we wouldn't need Cloudflare and similar :(
However the https://www.cloudflarestatus.com/ does not really mention anything relevant. What's the point of having a status page if it lies ?
Update Ah I just checked the status and now I get a big red warning (however the problem existed for like 15 minutes before 11:48 UTC):
> Investigating - Cloudflare is aware of, and investigating an issue which potentially impacts multiple customers. Further detail will be provided as more information becomes available. Nov 18, 2025 - 11:48 UTC
Status pages are basically marketing crap right now. The same thing happened with Azure where it took at least 45 minutes to show any change. They can't be trusted.
What is the lie ?
> Cloudflare Global Network experiencing issues
cloudflare has a specific service names "Network" and it's having issues..
For 15 minute cloudflare wasn't working and the status page did not mentioned anything. Yes, right now the status page mentions the serious network problem but for some time our pages were not working and we didn't know what was happening.
So for ~ 15 minutes the status page lied. The whole point of a status page is to not lie, i.e to be updated automatically when there are problem and not by a person that needs to get clearance on what and how to write.
RIP to the engineers fixing this without any AI help.
Concerning though how much the web relies on one (great) service.
>Cloudflare is aware of, and investigating an issue which potentially impacts multiple customers. Further detail will be provided as more information becomes available.
>Posted 4 minutes ago
Edit: and down again a third time!
> Investigating - Cloudflare is aware of, and investigating an issue which potentially impacts multiple customers. Further detail will be provided as more information becomes available.
Things are back up (a second time) for me.
Cloudflare have updated their status page now to reflect the problems now. It doesn’t sound like they are confident the problem is fully fixed yet.
Edit: and down again a third time!
Yeah, those multiple customers is like 70% of the internet.
The internet is officially down.
EDIT: It would appear it is still unreliable in these countries, it just stopped working in France for me.
How are others doing this? How is Hacker News hosted/protected?
Things like Apple private relay (which way too many people seem to have it enabled) are tunnelled via Cloudflare, maybe using warp?
Your physical servers should have similar issues if you put a CDN in front unless the physical server is able to achieve a 100% uptime (100% * 3 9s = 3 9s). Or you don’t have a CDN but can be trivially knocked offline by the tiniest botnet (or even hitting hacker news front page)
And yes, I know that there's sites that need the scale of an operation like Cloudflare or AWS. But 99.9(...)% of pages don't, and people should start realizing that.
[0] (German) https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Cyber-Si...
DDoS mitigation is one of those areas that an on-prem solution just isn’t well suited to solve.
The system isn't designed for technical, rational decision making.
Instant switch means really high cost (two contracts) and also maintaining and testing it regularly. Also means you are limited on advanced features and most likely stick to basic features, reducing your ROI.
Most people are ok with a switch that would take days to weeks, reimplementing basic stuff during initial migration, then iterating on more advanced features. You run at risk to be down for hours to days. The cost of 2N or 2N+1 vendors is just too high to justify it.
When even Cloudflare goes down, nobody can blame the little guys.
My domain is registered with cloudflare so I'm 100% helpless to get things back online.
I can't edit DNS records to bypass cloudflare and I can't change nameservers either.
Then I was like... "when did I last time fly for 10+ hours and wanted to do programming, etc, so that I need offline docs?" So I gave up.
Today I can't browse the libs' docs quickly, so I'm resuming the work on my local mirroring :-)
A global upstream provider :)
Not-so-funny thing is that the Betterstack dashboard is down but our status page hosted by Betterstack is up, and we can't access the dashboard to create an incident and let our customers know what's going on.
Edit: wording.
Update: our app is available again without Cloudflare, you'll be able to post updates to status pages smoothly again.
If there is a slight positive note to all this, then it is that these outages are so large that customers usually seem to be quite understanding.
It's monthly by now
I think at the very least, one should plan the ability to switch to an alternative when your main choice fails… which together with AWS and GitHub is a weekly event now.
(Note: Zero negative sentiment towards imgur here)
Most time I get ddosed now it's either Facebook directly, Something something Azure or any random AI.
And lots of real users time wasted for captchas.
Does it make sense? Nah, but is it part of the weird reality we live in. Looks like it
I have no way of contacting Facebook. All I can do is keep complaining on hackernews whenever the topic arrises.
Edit:// Oh and I see the same with Azure, however there I have no list of IPs to verify it's official just because it looks like it.
If I choose something else, we're down, and our competitors aren't, then my overlords will start asking a lot of questions.
CF can be just as difficult if not more to migrate off of especially when using things like durable objects
> this is tenable as long as these services are reliable
do you hear yourself, this is supposed to be a distributed CDN. imagine if HTTP had 30 minutes of downtime a year.
and judging by the HN post age, we're now past minute 60 of this incident.
Huh? It's been back up during most of this time. It was up and then briefly went back down again but it's been up for a while now. Total downtime was closer to 30 minutes
Tbh though this is sort of all the other companies fault, "everyone" uses aws and cf and so others follow. now not only are all your chicks in one basket, so is everyone elses. When the basket inevitably falls into a lake....
Providers need to be more aware of their global impact in outages, and customers need to be more diverse in their spread.
So you think the problem is they aren't "aware"?
Outages happen, code changes occur; but you can do a lot to prevent these things on a large scale, and they simply dont.
Where is the A/B deployment, preventing a full outage? What about internally, where was the validation before the change, was the testing run against a prodlike environment or something that once resembled prod but hasnt forever?
They could absolutely mitigate impacting the entire global infra in multiple ways, and havent, despite their many outages.
Not really sure how our community is supposed to deal with this.
I have always felt so, but my opinion is definitely in the minority.
In fact, I find that folks have extremely negative responses to any discussion of improving software Quality.
A large proportion of “developers” enjoy build vs buy arguments far too much.
Now that we have an abundance of compute and most people run devices more powerful than the devices that put man on the moon, it's easier than ever to make app bloat, especially when using a framework like Electron or React Native.
People take it personally when you say they write poor quality software, but it's not a personal attack, it's an observation of modern software practices.
And I'm guilty of this, mainly because I work for companies that prioritize speed of development over quality of software, and I suspect most developers are in this trap.
The typical argument that I see, is homemade encryption, which is quite valid.
However, encryption is just a tiny corner of the surface.
Most folks don’t want to haul in 1MB of junk, just so they can animate a transition.
Well, I guess I should qualify that: Most normal folks wouldn't want to do that, but, apparently, it's de rigueur for today's coders.
Which only shows that chasing five 9s is worthless for almost all web products. The idea is that by relying on AWS or Cloudflare you can push your uptime numbers up to that standard, but these companies themselves are having such frequent outages that customers themselves don't expect that kind reliability from web products.
"Yes but what if they go down" - it doesnt matter, having it hosted by someone who can be down for the same reason as your main product/service is a recipe for disaster.
It took a few minutes but I got https://hcker.news off of it.
But then, that’s what Cloudflare signed up to be.
I also can't log in via Google SSO since Cloudflare's SSO service is down.
/s
Of course, on the other hand, I know that relying on Cloudflare cert's is basically inviting a MITM attack.
Use Caddy. I never worry about certs.
The setup appears very simple in Caddy - amazingly simple, honestly. I'm going to give it a good try.
It's one of the more controversial parts of the business, it makes the fact that the traffic is unencrypted on public networks invisible to the end user.
If you host images as well, your bandwidth costs might skyrocket.
I was pretty much forced into putting a site I'm managing for a client behind cloudflare due to the above-mentioned issue.
But turns out that's fine :).
I'm still confused. Does this mean that HN switches CF on or off in response to recent volume of bot traffic?
Those TikTok AI crawlers were destroying some of my sites.
Millions of images served to ByteSpider bots, over and over again. They wouldn't stop. It was relentless abuse. :-(
Now I've just blocked them all with CF.
Yeah, they for sure let nothing through right now. ;)
At time like this really glad we self-hosted.
But for a small operation, AKA just me, it's one more thing for me to get my head around and manage.
I don't run just one one website or one service.
It's 100s of sites across multiple platforms!
Not sure I could ever keep up playing AI Crawler and IP Whack-A-Mole!
Blocking ASNs is one step of the fight, but unfortunately it's not the solution.
Yes, they are really hard to block. In the end I switched to Cloudflare to just so they can handle this mess.
Probably more effective would be to get the bots to exclude your IP/domain. I do this for SSH, leaving it open on my public SFTP servers on purpose. [1] If I can get 5 bot owners to exclude me that could be upwards of 250k+ nodes mostly mobile IP's that stop talking to me. Just create something that confuses and craps up the bots. With SSH bots this is trivial as most SSH bot libraries and code are unmaintained and poorly written to begin with. In my ssh example look for the VersionAddendum. Old versions of ssh, old ssh libraries and code that tries to implement ssh itself will choke on a long banner string. Not to be confused with the text banner file.
I'm sure the clever people here could make something similar for HTTPS and especially for GPT/LLM bots at the risk of being flagged "malicious".
[1] - https://mirror.newsdump.org/confuse-some-ssh-bots.html
About 90%+ of bots can not visit this URL, including real people that have disabled HTTP/2.0 in their browser.
You realize it was possible to block bad actors before Cloudflare right? They just made it easier, not possible in the first place.
And my image CDN blocked ByteSpider for me.
For a while I also blocked the entirety of Singapore due to all the bots coming out of AWS over there!
But it's honestly something I just dont need to be thinking about for every single site I run across a multitude of platforms.
Having said that, I will now look at the options for the business critical services I operate for clients!
The cost of hardware and software resources these days is absolute peanuts compared to 10 years ago. Cloud services and APIs has made managing them also trivial as hell.
Cloudflare is simply a evolution in response to the other side also having evolved greatly, both legitimate and illegitimate users.
edit: I guess I understand "AI bots scraping sites for data to feed LLM training" but what about the image serving?
The image scraping bots are training for generative AI, I'm assuming.
As to why they literally scrape the same images hundreds of thousands of times?
I have no idea!
But I am not special, the bots have been doing it across the internet.
My main difference to other sites is that I operate a Tourism focused SAAS for local organisations and government tourist boards. Which means we have a very healthy amount of images being served per page across our sites.
We also do on the fly transformations for responsive images and formats. Which is all done through Cloudinary.
The Bytespider bot (Bytedance / TikTok) was the one that was being abusive for me.
1. DDOS protection is not the only thing anymore, I use cloudflare because of vast amounts of AI bots from thousands of ASNs around the world crawling my CI servers (bloated Java VMs on very undersized hosts) and bringing them down (granted, I threw cloudflare onto my static sites as well which was not really necessary, I just liked their analytics UX)
2. the XKCD comic is mis-interpreted there, that little block is small because it's a "small open source project run by one person", cloudflare is the opposite of that
3. edit: also cloudflare is awesome if you are migrating hosts, did a migration this past month, you point cloudflare to the new servers and it's instant DNS propagation (since you didnt propagate anything :) )
"Only" 10% of the internet is behind Cloudflare so far ;)
I am curious about these two things:
1- Does GCP also have any outages recently similar to AWS, Azure or CF? If a similar size (14 TB?) DDoS were to hit GCP, would it stand or would it fail?
2- If this DDoS was targeting Fly.io, would it stand? :)
Apparently prisma's `npm exec prisma generate` command tries to download "engine binaries" from https://binaries.prisma.sh, which is behind... guess what...
So now my CI/CD is broken, while my production env is down, and I can't fix it.
Amazing lol
[1] https://totalrealreturns.com/
[2] https://status.heyoncall.com/svg/uptime/zCFGfCmjJN6XBX0pACYY...
BetterStack, InStatus and HetrixTools seemingly all use Cloudflare on their dashboards, which means I can't login but I keep getting "your website/API is down" emails.
Update: I also can't login to UptimeRobot and Pulsetic. Now, I am getting seriously concerned about the sheer degree of centralization we have for CDNs/login turnstiles on Cloudflare.
Update: Looks like the issue has been resolved now. All sites are operational now.
Name Server: NS-225.AWSDNS-28.COM Name Server: NS-1411.AWSDNS-48.ORG Name Server: NS-1914.AWSDNS-47.CO.UK Name Server: NS-556.AWSDNS-05.NET
At least for DNS. Data center appears to be Lightedge.
- [Sorry I broke the server](https://news.ycombinator.com/item?id=9052128)
- [New attempt at mobile markup](https://news.ycombinator.com/item?id=10489499)
- [Clickable domains and QoL](https://news.ycombinator.com/item?id=10223645)
- [New features and a moderator](https://news.ycombinator.com/item?id=12073675)
- [Thanks to thehodge and littlewarden, this site is up today](https://news.ycombinator.com/item?id=28472350)
Maybe one day. Seeing all of these big providers stumbling, an article about HN staying on top of everything would surely resonate.
/s
This is what you get for being lazy and choosing to making the internet more centralized.
The outages are the Roomba.
> If the problem isn’t resolved in the next few minutes, it’s most likely an issue with the web server you were trying to reach.
[1] https://www.cloudflare.com/5xx-error-landing/?utm_source=err...
Yeah I don't think you are using this phrase correctly
Shouting will not prevent errors, and you are only creating a hostile work environment where not acting is better than the risk of making a mistake and triggering an aggressive response from your part.
There is nothing wrong with shouting during a perceived outage. Shouting is just raising your voice to give a notion of urgency. Yelling is different.
How often have you heard "shout at me", or something like that?
OP, continue you to shout when its needed, just don't yell at people you work with ;)
Looking forward to the post-mortem.
edit: it's up!
edit: it's down!
I'm going to take the metro now and thinking how long do we have until the entire transit network goes down because of a similar incident.
I'm leaving the redaction because I couldn't work atm...
Time for a beer , greetings from germany!
It’s been great, but I always wonder when a company starts doing more than it’s initially calling. There have been a ton of large attacks, tons of bot scrappers so it’s the Wild West.
Edit: beautiful, this decentralised design of the internet.
I don't like it.
I get why, but it would give me more confidence if they would tell me about everything.
Requires zero insight into other infrastructure, absolutely minimal automation, but immediately gives you an idea whether it's down for just you or everybody. Sadly now deceased.
Reddit Status used to show API response times way back in the day as well when I used to use the site, but they've really watered it down since then. Everything that goes there needs to be manually put in now AFAIK. Not to mention that one of the few sections is for "ads.reddit.com", classic.
You can put your test runner on different infrastructure, and now you have a whole new class of false positives to deal with. And it costs you a bit more because you're probably paying someone for the different infra.
You can put several test runners on different infrastructure in different parts of the world. This increases your costs further. The only truly clear signals you get from this are when all are passing or all are failing. Any mixture of passes and fails has an opportunity for misinterpretation. Why is Sydney timing out while all the others are passing? Is that an issue with the test runner or its local infra, or is there an internet event happening (cable cut, BGP hijack, etc) beyond the local infra?
And thus nearly everyone has a human in the loop to interpret the test results and make a decision about whether to post, regardless of how far they've gone with automation.
Whenever there was an outage they would put up a fight against anyone wanting to update the status page to show the outage. They had so many excuses and reasons not to.
Eventually we figured out that they were planning to use the uptime figures for requesting raises and promos as they did at their FAANG employer, so anything that reduced that uptime number was to be avoided at all costs.
I think it's way more common for companies to have a public status page, and then internal tooling that tracks the "real" uptime number. (E.g. Datadog monitors, New Relic monitoring, etc)
(Your point still stands though.)
It’s obviously not a problem at every company because there are many companies who will recognize these shenanigans and come down hard on them. However you could tell these guys could recognize any opportunity to game the numbers if they thought those numbers would come up at performance review time.
Ironically our CEO didn’t even look at those numbers. He used the site and remembered the recent outages.
Or let's say your load balancer croaks, triggering a "down" status, but it's 3am, so a single server is handling traffic just fine? In short, defining "down" in an automated way is just exposing internal tooling unnecessarily and generates more false positives than negatives.
Lastly, if you are allowed 45 minutes of downtime per year and it takes you an hour to manually update the status page, you just bought yourself an extra hour to figure out how to fix the problem before you have to start issuing refunds/credits.
No. Not if you're not defrauding your customers, you didn't.
Im going home. Time for a beer .
Greetings from germany
Almost no one gets mad if your site and half the internet were down.
They'd only have to take down a few services to completely cripple the West - the exact case ARPANET was designed to prevent.
I am a self-host enthousiast. So I use Hetzner, Kamal and other tools for self-managing our servers, but we still have Cloudflare in front of them because we didn't want to handle the parts I mentioned (yet, we might sometime).
Calling it a mistake is a very narrow look at it. Just because it goes down every now and then, it isn't a mistake. Going for cloud or not has its trade-offs and I agree that paying 200 dollars a month for a 1GB Heroku Redis instance is complete madness when you can get a 4GB VPS on Hetzner for 3,8 a month. Then again, some people are willing to make that trade-off for not having to manage the servers.
Cloud servers have taught me so much about working with servers because they are so easy and cheap to spin up, experiment with and then get rid of again. If I had had to buy racks and host them each time I wanted to try something, I would've never done it.
But in the face of adversity, it's a huge liability. Imagine Chinese Hackers taking down AWS, Cloudflare, Azure and GCP simultaneously in some future conflict. Imagine what that would do to the West.
I don't believe in Fukuyamas End of History. History is still happening, and the choices we make will determine how it plays out.
and my alarms are going off my and support line is ringing...
I cant even login to my CF dashboard to disable the CDN!
Edit: It's back. Hopefully it will stay up!
Edit 2: 1 Hour Later.
Narrator: It didn't stay up :/
even status page is giving 504 Gateway Timeout ERROR The request could not be satisfied. now in India
I guess claude is more important than your average site :)
The dashboard's API server runs on Cloudflare and is currently blocking all logins, will fix.
Y'know, along with most other SAAS services.
or search a new job for yourself. Maybe digging to the earth core. Why? Idk. Because then you can say: I did it, or so.
(index):64 Uncaught ReferenceError: $ is not defined at (index):64:3
I hope it gets resolved in the next hour or two, or it could be a serious problem for me.
resource "cloudflare_dns_record"
- proxied = true
+ proxied = false
https://www.pcgamer.com/gaming-industry/legendary-game-desig...
Browser Working
San Jose
Cloudflare Error
mysite.com
Host Working
Lol! Like a solar eclipse!
Our support portal provider is currently experiencing issues
Are they using Cloudflare perchance? (scnr)Cloudflare Mumbai, Bengaluru, Chennai, Hyderabad edge-nodes also unable to serve content.
x.com down.
Few quick-commerce apps are acting up at times.
Same goes for my personal projects: I've never been worried about being targeted by a botnet so much that I introduce a single point of failure like this.
What, they have Cloudflare and we don't? We also must have cloudflare. Don't ask why.
Now that you have it, you are at least level 15 and not a peasant.
Same applies to every braindead framework on the web. The gadget mind of the bois is the cause for all this.
I've seen an RPi serve a few dozen QPS of dynamic content without issue... The only service I've had actually get successfully taken down by benign bots is a Gitea-style git forges (which was 'fixed' by deploying Anubis in front of it).
I consider my server's real IP (or load balancer IP) as a secret for that reason, and Cloudflare helps exactly with that.
Everything goes through Cloudflare, where we have rate limiters, Web firewall, challenges for China / Russian inbound requests (we are very local and have zero customers outside our country), and so on.
Frustrating, because I know I'll get asked today if we have an alternative to using CF, and I don't have a good answer.
I finally work around this by change the tcp options sent by vpp tcp stack.
But the whole thing made me worry there must be something deployed which cause this issue.
But I do not think that related with this network issue, it just reminds me the above, I feel there are frequently new articles about cloudflare networking, maybe new method or new deployment sort of related high probability of issues
It would be easy for me to tell my customers exactly what they need to say to get the maximum, but I've been told not to do that, so I guess it's on them to figure it out.
Yet there has been an uptick in frequency of outages only in the recent few months. Correlation correlation.
Why assume that these misconfigs are not the result of someone asking AI how to do them?
The openai login page says:
Please unblock challenges.cloudflare.com to proceed.Likely this coupled with the mass brain damage caused by never-ending COVID re-infections.
Since vaccines don't prevent transmission, and each re-infection increases the chances of long COVID complications, the only real protection right now is wearing a proper respirator everywhere you go, and basically nobody is doing that anymore.
Most people are not self reflective reflective enough to notice. Need to trust the studies.
Far more plausible than the AI ideas.
I find it far more likely these are smart people running without oversight for years pre-COVID, relying on being smart at 2am change windows. Now half or a full std. dev. lower on the IQ scale, hubris means fewer guard rails before change, and far lower ability to recover during change window.
We can even see (measure) it in driving behavior patterns.
Another data point is how Hollywood has gone to great lengths to keep the whole thing hush hush, because such a downer is bad for business:
https://old.reddit.com/r/ZeroCovidCommunity/comments/1ncmclw...
Keep in mind many parties benefit from capitalizing on hysterical hype. And you're going to sit here and tell me they're all keeping it covered up for some reason?
This comes off like the extreme left-wing equivalent of extreme right-wingers that say the "world is run by the evil jewish cabal" and when people ask for proof, they retort "everyone is hiding it so none exists".
It comes to no surprise that media owned by capital interests is falling in line with said capital interests.
https://www.thegauntlet.news/p/how-the-press-manufactured-co...
It's far more likely due to either AI, or more directly, layoffs and offshoring, as that affects hundreds of thousands of their employees.
If you excuse the sloppy plot manually transcribed from market index data: https://i.xkqr.org/cyberinsurancecost.png
You're absolutely right! I shouldn't have force pushed that change to master. Let me try and roll it back. * Confrobulating* Oh no! Cloudflare appears to be down and I cannot revert the change. Why don't you go make a cup of coffee until that comes back. This code is production ready, it's probably just a blip.
It seems 20% of the Internet is down every two weeks now.
Wonder if the internet will soon be deleted.
Unfortunately, that means they can also break 75% of the internet.
Coincidence? I think not.
I run my applications on OVH behind BunnyCDN and all is well.
So everyone who's wrapped their host with Cloudflare is stuck with it.
Will my Spelling Bee QBABM count today, or will it fail and tomorrow I find out that last MA(4) didn't register, ruining my streak? Society cannot function like this! /s
Is Cloudflare being attacked...?
> We are continuing to work towards restoring other services. > Posted 12 minutes ago. Nov 18, 2025 - 13:13 UTC
Now I'm really suspicious that they were attacked...
We vibe coded a tool to mass disconnect Cloudflare Warp for incident responders: https://github.com/aberoham/unwarp
To go along with the shenanigans around dealing with MITM traffic inspection https://github.com/aberoham/fuwarp
Well...
Just saying.
A lot (and I mean a lot) of people in IT like centralization specifically because it’s hard to blame people for doing something that everyone else is doing.
I'd be horrified. That's not the internet or computing industries I grew up with, or started working in.
But as long as the SPY keeps hitting > 10% returns each year, everyone's happy.
Perhaps the most graceful death of a tech company is that sentiment? Before some perception shift?
"I added a load-balancer to improve system reliability" (happy)
"Load balancer crashed" (smiling-through-the-pain)
Technically, multi-node cluster with failover (or full on active-active) will have far higher uptime than just a single node.
Practically, to get the multi-node cluster (for any non trivial workload) to work right, reliably, fail-over in every case etc. is far more work, far more code (that can have more bugs), and even if you do everything right and test what you can, unexpected stuff can still kill it. Like recently we had uncorrectable memory error which just happened to hit the ceph daemon just right that one of the OSDs misbehaved and bogged down entire cluster...
Of course, if a big incident happens for a big CDN, there might not be enough latent capacity in the other CDNs to take all the traffic. CDNs are a cutthroat business, with small margins, so there usually isn’t a TON of unused capacity laying around.
Companies can always do as they please and people will rationalize anything.
there was a point (maybe still) where not having a netflix subscription was seen as 'strange'.
if that's the case in your social circles -- and these kind of social things bother you -- you're not going to cancel the subscription due to bad service until it becomes a socially accepted norm.
Well that and the fact that when 99% goes through a central party, then that central party will be very interesting for authoritarian governments to apply sweeping censorship rules to.
Eventually?
It's very frustrating of course, and it's the nature of the beast.
Not sure I follow, I didn't say it wasn't worrying or an issue. Just the reasons for it getting to this point are valid.
The target these days is the user.
The make-believe worm.
Oddly this centralization allows a complete deferral of blame without you even doing anything: if you’re down, that’s bad. But if you’re down, Spotify is down, social media is down… then “the internet is broken” and you don’t look so bad.
It also reduces your incentive to change, if “the internet is down” people will put down their device and do something else. Even if your web site is up they’ll assume it isn’t.
I’m not saying this is a good thing but I’m simply being realistic about why we ended up where we are.
The internet can’t afford to just “give people mental health breaks.”
What happened to having a business continuity plan? E.g. when your IT system is down, writing down incoming orders manually and filling them into the system when it's restored?
I have a creeping suspicion that people don't care about that, in which case they can't really expect more than to occasionally be forced into some downtime by factors outside of their control.
Either it's important enough to have contingencies in place, or it's not. Downtime will happen either way, no matter how brilliant the engineers working at these large orgs are. It's just that with so much centralization (probably too much) the blast range of any one outage will be really large.
Doesn't change the fact that 99% of our ticket sales happen online. People will even come in to the theatre to check us out (we're magicians and it's a small magic shop + magic-themed theatre - so people are curious and we get a lot of foot traffic) but, despite being in the store, despite being able to buy tickets right then and there and despite the fact that it would cost less to do so ... they invariably take a flyer and scan the QR code and buy online.
We might be kind of niche, since events usually sell to groups of people and it's rare that someone decides to attend an event by themselves right there on the spot. So that undoubtedly explains why people behave like this - they're texting friends and trying to see who is interested in going. But I'm still bringing us up as an example to illustrate just how "online" people are these days. Being online allows you to take a step back, read the reviews, price shop, order later and have things delivered to your house once you've decided to commit to purchasing. That's just normal these days for so many businesses and their customers.
If AWS is down, most businesses on AWS are also down, and it’s mostly fine for those businesses.
It's better to have diverse, imperfect infrastructure, than one form of infra that goes down with devastating results.
I'm being semi-flippant but people do need to cope with an internet that is less than 100% reliable. As the youth like to say, you need to touch grass
Being less flippant: an economy that is completely reliant on the internet is one vulnerable to cyberattacks, malware, catastrophic hardware loss
It also protects us from the malfeasance or incompetence of actors like Google (who are great stewards of internet infrastructure... until it's no longer in their interests)
The idea that we absolutely need 24/7 productivity is a new one and I’m not that convinced by it. Obviously there are some scenarios that need constant connectivity but those are more about safety (we don’t want the traffic lights to stop working everywhere) than profit.
We don't need it, the owners want it
Phone lines absolutely did not go down. Physical POTS lines (Yes, even the cheap residential ones) were required to have around 5 9s of availability, or approximately 5 minutes per year. And that's for a physical medium affected by weather, natural disasters, accidents, and physical maintenance. If we or the LEC did not meet those targets contracts would be breached and worst case the government would get involved.
Physical network equipment is redundant and reliant enough that getting 5 minutes of downtime or less per year is totally doable.
the web however... is a far different beast (and in my opinion, with an incentive which does not factor in reliability)
This was the same when I was doing consulting inside (ie large companies willing to pay the premium cost of AWS ProServe consultants) and outside working at 3rd party companies.
try going outside
This:
> if you’re down, that’s bad. But if you’re down, Spotify is down, social media is down… then “the internet is broken” and you don’t look so bad.
is just marketing. If you are down with some other websites it is still bad.
In some cases, absolutely. For the vast majority, it really, really doesn't matter.
(Source: my personal website is down and nobody cares, including me)
> it is still bad
No doubt. But there’s a calculation to make, is it bad enough to spend the extra money on mitigations, to hire extra devops folks to manage it all… and in the majority of end user facing cases the answer is no, it isn’t.
That one time when an AZ goes down and your infra successfully fails over to the other two isn't worth it for a lot of my scale companies, ops consultants seem to be chasing high cloud spend to justify their own high cost. I also factor in that I live in Sweden where most infrastructure outages are exceptionally rare.
Ofc it depends on what kind of company you are and what you're providing.
In this case, the internet should be down more often.
Users are never a consideration today anyway.
We had very decentralized "internet" with BBSes, AOL, Prodigy, etc.
Then we centralized on AOL (ask anyone over 40 if they remember "AOL Keyword: ACME" plastered all over roadside billboards).
Then we revolted and decentralized across MySpace, Digg, Facebook, Reddit, etc.
Then we centralized on Facebook.
We are in the midst of a second decentralization...
...from an information consumer's perspective. From an internet infrastructure perspective, the trend has been consistently toward more decentralization. Initially, even after everyone moved away from AOL as their sole information source online, they were still accessing all the other sites over their AOL dial-up connection. Eventually, competitors arrived and, since AOL no longer had a monopoly on content, they lost their grip on the infrastructure monopoly.
Later, moving up the stack, the re-centralization around Facebook (and Google) allowed those sources to centralize power in identity management. Today, though, people increasingly only authenticate to Facebook or Google in order to authenticate to some 3rd party site. Eventually, competitors for auth will arrive (or already have ahem passkeys coughcough) and, as no one goes to Facebook anymore anyway, they'll lose grip on identity management.
It's an ebb and flow, but the fundamental capability for decentralization has existed in the technology behind the internet from the beginning. Adoption and acclimatization, however, is a much slower process.
That's the whole issue with monopolies for example, innit? We envision "ideal free market dynamics" yet in practice everybody just centralizes for efficiency gains.
The much bigger issue with monopolies is that there is no pressure on the monopolist to compete on price or quality of the offering.
I don't have a better solution, but it's a clear problem. Also, for some reason, more and more people (not you) will praise and attack anyone who doesn't defend state A (ideal equilibrium). Leaving no room to point out state B as a logical consequence of A which requires intervention.
If anyone in the industry actually cared about reliability and took personal stake in their system being up, everyone would be back on-prem.
No, the value proposition was always about saving money, turning CapEx into OpEx. Direct quote from my former CEO maybe 9 years ago: We are getting out of the business of buying servers.
Cloud engineering involves architecting for unexpected events: retry patterns, availability zones, multi-region fail over, that sort of thing.
Now - does it all add up to cost savings? I could not tell you. I have seen some case studies, but I also have been around long enough to take those with a big grain of salt.
IMHO it adds, but only if you are big enough. Netflix level. At that level, you go and dine with Bezos and negotiate a massive discount. For anyone else, I’d genuinely love to see the numbers that prove otherwise.
> There's zero personal responsibility
Unfortunately, this seems to be the unspoken mantra of modern IT management. Nobody wants to be directly accountable for anything, yet everyone wants to have their fingerprints on everything. A paradox of collaboration without ownership.
And this is not reserved instances, this is an org level pricing deal. Some have been calling it anti-competitive and saying the regulators need to look at the practice.
It adds if you're smart about using resources efficiently, at any level. And engineer the system to spin up / spin down as customers dictate.
For situations where resources are allocated but are only being utilized a low percentage (even < 50% in some cases), it is not cost effective. All that compute / RAM / disk / network etc. is just sitting there wasted.
You no longer needed them to approve a new machine, you just spun it up how you want. Sped things up massively for a while.
I saw this stuff too many times, and it is precisely why the cloud exploded in use in about 2010.
For many organizations, that's literally illegal, and anyone who does this should be fired.
But sure jump to more conclusions if you want.
And this is in Fortune 500 of course.
I mean yea, but who knows how long that box would sit around before it was discovered.
If the business can live with a couple of hours downtime per year when "cloud" is down, and they think they can ship faster / have less crew / (insert perceived benefit), then I don't know why that is a problem.
Frankly it's a blessing, always being able to blame the cloud that management forced company to migrate to be "cheaper" (which half of the time turns out to be false anyway)
https://www.apple.com/newsroom/2025/11/apple-introduces-digi...
But we can all say thank you to all the AI crawlers who hammer websites with impossible traffic.
I.e., if you use Tor for "normie sites", then the fact that someone can be seen using Tor is no longer a reliable proxy for detecting them trying to see/do something confidential and it becomes harder to identify & target journalists, etc. just because they're using Tor.
If you have a site with valuable content the LLM crawlers hound you to no end. CF is basically a protection racket at this point for many sites. It doesnt even stop the more determined ones but it keeps some away.
And they're pretty tame as far as computer fraud goes - if my device gets compromised I'd much rather deal with it being used for fake YouTube views than ransomware or a banking trojan.
"The internet sucks", yes, but we're doing it to ourselves.
LLMs aren't going anywhere, but the world would be a better place if they hadn't been developed. Even if they had more positive impacts, those would not outweigh the massive environmental degradation they are causing or the massive disincentive they created against researching other, more useful forms of AI.
LLMs have become a crucial compendium of knowledge, that had become hidden behind SEO
A solid secondary option is making LLM scraping for training opt-in, and/or compensating sites that were/are scraped for training data. Hell, maybe then you could not knock websites over incentivizing them to use Cloudflare in the first place.
But that means LLM researchers have to respect other people's IP which hasn't been high on their todo lists as yet.
bUt ThAT dOeSn'T sCaLe - not my fuckin problem chief. If you as an LLM developer are finding your IP banned or you as a web user are sick of doing "prove you're human" challenges, it isn't the website's fault. They're trying to control costs being arbitrarily put onto them by a disinterested 3rd party who feels entitled to their content, which it costs them money to deliver. Blame the asshole scraping sites left and right.
Edit: and you wouldn't even need to go THAT far. I scrape a whole bunch of sites for some tools I built and a homemade news aggregator. My IP has never been flagged because I keep the number of requests down wherever possible, and rate-limit them so it's more in line with human like browsing. Like so much of this could be solved with basic fucking courtesy.
Most of the problems on the internet in 2025 aren't because of one particular technology. They're because the modern web was based on gentleman's agreements and handshakes, and since those things have now gotten in the way of exponential profit increases on behalf of a few Stanford dropouts, they're being ignored writ large.
CF being down wouldn't be nearly as big of a deal if their service wasn't one of the main ways to protect against LLM crawlers that blatantly ignore robots.txt and other long-established means to control automated extraction of web content. But, well, it is one of the main ways.
Would it be one of the main ways to protect against LLM web scraping if we investigated one of the LLM startups for what is arguably a violation of the Computer Fraud and Abuse Act, arrested their C-suite, and sent each member to a medium-security federal prison (I don't know, maybe Leavenworth?) for multiple years after a fair trial?
Probably not.
It is the Web that is being degraded
The site doesn't even need to have valuable content. Any content at all.
That's a problem caused by bots and spammers and DDoSers, that Cloudflare is trying to alleviate.
And you generally don't have to prove it over and over again unless there's a high-risk signal associated with you, like you're using a VPN or have cookies disabled, etc. Which are great for protecting your privacy, but then obviously privacy means you do have to keep demonstrating you're not a bot.
That they're trying counts for brownie points, it's not an excuse to be satisfied with something that still bothers a lot of people. Do better, CloudFlare.
If you have any ideas on how to protect against bad actors in a way that is just as effective but easier for users, please share it.
Because as far as I can tell, this isn't a question of effort. It's a question of fundamental technological limitations.
If you have a better technological solution, we'd all love to know it. Because right now, site owners are using the best tools available.
Criticizing when there's no other solution isn't very useful, is it?
Cloudflare is the multi-billion dollar corporation. It has everything to do with that, because they are the primary cause, and their resources and position make them by far the best equipped to solve it.
> Criticizing when there's no other solution isn't very useful, is it?
Of course it is. Without criticism, the growing problem goes unacknowledged and allowed to persist. It should instead be continually called out until it is prioritized, and some of those billions should be spent on researching a solution. (Similarly, a company found to be dumping waste into a river should be held responsible for cleaning up the mess they created. Even if that turns out to be expensive or difficult.)
Expecting a single affected person to solve it for the big corp that caused it is unrealistic. And blaming the victims because they use VPNs or disable cookies is... unhelpful.
CloudFlare is protecting sites from DDoS attacks and out-of-control bots. They're not the ones causing them. If CloudFlare wasn't asking you to prove you're human, many times the site would be down entirely because it couldn't keep up. Or the site would simply shut down because it couldn't afford it.
And this isn't a question of spending some fraction of billions on researching a solution. There fundamentally isn't one, if you understand how the internet works. This is a problem a lot of people would like to solve better, believe me.
So, yes, criticizing Cloudflare here is as useful as criticizing it for not having faster-than-light communication. There's nothing else it can do. It's not "blaming the victims".
I'm going to assume you simply don't have the technical understanding of how the internet works. Because the position you're taking is simply absurd and nonsensical, and there's no way you would write what you're writing otherwise.
As long as HN is up and running, everything is going to be O.K.!
Smaller companies that provide real world services or goods to make a much more meagre living that rely on some of the services sold to them by said software companies will be impacted much more greatly.
Losing a day or two of sales to someone who relies on making sales every day can be a growing hardship.
This doesn’t just impact developers. It’s exactly this kind of myopic thinking that leads to scenarios like mass outages.
You have to realize when software companies tell the world they should rely on their works, they world will do so. And once that occurs, the responsibility is all on the software companies to meet the expectations they built in people!
It's mad that this industry works so hard to claim the trust of millions of people, then shirks it as soon as it's convenient.
It's shameful.
When have users been asked about anything?
this is like a bad motivational speaker talk.. heavy exhortations with a dramatic lack of actual reasoning.
Systems are difficult, people. It is "incentives" of parties and lockin by tech design and vendors, not lack of individual effort.
The internet this day is fucking dangerous and murderous as hell. We need Cloudflare just to keep services up due to the deluge of AI data scrapers and other garbage.
In my direct experience, this isn't true if you're running something even vaguely mission-critical for your customers. Your customer's workers just know that they can't do their job for the day, and your customer's management just knows that the solution they shepherded through their organization is failing.
If akamai went down i have a feeling you'd see a whole lot more real life chaos.
if you run anything even remotely mission critical, not having a plan B which is executable and of which you are in control (and a plan C) will make you look completely incompetent.
There are very, very few events which some people who run mission critical systems accept as force majeur. Most of those are of the scale "national emergency" or worse.
And why should anyone be surprised? It's been about 80 years since "The buck stops here."[0] had any real relevance. And more's the pity.
[0] https://www.phrases.org.uk/meanings/the-buck-stops-here.html
End product users have no power, they can complain to support and maybe get a free month of service, but the 0.1% of customers that do that aren't going to turn the tide and have anything change.
Engineering teams using these services also get "covered" by them - they can finger point and say "everyone else was down too."
oh no
I agree. When people talk about the enshittification of the internet, Cloudflare plays a significant role.
Which changes nothing to you actually being down, youre only down more. CF proxies always sucked - not your domain, not your domain...
It's genuinely insane that many companies are designing a great amount of fallbacks... on the software level but almost none is thought on the hardware/infrastructure level, common-sense dictate that you should never host everything on a single provider.
I was wrong, and ever since I've dealt with a targeted attack (which was evolving as I added more CF firewall rules). At this point it's taken care of, but only because I have most things completely blocked at the CF firewall layer.
(also congrats on 1 million subscribers but I know you must be tired of listening it but have a nice day jeff! Your videos are awesome!)
The idea that if companies like my former employer would stop doing DRM their audience would embrace it is rife idealism. But based on bitter experience so enough people will do bad things just for the lulz that you need to cover your ass.
My home lab will never have an open port, I'll always put things behind a CDN or zero trust system, even then...
FWIW, it's worthwhile just for educational reasons to look at abuseipdb.com quite revealing.
That being said, streaming content security is more than just DRM and DRM is more than just copy protection. There's a whole suite of tools inside DRM systems to manage content access at different levels and rulesets that can be applied for different situations. It's still fundamentally controlling an encrypted bitstream however. But I've implemented a great deal more than just DRM in order to build a better content security platform. Transit level controls, advanced token schemes, visible/invisible watermarking, threat/intrusion detection and abuse detection, there's quite a bit that can be implemented.
I was at Softlayer before I was at AWS and what catalyzed the move was the time I needed to add another hard drive to a system and somehow they screwed it up. I couldn't put a trouble ticket it to get it fixed because my database record in their trouble ticket system was corrupted. The next day I moved my stuff to AWS and the day after that they had a top sales guy talk to me to try to get me to stay but it was too late.
Then, with regular VPSs I also had systems down for 1-2 days. Just last week the company that hosts NextCloud for us was down the whole weekend (from Friday evening) and we couldn’t get their attention until Monday.
So far these huge outages that last 2-5 hours are still lower impact for me, and require me to take less action.
It depends on how you calculate your cost. If you only include the physical infrastructure having a dedicated server is cheaper. But by having some dedicated server you loose a lot of flexibility. Needs more resources? Just scale up your ec2, and with a dedicated server there is a lot more work involved.
Do you want a 'production-ready' database? With AWS you can just click a few buttons and have a RDS ready to use. To roll out your own PG installation you need someone with a lot of knowledge(how to configure replication? backups? updates? ...).
So if you include salaries in the calculation the result changes a lot. And even if you already have some experts in your payroll by putting them to work in deploying a PG instance you won't be able to use them to build other things that may generate more value to you business than the premium you pay to AWS.
So there are massive economies of scale. Small CDN with (say) 10,000 customers and 10mbit/sec per customer can handle 100gbit/s DDoS (way too simplistic, but hopefully you get the idea) - way too small.
If you have the same traffic provisioned on average per customer and have 1 million customers, you can handle a DDoS 100x the size.
Only way to compete with this is to massively overprovision bandwidth per customer (which is expensive, as those customers won't pay more just for you to have more redundancy because you are smaller).
In a way (like many things in infrastructure) CDNs are natural monopolies. The bigger you get -> the more bandwidth and PoP you can have -> more attractive to more customers (this repeats over and over).
It was probably very astute of Cloudflare to realise that offering such a generous free plan was a key step in this.
There is no network protocol per se, but there is commercial solutions like fortinet that can block countries iirc, but to note that it's only ip range based so it's not worth a lot
edit: yes, you can you bgp to blockhole subnet traffic - the standard doesn't play well if you want blackhole unrelated subnets from upstream network
Blocking individual IP addresses? Sure, but consider that before your service detects enough anomalous traffic from one particular IP and is able to send the request to block upstream, your service will already be down from the aggregate traffic. Even a "slow" ddos with <10 packets per second from one source is enough to saturate your 10Gbps link if the attacker has a million machines to originate traffic from.
They have all the data on what CPE a user has, can send a letter and email with a deadline, and cut them off after it expires and the router has not been updated/is still exposed to the wide internet.
(Turns out some raspi reseller shipped a product with empty uname/password)
While a cute story, how do you scale that? And what about all the users that would be incapable of troubleshooting it, like if their laptop, roku, or smart lightbulb were compromised? They just lose internet?
And what about a botnet that doesn’t saturate your connection, how does your ISP even know? They get full access to your traffic for heuristics? What if it’s just one curl request per N seconds?
Not many good answers available if any.
Uh, yes. Exactly and plainly that. We also go and suspend people's driver licenses or at the very least seriously fine them if they misbehave on the road, including driving around with unsafe cars.
Access to the Internet should be a privilege, not a right. Maybe the resulting anger from widespread crackdowns would be enough of a push for legislators to demand better security from device vendors.
> And what about a botnet that doesn’t saturate your connection, how does your ISP even know?
In ye olde days providers had (to have to) abuse@ mailboxes. Credible evidence of malicious behavior reported to these did lead to customers getting told to clean up shop or else.
And even if the attack comes from your country, it is better to block part of the customers and figure out what to do next rather than have your site down.
Identifying and dynamically blocking the 500k offending IPs would certainly be possible technically -- 500k /32s is not a hard filtering problem -- but I seriously question the operational ability of internet providers to perform such granular blocking in real-time against dynamic targets.
I also have concerns that automated blocking protocols would be widely abused by bad actors who are able to engineer their way into the network at a carrier level (i.e. certain governments).
No, in my concept the host can only manage the traffic targeted at it and not at other hosts.
Is this really true? What device in the network are you loading that filter into? Is it even capable of handling the packet throughput of that many clients while also handling such a large block list?
But this would require a service like DNSBL / RBL which email providers use. Mutually trusting big players would exchange lists of IPs currently involved in DDoS attacks, and block them way downstream in their networks, a few hops from the originating machines. They could even notify the affected customers.
But this would require a lot of work to build, and a serious amount of care to operate correctly and efficiently. ISPs don't seem to have a monetary incentive to do that.
In a CDN, customers consume bandwidth; they do not contribute it. If Cloudflare adds 1 million free customers, they do not magically acquire 1 million extra pipes to the internet backbone. They acquire 1 million new liabilities that require more infrastructure investment.
All you are doing is echoing their pitch book. Of course they want to skim their share of the pie.
However most customers are rarely at their peak, this gives you tremendous spare capacity to use to eat DDoS attacks, assuming that the attacks are uncorrelated. This gives you huge amounts of capacity that's frequently doing nothing. Cloudflare advertise this spare capacity as "DDoS protection."
I suppose in theory it might be possible to massively optimise utilisation of your links, but that would be at the cost of DDoS protection and might not improve your margin very meaningfully, especially is customers care a lot about being online.
OP is saying it's cheaper overall for a 10 million customer company to add infrastructure for 1 million more than it is for a 10,000 customer company to add infrastructure for 1000 more people.
If you're looking at this as a "share of the pie", it's probably not going to make sense. The industry is not zero sum.
They contribute money which buys infrastructure.
> If Cloudflare adds 1 million free customers,
Is the free tier really customers? Regardless most of them are small that it doesn't cost cloudflare much anyways. The infrastructure is already there anyways. Its worth it to them for the good will it generates which leads to future paying customers. It probably also gives them visibility into what is good vs bad traffic.
1 million small sites could very well cost less to cloudflare than 1 big site.
The same reason I use cloud compute -- elastic infrastructure because I can't afford the peaks -- is the same reason large service providers "work".
It's funny how we always focus on Cloudflare, but all cloud providers have this same concentration downside. I think it's because Cloudflare loves to talk out of both sides of their mouth.
That's AI bots, BTW. Bots like Playwright or Crawl4AI, which provide a useful service to individuals using agentic AI. Cloudflare is hostile to these types of users, even though they likely cost websites nothing to support well.
The "scale saves money" argument commits a critical error: it counts only the benefits of concentration while externally distributing the costs.
Yes, economies of scale exist. But Cloudflare's scale creates catastrophic systemic risk that individual companies using cloud compute never would. An estimated $5-15 billion was lost for every hour of the outage according to Tom's Guide. That cost didn't disappear. It was transferred to millions of websites, businesses, and users who had zero choice in the matter.
Again, corporations shitting on free users. It's a bad habit and a dark pattern.
Even worse, were you hoping to call an Uber this morning for your $5K vacation? Good luck.
This is worse than pure economic inefficiency. Cloudflare operates as an authorized man-in-the-middle to 20% of the internet, decrypting and inspecting traffic flows. When their systems fail, not due to attacks, but to internal bugs in their monetization systems, they don't just lose uptime.
They create a security vulnerability where encrypted connections briefly lose their encryption guarantee. They've done this before (Cloudbleed), and they'll do it again. Stop pretending to have rational arguments with irrational future outcomes.
The deeper problem: compute, storage, and networking are cheap. The "we need Cloudflare's scale for DDoS protection" argument is a circular justification for the very concentration that makes DDoS attractive in the first place. In a fragmented internet with 10 CDNs, a successful DDoS on one affects 10% of users. In a Cloudflare-dependent internet, a DDoS, or a bug, affects 50%, if Cloudflare is unable to mitigate (or DDoSs themselves).
Cloudflare has inserted themselves as an unremovable chokepoint. Their business model depends on staying that chokepoint. Their argument for why they must stay a chokepoint is self-reinforcing. And every outage proves the model is rotten.
i don't even think they are evil because of the concentration of power, that's just a problematic issue. the evil part is they convince themselves they aren't the bad guys. that they are saving us from ourselves. that the things they do are net positives, or even absolute positives. like the whole "let's defend the internet from AI crawlers" position they appointed themselves sheriff on, that i think you're referencing. it's an extremely dangerous position we've allowed them to occupy.
> they monetize it
yes, and they can't do this without the scale.
> scale saves money
any company, uber for example, can design their infra to not rely on a sole provider. but why? their customers aren't going to leave in droves when a pretty reliable provider has the occasional hiccup. so it's not worth the cost, so why shouldn't they externalize it? uber isn't in business to make the internet a better place. so yes, scale does save money. you're arguing something at a higher principle than how architectural decisions are made.
i'm not defending economy of scale as a necessary evil. i'm just backing up that it's how cloudflare is built, and that it is in fact useful to customers.
Not every company can be an expert at everything.
But perhaps many of us could buy a different CDN than the major players if we want to reduce the likelihood of mass outages like this though.
People not being ready for cloudflare/[insert hyperscaler] to be possibly down is the only fault.
So while us in tech might like a "snow day", there are millions of small businesses and people trying to go about their day to day lives who get cut off because of someone else's fuck-ups when this happens.
Even if you were making a million a minute, typically, it still didn't cost you a thing, nor have you lost anything.
You're not making as much, sure, but neither a cost, nor a loss.
Don't ask me how I know.
Maybe Tuesdays tend to be a big day for me, and instead of "down for a day", it's "lose almost a quarter of my income for the month".
Cloudflare is pretty pervasive, there are all kinds of people and businesses, in all kinds of situations, impacted by this.
But there are systems that depend on Cloudflare, directly or not, and when they go down it can have a serious impact on somebody's livelihood.
As as software engineer, I get it. as a CTO, I spent this morning triaging with my devops ai(actual Indian) to find some workaround (we found one) while our CEO was doing damage control with customers (non technical field) who were angry that we were down and they were losing business by the minute.
sometimes I miss not having a direct stake in the success of the business.
If it can tell us that the host is up, surely it can just bypass itself to route traffic.
Congratulations, you've successfully completed Management Training 101.
But I dare say the folks at these organisations take these matters incredibly seriously and the centralisation problem is largely one of risk efficiency.
I think there is no excuse, however, to not have multi region on state, and pilot light architectures just in case.
As always, in the name of "security". When are we going to learn that anything done, either by the government or by a corporation, in the name of security is always bad for the average person?
It's like github vs whatever else you can do with git that is truly decentralized. The centralization has such massive benefits that I'm very happy to pay the price of "when it's down I can't work".
AWS - someone touches DynamoDB and it kills the DNS.
Cloudflare - someone touches functionality completely unrelated to DNS hosting and proxying and, naturally, it kills the DNS.
There is this critical infrastructure that just becomes one small part of a wider product offering, worked on by many hands, and this critical infrastructure gets taken down by what is essentially a side-effect.
It's a strong argument to move to providers that just do one thing and do it well.
Dialogue about mitigations/solutions? Alternative services? High availability strategies?
Nah! It's free to complain.
Me personally, I'd say those companies do a phenomenal job by being a de facto backbone of the modern web. Also Cloudflare, in particular, gives me a lot of things for free.
It has been dead to me since the SSL cache vulnerability thing and the arrogance with which senior people expected others to solve their problems.
But consider how many people still do stupid things like use the default CDN offered by some third party library, or use google fonts directly; people are lazy and don't care.
It is not as bad as Cloudflare or AWS because certificates will not expire the instant there is an outage, but considers that:
- It serves about 2/3 of all websites
- TLS is becoming more and more critical over time. If certificates fail, the web may as well be down
- Certificate lifetimes are becoming shorter and shorter, now 90 days, but Let's Encrypt is now considering 6 days, with 47 days being planned as a minimum
- An outage is one thing, but should a compromise happen, that would be even more catastrophic
Let's Encrypt is a good guy now, but remember that Google used to be a good guy in the 2000s too!
Let’s Encrypt is great at making the existing system less painful, and there are a few alternatives like ZeroSSL, but all of this automation is basically a pile of workarounds on top of a fundamentally inappropriate design.
But DNSSEC was hard according to some, and now we are running a massive SPOF in terms of TLS certificates.
I'm also concerned about LE being a single point of failure for the internet! I really wish there were other free and open CAs out there. Our goal is to encrypt the web, not to perpetuate ourselves.
That said, I'm not sure the line of reasoning here really holds up? There's a big difference between this three-hour outage and the multi-day outage that would be necessary to prevent certificate renewal, even with 6-day certs. And there's an even bigger difference between this sort of network disruption and the kind of compromise that would be necessary to take LE out permanently.
So while yes, I share your fear about the internet-wide impact of total Let's Encrypt collapse, I don't think that these situations are particularly analogous.
Very worrying indeed.
Are people really this confused?
It sucks there's not more competition in this space but CloudFlare isn't widely used for no reason.
AWS also solves real problems people have. Maintaining infrastructure is expensive as is hardware service and maintenance. Redundancy is even harder and more expensive. You can run a fairly inexpensive and performant system on AWS for years for the cost of a single co-located server.
Your power is provided by a power utility company. They usually serve an entire state, if not more than one (there are smaller ones too). That's "centralization" in that it's one company, and if they "go down", so do a lot of businesses. But actually it's not "centralized", in that 1) there are actually many different companies across the country/world, and 2) each company "decentralizes" most of its infrastructure to prevent massive outages.
And yes, power utilities have outages. But usually they are limited in scope and short-lived. They're so limited that most people don't notice when they happen, unless it's a giant weather system. Then if it's a (rare) large enough impact, people will say "we need to reform the power grid!". But later when they've calmed down, they realize that would be difficult to do without making things worse, and this event isn't common.
Large internet service providers like AWS, Cloudflare, etc, are basically internet utilities. Yes they are large, like power utilities. Yes they have outages, like power utilities. But the fact that a lot of the country uses them, isn't any worse than a lot of the country using a particular power company. And unlike the power companies, we're not really that dependent on internet service providers. You can't really change your power company; you can change an internet service provider.
Power didn't used to be as reliable as it is. Everything we have is incredibly new and modern. And as time has passed, we have learned how to deal with failures. Safety and reliability has increased throughout critical industries as we have learned to adapt to failures. But that doesn't mean there won't be failures, or that we can avoid them all.
We also have the freedom to architect our technology to work around outages. All the outages you have heard about recently could be worked around, if the people who built on them had tried:
- CDN goes down? Most people don't absolutely need a CDN. Point your DNS at your origins until the CDN comes back. (And obviously, your DNS provider shouldn't be the same as your CDN...)
- The control plane goes down on dynamic cloud APIs? Enable a "limp mode" that persists existing infrastructure to serve your core needs. You should be able to service most (if not all) of your business needs without constantly calling a control plane.
- An AZ or region goes down? Use your disaster recovery plan: deploy infrastructure-as-code into another region or AZ. Destroy it when the az/region comes back.
...and all of that just to avoid a few hours of downtime per year? It's likely cheaper to just take the downtime. But that doesn't stop people from piling on when things go wrong, questioning whether the existence of a utility is a good idea.
And it’s hard to protect against DDoS without something like Cloudflare.
Look at the posts here.
Even the meager HN “hug of death” will take things down
These decision are made individually not centrally. There is no process in place (and most likely there will never be) that will be able to control and dictate if people decide one way of doing things is the best way to do it. Even assuming they understand everything or know of the pitfalls.
Even if you can control individually what you do for the site you operate (or are involved in) you won't have any control on parts of your site (or business) that you rely on where others use AWS or Cloudflare.
You use a service provider, if that service provider is down, your site is down. Does it matter to you that others are also down in that instance?
> Application error: a client-side exception has occurred while loading www.digitalocean.com (see the browser console for more information).
Yellow flags on status.digitalocean.com *
My ISP is routing public internet traffic to my IPs these days. What keeps me from running my blog from home? Fear of exposing a TCP port, that's what. What do we do about that?
Depending on the contract it might not be allowed to run public network services from your home network.
I had a friend doing that and once his site got popular the ISP called (or sent a letter? don't remember anymore) with "take this 10x more expensive corporate contract or we will block all this traffic".
In general why the ISPs don't want you to do that (in addition to way more expensive corporate rates) is the risk of someone DDoS that site which could cause issues to large parts of their domestic customers (and depending on the country be liable to compensate those customers for not providing a service they paid for)
> We acknowledge the inconvenience this may cause and are working diligently to restore normal operations. Signs of recovery are starting to appear, with most requests beginning to succeed. We will continue to monitor the situation closely and provide timely updates as more information becomes available. Thank you for your patience as we work towards full service restoration.
It's not down for you, but for others.
what ChatGPT and Claude being down taught me about b2b SaaS
I sense a great disturbance in the force... As if millions of cringefluencers suddenly cried out in terror cause they had to come up with an original thought.
Be aware that this change has a few immediate implications:
- SSL/TLS: You will likely lose your Cloudflare-provided SSL certificate. Your site will only work if your origin server has its own valid certificate.
- Security & Performance: You will lose the performance benefits (caching, minification, global edge network) and security protections (DDoS mitigation, WAF) that Cloudflare provides.
if you don't have the keys make sure to grab them for the next one.
Edit: it was related
https://www.laprovence.com/article/region/83645099971988/pan...
Edit2: They edited the article stating it wasn't related.
DNS is actually one of the easiest services to self-host, and it's fairly tolerant of downtime due to caching. If you want redundancy/geographical distribution, Hurricane Electric has a free secondary/slave DNS service [0] where they'll automatically mirror your primary/master DNS server.
[0]: https://dns.he.net/
Never had an issue hosting my stuff, but as said - don't yet have experience hoting something from home with a more dynamic DNS setup.
But, yeah, it's still a horrible outage, much worse than the Amazon one.
Ofc the bigger perception issue here is many services going out at the same time, but why would (most) providers care if their annual downtime does or doesn't coincide with others? Their overall reliability is no better or worse had only their service gone down.
All of this can change ofc if this becomes a regular thing, the absolute hours of downtime does matter.
PS:Someone really doesn't want Gemini 3 to get air time today
Prior hosting provider was a little-known company with decent enough track record, but because they employed humans, stuff would break. When it did break, C-suite would panic about how much revenue is lost, etc.
The number of outages was "reasonable" to anyone who understood the technical side, but non-technical would complain for weeks after an outage about how we're always down, "well BigServiceX doesn't break ever, why do we?", and again lost revenue.
Now on Azure/Cloudflare, we go down when everyone else does, but C-Suite goes "oh it's not just us, and it's out of our control? Okay let us know when it fixes itself."
A great lesson in optics and perception, for our junior team members.
The mass centralization is a massive attack vector for organized attempts to disrupt business in the west.
But we’re not doing anything about it because we’ve made a mountain at of a molehill. Was it that hard to manage everything locally?
I get that there’s plenty of security implications going that route, but it would be much harder to bring down t large portions of online business with a single attack.
A lot of money related to stuff you currently don't have to worry about.
I remember how shit worked before AWS. People don't remember how costly and time consuming this stuff used to be. We had close to 50 people in our local ops team back in the day when I was working with Nokia 13 years ago. They had to deal with data center outages, expensive storage solutions failing, network links between data centers, offices, firewalls, self hosted Jira running out of memory, and a lot of other crap that I don't spend a lot of time about worrying with a cloud based setup. Just a short list of stuff that repeatedly was an issue. Nice when it worked. But nowhere near five nines of uptime.
That ops team alone cost probably a few million per year in salaries alone. I knew some people in that team. Good solid people but it always seemed like a thankless and stressful job to me. Basically constant firefighting while getting people barking at you to just get stuff working. Later a lot of that stuff moved into AWS and things became a lot easier and the need for that team largely went away. The first few teams doing that caused a bit of controversy internally until management realized that those teams were saving money. Then that quickly turned around. And it wasn't like AWS was cheap. I worked in one of those teams. That entire ops team was replaced by 2-3 clued in devops people that were able to move a lot faster. Subsequent layoff rounds in Nokia hit internal IT and ops teams hard early on in the years leading up to the demise of the phone business.
In general, I'm much happier with the current status of "it all works" or "it's ALL broken and its someone else's job to fix it as fast as possible"!
Not saying its perfect but neither was on-prem/colocation
That shows, the distributed nature of the internet is still there. It is a problem though, if everything is funneled through one provider.
>I was Cloudflare's CTO.
A gentle reminder to not take any CF-related frustrations out on John today.
Not that I think blaming individuals on forums who are already under stress is a good strategy anyway.
If anything, he should be the first to be blamed for the greater and greater effect this tech monster has on internet stability, since, you know, his people built it.
so stupid there is no fallback and can take down 50% of the internet
adding:looks like even Cloudflare's Silk Privacy Pass with challenge tokens is broken
such a great idea to put half the web behind a single fail point without fallover
You probably cannot achieve this with a single node, so you'll at least need to replicate it a few times to combat the normal 2-3 9s you get from a single node. But then you've got load balancers and dns, which can also serve as single point of failure, as seen with cloudflare.
Depending on the database type and choice, it varies. If you've got a single node of postgres, you can likely never achieve more than 2-3 9s (aws guarantees 3 9s for a multi-az RDS). But if you do multi-master cockroach etc, you can maybe achieve 5 9s just on the database layer, or using spanner. But you'll basically need to have 5 9s which means quite a bit of redundancy in all the layers going to and from your app and data. The database and DNS being the most difficult.
Reliable DNS provider with 5 9s of uptime guarantees -> multi-master load balancer each with 3 9s, -> each load balancer serving 3 or more apps each with 3 9s of availability, going to a database(s) with 5 9s.
This page from google shows their uptime guarantees for big tables, 3 9s for a single region with a cluster. 4 9s for multi cluster and 5 9s for multi region
https://docs.cloud.google.com/architecture/infra-reliability...
In general it doesn't matter really what you're running, it is all about redundancy. Whether that is instances, cloud vendor, region, zone etc.
Probably:
- a couple of tower servers, running Linux or FreeBSD, backed up by a UPS and an auto-run generator with 24 hours worth of diesel (depending on where you are, and the local areas propensity for natural disasters - maybe 72 hours),
- Caddy for a reverse proxy, Apache for the web server, PostgreSQL for the database;
- behind a router with sensible security settings, that also can load-balance between the two servers (for availability rather than scaling);
- on static WAN IPs,
- with dual redundant (different ISPs/network provider) WAN connections,
- a regular and strictly followed patch and hardware maintenance cycle,
- located in an area resistant to wildfire, civil unrest, and riverine or coastal flooding.
I'd say that'd get you close to five 9s (no more than ~5 minutes downtime per year), though I'd pretty much guarantee five 9s (maybe even six 9s - no more than 32 seconds downtime per year) if the two machines were physically separated from each other by a few hundred kilometres, each with their own supporting infrastructure above, sans the load balancing (see below), through two separate network routes.
Load balancing would become human-driven in this 'physically separate' example (cheaper, less complex): if your-site-1.com fails, simply re-point your browser to your-site-2.com which routes to the other redundant server on a different network.
The hard part now will be picking network providers that don't use the same pipes/cables, i.e. they both use Cloudflare, or AWS...
Keep the WAN IPs written down in case DNS fails.
PostgreSQL can do master-master replication, but it's a pain to set up I understand.
And then still failing anyway? Why do I need CloudFlare to access claude.io? Wtf?
I know there are solutions like IPFS out there for doing distributed/decentralised static content distribution, but that seems like only part of the problem. There are obviously more types of operation that occur via the network -- e.g. transactions with single remote pieces of equipment etc, which by their nature cannot be decentralised.
Anyone know of research out there into changing the way that packet-routing/switching works so that 'DDOS' just isn't a thing? Of course I appreciate there are a lot of things to get right in that!
If a botnet gets access through 500k IP addresses belonging to home users around the world, there's no way you could have prepared yourself ahead of time.
The only real solution is to drastically increase regulation around security updates for consumer hardware.
The core issue is that hackers can steal the "identity" of internet customers at scale, not that the internet allows unauthenticated traffic.
That's on one end, right? There's also the other end: as a user connecting to the network, currently one is subscribing to receiving packets from literally everyone else on the internet.
> It's a fundamental issue with trust and distributed systems
We currently trust entities within the network to route packets as they are asked. The network can tolerate some level of bad actors within that, but there is still trust in the existing system. What if the things we trusted the network to do were to change slightly?
Why fridge need to even be reachable from the internet ?? You should have some AI agent for managing your "smart" home. At least it's how sci-fi movies/games show it, eg. Iron man or Starcraft II ;)
So you can access it from a phone app even when outside your home network.
The issue with DDOS is specifically with the distributed nature of it. One single bot of a botnet is pretty harmless, it's the cohesive whole that's the problem.
To make botnets less efficient you need to find members before they do anything. Retroactively blocking them won't really help, you'll just end up cutting off internet for regular people, most of whom probably don't even know how to get their fridge off of their local network.
There's not really any easy fix for this. You could regulate it, and require a license to operate IoT devices with some registration requirement + fines if you don't keep them up to date. But even that will probably not solve the issue.
The closest thing I can think of is the Gemini protocol browser. It uses TOFU for authentication, which requires a human to initially validate every interaction.
YOU: Ask cave-chief for fire.
CAVE-CHIEF (Cloudflare): Big strong rock wall around many other cave fires (other websites). Good, fast wall!
MANY CAVE-PEOPLE: Shout at rock wall to get fire.
ROCK WALL: Suddenly… CRACK! Wall forgets which cave has which fire! Too many shouts!
RESULT:
Your Shout: Rock wall does not hear you, or sends you to wrong cave.
Other Caves (like X, big games): Fire is there, but wall is broken. Cannot get to fire.
ME (Gemini): My cave has my own wall! Not rock wall chief! So my fire is still burning! Good!
BIG PROBLEM: Big strong wall broke. Nobody gets fire fast. Wall chief must fix strong rock fast!One of my other worries is having fight bots over a couple hobby sites while I have other fires to put out (generally in life).
Windows 11, latest Edge browser, 64GB of RAM, 13th Gen i7.
There's your issue
So, what, specifically, is the issue?
Maybe a coincidence or maybe not.
First you can grab the zone ID via:
curl -X GET "https://api.cloudflare.com/client/v4/zones" -H "Authorization: Bearer $API_TOKEN" -H "Content-Type: application/json" | jq -r '.result[] | "\(.id) \(.name)"'
And a list of DNS records using: curl -X GET "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" -H "Authorization: Bearer $API_TOKEN" -H "Content-Type: application/json"
Each DNS record will have an ID associated. Finally patch the relevant records: curl -X PATCH "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" -H "Authorization: Bearer $API_TOKEN" -H "Content-Type: application/json" --data '{"proxied":false}'
Copying from a sibling comment - some warnings:- SSL/TLS: You will likely lose your Cloudflare-provided SSL certificate. Your site will only work if your origin server has its own valid certificate.
- Security & Performance: You will lose the performance benefits (caching, minification, global edge network) and security protections (DDoS mitigation, WAF) that Cloudflare provides.
- This will also reveal your backend internal IP addresses. Anyone can find permanent logs of public IP addresses used by even obscure domain names, so potential adversaries don't necessarily have to be paying attention at the exact right time to find it.
Edit: seems like we are back online!
I surely missed a valid API token today.
Still can't load the Turnstile JS :-/
-H "X-Auth-Email: $EMAIL_ADDRESS" -H "X-Auth-Key: $API_KEY"
instead of the Bearer token header.Edit: and in case you're like me and thought it would be clever to block all non-Cloudflare traffic hitting your origin... remember to disable that.
looks like i can get everywhere i couldn't except my cloudflare dash.
My profile (including api tokens,) and websites pages all work, the accounts tab above website on the left does not.
And no need for -X GET to make a GET request with curl, it is the default HTTP method if you don’t send any content.
If you do send content with say -d curl will do a POST request, so no need for -X then either.
For PATCH though, it is the right curl option.
How can such big incidents occur where half of the internet is down because of one company and what can be done to prevent that?
The departure tables are borked, showing incorrect data, the route map stopped updating, the website and route planner are down, and the API returns garbage. Despite everything, the management will be pleased to know the ads kept on running offline.
Why would you put a WAP between devices you control and your own infra, God knows.
Checkbox security says a WAP is required and no CISO will put their neck on the line to approve the exemption.
No suicides created by ChatGPT Today. Billions of dollars in GPU will sit idle. Sudden drop of Linkedin content...
World is a better place
This won't help against carpet bombing.
The only workable solution for enterprises is a combination of on-prem and cloud mitigation. Cloud to get all the big swaths of mitigation and to keep your pipe flowing, and on-prem to mitigate specific attack vectors like state exhaustion.
Makes sense. The ability to pass the buck like this is 95% of the reason Cloudflare exists in the first place. Not being snarky, either.
Currently I have multi-region loadbalanced servers. DNS and WAF (and the load balancer) on Cloudflare.
Moving DNS elsewhere is step 1 so I'm not locked out - but then I can't use Cloudflare full stop (without enterprise pricing).
Multi-provider DNS and WAF - okay I could see how that works.
But what about the global load balancer, surely that has to remain a single point of failure?
https://www.nytimes.com/2025/11/18/business/cloudflare-down-...
Update - The team is continuing to focus on restoring service post-fix. We are mitigating several issues that remain post-deployment. Nov 18, 2025 - 15:40 UTC
We see the signs with Amazon and Cloudflare going down, Windows Update breaking stuff. But the worse is yet to come, and I am thinking about airport traffic control, nuclear power plants, surgeons...
It is much more nuanced than that.
The long-term rise (Flynn Effect) of IQs in the 20th century is widely believed to be driven by environmental factors more than genetics.
Plateau / decline is context-dependent: The reversal or slowdown isn’t universal, like you suggest. It seems more pronounced in certain countries or cohorts.
Cognitive abilities are diversifying: As people specialize more (education, careers, lifestyles), the structure of intelligence (how different cognitive skills relate) might be changing.
Time to go back to on prem. AWS and co are too expensive anyways
Lovely.
Meanwhile all my sites are down. I'll just wait this one out, it's not the end of the world for me.
My GitHub actions are also down for one of my project because some third-party deps go through Cloudflare (Vulkan SDK). Just yesterday I was thinking to myself: "I don't like this dependency on that URL...". Now I like it even less
>A spokesperson for Cloudflare said: “We saw a spike in unusual traffic to one of Cloudflare’s services beginning at 11.20am. That caused some traffic passing through Cloudflare’s network to experience errors. While most traffic for most services continued to flow as normal, there were elevated errors across multiple Cloudflare services.
>“We do not yet know the cause of the spike in unusual traffic. We are all hands on deck to make sure all traffic is served without errors. After that, we will turn our attention to investigating the cause of the unusual spike in traffic.”
https://www.theguardian.com/technology/2025/nov/18/cloudflar...
In most cases, it's just cloud services eating shit from a bug.
They just posted:
Update We've deployed a change which has restored dashboard services. We are still working to remediate broad application services impact Posted 2 minutes ago. Nov 18, 2025 - 14:34 UTC
but,.. I'm stuck at the captcha that does not work: dash.cloudflare.com Verifying you are human. This may take a few seconds.
dash.cloudflare.com needs to review the security of your connection before proceeding.
Seems like they think they've fixed it fully this time!
Update - Some customers may be still experiencing issues logging into or using the Cloudflare dashboard. We are working on a fix to resolve this, and continuing to monitor for any further issues. Nov 18, 2025 - 14:57 UTC
oh no, anyway
- restarting their routers and computers instead of taking their morning shower, getting their morning coffee, taking their medication on time because they’re freaking out, etc. - calling ISPs in a furious mood not knowing it’s a service in the stack and not the provider’s fault (maybe) - being late for work in general - getting into arguments with friends and family and coworkers about politics and economics - being interrupted making their jerk chicken
how about your location?
Stop. Using. Cloudflare.
From the CTO, Source: https://x.com/dok2001/status/1990791419653484646
This is becoming the "new normal." It seems like every few months, there's another "outage" that takes down vast swathes of internet properties, since they're all dependent on a few platforms and those platforms are, clearly, poorly run.
This isn't rocket surgery here. Strong change management, QA processes and active business continuity planning/infrastructure would likely have caught this (or not), as is clear from other large platforms that we don't even think about because outages are so rare.
Like airline reservations systems[0], credit card authorization systems from VISA/MasterCard, American Express, etc.
Those systems (and others) have outages in the "once a decade" or even much, much, longer ranges. Are the folks over at SABRE and American Express that much smarter and better than Cloudflare/AWS/Google Cloud/etc.? No. Not even close. What they are is careful as they know their business is dependent on making sure their customers can use their services anytime/anywhere, without issue.
It amazes me the level of "Stockholm Syndrome"[1] expressed by many posting to this thread, expressing relief that it wasn't "an attack" and essentially blaming themselves for not having the right tools (API keys, etc.) to recover from the gross incompetence of, this time at least, Cloudflare.
I don't doubt that I'll get lots of push back from folks claiming, "it's hard to do things at scale," and/or "there are way too many moving parts," and the like.
Other organizations like the ones I mention above don't screw they're customers every 4-6 months with (clearly) insufficiently tested configuration and infrastructure changes.
Yet many here seem to think that's fine, even though such outages are often crushing to their businesses. But if the customers of these huge providers don't demand better, they'll only get worse. And that's not (at least in my experience) a very deep or profound idea.
[0] https://en.wikipedia.org/wiki/Airline_reservations_system
You NEED to phase config rollouts like you phase code rollouts.
I also don’t understand how it is uncharitable to compare them to crowdstrike as both companies run critical systems that affect a large number of people’s lives, and both companies seem to have outages at a similar rate (if anything, cloudflare breaks more often than crowdstrike).
> The larger-than-expected feature file was then propagated to all the machines that make up our network
> As a result, every five minutes there was a chance of either a good or a bad set of configuration files being generated and rapidly propagated across the network.
I was right. Global config rollout with bad data. Basically the same failure mode of crowdstrike.
There are still two weaknesses:
1) Some configs are inherently global and cannot be phased. There's only one place to set them. E.g. if you run a webapp, this would be configs for the load balancer as opposed to configs for each webserver
2) Some configs have a cascading effect -- even though a config is applied to 1% of servers, it affects the other servers they interact with, and a bad thing spreads across the entire network
This is also why "it is always DNS". It's not that DNS itself is particularly unreliable, but rather that it is the one area where you can really screw up a whole system by running a single command, even if everything else is insanely redundant.
You can shard your service behind multiple names:
my-service-1.example.com
my-service-2.example.com
my-service-3.example.com …
Then you can create smoke tests which hit each phase of the DNS and if you start getting errors you stop the rollout of the service.
And the access controls of DNS services are often (but not always) not fine-grained enough to actually prevent someone from ignoring the procedure and changing every single subdomain at once.
It does help. For example, at my company we have two public endpoints:
company-staging.com company.com
We roll out changes to company-staging.com first and have smoke tests which hit that endpoint. If the smoketests fail we stop the rollout to company.com.
Users hit company.com
Then you can update the DNS configuration for company-staging.com, and if that doesn't break there's very little scope for the update to company.com to go differently.
The scope for it to go wrong is the differences in real-world and simulation.
It's a good thing to have, but not a replacement for the concept of staged rollout.
So if you've got some configuration that has a problem that only appears at the root-level domain, no amount of subdomain testing is going to catch it.
Every.Single.Time
I must say I'm astonished, as naive as it may be, to see the number of separate platforms affected by this. And it has been a bit of a learning experience too.
sure there are botnets, infected devices, etc that would conform to this but where does the sheer power of a big ddos attack come from? including those who sell it as a service. they have to have some infrastructure in some datacenter right?
make a law that forces every edge router of a datacenter to check for source IP and you would eliminate a very big portion of DDoS as we know it.
until then, the only real and effective method of mitigating a DDoS attack is with even more bandwidth. you are basically a black hole to the attack, which cloudflare basically is.
and what prevents me, as a abuse hoster or "bad guy" from just announcing my own IP space directly on a transit or IXP?
You might say, the IXP should do source checking aswell, but what if ipspace is distributed/anycasted across multiple ASN's/ on the IXP?
Also, if you add multiple egress points distributed across different routing domains, it gets complicated fast.
Does my transit upstream need to do source validation of my IP space? What about their upstream? Also, how would he know which IPspace belongs to which ASN's considering the allocation of ASN numbers and IP space is distributed across different organisations across the globe. (some of which are more malicious/non function than others[0]). Source routing becomes extremly complex because there is no single, universal mapping between IP space and ASN's they belong too.
[0]https://afrinic.net/notice-for-termination-of-the-receiversh...
Better link for chroniclers, since the incident is now buried pretty far down on the status page: https://www.cloudflarestatus.com/incidents/8gmgl950y3h7
Anyone know why? Could be totally bias because one news story propels the next, so when they happen in clusters, you just hear about them more than when they don't.
>cups.servic
>foomaticrip
Form a [cerulean] type-font in the page-source.
imdsm•2mo ago
https://news.ycombinator.com/item?id=45955900
alexdns•2mo ago
Jordan-117•2mo ago
hirako2000•2mo ago
polisaez•2mo ago
chedabob•2mo ago
pstation•2mo ago
watermelon0•2mo ago