I guess now we should start using a completely different provider as dns backup Maybe 8.8.8.8 or 9.9.9.9
1.0.0.0/24 is a different network than 1.1.1.0/24 too, so can be hosted elsewhere. Indeed right now 1.1.1.1 from my laptop goes via 141.101.71.63 and 1.0.0.1 via 141.101.71.121, which are both hosts on the same LINX/LON1 peer but presumably from different routers, so there is some resilience there.
Given DNS is about the easiest thing to avoid a single point of failure on I'm not sure why you would put all your eggs in a single company, but that seems to be the modern internet - centralisation over resilience because resilience is somehow deemed to be hard.
I guess. I wouldn't have thought it worthwhile for 4 chars, but yes.
> 1.0.0.0/24 is a different network than 1.1.1.0/24 too, so can be hosted elsewhere.
I thought anycast gave them that on a single IP, though perhaps this is even more resilient?
[0] https://man7.org/linux/man-pages/man3/inet_aton.3.html#DESCR...
That said, it's a good idea to specifically pick multiple resolvers in different regions, on different backbones, using different providers, and not use an Anycast address, because Anycast can get a little weird. However, this can lead to hard-to-troubleshoot issues, because DNS doesn't always behave the way you expect.
And the closest resolving proxy DNS server for most of my machines is listening on their loopback interface. The closest such machine happens to be about 1m away, so is beaten out of first place by centimetres. (-:
It's a shame that Microsoft arbitrarily ties such functionality to the Server flavour of Windows, and does not supply it on the Workstation flavour, but other operating systems are not so artificially limited or helpless; and even novice users on such systems can get a working proxy DNS server out of the box that their sysops don't actually have to touch.
The idea that one has to rely upon an ISP, or even upon CloudFlare and Google and Quad9, for this stuff is a bit of a marketing tale that is put about by thse self-same ISPs and CloudFlare and Google and Quad9. Not relying upon them is not actually limited to people who are skilled in system operation, i.e. who they are; but rather merely limited by what people run: black box "smart" tellies and whatnot, and the Workstation flavour of Microsoft Windows. Even for such machines, there's the option of a decent quality router/gateway or simply a small box providing proxy DNS on the LAN.
In my case, said small box is roughly the size of my hand and is smaller than my mass-market SOHO router/gateway. (-:
My Pi-holes both use OpenDNS, Quad9, and CloudFlare for upstream.
Most of my devices use both of my Pi-holes.
I did this for a while, but ~300ms hangs on every DNS resolution sure do get old fast.
Like mentioned by other comments, do it on your own if you are not happy with the stability. Or just pay someone to provide it - like your ISP..
And TBH I trust my local ISP more than Google or CF. Not in availability, but it's covered by my local legislature. That's a huge difference - in a positive way.
I don't think this is fair when discussing infrastructure. It's reasonable to complain about potholes, undrinkable tap water, long lines at the DMV, cracked (or nonexistent) sidewalks, etc. The internet is infrastructure and DNS resolution is a critical part of it. That it hasn't been nationalized doesn't change the fact that it's infrastructure (and access absolutely should be free) and therefore everyone should feel free to complain about it not working correctly.
"But you pay taxes for drinkable tap water," yes, and we paid taxes to make the internet work too. For some reason, some governments like the USA feel it to be a good idea to add a middle man to spend that tax money on, but, fine, we'll complain about the middle man then as well.
DNS is infrastructure. But "Cloudflare Public Free DNS Resolver" is not, it's just a convenience and a product to collect data.
(This isn't a major concern, of course; and I mention it just to extend your argument yet further. The major gain of a private root content DNS server is the fraction of really stupid nonsense DNS traffic that comes about because of various things gets filtered out either on-machine or at least without crossing a border router. The gains are in security and privacy more than uptime.)
But opposite to tap water there are a lot of different free DNS resolvers that can be used.
And I don't see how my taxes funded CFs DNS service. But my ISP fee covers their DNS resolving setup. That's the reason why I wrote
> a service that's free of charge
Which CF is.
which might not be a good thing in some jurisdictions - see the porn block in the UK (it's done via dns iirc, and trivially bypassed with a third party dns like cloudflare's).
When DNS resolver is down, it affects everything, 100% uptime is a fair expectation, hence redundancy. Looks like both 1.0.0.1 and 1.1.1.1 were down for more than 1h, pretty bad TBH, especially when you advise global usage.
RCA is not detailed and feels like a marketing stunt we are now getting every other week.
> It’s worth noting that DoH (DNS-over-HTTPS) traffic remained relatively stable as most DoH users use the domain cloudflare-dns.com, configured manually or through their browser, to access the public DNS resolver, rather than by IP address.
Interesting, I was affected by this yesterday. My router (supposedly) had Cloudflare DoH enabled but nothing would resolve. Changing the DNS server to 8.8.8.8 fixed the issues.
LetEncrypt are trialling ip address https/TLS certificates right now:
https://letsencrypt.org/2025/07/01/issuing-our-first-ip-addr...
They say:
"In principle, there’s no reason that a certificate couldn’t be issued for an IP address rather than a domain name, and in fact the technical and policy standards for certificates have always allowed this, with a handful of certificate authorities offering this service on a small scale."
So certs were often tied with identity which an IP really isn’t so few providers offered them.
DigiCert does. That is where 1.1.1.1 and 9.9.9.9 get their valid certificates from
Your operating system can validate the IP address of the DNS response by using the Subject Alternative Name (SAN) field within the CA certificate presented by the DoH server: https://g.co/gemini/share/40af4514cb6e
Note that this introduces one query overhead per DNS request if the previous cache has expired. For this reason, I've been using https://1.1.1.1/dns-query instead.
In theory, this should eliminate that overhead. Your operating system can validate the IP address of the DNS response by using the Subject Alternative Name (SAN) field within the CA certificate presented by the DoH server: https://g.co/gemini/share/40af4514cb6e
TLDR; DoH was working
It’s corporate newspeak. “legacy” isn’t a clear term, it’s used to abstract and obfuscate.
> Legacy components do not leverage a gradual, staged deployment methodology. Cloudflare will deprecate these systems which enables modern progressive and health mediated deployment processes to provide earlier indication in a staged manner and rollback accordingly.
I know what this means, but there’s absolutely no reason for it to be written in this inscrutable corporatese.
It's carefully written so my boss's boss thinks he understands it, and that we cannot possibly have that problem because we obviously don't have any "legacy components" because we are "modern and progressive".
It is, in my opinion, closer to "intentionally misleading corporatese".
Or they have a different definition of impact than I do
I will not say whether or not it’s acceptable for a company of their size and maturity, but it’s definitely not hidden in corporate lingo.
I do believe they could have elaborate more on the follow up steps they will take to prevent this from happening again, I don’t think staggered roll outs are the only answer to this, they’re just a safety net.
Maybe there is noticeable difference?
I have seen more outage incident reports of cloudflare than of google, but this is just personal anecdote.
For me cloudflare 1.1.1.1 and 1.0.0.1 have a mean response time of 15.5ms over the last 3 months, 8.8.8.8 and 8.8.4.4 are 15.0ms, and 9.9.9.9 is 13.8ms.
All of those servers return over 3-nines of uptime when quantised in the "worst result in a given 1 minute bucket" from my monitoring points, which seem fine to have in your mix of upstream providers. Personally I'd never rely on a single provider. Google gets 4 nines, but that's only over 90 days so I wouldn't draw any long term conclusions.
Last 30 days, 8.8.8.8 has 99.99% uptime vs 1.1.1.1 has 99.09%
EDIT: Appears I was wrong, it is failover not round-robin between the primary and secondary DNS servers. Thus, using 1.1.1.1 and 8.8.8.8 makes sense.
If you have a more advanced local resolver of some sort (systemd for example) you can configure whatever behaviour you want.
Let's say you've got a metric aggregation service, and that service crashes.
What does that result in? Metrics get delayed until your orchestration system redeploys that service elsewhere, which looks like a 100% drop in metrics.
Most orchestration take a sec to redeploy in this case, assuming that it could be a temporary outage of the node (like a network blip of some sort).
Sooo, if you alert after just a minute, you end up with people getting woken up at 2am for nothing.
What happens if you keep waking up people at 2am for something that auto-resolves in 5 minutes? People quit, or eventually adjust the alert to 5 minutes.
I know you often can differentiate no data and real drops, but the overall point, of "if you page people constantly, people will quit" I think is the important one. If people keep getting paged for too tight alarms, the alarms can and should be loosened... and that's one way you end up at 5 minutes.
Step 1: You start out with the founders being on call 27x7x365 or people in the first 10 or 20 hires "carry the pager" on weekends and evenings and your entire company is doing unpaid rostered on call.
Step 2: You steal all the underwear.
Step 3: You have follow-the-sun office-hours support staff teams distributed around the globe with sufficient coverage for vacations and unexpected illness or resignations.
<google google google>
"Original air date: December 16, 1998"
Oh, right. Half of you weren't even born... Now I feel ooooooold.
Now without crying: I saw multiple, big companies getting rid of NOC and replacing that with on duties in multiple, focused teams. Instead of 12 people sitting 24/7 in group of 4 and doing some basic analysis and steps before calling others - you page correct people in 3-5 minutes, with exact and specific alert.
Incident resolution times went greatly down (2-10x times - depends on company), people don’t have to sit overnight and sleep for most of the time and no stupid actions like service restart taken to slow down incident resolution.
And I’m not liking that some platforms hire 1500 people for job that could be done with 50-100, but in terms of incident response - if you already have teams with separated responsibilities then NOC it’s "legacy"
Before you fire a quick alarm, check that the node is up, check that the service is up etc.
Traditional monitoring systems like Nagios and Icinga have settings where they only open events/alerts if a check failed three times in a row, because spurious failed checks are quite common.
If you spam your operators with lots of alerts for monitoring checks that fix themselves, you stress the unnecessarily and create alert blindness, because the first reaction will be "let's wait if it fixes itself".
I've never operated a service with as much exposure as CF's DNS service, but I'm not really surprised that it took 8 minutes to get a reliable detection.
Were it much less than 1.1.1.1 itself, taking longer than a minute to alarm probably wouldn’t surprise me, but this is 1.1.1.1, they’re dealing with vasts amounts of probably fairly consistent traffic.
Thing is, it's probably still some engineering effort, and most orgs only really improve their monitoring after it turned out to be sub-optimal.
They have a rather significant vested interest in it being reliable.
Say what now? A test triggered a global production change?
> Due to the earlier configuration error linking the 1.1.1.1 Resolver's IP addresses to our non-production service, those 1.1.1.1 IPs were inadvertently included when we changed how the non-production service was set up.
You have a process that allows some other service to just hoover up address routes already in use in production by a different service?
It is designed to be used in conjunction with 1.0.0.1. DNS has fault tolerance built in.
Did 1.0.0.1 go down too? If so, why were they on the same infrastructure?
This makes no sense to me. 8.8.8.8 also has 8.8.4.4. The whole point is that it can go down at any time and everything keeps working.
Shouldn’t the fix be to ensure that these are served out of completely independent silos and update all docs to make sure anyone using 1.1.1.1 also has 1.0.0.1 configured as a backup?
If I ran a service like this I would regularly do blackouts or brownouts on the primary to make sure that people’s resolvers are configured correctly. Nobody should be using a single IP as a point of failure for their internet access/browsing.
Not sure what the "advantage" of stub resolvers is in 2025 for anything.
Don't you normally have 2 DnS servers listed on any device. So was the second also down, if not why didn't it go to that.
I would count not configuring at least two as 'user error'. Many systems require you to enter a primary and alternate server in order to save a configuration.
Btw, I really don't understand why it does not accept an IP (1.1.1.1), so you have to give an address (one.one.one.one). It would be more sensible to configure a DNS server from an IP rather than from an address to be resolved by a DNS server :/
Normal DNS can normally be changed in your connection settings for a given connection on most flavours of Android.
Yes, sorry, I did not mention it.
So if you want to use DNS over HTTPS on Android, it is not possible to provide a fallback.
Not true. If the (DoH) host has multiple A/AAAA records (multiple IPs), any decent DoH client would retry its requests over multiple or all of those IPs.
If your device doesn't support proper failover use a local DNS forwarder on your router or an external one.
In Switzerland I would use Init7 (isp that doesn't filter) -> quad9 (unfiltered Version) -> eu dns0 (unfiltered Version)
But I understand why Cloudflare can’t just say “use 8.8.8.8 as your backup”.
Which means that you’d be on cloudflare half the time and on google half the time which may not be what you wanted.
It would be interesting to see the service level objective (SLO) that cloudflare internally has for this service.
I've found https://www.cloudflare.com/r2-service-level-agreement/ but this seems to be for payed services, so this outage would put July in the "< 99.9% but >= 99.0%" bucket, so you'd get a 10% refund for the month if you payed for it.
Not sure how cloudflare keeps struggling with issues like these, this isn't the first (and probably won't be the last) time they have these 'simple', 'deprecated', 'legacy' issues occuring.
8.8.8.8+8.8.4.4 hasn't had a global(1) second of downtime for almost a decade.
1: localized issues did exist, but that's really the fault of the internet and they did remain running when google itself suffered severe downtime in various different services.
European users might prefer one of the alternatives listed at https://european-alternatives.eu/category/public-dns over US corporations subject to the CLOUD act.
If there were some way to view torrenting traffic, no doubt there'd be a 20 minute slump.
I know.
Secondary DNS is supposed to be in an independent network to avoid precisely this.
The theory is CF had the capacity to soak up the junk traffic without negatively impacting their network.
all-servers
server=8.8.8.8
server=9.9.9.9
server=1.1.1.1
If you were using systemd-resolved however, it retries all servers in the order they were specified, so it's important to interleave upstreams.
Using the servers in the above example, and assuming IPv4 + IPv6:
1.1.1.1
2001:4860:4860::8888
9.9.9.9
2606:4700:4700::1111
8.8.8.8
2620:fe::fe
1.0.0.1
2001:4860:4860::8844
149.112.112.112
2606:4700:4700::1001
8.8.4.4
2620:fe::9
will failover faster and more successfully on systemd-resoved, rather than if you specify all Cloudflare IPs together, then all Google IPs, etc.Also note that Quad9 is default filtering on this IP while the other two or not, so you could get intermittent differences in resolution behavior. If this is a problem, don't mix filtered and unfiltered resolvers. You definitely shouldn't mix DNSSEC validatng and not DNSSEC validating resolvers if you care about that.
Mindless2112•7h ago
I recently started using the "luci-app-https-dns-proxy" package on OpenWrt, which is preconfigured to use both Cloudflare and Google DNS, and since DoH was mostly unaffected, I didn't notice an outage. (Though if DoH had been affected, it presumably would have failed over to Google DNS anyway.)
anon7000•7h ago
caconym_•7h ago
Anecdotally, I figured out their DNS was broken before it hit their status page and switched my upstream DNS over to Google. Haven't gotten around to switching back yet.
radicaldreamer•7h ago
sammy2255•7h ago
nojs•5h ago
misiek08•5h ago
And it’s not conspiracy theory - it was very suspicious when we did some testing on small, aware group. The traffic didn’t look like being handled anonymously at Google side
mnordhoff•5h ago
DarkCrusader2•4h ago
Tijdreiziger•2h ago
Elucalidavah•5h ago
Although, perhaps, having an external VPS with a dns proxy could be a good middle ground?
Tijdreiziger•2h ago
daneel_w•12m ago
immibis•1h ago
daneel_w•14m ago
Algent•3h ago
homebrewer•31m ago
https://developers.cloudflare.com/1.1.1.1/faq/#does-1111-sen...
I've also changed to 9.9.9.9 and 8.8.8.8 after using 1.1.1.1 for several years because connectivity here is not very good, and being connected to the wrong data center means RTT in excess of 300 ms. Makes the web very sluggish.
motorest•6h ago
Clients cache DNS resolutions to avoid having to do that request each time they send a request. It's plausible that some clients held on to their cache for a significant period.