Interesting.
> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:
They really need to figure out a way to correlate global configuration changes to the errors they trigger as fast as possible.
> as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules
Warning signs like this are how you know that something might be wrong!
Yes, as they explain it's the rollback that was triggered due to seeing these errors that broke stuff.
what if we broke the internet, just a little bit, to even the score with unwrap (kidding!!)
But a more important takeaway:
> This type of code error is prevented by languages with strong type systems
It required a significant organizational failure to happen. These happen but they ought to be rarer than your average bug (unless your organization is fundamentally malfunctioning, that is)
They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?
Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.
Pure speculation, but to me that sounds like there's more to the story, this sounds like the sort of cowboy decision a team makes when they've either already broken all the rules or weren't following them in the first place
From a more tinfoil-wearing angle, it may not even be a regular deployment, given the idea of Cloudflare being "the largest MitM attack in history". ("Maybe not even by Cloudflare but by NSA", would say some conspiracy theorists, which is, of course, completely bonkers: NSA is supposed to employ engineers who never let such blunders blow their cover.)
Also there seems to be insufficient testing before deployment with very junior level mistakes.
> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:
Where was the testing for this one? If ANY exception happened during the rules checking, the deployment should fail and rollback. Instead, they didn't assess that as a likely risk and pressed on with the deployment "fix".
I guess those at Cloudflare are not learning anything from the previous disaster.
In this case it's not just a matter of 'hold back for another day to make sure it's done right', like when adding a new feature to a normal SaaS application. In Cloudflare's case moving slower also comes with a real cost.
That isn't to say it didn't work out badly this time, just that the calculation is a bit different.
I'm not sure of the nature of the rollback process in this case, but leaning on ill-founded assumptions is a bad practice. I do agree that a global rollout is a problem.
“We have spoken directly with hundreds of customers following that incident and shared our plans to make changes to prevent single updates from causing widespread impact like this. We believe these changes would have helped prevent the impact of today’s incident but, unfortunately, we have not finished deploying them yet.
“We know it is disappointing that this work has not been completed yet. It remains our first priority across the organization.”
Come on.
This PM raises more questions than it answers, such as why exactly China would have been immune.
I have seen similar bugs in cloudflare API recently as well.
There is an endpoint for a feature that is available only to enterprise users, but the check for whether the user is on an enterprise plan is done at the last step.
> we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications.
Why is the Next.js limit 1 MB? It's not enough for uploading user generated content (photographs, scanned invoices), but a 1 MB request body for even multiple JSON API calls is ridiculous. There frameworks need to at least provide some pushback to unoptimized development, even if it's just a lower default request body limit. Otherwise all web applications will become as slow as the MS office suite or reddit.
So, if anything, their efforts towards a typed language were justified. They just didn't manage to migrate everything in time before this incident - which is ironically a good thing since this incident was cause mostly by a rushed change in response to an actively exploited vulnerability.
For DDOS protection you can't really rely on multiple-hours rollouts.
I truly believe they're really going to make resilience their #1 priority now, and acknowledging the release process errors that they didn't acknowledge for a while (according to other HN comments) is the first step towards this.
HugOps. Although bad for reputation, I think these incidents will help them shape (and prioritize!) resilience efforts more than ever.
At the same time, I can't think of a company more transparent than CloudFlare when it comes to these kind of things. I also understand the urgency behind this change: CloudFlare acted (too) fast to mitigate the React vulnerability and this is the result.
Say what you want, but I'd prefer to trust CloudFlare who admits and act upon their fuckups, rather than trying to cover them up or downplaying them like some other major cloud providers.
@eastdakota: ignore the negative comments here, transparency is a very good strategy and this article shows a good plan to avoid further problems
You can be angry - but that doesn't help anyone. They fucked up, yes, they admitted it and they provided plans on how to address that.
I don't think they do these things on purpose. Of course given their good market penetration they end up disrupting a lot of customers - and they should focus on slow rollouts - but I also believe that in a DDOS protection system (or WAF) you don't want or have the luxury to wait for days until your rule is applied.
This childish nonsense needs to end.
Ops are heavily rewarded because they're supposed to be responsible. If they're not then the associated rewards for it need to stop as well.
I think it's human nature (it's hard to realize something is going well until it breaks), but still has a very negative psychological effect. I can barely imagine the stress the team is going through right now.
That's why their salaries are so high.
In this particular case, they seem to be doing two things: - Phasing out the old proxy (Lua based) which is replaced by FL2 (Rust based, the one that caused the previous incident) - Reacting to an actively exploited vulnerability in React by deploying WAF rules - and they're doing them in a relatively careful way (test rules) to avoid fuckups, which caused this unknown state, which triggered the issue
That's not deserving of sympathy.
https://blog.cloudflare.com/5-december-2025-outage/#what-abo...
Cloudflare deployed code that was literally never tested, not even once, neither manually nor by unit test, otherwise the straightforward error would have been detected immediately, and their implied solution seems to be not testing their code when written, or even adding 100% code coverage after the fact, but rather relying on a programming language to bail them out and cover up their failure to test.
I've worked at one of the top fintech firms, whenever we do a config change or deployment, we are supposed to have rollback plan ready and monitor key dashboards for 15-30 minutes.
The dashboards need to be prepared beforehand on systems and key business metrics that would be affected by the deployment and reviewed by teammates.
I've never seen a downtime longer than 1 minute while I was there, because you get a spike on the dashboard immediately when something goes wrong.
For the entire system to be down for 10+ minutes due to a bad config change or deployment is just beyond me.
The process was pretty tight, almost no revenue-affecting outages from what I can remember because it was such a collaborative effort (even though the board presentation seemed a bit spiky and confrontational at the time, everyone was working together).
Since attackers might rotate IPs more frequently than once per minute, this effectively means that the whole fleet of servers should be able to quickly react depending on the decisions done centrally.
"Why?"
"I've just been transferred to the Cloudflare outage explanation department."
Doesn't Cloudflare rigorously test their changes before deployment to make sure that this does not happen again? This better not have been used to cover for the fact that they are using AI to fix issues like this one.
Better not be any presence of vibe coders or AI agents being used to be touching such critical pieces of infrastructure at all and I expected Cloudflare to learn from the previous outage very quickly.
But this is quite a pattern but might need to consider putting the unreliability next to GitHub (which goes down every week).
Why would increasing the buffer size help with that security vulnerability? Is it just a performance optimization?
jpeter•56m ago
throwawaymaths•55m ago
RoyTyrell•50m ago
gcau•50m ago
barbazoo•40m ago
rvz•26m ago
dap•46m ago
> This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur.
skywhopper•36m ago