frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Cloudflare outage on December 5, 2025

https://blog.cloudflare.com/5-december-2025-outage/
135•meetpateltech•1h ago

Comments

jpeter•1h ago
Unwrap() strikes again
throwawaymaths•1h ago
this time in lua. cloudflare can't catch a break
RoyTyrell•59m ago
Or they're not thoroughly testing changes before pushing them out. As I've seen some others say, CloudFlare at this point should be considered critical infrastructure. Maybe not like power but dang close.
gcau•59m ago
The 'rewrite it in lua' crowd are oddly silent now.
barbazoo•49m ago
How do you know?
rvz•35m ago
Time to use boring languages such as Java and Go.
dap•55m ago
I guess you’re being facetious but for those who didn’t click through:

> This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur.

skywhopper•45m ago
That bit may be true, but the underlying error of a null reference that caused a panic was exactly the same in both incidents.
barbazoo•1h ago
> Customers that did not have the configuration above applied were not impacted. Customer traffic served by our China network was also not impacted.

Interesting.

flaminHotSpeedo•51m ago
They kinda buried the lede there, 28% failure rate for 100% of customers isn't the same as 100% failure rate for 28% of customers
Scaevolus•1h ago
> Disabling this was done using our global configuration system. This system does not use gradual rollouts but rather propagates changes within seconds to the entire network and is under review following the outage we recently experienced on November 18.

> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:

They really need to figure out a way to correlate global configuration changes to the errors they trigger as fast as possible.

> as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules

Warning signs like this are how you know that something might be wrong!

philipwhiuk•37m ago
> Warning signs like this are how you know that something might be wrong!

Yes, as they explain it's the rollback that was triggered due to seeing these errors that broke stuff.

kachapopopow•59m ago
why does this seem oddly familiar (fail-closed logic)
dematz•59m ago
>This code expects that, if the ruleset has action=”execute”, the “rule_result.execute” object will exist. However, because the rule had been skipped, the rule_result.execute object did not exist, and Lua returned an error due to attempting to look up a value in a nil value.

what if we broke the internet, just a little bit, to even the score with unwrap (kidding!!)

xnorswap•57m ago
My understanding, paraphrased: "In order to gradually roll out one change, we had to globally push a different configuration change, which broke everything at once".

But a more important takeaway:

> This type of code error is prevented by languages with strong type systems

debugnik•53m ago
Prevented unless they assert the wrong invariant at runtime like they did last time.
jsnell•51m ago
That's a bizarre takeaway for them to suggest, when they had exactly the same kind of bug with Rust like three weeks ago. (In both cases they had code implicitly expecting results to be available. When the results weren't available, they terminated processing of the request with an exception-like mechanism. And then they had the upstream services fail closed, despite the failing requests being to optional sidecars rather than on the critical query path.)
littlestymaar•40m ago
In fairness, the previous bug (with the Rust unwrap) should never have happened: someone explicitly called the panicking function, the review didn't catch it and the CI didn't catch it.

It required a significant organizational failure to happen. These happen but they ought to be rarer than your average bug (unless your organization is fundamentally malfunctioning, that is)

greatgib•31m ago
The issue would also not have happened, if someone did the right code, tests, and the review or CI caught it...
skywhopper•48m ago
This is the exact same type of error that happened in their Rust code last time. Strong type systems don’t protect you from lazy programming.
flaminHotSpeedo•55m ago
What's the culture like at Cloudflare re: ops/deployment safety?

They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?

Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.

Pure speculation, but to me that sounds like there's more to the story, this sounds like the sort of cowboy decision a team makes when they've either already broken all the rules or weren't following them in the first place

deadbabe•53m ago
As usual, Cloudflare is the man in the arena.
samrus•45m ago
There are other men in the arena who arent tripping on their own feet
usrnm•39m ago
Like who? Which large tech company doesn't have outages?
k8sToGo•34m ago
It's not about outages. It's about the why. Hardware can fail. Bugs can happen. But to continue a roll out despite warning sings and without understanding the cause and impact is on another level. Especially if it is related to the same problem as last time.
k__•33m ago
"tripping on their own feet" == "not rolling back"
this_user•40m ago
The question is perhaps what the shape and status of their tech stack is. Obviously, they are running at massive scale, and they have grown extremely aggressively over the years. What's more, especially over the last few years, they have been adding new product after new product. How much tech debt have they accumulated with that "move fast" approach that is now starting to rear its head?
nine_k•38m ago
> more to the story

From a more tinfoil-wearing angle, it may not even be a regular deployment, given the idea of Cloudflare being "the largest MitM attack in history". ("Maybe not even by Cloudflare but by NSA", would say some conspiracy theorists, which is, of course, completely bonkers: NSA is supposed to employ engineers who never let such blunders blow their cover.)

lukeasrodgers•36m ago
Roll back is not always the right answer. I can’t speak to its appropriateness in this particular situation of course, but sometimes “roll forward” is the better solution.
rvz•30m ago
> Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.

Also there seems to be insufficient testing before deployment with very junior level mistakes.

> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:

Where was the testing for this one? If ANY exception happened during the rules checking, the deployment should fail and rollback. Instead, they didn't assess that as a likely risk and pressed on with the deployment "fix".

I guess those at Cloudflare are not learning anything from the previous disaster.

dkyc•28m ago
One thing to keep in mind when judging what's 'appropriate' is that Cloudflare was effectively responding to an ongoing security incident outside of their control (the React Server RCE vulnerability). Part of Cloudlfare's value proposition is being quick to react to such threats. That changes the equation a bit: any hour you wait longer to deploy, your customers are actively getting hacked through a known high-severity vulnerability.

In this case it's not just a matter of 'hold back for another day to make sure it's done right', like when adding a new feature to a normal SaaS application. In Cloudflare's case moving slower also comes with a real cost.

That isn't to say it didn't work out badly this time, just that the calculation is a bit different.

liampulles•27m ago
Rollback is a reliable strategy when the rollback process is well understood. If a rollback process is not well known and well experienced, then it is a risk in itself.

I'm not sure of the nature of the rollback process in this case, but leaning on ill-founded assumptions is a bad practice. I do agree that a global rollout is a problem.

otterley•21m ago
From the post:

“We have spoken directly with hundreds of customers following that incident and shared our plans to make changes to prevent single updates from causing widespread impact like this. We believe these changes would have helped prevent the impact of today’s incident but, unfortunately, we have not finished deploying them yet.

“We know it is disappointing that this work has not been completed yet. It remains our first priority across the organization.”

NoSalt•10m ago
Ooh ... I want to be on a cowboy decision making team!!!
fidotron•54m ago
> This change was being rolled out using our gradual deployment system, and, as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules. As this was an internal tool, and the fix being rolled out was a security improvement, we decided to disable the tool for the time being as it was not required to serve or protect customer traffic.

Come on.

This PM raises more questions than it answers, such as why exactly China would have been immune.

skywhopper•46m ago
China is probably a completely separate partition of their network.
fidotron•45m ago
One that doesn't get proactive security rollouts, it would seem.
skywhopper•36m ago
I assume it was next on the checklist, or assigned to a different ops team.
miyuru•54m ago
Whats going on with cloudflare's software team?

I have seen similar bugs in cloudflare API recently as well.

There is an endpoint for a feature that is available only to enterprise users, but the check for whether the user is on an enterprise plan is done at the last step.

antiloper•52m ago
Make faster websites:

> we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications.

Why is the Next.js limit 1 MB? It's not enough for uploading user generated content (photographs, scanned invoices), but a 1 MB request body for even multiple JSON API calls is ridiculous. There frameworks need to at least provide some pushback to unoptimized development, even if it's just a lower default request body limit. Otherwise all web applications will become as slow as the MS office suite or reddit.

ramon156•36m ago
The update was to update it to 3MB (paid 10MB)
AmazingTurtle•34m ago
a) They serialize tons of data into requests b) Headers. Mostly cookies. They are a thing. They are being abused all over the world by newbies.
websiteapi•51m ago
i wonder why they cannot partially rollout. like the other outage they have to do a global rollout.
usrnm•48m ago
I really don't see how it would've helped. In go or Rust you'd just get a panic, which is in no way different.
denysvitali•46m ago
The article mentions that this Lua-based proxy is the old generation one, which is going to be replaced by the Rust based one (FL2) and that didn't fail on this scenario.

So, if anything, their efforts towards a typed language were justified. They just didn't manage to migrate everything in time before this incident - which is ironically a good thing since this incident was cause mostly by a rushed change in response to an actively exploited vulnerability.

websiteapi•35m ago
yes, but as the article states why are they doing global fast rollouts?
denysvitali•25m ago
I think (would love to be corrected) that this is the nature of their service. They probably push multiple config changes per minute to mitigate DDOS attacks. For sure the proxies have a local list of IPs that, for a period of time, are blacklisted.

For DDOS protection you can't really rely on multiple-hours rollouts.

snafeau•51m ago
A lot of these kind of bugs feel like they could be caught be a simple review bot like Greptile... I wonder if Cloudlare uses an equivalent tool internally?
nkmnz•29m ago
What makes greptile a better choice compared to claude code or codex, in your opinion?
denysvitali•51m ago
Ironically, this time around the issue was in the proxy they're going to phase out (and replace with the Rust one).

I truly believe they're really going to make resilience their #1 priority now, and acknowledging the release process errors that they didn't acknowledge for a while (according to other HN comments) is the first step towards this.

HugOps. Although bad for reputation, I think these incidents will help them shape (and prioritize!) resilience efforts more than ever.

At the same time, I can't think of a company more transparent than CloudFlare when it comes to these kind of things. I also understand the urgency behind this change: CloudFlare acted (too) fast to mitigate the React vulnerability and this is the result.

Say what you want, but I'd prefer to trust CloudFlare who admits and act upon their fuckups, rather than trying to cover them up or downplaying them like some other major cloud providers.

@eastdakota: ignore the negative comments here, transparency is a very good strategy and this article shows a good plan to avoid further problems

trashburger•44m ago
I would very much like for him not to ignore the negativity, given that, you know, they are breaking the entire fucking Internet every time something like this happens.
denysvitali•39m ago
This is the kind of comment I wish he would ignore.

You can be angry - but that doesn't help anyone. They fucked up, yes, they admitted it and they provided plans on how to address that.

I don't think they do these things on purpose. Of course given their good market penetration they end up disrupting a lot of customers - and they should focus on slow rollouts - but I also believe that in a DDOS protection system (or WAF) you don't want or have the luxury to wait for days until your rule is applied.

beanjuiceII•18m ago
I hope he doesn't ignore it, the internet has been forgiving enough toward cloudflares string of failures..its getting pretty old, and creates a ton of choas. I work with life saving devices, being impacted in any way in data monitoring has a huge impact in many ways. "sorry ma'am we can't give your child t1d readings on your follow app because our provider decided to break everything in the pursuit of some react bug." has a great ring to it
fidotron•38m ago
> HugOps

This childish nonsense needs to end.

Ops are heavily rewarded because they're supposed to be responsible. If they're not then the associated rewards for it need to stop as well.

denysvitali•32m ago
I have never seen an Ops team being rewarded for avoiding incidents (focusing in tech debt reduction), but instead they get the opposite - blamed when things go wrong.

I think it's human nature (it's hard to realize something is going well until it breaks), but still has a very negative psychological effect. I can barely imagine the stress the team is going through right now.

fidotron•29m ago
> I have never seen an Ops team being rewarded for avoiding incidents

That's why their salaries are so high.

denysvitali•21m ago
Depending on the tech debt, the ops team might just be in "survival mode" and not have the time to fix every single issue.

In this particular case, they seem to be doing two things: - Phasing out the old proxy (Lua based) which is replaced by FL2 (Rust based, the one that caused the previous incident) - Reacting to an actively exploited vulnerability in React by deploying WAF rules - and they're doing them in a relatively careful way (test rules) to avoid fuckups, which caused this unknown state, which triggered the issue

fidotron•17m ago
They deliberately ignored an internal tool that started erroring out at the given deployment and rolled it out anyway without further investigation.

That's not deserving of sympathy.

da_grift_shift•37m ago
[ Removed by Reddit ]
denysvitali•31m ago
Wow. The three comments below parent really show how toxic HN has become.
beanjuiceII•17m ago
being angry about something doesn't make it toxic, people have a right to be upset
denysvitali•9m ago
The comment, before the edit, was what I would consider toxic. No wonder it has been edited.

It's fine to be upset, and especially rightfully so after the second outage in less than 30 days, but this doesn't justify toxicity.

gkoz•50m ago
I sometimes feel we'd be better off without all the paternalistic kitchensink features. The solid, properly engineered features used intentionally aren't causing these outages.
ilkkao•41m ago
Agreed, I don't really like Cloudflare trying to magically fix every web exploit there is in frameworks my site has never used.
da_grift_shift•50m ago
It's not an outage, it's an Availability Incident™.

https://blog.cloudflare.com/5-december-2025-outage/#what-abo...

lapcat•49m ago
> This is a straightforward error in the code, which had existed undetected for many years. This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur.

Cloudflare deployed code that was literally never tested, not even once, neither manually nor by unit test, otherwise the straightforward error would have been detected immediately, and their implied solution seems to be not testing their code when written, or even adding 100% code coverage after the fact, but rather relying on a programming language to bail them out and cover up their failure to test.

paradite•48m ago
The deployment pattern from Cloudflare looks insane to me.

I've worked at one of the top fintech firms, whenever we do a config change or deployment, we are supposed to have rollback plan ready and monitor key dashboards for 15-30 minutes.

The dashboards need to be prepared beforehand on systems and key business metrics that would be affected by the deployment and reviewed by teammates.

I've never seen a downtime longer than 1 minute while I was there, because you get a spike on the dashboard immediately when something goes wrong.

For the entire system to be down for 10+ minutes due to a bad config change or deployment is just beyond me.

theideaofcoffee•21m ago
Same, my time at a F100 ecommerce retailer showed me the same. Every change control board justification needed an explicit back-out/restoration plan with exact steps to be taken, what was being monitored to ensure that was being held to, contacts of prominent groups anticipated to have an effect, emergency numbers/rooms for quick conferences if in fact something did happen.

The process was pretty tight, almost no revenue-affecting outages from what I can remember because it was such a collaborative effort (even though the board presentation seemed a bit spiky and confrontational at the time, everyone was working together).

prdonahue•9m ago
And you moved at a glacial pace compared to Cloudflare. There are tradeoffs.
markus_zhang•7m ago
My guess is that CF has so many external customers that they need to move fast and try not to break things. My hunch is that their culture always favors moving fast. As long as they are not breaking too many things, customers won't leave them.
paradite•5m ago
[delayed]
hrimfaxi•48m ago
Having their changes fully propagate within 1 minute is pretty fantastic.
chatmasta•37m ago
The coolest part of Cloudflare’s architecture is that every server is the same… which presumably makes deployment a straightforward task.
denysvitali•34m ago
This is most likely a strong requisite for such a big scale deployment if DDOS protection and detection - which explains their architectural choices (ClickHouse & co) and the need of a super low latency config changes.

Since attackers might rotate IPs more frequently than once per minute, this effectively means that the whole fleet of servers should be able to quickly react depending on the decisions done centrally.

alwaysroot•42m ago
The results of vibe coded deployments are starting to show.
dreamcompiler•41m ago
"Honey we can't go on that vacation after all. In fact we can't ever take a vacation period."

"Why?"

"I've just been transferred to the Cloudflare outage explanation department."

rvz•40m ago
> Instead, it was triggered by changes being made to our body parsing logic while attempting to detect and mitigate an industry-wide vulnerability disclosed this week in React Server Components.

Doesn't Cloudflare rigorously test their changes before deployment to make sure that this does not happen again? This better not have been used to cover for the fact that they are using AI to fix issues like this one.

Better not be any presence of vibe coders or AI agents being used to be touching such critical pieces of infrastructure at all and I expected Cloudflare to learn from the previous outage very quickly.

But this is quite a pattern but might need to consider putting the unreliability next to GitHub (which goes down every week).

rany_•37m ago
> As part of our ongoing work to protect customers using React against a critical vulnerability, CVE-2025-55182, we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications.

Why would increasing the buffer size help with that security vulnerability? Is it just a performance optimization?

boxed•33m ago
I think the buffer size is the limit on what they check for malicious data, so the old 128k would mean it would be trivial to circumvent by just having 128k ok data and then put the exploit after.
redslazer•32m ago
If the request data is larger than the limit it doesn’t get processed by the Cloudflare system. By increasing buffer size they process (and therefore protect) more requests.
_pdp_•37m ago
So no static compiler checks and apparently no fuzzers used to ensure these rules work as intended?
rachr•20m ago
Time for Cloudflare to start using the BOFH excuse generator. https://bofh.d00t.org/
iLoveOncall•16m ago
The most surprising from this article is that CloudFlare handles only around 85M TPS.
liampulles•9m ago
The lesson presented by the last few big outages is that entropy is, in fact, inescapable. The comprehensibility of a system cannot keep up with its growing and aging complexity forever. The rate of unknown unknowns will increase.

The good news is that a more decentralized internet with human brain scoped components is better for innovation, progress, and freedom anyway.

egorfine•8m ago
> provides customers with protection against malicious payloads, allowing them to be detected and blocked. To do this, Cloudflare’s proxy buffers HTTP request body content in memory for analysis.

I have a mixed feeling about this.

On the other hand, I absolutely don't want a CDN to look inside my payloads and decide what's good for me or. Today it's protection, tomorrow it's censorship.

At the same time this is exactly what CloudFlare is good for - to protect sites from malicious requests.

jgalt212•6m ago
I do kind of like who they are blaming React for this.

Netflix to Acquire Warner Bros

https://about.netflix.com/en/news/netflix-to-acquire-warner-bros
778•meetpateltech•4h ago•640 comments

Cloudflare outage on December 5, 2025

https://blog.cloudflare.com/5-december-2025-outage/
142•meetpateltech•1h ago•90 comments

Covid-19 mRNA Vaccination and 4-Year All-Cause Mortality

https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2842305
150•bpierre•1h ago•114 comments

Jolla Phone Pre-Order

https://commerce.jolla.com/products/jolla-phone-preorder
123•jhoho•1h ago•58 comments

UniFi 5G

https://blog.ui.com/article/introducing-unifi-5g
249•janandonly•9h ago•199 comments

Most technical problems are people problems

https://blog.joeschrag.com/2023/11/most-technical-problems-are-really.html
156•mooreds•3h ago•141 comments

Framework Laptop 13 gets ARM processor with 12 cores via upgrade kit

https://www.notebookcheck.net/Framework-Laptop-13-gets-ARM-processor-with-12-cores-via-upgrade-ki...
49•woodrowbarlow•55m ago•18 comments

I'm Peter Roberts, immigration attorney who does work for YC and startups. AMA

22•proberts•40m ago•10 comments

Write ReactJS in Rust

https://github.com/hyper-forge/brahma-react
8•StellaMary•5d ago•0 comments

Show HN: Kraa – Writing App for Everything

https://kraa.io/about
59•levmiseri•1d ago•34 comments

Emerge Career (YC S22) Is Hiring

https://www.ycombinator.com/companies/emerge-career/jobs/qQhLEmC-founding-design-engineer
1•gabesaruhashi•2h ago

Netflix’s AV1 Journey: From Android to TVs and Beyond

https://netflixtechblog.com/av1-now-powering-30-of-netflix-streaming-02f592242d80
448•CharlesW•16h ago•235 comments

BMW PHEV: Safety fuse replacement is extremely expensive

https://evclinic.eu/2025/12/04/2021-phev-bmw-ibmucp-21f37e-post-crash-recovery-when-eu-engineerin...
356•mikelabatt•15h ago•367 comments

Making RSS More Fun

https://matduggan.com/making-rss-more-fun/
75•salmon•3h ago•46 comments

Nimony (Nim 3.0) Design Principles

https://nim-lang.org/araq/nimony.html
86•andsoitis•3d ago•47 comments

I have been writing a niche history blog for 15 years

https://resobscura.substack.com/p/why-i-have-been-writing-a-niche-history
206•benbreen•21h ago•34 comments

Show HN: Pbnj – A minimal, self-hosted pastebin you can deploy in 60 seconds

https://pbnj.sh/
14•bhavnicksm•3h ago•2 comments

After 40 years of adventure games, Ron Gilbert pivots to outrunning Death

https://arstechnica.com/gaming/2025/12/after-40-years-of-adventure-games-ron-gilbert-pivots-to-ou...
153•mikhael•3d ago•62 comments

Trick users and bypass warnings – Modern SVG Clickjacking attacks

https://lyra.horse/blog/2025/12/svg-clickjacking/
280•spartanatreyu•16h ago•40 comments

Kenyan court declares law banning seed sharing unconstitutional

https://apnews.com/article/kenya-seed-sharing-law-ruling-ad4df5a364299b3a9f8515c0f52d5f80
209•thunderbong•7h ago•56 comments

New 3D scan reveals a hidden network of moai carvers on Easter Island

https://www.sciencedaily.com/releases/2025/11/251130050717.htm
17•saikatsg•4d ago•3 comments

Show HN: Tacopy – Tail Call Optimization for Python

https://github.com/raaidrt/tacopy
73•raaid-rt•5d ago•34 comments

Influential study on glyphosate safety retracted 25 years after publication

https://www.lemonde.fr/en/environment/article/2025/12/03/influential-study-on-glyphosate-safety-r...
125•isolli•3h ago•105 comments

Ephemeral Infrastructure: Why Short-Lived Is a Good Thing

https://lukasniessen.medium.com/ephemeral-infrastructure-why-short-lived-is-a-good-thing-2cf26afd...
26•birdculture•5d ago•11 comments

Sugars, Gum, Stardust Found in NASA's Asteroid Bennu Samples

https://www.nasa.gov/missions/osiris-rex/sugars-gum-stardust-found-in-nasas-asteroid-bennu-samples/
90•jnord•4h ago•27 comments

CSS now has an if() conditional function

https://caniuse.com/?search=if
222•aanthonymax•5d ago•180 comments

How elites could shape mass preferences as AI reduces persuasion costs

https://arxiv.org/abs/2512.04047
642•50kIters•1d ago•600 comments

Transparent leadership beats servant leadership

https://entropicthoughts.com/transparent-leadership-beats-servant-leadership
485•ibobev•1d ago•220 comments

We gave 5 LLMs $100K to trade stocks for 8 months

https://www.aitradearena.com/research/we-ran-llms-for-8-months
311•cheeseblubber•17h ago•255 comments

Reframing Impact

https://turntrout.com/reframing-impact
7•jxmorris12•1w ago•1 comments