frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Shingles vaccine could slow dementia

https://medicalxpress.com/news/2025-12-dementia-shingles-vaccine-disease.html
1•gmays•1m ago•0 comments

Patterns for Defensive Programming in Rust

https://corrode.dev/blog/defensive-programming/
2•PaulHoule•1m ago•0 comments

Show HN: HTTP-network-relay: Tunnel TCP over HTTP using WebSockets

https://github.com/Thymis-io/http-network-relay
1•elikoga•1m ago•0 comments

Show HN: 7× faster Iceberg ingestion, how we redesigned OLake's writer

https://olake.io/blog/how-olake-becomes-7x-faster/
1•rohankhameshra•2m ago•0 comments

Qwen3-TTS Update: 49 Timbres and 10 Languages and 9 Dialects

https://qwen.ai/blog?id=qwen3-tts-1128
2•pretext•2m ago•0 comments

hls4ml: A Flexible, OSS Platform for ML Acceleration on Reconfigurable Hardware

https://arxiv.org/abs/2512.01463
2•matt_d•2m ago•0 comments

Z Image

https://z-image.net
1•122506•4m ago•1 comments

YouTube deletes muscians's catalog over anti-zionist content?

https://davidrovics.substack.com/p/youtube-aka-the-biggest-platform
2•feraldidactic•4m ago•1 comments

Mar-a-Lago face

https://en.wikipedia.org/wiki/Mar-a-Lago_face
1•hashim•4m ago•0 comments

Getting the graphical desktop going – step by step instructions

https://old.reddit.com/r/androidterminal/comments/1peb3an/getting_the_graphical_desktop_going_ste...
1•sipofwater•5m ago•0 comments

DSPy Parallel Chunk Streaming

https://www.elicited.blog/posts/dspy-parallel-chunk-streaming/
1•justanotheratom•6m ago•0 comments

Hacking Discoveries? Context Engineered Atomics Theory [pdf]

https://github.com/Open-Hermios/Context-Engineered-Atomics-Theory/blob/main/CDE.pdf
1•dowingard•9m ago•1 comments

Revumatic – AI Growth Loop for SMBs Tired of Yelp, Google Ads, and Groupon

https://revumatic.com
1•adrianpaunc•11m ago•1 comments

A Burp-Like HTTP Repeater Inside Chrome DevTools, Supercharged with AI

https://twitter.com/BourAbdelhadi/status/1992622964077179229
2•qwertyX•12m ago•0 comments

Chicago Tribune sues Perplexity AI for copyright infringement

https://techxplore.com/news/2025-12-chicago-tribune-sues-perplexity-ai.html
2•bikenaga•13m ago•0 comments

Show HN: SideSpark – A Local, Private AI Note Taker for macOS

https://sidespark.app/
1•raj_khare•15m ago•0 comments

OpenQuestCapture – An Open Source Meta Quest 3D Gaussian Splat Capture Pipeline

https://github.com/samuelm2/OpenQuestCapture
2•samuelm2•16m ago•1 comments

Agentic Property Extraction: Simple yet Powerful

https://www.aryn.ai/post/announcing-agentic-property-extraction-extracting-structured-data-fields...
1•mehulashah•17m ago•0 comments

Zellij: A terminal workspace with batteries included

https://zellij.dev
1•ndr•17m ago•0 comments

Learnings from managing the infrastructure of a company alone

https://www.carneiro.pt/blog/how-i-manage-skeeled-infrastructure-alone/
1•mig4ng•17m ago•0 comments

The Resonant Computing Manifesto

https://resonantcomputing.org/
2•simonw•19m ago•0 comments

Valve Has Been Funding the FEX Project [video]

https://www.youtube.com/watch?v=DfyfU2Sfhgo
1•stevefan1999•20m ago•2 comments

Gemini 3 Pro: the frontier of vision AI

https://blog.google/technology/developers/gemini-3-pro-vision/
2•xnx•21m ago•0 comments

Ask HN: Anyone considered forming a General Computing political organization?

2•trinsic2•21m ago•0 comments

Daily steps are a predictor of, but perhaps not a risk factor for Parkinson's

https://www.nature.com/articles/s41531-025-01214-6
2•bikenaga•22m ago•1 comments

AI coding crossed the speed threshold

https://betweentheprompts.com/speed-threshold/
2•scastiel•22m ago•0 comments

Show HN: Togewire – Share your Spotify listening session via your own website

https://github.com/kurodaze/togewire
1•wvrlow•22m ago•0 comments

Klarity AI turns speech into smart searchable notes

https://play.google.com/store/apps/details?id=tech.vkode.klarity&hl=en_US
1•shrida_kl•22m ago•1 comments

POC for CVE-2025-55182 (react4shell)

https://gist.github.com/maple3142/48bc9393f45e068cf8c90ab865c0f5f3
1•jimmyl02•22m ago•0 comments

The Paralyzed Programmer: What Spinal Cord Injury Taught Me About Debugging

https://www.vectorjoy.dev/blog/paralyzed-programmer-debugging-spinal-injury
1•ChrisHardie•22m ago•0 comments
Open in hackernews

Cloudflare outage on December 5, 2025

https://blog.cloudflare.com/5-december-2025-outage/
125•meetpateltech•1h ago

Comments

jpeter•56m ago
Unwrap() strikes again
throwawaymaths•55m ago
this time in lua. cloudflare can't catch a break
RoyTyrell•50m ago
Or they're not thoroughly testing changes before pushing them out. As I've seen some others say, CloudFlare at this point should be considered critical infrastructure. Maybe not like power but dang close.
gcau•50m ago
The 'rewrite it in lua' crowd are oddly silent now.
barbazoo•40m ago
How do you know?
rvz•26m ago
Time to use boring languages such as Java and Go.
dap•46m ago
I guess you’re being facetious but for those who didn’t click through:

> This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur.

skywhopper•36m ago
That bit may be true, but the underlying error of a null reference that caused a panic was exactly the same in both incidents.
barbazoo•55m ago
> Customers that did not have the configuration above applied were not impacted. Customer traffic served by our China network was also not impacted.

Interesting.

flaminHotSpeedo•42m ago
They kinda buried the lede there, 28% failure rate for 100% of customers isn't the same as 100% failure rate for 28% of customers
Scaevolus•51m ago
> Disabling this was done using our global configuration system. This system does not use gradual rollouts but rather propagates changes within seconds to the entire network and is under review following the outage we recently experienced on November 18.

> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:

They really need to figure out a way to correlate global configuration changes to the errors they trigger as fast as possible.

> as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules

Warning signs like this are how you know that something might be wrong!

philipwhiuk•28m ago
> Warning signs like this are how you know that something might be wrong!

Yes, as they explain it's the rollback that was triggered due to seeing these errors that broke stuff.

kachapopopow•50m ago
why does this seem oddly familiar (fail-closed logic)
dematz•50m ago
>This code expects that, if the ruleset has action=”execute”, the “rule_result.execute” object will exist. However, because the rule had been skipped, the rule_result.execute object did not exist, and Lua returned an error due to attempting to look up a value in a nil value.

what if we broke the internet, just a little bit, to even the score with unwrap (kidding!!)

xnorswap•48m ago
My understanding, paraphrased: "In order to gradually roll out one change, we had to globally push a different configuration change, which broke everything at once".

But a more important takeaway:

> This type of code error is prevented by languages with strong type systems

debugnik•45m ago
Prevented unless they assert the wrong invariant at runtime like they did last time.
jsnell•43m ago
That's a bizarre takeaway for them to suggest, when they had exactly the same kind of bug with Rust like three weeks ago. (In both cases they had code implicitly expecting results to be available. When the results weren't available, they terminated processing of the request with an exception-like mechanism. And then they had the upstream services fail closed, despite the failing requests being to optional sidecars rather than on the critical query path.)
littlestymaar•31m ago
In fairness, the previous bug (with the Rust unwrap) should never have happened: someone explicitly called the panicking function, the review didn't catch it and the CI didn't catch it.

It required a significant organizational failure to happen. These happen but they ought to be rarer than your average bug (unless your organization is fundamentally malfunctioning, that is)

greatgib•22m ago
The issue would also not have happened, if someone did the right code, tests, and the review or CI caught it...
skywhopper•39m ago
This is the exact same type of error that happened in their Rust code last time. Strong type systems don’t protect you from lazy programming.
flaminHotSpeedo•46m ago
What's the culture like at Cloudflare re: ops/deployment safety?

They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?

Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.

Pure speculation, but to me that sounds like there's more to the story, this sounds like the sort of cowboy decision a team makes when they've either already broken all the rules or weren't following them in the first place

deadbabe•44m ago
As usual, Cloudflare is the man in the arena.
samrus•36m ago
There are other men in the arena who arent tripping on their own feet
usrnm•30m ago
Like who? Which large tech company doesn't have outages?
k8sToGo•25m ago
It's not about outages. It's about the why. Hardware can fail. Bugs can happen. But to continue a roll out despite warning sings and without understanding the cause and impact is on another level. Especially if it is related to the same problem as last time.
k__•24m ago
"tripping on their own feet" == "not rolling back"
this_user•31m ago
The question is perhaps what the shape and status of their tech stack is. Obviously, they are running at massive scale, and they have grown extremely aggressively over the years. What's more, especially over the last few years, they have been adding new product after new product. How much tech debt have they accumulated with that "move fast" approach that is now starting to rear its head?
nine_k•29m ago
> more to the story

From a more tinfoil-wearing angle, it may not even be a regular deployment, given the idea of Cloudflare being "the largest MitM attack in history". ("Maybe not even by Cloudflare but by NSA", would say some conspiracy theorists, which is, of course, completely bonkers: NSA is supposed to employ engineers who never let such blunders blow their cover.)

lukeasrodgers•27m ago
Roll back is not always the right answer. I can’t speak to its appropriateness in this particular situation of course, but sometimes “roll forward” is the better solution.
rvz•21m ago
> Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.

Also there seems to be insufficient testing before deployment with very junior level mistakes.

> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:

Where was the testing for this one? If ANY exception happened during the rules checking, the deployment should fail and rollback. Instead, they didn't assess that as a likely risk and pressed on with the deployment "fix".

I guess those at Cloudflare are not learning anything from the previous disaster.

dkyc•19m ago
One thing to keep in mind when judging what's 'appropriate' is that Cloudflare was effectively responding to an ongoing security incident outside of their control (the React Server RCE vulnerability). Part of Cloudlfare's value proposition is being quick to react to such threats. That changes the equation a bit: any hour you wait longer to deploy, your customers are actively getting hacked through a known high-severity vulnerability.

In this case it's not just a matter of 'hold back for another day to make sure it's done right', like when adding a new feature to a normal SaaS application. In Cloudflare's case moving slower also comes with a real cost.

That isn't to say it didn't work out badly this time, just that the calculation is a bit different.

liampulles•18m ago
Rollback is a reliable strategy when the rollback process is well understood. If a rollback process is not well known and well experienced, then it is a risk in itself.

I'm not sure of the nature of the rollback process in this case, but leaning on ill-founded assumptions is a bad practice. I do agree that a global rollout is a problem.

otterley•12m ago
From the post:

“We have spoken directly with hundreds of customers following that incident and shared our plans to make changes to prevent single updates from causing widespread impact like this. We believe these changes would have helped prevent the impact of today’s incident but, unfortunately, we have not finished deploying them yet.

“We know it is disappointing that this work has not been completed yet. It remains our first priority across the organization.”

NoSalt•1m ago
Ooh ... I want to be on a cowboy decision making team!!!
fidotron•45m ago
> This change was being rolled out using our gradual deployment system, and, as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules. As this was an internal tool, and the fix being rolled out was a security improvement, we decided to disable the tool for the time being as it was not required to serve or protect customer traffic.

Come on.

This PM raises more questions than it answers, such as why exactly China would have been immune.

skywhopper•38m ago
China is probably a completely separate partition of their network.
fidotron•37m ago
One that doesn't get proactive security rollouts, it would seem.
skywhopper•27m ago
I assume it was next on the checklist, or assigned to a different ops team.
miyuru•45m ago
Whats going on with cloudflare's software team?

I have seen similar bugs in cloudflare API recently as well.

There is an endpoint for a feature that is available only to enterprise users, but the check for whether the user is on an enterprise plan is done at the last step.

antiloper•43m ago
Make faster websites:

> we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications.

Why is the Next.js limit 1 MB? It's not enough for uploading user generated content (photographs, scanned invoices), but a 1 MB request body for even multiple JSON API calls is ridiculous. There frameworks need to at least provide some pushback to unoptimized development, even if it's just a lower default request body limit. Otherwise all web applications will become as slow as the MS office suite or reddit.

ramon156•27m ago
The update was to update it to 3MB (paid 10MB)
AmazingTurtle•25m ago
a) They serialize tons of data into requests b) Headers. Mostly cookies. They are a thing. They are being abused all over the world by newbies.
websiteapi•43m ago
i wonder why they cannot partially rollout. like the other outage they have to do a global rollout.
usrnm•40m ago
I really don't see how it would've helped. In go or Rust you'd just get a panic, which is in no way different.
denysvitali•37m ago
The article mentions that this Lua-based proxy is the old generation one, which is going to be replaced by the Rust based one (FL2) and that didn't fail on this scenario.

So, if anything, their efforts towards a typed language were justified. They just didn't manage to migrate everything in time before this incident - which is ironically a good thing since this incident was cause mostly by a rushed change in response to an actively exploited vulnerability.

websiteapi•27m ago
yes, but as the article states why are they doing global fast rollouts?
denysvitali•17m ago
I think (would love to be corrected) that this is the nature of their service. They probably push multiple config changes per minute to mitigate DDOS attacks. For sure the proxies have a local list of IPs that, for a period of time, are blacklisted.

For DDOS protection you can't really rely on multiple-hours rollouts.

snafeau•42m ago
A lot of these kind of bugs feel like they could be caught be a simple review bot like Greptile... I wonder if Cloudlare uses an equivalent tool internally?
nkmnz•20m ago
What makes greptile a better choice compared to claude code or codex, in your opinion?
denysvitali•42m ago
Ironically, this time around the issue was in the proxy they're going to phase out (and replace with the Rust one).

I truly believe they're really going to make resilience their #1 priority now, and acknowledging the release process errors that they didn't acknowledge for a while (according to other HN comments) is the first step towards this.

HugOps. Although bad for reputation, I think these incidents will help them shape (and prioritize!) resilience efforts more than ever.

At the same time, I can't think of a company more transparent than CloudFlare when it comes to these kind of things. I also understand the urgency behind this change: CloudFlare acted (too) fast to mitigate the React vulnerability and this is the result.

Say what you want, but I'd prefer to trust CloudFlare who admits and act upon their fuckups, rather than trying to cover them up or downplaying them like some other major cloud providers.

@eastdakota: ignore the negative comments here, transparency is a very good strategy and this article shows a good plan to avoid further problems

trashburger•35m ago
I would very much like for him not to ignore the negativity, given that, you know, they are breaking the entire fucking Internet every time something like this happens.
denysvitali•30m ago
This is the kind of comment I wish he would ignore.

You can be angry - but that doesn't help anyone. They fucked up, yes, they admitted it and they provided plans on how to address that.

I don't think they do these things on purpose. Of course given their good market penetration they end up disrupting a lot of customers - and they should focus on slow rollouts - but I also believe that in a DDOS protection system (or WAF) you don't want or have the luxury to wait for days until your rule is applied.

beanjuiceII•9m ago
I hope he doesn't ignore it, the internet has been forgiving enough toward cloudflares string of failures..its getting pretty old, and creates a ton of choas. I work with life saving devices, being impacted in any way in data monitoring has a huge impact in many ways. "sorry ma'am we can't give your child t1d readings on your follow app because our provider decided to break everything in the pursuit of some react bug." has a great ring to it
fidotron•30m ago
> HugOps

This childish nonsense needs to end.

Ops are heavily rewarded because they're supposed to be responsible. If they're not then the associated rewards for it need to stop as well.

denysvitali•23m ago
I have never seen an Ops team being rewarded for avoiding incidents (focusing in tech debt reduction), but instead they get the opposite - blamed when things go wrong.

I think it's human nature (it's hard to realize something is going well until it breaks), but still has a very negative psychological effect. I can barely imagine the stress the team is going through right now.

fidotron•20m ago
> I have never seen an Ops team being rewarded for avoiding incidents

That's why their salaries are so high.

denysvitali•13m ago
Depending on the tech debt, the ops team might just be in "survival mode" and not have the time to fix every single issue.

In this particular case, they seem to be doing two things: - Phasing out the old proxy (Lua based) which is replaced by FL2 (Rust based, the one that caused the previous incident) - Reacting to an actively exploited vulnerability in React by deploying WAF rules - and they're doing them in a relatively careful way (test rules) to avoid fuckups, which caused this unknown state, which triggered the issue

fidotron•8m ago
They deliberately ignored an internal tool that started erroring out at the given deployment and rolled it out anyway without further investigation.

That's not deserving of sympathy.

da_grift_shift•28m ago
[ Removed by Reddit ]
denysvitali•22m ago
Wow. The three comments below parent really show how toxic HN has become.
beanjuiceII•8m ago
being angry about something doesn't make it toxic, people have a right to be upset
gkoz•41m ago
I sometimes feel we'd be better off without all the paternalistic kitchensink features. The solid, properly engineered features used intentionally aren't causing these outages.
ilkkao•32m ago
Agreed, I don't really like Cloudflare trying to magically fix every web exploit there is in frameworks my site has never used.
da_grift_shift•41m ago
It's not an outage, it's an Availability Incident™.

https://blog.cloudflare.com/5-december-2025-outage/#what-abo...

lapcat•40m ago
> This is a straightforward error in the code, which had existed undetected for many years. This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur.

Cloudflare deployed code that was literally never tested, not even once, neither manually nor by unit test, otherwise the straightforward error would have been detected immediately, and their implied solution seems to be not testing their code when written, or even adding 100% code coverage after the fact, but rather relying on a programming language to bail them out and cover up their failure to test.

paradite•39m ago
The deployment pattern from Cloudflare looks insane to me.

I've worked at one of the top fintech firms, whenever we do a config change or deployment, we are supposed to have rollback plan ready and monitor key dashboards for 15-30 minutes.

The dashboards need to be prepared beforehand on systems and key business metrics that would be affected by the deployment and reviewed by teammates.

I've never seen a downtime longer than 1 minute while I was there, because you get a spike on the dashboard immediately when something goes wrong.

For the entire system to be down for 10+ minutes due to a bad config change or deployment is just beyond me.

theideaofcoffee•12m ago
Same, my time at a F100 ecommerce retailer showed me the same. Every change control board justification needed an explicit back-out/restoration plan with exact steps to be taken, what was being monitored to ensure that was being held to, contacts of prominent groups anticipated to have an effect, emergency numbers/rooms for quick conferences if in fact something did happen.

The process was pretty tight, almost no revenue-affecting outages from what I can remember because it was such a collaborative effort (even though the board presentation seemed a bit spiky and confrontational at the time, everyone was working together).

hrimfaxi•39m ago
Having their changes fully propagate within 1 minute is pretty fantastic.
chatmasta•28m ago
The coolest part of Cloudflare’s architecture is that every server is the same… which presumably makes deployment a straightforward task.
denysvitali•25m ago
This is most likely a strong requisite for such a big scale deployment if DDOS protection and detection - which explains their architectural choices (ClickHouse & co) and the need of a super low latency config changes.

Since attackers might rotate IPs more frequently than once per minute, this effectively means that the whole fleet of servers should be able to quickly react depending on the decisions done centrally.

alwaysroot•33m ago
The results of vibe coded deployments are starting to show.
dreamcompiler•32m ago
"Honey we can't go on that vacation after all. In fact we can't ever take a vacation period."

"Why?"

"I've just been transferred to the Cloudflare outage explanation department."

rvz•31m ago
> Instead, it was triggered by changes being made to our body parsing logic while attempting to detect and mitigate an industry-wide vulnerability disclosed this week in React Server Components.

Doesn't Cloudflare rigorously test their changes before deployment to make sure that this does not happen again? This better not have been used to cover for the fact that they are using AI to fix issues like this one.

Better not be any presence of vibe coders or AI agents being used to be touching such critical pieces of infrastructure at all and I expected Cloudflare to learn from the previous outage very quickly.

But this is quite a pattern but might need to consider putting the unreliability next to GitHub (which goes down every week).

rany_•28m ago
> As part of our ongoing work to protect customers using React against a critical vulnerability, CVE-2025-55182, we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications.

Why would increasing the buffer size help with that security vulnerability? Is it just a performance optimization?

boxed•24m ago
I think the buffer size is the limit on what they check for malicious data, so the old 128k would mean it would be trivial to circumvent by just having 128k ok data and then put the exploit after.
redslazer•23m ago
If the request data is larger than the limit it doesn’t get processed by the Cloudflare system. By increasing buffer size they process (and therefore protect) more requests.
_pdp_•28m ago
So no static compiler checks and apparently no fuzzers used to ensure these rules work as intended?
rachr•11m ago
Time for Cloudflare to start using the BOFH excuse generator. https://bofh.d00t.org/
iLoveOncall•7m ago
The most surprising from this article is that CloudFlare handles only around 85M TPS.