No other time in history has one single company been responsible for so much commerce and traffic. I wonder what some outage analogs to the pre-internet ages would be.
> The standard procedures the managers tried first failed to bring the network back up to speed and for nine hours, while engineers raced to stabilize the network, almost 50% of the calls placed through AT&T failed to go through.
> Until 11:30pm, when network loads were low enough to allow the system to stabilize, AT&T alone lost more than $60 million in unconnected calls.
> Still unknown is the amount of business lost by airline reservations systems, hotels, rental car agencies and other businesses that relied on the telephone network.
https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collap...
AWS very likely has Cloudflare beat in commerce responsibility. Amazon is equal to ~2.3% of US GDP by itself.
In the pre digital era, East India Company dwarfs every other company in any metric like commerce controlled, global shipping, communication traffic, private army size, %GDP , % of workforce employed by considerable margins.
The default was large consolidated organization throughout history, like say Bell Labs, or Standard Oil before that and so on, only for a brief periods we have enjoyed benefits of true capitalism.
[1] Although I suspect either AWS or MS/Azure recent down-times in the last couple of years are likely higher
Lots of things have the sky in common. Maybe comet-induced ice ages...
- Their database permissions changed unexpectedly (??)
- This caused a 'feature file' to be changed in an unusual way (?!)
- Their SQL query made assumptions about the database; their permissions change thus resulted in queries getting additional results, permitted by the query
- Changes were propagated to production servers which then crashed those servers (meaning they weren't tested correctly)
- They hit an internal application memory limit and that just... crashed the app
- The crashing did not result in an automatic backout of the change, meaning their deployments aren't blue/green or progressive
- After fixing it, they were vulnerable to a thundering herd problem
- Customers who were not using bot rules were not affected; CloudFlare's bot-scorer generated a constant bot score of 0, meaning all traffic is bots
In terms of preventing this from a software engineering perspective, they made assumptions about how their database queries work (and didn't validate the results), and they ignored their own application limits and didn't program in either a test for whether an input would hit a limit, or some kind of alarm to notify the engineers of the source of the problem.From an operations perspective, it would appear they didn't test this on a non-production system mimicing production; they then didn't have a progressive deployment; and they didn't have a circuit breaker to stop the deployment or roll-back when a newly deployed app started crashing.
This is literally the CrowdStrike bug, in a CDN. This is the most basic, elementary, day 0 test you could possibly invent. Forget the other things they fucked up. Their app just crashes with a config file, and nobody evaluates it?! Not every bug is preventable, but an egregious lack of testing is preventable.
This is what a software building code (like the electrical code's UL listings that prevent your house from burning down from untested electrical components) is intended to prevent. No critical infrastructure should be legal without testing, period.
> The change explained above resulted in all users accessing accurate metadata about tables they have access to. Unfortunately, there were assumptions made in the past, that the list of columns returned by a query like this would only include the “default” database:
SELECT
name,
type
FROM system.columns
WHERE
table = 'http_requests_features'
order by name;
Note how the query does not filter for the database name. With us gradually rolling out the explicit grants to users of a given ClickHouse cluster, after the change at 11:05 the query above started returning “duplicates” of columns because those were for underlying tables stored in the r0 database.And here is the query they used ** (OK, so it's not exactly):
SELECT * from feature JOIN permissions on feature.feature_type_id = permissions.feature_type_id
someone added a new row to permissions and the JOIN started returning two dupe feature rows for each distinct feature.** "here is the query" is used for dramatic effect. I have no knowledge of what kind of database they are even using much less queries (but i do have an idea).
more edits: OK apparently it's described later in the post as a query against clickhouse's table metadata table, and because users were granted access to an additional database that was actually the backing store to the one they normally worked with, some row level security type of thing doubled up the rows. Not sure why querying system.columns is part of a production level query though, seems overly dynamic.
Any other large-ish company, there would be layers of "stakeholders" that will slow this process down. They will almost always never allow code to be published.
0/10, get it right the first time, folks. (/s)
Damn corporate karma farming is ruthless, only a couple minute SLA before taking ownership of the karma. I guess I'm not built for this big business SLA.
Thanks for the insight.
Fantastic for recruiting, too.
I'd consider applying based on this alone
I'm so jealous. I've written postmortems for major incidents at a previous job: a few hours to write, a week of bikeshedding by marketing and communication and tech writers and ... over any single detail in my writing. Sanitizing (hide a part), simplifying (our customers are too dumb to understand), etc, so that the final writing was "true" in the sense that it "was not false", but definitely not what I would call "true and accurate" as an engineer.
> Spent some time after we got things under control talking to customers. Then went home.
What did sama / Fidji say? ;) Turnstile couldn't have been worth that.
As a visitor to random web pages, I definitely appreciated this—much better than their completely false “checking the security of your connection” message.
> The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions
Also appreciate the honesty here.
> On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures to deliver core network traffic. […]
> Core traffic was largely flowing as normal by 14:30. We worked over the next few hours to mitigate increased load on various parts of our network as traffic rushed back online. As of 17:06 all systems at Cloudflare were functioning as normal.
Why did this take so long to resolve? I read through the entire article, and I understand why the outage happened, but when most of the network goes down, why wasn't the first step to revert any recent configuration changes, even ones that seem unrelated to the outage? (Or did I just misread something and this was explained somewhere?)
Of course, the correct solution is always obvious in retrospect, and it's impressive that it only took 7 minutes between the start of the outage and the incident being investigated, but it taking a further 4 hours to resolve the problem and 8 hours total for everything to be back to normal isn't great.
https://how.complexsystems.fail/#18
It'd be fun to read more about how you all procedurally respond to this (but maybe this is just a fixation of mine lately). Like are you tabletopping this scenario, are teams building out runbooks for how to quickly resolve this, what's the balancing test for "this needs a functional change to how our distributed systems work" vs. "instead of layering additional complexity on, we should just have a process for quickly and maybe even speculatively restoring this part of the system to a known good state in an outage".
However, you forgot that the lighting conditions are where only red lights from the klaxons are showing so you really can't differentiate the colors of the wires
- A product depends on frequent configuration updates to defend against attackers.
- A bad data file is pushed into production.
- The system is unable to easily/automatically recover from bad data files.
(The CrowdStrike outages were quite a bit worse though, since it took down the entire computer and remediation required manual intervention on thousands of desktops, whereas parts of Cloudflare were still usable throughout the outage and the issue was 100% resolved in a few hours)
Outages are in a large majority of cases caused by change, either deployments of new versions or configuration changes.
A configuration file should not grow! design failure here, I want to understand
Or you do have something like this but the specific db permission change in this context only failed in production
"This feature file is refreshed every few minutes and published to our entire network and allows us to react to variations in traffic flows across the Internet. It allows us to react to new types of bots and new bot attacks. So it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly."
Side thought as we're working on 100% onchain systems (for digital assets security, different goals):
Public chains (e.g. EVMs) can be a tamper‑evident gate that only promotes a new config artifact if (a) a delay or multi‑sig review has elapsed, and (b) a succinct proof shows the artifact satisfies safety invariants like ≤200 features, deduped, schema X, etc.
That could have blocked propagation of the oversized file long before it reached the edge :)
For London customers this made the impact more severe temporarily.
The exact wording (which I can easily find, because a good chunk of the internet gives it to me, because I’m on Indian broadband):
> example.com needs to review the security of your connection before proceeding.
It bothers me how this bald-faced lie of a wording has persisted.
(The “Verify you are human by completing the action below.” / “Verify you are human” checkbox is also pretty false, as ticking the box in no way verifies you are human, but that feels slightly less disingenuous.)
The "...was then propagated to all the machines that make up our network..." followed by "....caused the software to fail." screams for a phased rollout / rollback methodology. I get that "...it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly" but today's outage highlights that rapid deployment isn't all upside.
The remediation section doesn't give me any sense that phased deployment, acceptance testing, and rapid rollback are part of the planned remediation strategy.
Edit: Similar to Crowdstrike, the bot detector should have fallen-back to its last-known-good signature database after panicking, instead of just continuing to panic.
If you want to say that systems that light up hundreds of customers, or propagate new reactive bot rules, or notify a routing system that a service has gone down are intrinsically too complicated, that's one thing. By all means: "don't build modern systems! computers are garbage!". I have that sticker on my laptop already.
But like: handling these problems is basically the premise of large-scale cloud services. You can't just define it away.
I read the parent poster as broadly suggesting configuration updates should have fitness tests applied and be deployed to minimize the blast radius when an update causes a malfunction. That makes intuitive sense to me. It seems like software should be subject to health checks after configuration updates, even if it's just to stop a deployment before it's widely distributed (let alone rolling-back to last-working configurations, etc).
Am I being thick-headed in thinking defensive strategies like those are a good idea? I'm reading your reply as arguing against those types of strategies. I'm also not understanding what you're suggesting as an alternative.
Again, I'm sorry to belabor this. I've replied once, deleted it, tried writing this a couple more times and given up, and now I'm finally pulling the trigger. It's really eating at me. I feel as though I must be deep down the Dunning-Kruger rabbit hole and really thinking "outside my lane".
What's irritating to me are the claims that there's nothing distinguishing real time control plane state changes and config files. Most of us have an intuition for how they'd do a careful rollout of a config file change. That intuition doesn't hold for control plane state; it's like saying, for instance, that OSPF should have canaries and staged rollouts every time a link state changes.
I'm not saying there aren't things you to do make real-time control plane state propagation safer, or that Cloudflare did all those things (I have no idea, I'm not familiar with their system at all, which is another thing irritating me about this thread --- the confident diagnostics and recommendations). I'm saying that people trying to do the "this is just like CrowdStrike" thing are telling on themselves.
I took the "this sounds like Crowdstrike" tack for two reasons. The write-up characterized this update as an every five minutes process. The update, being a file of rules, felt analogous in format to the Crowdstrike signature database.
I appreciate the OSPF analogy. I recognize there are portions of these large systems that operate more like a routing protocol (with updates being unpredictable in velocity or time of occurrence). The write-up didn't make this seem like one of those. This seemed a lot more like a traditional daemon process receiving regular configuration updates and crashing on a bad configuration file.
Most of what I'm saying is:
(1) Looking at individual point failures and saying "if you'd just fixed that you wouldn't have had an incident" is counterproductive; like Mr. Oogie-Boogie, every big distributed system is made of bugs. In fact, that's true of literally every complex system, which is part of the subtext behind Cook[1].
(2) I think people are much too quick to key in on the word "config" and just assume that it's morally indifferentiable from source code, which is rarely true in large systems like this (might it have been here? I don't know.) So my eyes twitch like Louise Belcher's when people say "config? you should have had a staged rollout process!" Depends on what you're calling "config"!
It has somewhat regularly saved us from disaster in the past.
For something so critical, why aren't you using lints to identify and ideally deny panic inducing code. This is one of the biggest strengths of using Rust in the first place for this problem domain.
I will give you one example though: various .NET repositories (runtime, aspnetcore, orleans).
Wonder why these old grey beards chose to go with that.
Afaik, Go and Java are the only languages that make you pause and explicitly deal with these exceptions.
unwrap() implicitly panic-ed, right?
I suppose another way to think about it is that Result<T, E> is somewhat analogous to Java's checked exceptions - you can't get the T out unless you say what to do in the case of the E/checked exception. unwrap() in this context is equivalent to wrapping the checked exception in a RuntimeException and throwing that.
Surely a unwrap_or_default() would have been a much better fit--if fetching features fails, continue processing with an empty set of rules vs stop world.
This reads to me more like the error type returned by append with names is not (ErrorFlags, i32) and wasn't trivially convertible into that type so someone left an unwrap in place on an "I'll fix it later" basis, but who knows.
I'm pretty surprised that Cloudflare let an unwrap into prod that caused their worst outage in 6 years.
1. https://doc.rust-lang.org/std/option/enum.Option.html#recomm...
I don't know enough about Cloudflare's situation to confidently recommend anything (and I certainly don't know enough to dunk on them, unlike the many Rust experts of this thread) but if I was in their shoes, I'd be a lot less interested in eradicating `unwrap` everywhere and more in making sure than an errant `unwrap` wouldn't produce stable failure modes.
But like, the `unwrap` thing is all programmers here have to latch on to, and there's a psychological self-soothing instinct we all have to seize onto some root cause with a clear fix (or, better yet for dopaminergia, an opportunity to dunk).
A thing I really feel in threads like this is that I'd instinctively have avoided including the detail about an `unwrap` call --- I'd have worded that part more ambiguously --- knowing (because I have a pathological affinity for this community) that this is exactly how HN would react. Maybe ironically, Prince's writing is a little better for not having dodged that bullet.
But I do feel strongly that the expect pattern is a highly useful control and that naked unwraps almost always indicate a failure to reason about the reliability of a change. An unwrap in their core proxy system indicates a problem in their change management process (review, linting, whatever).
It's one thing to not want to be the one to armchair it, but that doesn't mean that one has to suppress their normal and obvious reactions. You're allowed to think things even if they're kitsch, you too are human, and what's kitsch depends and changes. Applies to everyone else here by extension too.
> Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input
> Enabling more global kill switches for features
> Eliminating the ability for core dumps or other error reports to overwhelm system resources
> Reviewing failure modes for error conditions across all core proxy modules
Absent from this list are canary deployments and incremental or wave-based deployment of configuration files (which are often as dangerous as code changes) across fault isolation boundaries -- assuming CloudFlare has such boundaries at all. How are they going to contain the blast radius in the future?
This is something the industry was supposed to learn from the CrowdStrike incident last year, but it's clear that we still have a long way to go.
Also, enabling global anything (i.e., "enabling global kill switches for features") sounds like an incredibly risky idea. One can imagine a bug in a global switch that transforms disabling a feature into disabling an entire system.
I wonder why clickhouse is used to store the feature flags here, as it has it's own duplication footguns[0] which could have also easily lead to a query blowing up 2/3x in size. oltp/sqlite seems more suited, but i'm sure they have their reasons
[0] https://clickhouse.com/docs/guides/developer/deduplication
Also, the link you provided is for eventual deduplication at the storage layer, not deduplication at query time.
It’s not a terrible idea, in that you can test the exact database engine binary in CI, and it’s (by definition) not a single point of failure.
I love sqlite for some things, but it's not The One True Database Solution.
People think of configuration updates (or state updates, call them what you will) as inherently safer than code updates, but history (and today!) demonstrates that they are not. Yet even experienced engineers will allow changes like these into production unattended -- even ones who wouldn't dare let a single line of code go live without being subject to the full CI/CD process.
The point here remains: consider every change to involve risk, and architect defensively.
And I hope fly.io has these mechanisms as well :-)
If the exact same thing happens again at Cloudflare, they'll be fair game. But right now I feel people on this thread are doing exactly, precisely, surgically and specifically the thing Richard Cook and the Cook-ites try to get people not to do, which is to see complex system failures as predictable faults with root causes, rather than as part of the process of creating resilient systems.
Fires happen every day. Smoke alarms go off, firefighters get called in, incident response is exercised, and lessons from the situation are learned (with resulting updates to the fire and building codes).
Yet even though this happens, entire cities almost never burn down anymore. And we want to keep it that way.
As Cook points out, "Safety is a characteristic of systems and not of their components."
No matter what architecture, processes, software, frameworks, and systems you use, or how exhaustively you plan and test for every failure mode, you cannot 100% predict every scenario and claim "cellular architecture fixes this". This includes making 100% of all failures "contained". Not realistic.
Cellular architecture within a region is the next level and is more difficult, but is achievable if you adhere to the same principles that prohibit inter-regional coupling:
https://docs.aws.amazon.com/wellarchitected/latest/reducing-...
https://docs.aws.amazon.com/wellarchitected/latest/reducing-...
Amazon has had multi-region outages due to pushing bad configs, so it’s extremely difficult to believe whatever you are proposing solves that exact problem by relying on multi-regions.
Come to think of it, Cloudflare’s outage today is another good counterexample.
Customers survive incidents on a daily basis by failing over across regions (even in the absence of an AWS regional failure, they can fail due to a bad deployment or other cause). The reason you don’t hear about it is because it works.
Complex system failures are not monocausal! Complex systems are in a continuous state of partial failure!
In this case, the older proxy's "fail-closed" categorization of bot activity was obviously better than the "fail-crash", but every global change needs to be carefully validated to have good characteristics here.
Having a mapping of which services are downstream of which other service configs and versions would make detecting global incidents much easier too, by making the causative threads of changes more apparent to the investigators.
In reference to fault isolation boundaries: I am not familiar with their CI/CD, in theory the error could have been caught/prevented there, but that comes with a lot of depends or it's tricky. But it looks like they didn't go the extra mile to care about safety sensitive areas. So euphemistic speaking, they are now recalibrating balance of safety measures.
Unfortunately they do not share, what caused the status page to went down as well. (Does this happen often? Otherwise a big coincidence it seems)
And since it seems this is hosted by Atlassian, this would be up to Atlassian.
[0] https://docs.aws.amazon.com/AmazonCloudFront/latest/Develope...
IDK Atlassian Statuspage clientele, but it's possible Cloudflare is much larger than usual.
At the bare minimum they could've used an expect("this should never happen, if it does database schema is incorrect").
The whole point of errors as values is preventing this kind of thing.... It wouldn't have stopped the outage but it would've made it easy to diagnose.
If anyone at cloudflare is here please let me in that codebase :)
Permitting it in development is why one ends up in the position of having to use an `expect()` in production code, because your API surfaces are wrong and can’t model your actual invariants.
There needs to be something at the top level that can handle a crashing process.
Or can a unwrap be stopped?
This is just a normal Tuesday for languages with Exception and try/catch.
Yes, unfortunately, random stack unrolls and weird state bugs as a result are a normal Tuesday for languages with (unchecked) Exception and try/catch
Unwrap gives you a stack trace, while retuned Err doesn't, so simply using a Result for that line of code could have been even harder to diagnose.
`unwrap_or_default()` or other ways of silently eating the error would be less catastrophic immediately, but could still end up breaking the system down the line, and likely make it harder to trace the problem to the root cause.
The problem is deeper than an unwrap(), related to handling rollouts of invalid configurations, but that's not a 1-line change.
The problem is that they didn't surface a failure case, which means they couldn't handle rollouts of invalid configurations correctly.
The use of `.unwrap()` isn't superficial at all -- it hid an invariant that should have been handled above this code. The failure to correctly account for and handle those true invariants is exactly what caused this failure mode.
Best post mortem I've read in a while, this thing will be studied for years.
A bit ironic that their internal FL2 tool is supposed to make Cloudflare "faster and more secure" but brought a lot of things down. And yeah, as other have already pointed out, that's a very unsafe use of Rust, should've never made it to production.
I know, this is "Monday morning quarterbacking", but that's what you get for an outage this big that had me tied up for half a day.
It's not necessarily invalid to use unwrap in production code if you would just call panic anyway. But just like every unsafe block needs a SAFETY comment, every unwrap in production code needs an INFALLIBILITY comment. clippy::unwrap_used can enforce this.
How about indexing into a slice/map/vec? Should every `foo[i]` have an infallibility comment? Because they're essentially `get(i).unwrap()`.
For the 5% of cases that are too complex for standard iterators? I never bother justifying why my indexes are correct, but I don't see why not.
You very rarely need SAFETY comments in Rust because almost all the code you write is safe in the first place. The language also gives you the tool to avoid manual iteration (not just for safety, but because it lets the compiler eliminate bounds checks), so it would actually be quite viable to write these comments, since you only need them when you're doing something unusual.
So: first, identify code that cannot be allowed to panic. Within that code, yes, in the rare case that you use [i], you need to at least try to justify why you think it'll be in bounds. But it would be better not to.
There are a couple of attempts at getting the compiler to prove that code can't panic (e.g., the no-panic crate).
I want to ban crates that panic from my dependency chain.
The language could really use an extra set of static guarantees around this. I would opt in.
Something that allows me to tag annotate a function (or my whole crate) as "no panic", and get a compile error if the function or anything it calls has a reachable panic.
This will allow it to work with many unmodified crates, as long as constant propagation can prove that any panics are unreachable. This approach will also allow crates to provide panicking and non panicking versions of their API (which many already do).
On the subject of this, I want more ability to filter out crates in our Cargo.toml. Such as a max dependency depth. Or a frozen set of dependencies that is guaranteed not to change so audits are easier. (Obviously we could vendor the code in and be in charge of our own destiny, but this feels like something we can let crate authors police.)
Which means banning anything that allocates memory and thousands of stdlib functions/methods.
I'm fine with allocation failures. I don't want stupid unwrap()s, improper slice access, or other stupid and totally preventable behavior.
There are things inside the engineer's control. I want that to not panic.
I don't want dependencies deciding to unwrap() or expect() some bullshit and that causing my entire program to crash because I didn't anticipate or handle the panic.
Code should be written, to the largest extent possible, to mitigate errors using Result<>. This is just laziness.
I want checks in the language to safeguard against lazy Rust developers. I don't want their code in my dependency tree, and I want static guarantees against this.
edit: I just searched unwrap() usage on Github, and I'm now kind of worried/angry:
https://github.com/search?q=unwrap%28%29+language%3ARust&typ...
A lot of this is just pure laziness.
Look at how many lazy cases of this there are in Rust code [1].
Some of these are no doubt tested (albeit impossible to statically guarantee), but a lot of it looks like sloppiness or not leaning on the language's strong error handling features.
It's disappointing to see. We've had so much of this creep into the language that eventually it caused a major stop-the-world outage. This is unlikely to be the last time we see it.
[1] https://github.com/search?q=unwrap%28%29+language%3ARust&typ...
A language DX feature I quite like is when dangerous things are labelled as such. IIRC, some examples of this are `accursedUnutterablePerformIO` in Haskell, and `DO_NOT_USE_OR_YOU_WILL_BE_FIRED_EXPERIMENTAL_CREATE_ROOT_CONTAINERS` in React.js.
I still think we should remove them outright or make production code fail to compile without a flag allowing them. And we also need tools to start cleaning up our dependency tree of this mess.
Unless the language addresses no-panic in its governing design or allows try-catch, not sure how you go about this.
https://github.com/search?q=unwrap%28%29+language%3ARust&typ...
This is ridiculous. We're probably going to start seeing more of these. This was just the first, big highly visible instance.
We should have a name for this similar to "my code just NPE'd". I suggest "unwrapped", as in, "My Rust app just unwrapped a present."
I think we should start advocating for the deprecation and eventual removal of the unwrap/expect family of methods. There's no reason engineers shouldn't be handling Options and Results gracefully, either passing the state to the caller or turning to a success or fail path. Not doing this is just laziness.
Could you share some more details, maybe one fully concrete scenario? There are lots of techniques, but there's no one-size-fits-all solution.
The developer was lazy.
A lot of Rust developers are: https://github.com/search?q=unwrap%28%29+language%3ARust&typ...
The details depend a lot on what you're doing and how you're doing it. Does the graph grow? Shrink? Do you have more than one? Do you care about programmer error types other than panic/UB?
Suppose, e.g., that your graph doesn't change sizes, you only have one, and you only care about panics/UB. Then you can get away with:
1. A dedicated index type, unique to that graph (shadow / strong-typedef / wrap / whatever), corresponding to whichever index type you're natively using to index nodes.
2. Some mechanism for generating such indices. E.g., during graph population phase you have a method which returns the next custom index or None if none exist. You generated the IR with those custom indexes, so you know (assuming that one critical function is correct) that they're able to appropriately index anywhere in your graph.
3. You have some unsafe code somewhere which blindly trusts those indices when you start actually indexing into your array(s) of node information. However, since the very existence of such an index is proof that you're allowed to access the data, that access is safe.
Techniques vary from language to language and depending on your exact goals. GhostCell [0] in Rust is one way of relegating literally all of the unsafe code to a well-vetted library, and it uses tagged types (via lifetimes), so you can also do away with the "only one graph" limitation. It's been awhile since I've looked at it, but resizes might also be safe pretty trivially (or might not be).
The general principle though is to structure your problem in such a way that a very small amount of code (so that you can more easily prove it correct) can provide promises that are enforceable purely via the type system (so that if the critical code is correct then so is everything else).
That's trivial by itself (e.g., just rely on option-returning .get operators), so the rest of the trick is to find a cheap place in your code which can provide stronger guarantees. For many problems, initialization is the perfect place (e.g., you can bounds-check on init and then not worry about it again) (e.g., if even bounds-checking on initialization is too slow then you can still use the opportunity at initialization to write out a proof of why some invariant holds and then blindly/unsafely assert it to be true, but you then immediately pack that hard-won information into a dedicated type so that the only place you ever have to think about it is on initialization).
* Graph/tree traversal functions that take a visitor function as a parameter
* Binary search on sorted arrays
* Binary heap operations
* Probing buckets in open-addressed hash tables
The smoltcp crate typically uses runtime checks to ensure slice accesses made by the library do not cause a panic. It's not exactly equivalent to GP's assertion, since it doesn't cover "every single slice access", but it at least covers slice accesses triggered by the library's public API. (i.e. none of the public API functions should cause a panic, assuming that the runtime validation after the most recent mutation succeeds).
Example: https://docs.rs/smoltcp/latest/src/smoltcp/wire/ipv4.rs.html...
I think adoption would have played out very different if there had only been some more syntactic-sugar. For example, an easy syntax for saying: "In this method, any (checked) DeepException e that bubbles up should immediately be replaced by a new (checked) MylayerException(e) that contains the original one as a cause.
We might still get lazy programmers making systems where every damn thing goes into a generic MylayerException, but that mess would still be way easier to fix later than a hundred scattered RuntimeExceptions.
I think it comes down to a psychological or use-case issue: People hate thinking about errors and handling them, because it's that hard stuff that always consumes more time than we'd like to think. Not just digitally, but in physical machines too. It's also easier to put off "for later."
1. in most cases they don't want to handle `InterruptedException` or `IOException` and yet need to bubble them up. In that case the code is very verbose.
2. it makes lambdas and functions incompatible. So eg: if you're passing a function to forEach, you're forced to wrap it in runtime exception.
3. Due to (1) and (2), most people become lazy and do `throws Exception` which negates most advantages of having exceptions in the first place.
In line-of-business apps (where Java is used the most), an uncaught exception is not a big deal. It will bubble up and gets handled somewhere far up the stack (eg: the server logger) without disrupting other parts of the application. This reduces the utility of having every function throw InterruptedException / IOException when those hardly ever happen.
This is true, but the hate predated lambdas in Java.
In my experience, it actually is a big deal, leaving a wake of indeterminant state behind after stack unrolling. The app then fails with heisenbugs later, raising more exceptions that get ignored, compounding the problem.
People just shrug off that unreliability as an unavoidable cost of doing business.
The problem is that any non-trivial software is composition, and encapsulation means most errors aren't recoverable.
We just need easy ways to propagate exceptions out to the appropriate reliability boundary, ie. the transaction/ request/ config loading, and fail it sensibly, with an easily diagnosable message and without crashing the whole process.
C# or unchecked Java exceptions are actually fairly close to ideal for this.
The correct paradigm is "prefer throw to catch" -- requiring devs to check every ret-val just created thousands of opportunities for mistakes to be made.
By contrast, a reliable C# or Java version might have just 3 catch clauses and handle errors arising below sensibly without any developer effort.
https://literatejava.com/exceptions/ten-practices-for-perfec...
Exceptions force a panic on all errors, which is why they're supposed to be used in "exceptional" situations. To avoid exceptions when an error is expected, (eof, broken socket, file not found,) you either have to use an unnatural return type or accept the performance penalty of the panic that happens when you "throw."
In Rust, the stack trace happens at panic (unwrap), which is when the error isn't handled. IE, it's not when the file isn't found, it's when the error isn't handled.
Can't Hotspot not generate the stack trace when it knows the exception will be caught and the stack trace ignored?
Actually it can also just turn off the collection of stack traces entirely for throw sites that are being hit all the time. But most Java code doesn't need this because code only throws exceptions for exceptional situations.
What do you mean?
Exceptions do not force panic at all. In most practical situations, an exception unhandled close to where it was thrown will eventually get logged. It's kind of a "local" panic, if you will, that will terminate the specific function, but the rest of the program will remain unaffected. For example, a web server might throw an exception while processing a specific HTTP request, but other HTTP requests are unaffected.
Throwing an exception does not necessarily mean that your program is suddenly in an unsupported state, and therefore does not require terminating the entire program.
When everyone uses runtime exceptions and doesn’t count for exception handling in every possible code path, that’s exactly what it means.
When you work with exceptions, the key is to assume that every line can throw unless proven otherwise, which in practice means almost all lines of code can throw. Once you adopt that mental model, things get easier.
It also makes errors part of the API contract, which is where they belong, because they are.
The point about being explicitly part of the API stands, though.
That's not what a panic means. Take a read through Go's panic / resume mechanism; it's similar to exceptions, but the semantics (with multiple return values) make it clear that panic is for exceptional situations. (IE, panic isn't for "file not found," but instead it's for when code isn't written to handle "file not found.")
Even Rust has mechanisms to panic without aborting the process, although I will readily admit that I haven't used them and don't understand them: https://doc.rust-lang.org/std/panic/fn.resume_unwind.html
In theory, theory and practice are the same. In practice...
You can't throw a checked exception in a stream, this fact actually underlines the key difference between an exception and a Result: Result is in return position and exceptions are a sort of side effect that has its own control flow. Because of that, once your method throws an Exception or you are writing code in a try block that catches an exception, you become blind to further exceptions of that type, even if you might be able to or required to fix those errors. Results are required to be handled individually and you get syntactic sugar to easily back propagate.
It is trivial to include a stack trace, but stack traces are really only useful for identifying where something occurred, and generally what is superior is attaching context as you back propagate which trivially occurs with judicious use of custom error types with From impls. Doing this means that the error message uniquely defines the origin and paths it passed through without intermediate unimportant stack noise. With exceptions you would always need to catch each exception and rethrow a new exception containing the old to add contextual information, then to avoid catching to much you need variables that will be initialized inside the try block defined outside of the try block. So stack traces are basically only useful when you are doing Pokemon exception handling.
The same is required for any principled error handling.
It's not a checked exception without a stack trace.
Rust doesn't have Java's checked or unchecked exception semantics at the moment. Panics are more like Java's Errors (e.g. OOM error). Results are just error codes on steroids.
A few ideas:
- It should not compile in production Rust code
- It should only be usable within unsafe blocks
- It should require explicit "safe" annotation from the engineer. Though this is subject to drift and become erroneous.
- It should be possible to ban the use of unsafe in dependencies and transitive dependencies within Cargo.
unwrap() should effectively work as a Result<> where the user must manually invoke a panic in the failure branch. Make special syntax if a match and panic is too much boilerplate.
This is like an implicit null pointer exception that cannot be statically guarded against.
I want a way to statically block any crates doing this from my dependency chain.
How was I informed as a user? It's not in the type signature.
Sounds like I get to indeterminately crash at runtime and have a fun time debugging.
Absent that there are hacks like no_panic[2]
[0] https://blog.yoshuawuyts.com/extending-rusts-effect-system/ [1] https://koka-lang.github.io/koka/doc/book.html#why-effects [2] https://crates.io/crates/no-panic
I don’t think you can ever completely eliminate panics, because there are always going to be some assumptions in code that will be surprisingly violated, because bugs exist. What if the heap allocator discovers the heap is corrupted? What if you reference memory that’s paged out and the disk is offline? (That one’s probably not turned into a panic, but it’s the same principle.)
Software engineers tend to get stuck in software problems and thinking that everything should be fixed in code. In reality there are many things outside of the code that you can do to operate unreliable components safely.
There's also an assumption here that if the unwrap wasn't there, the caller would have handled the error properly. But if this isn't part of some common library at CF, then chances are the caller is the same person who wrote the panicking function in the first place. So if a new error variant they introduced was returned they'd probably still abort the thread either by panicking at that point or breaking out of the thread's processing loop.
if (userOpt.isPresent()) {
var user = userOpt.get();
var accountOpt = accountRepository.selectAccountOpt(user.getId());
var account = accountOpt.orElseThrow();
}
Idea checks it by default and highlights if I've used `get()` without previous check. It's not forced at compiler level, but it's good enough for me.Not unlike people having a blind spot for Rust in general, no?
An unwrap should never make it to production IMHO. It's fine while prototyping, but once the project gets closer to production it's necessary to just grep `uncheck` in your code and replace those that can happen with a proper error management and replace those that cannot happen with `expect`, with a clear justification of why they cannot happen unless there's a bug somewhere else.
unwrap isn't like that.
The message isn't really here to be displayed during a crash (since the crash should never happen in the first place), it's here to communicate the invariant in the code, to the developer reading and modifying it later on.
We are now discussing what can be done to improve code correctness beyond memory and thread safety. I am excited for what is to come.
I am not sure that watching the trendy forefront successfully reach the 1990s and discuss how unwrapping Option is potentially dangerous really warm my heart. I can’t wait for the complete meltdown when they discover effect systems in 2040.
To be more serious, this kind of incident is yet another reminder that software development remains miles away from proper engineering and even key providers like Cloudfare utterly fail at proper risk management.
Celebrating because there is now one popular language using static analysis for memory safety feels to me like being happy we now teach people to swim before a transatlantic boat crossing while we refuse to actually install life boats.
To me the situation has barely changed. The industry has been refusing to put in place strong reliability practices for decades, keeps significantly under investing in tools mitigating errors outside of a few fields where safety was already taken seriously before software was a thing and keeps hiding behind the excuse that we need to move fast and safety is too complex and costly while regulation remains extremely lenient.
I mean this Cloudfare outage probably cost millions of dollars of damage in aggregate between lost revenue and lost productivity. How much of that will they actually have to pay?
> I mean this Cloudfare outage probably cost millions of dollars of damage in aggregate between lost revenue and lost productivity. How much of that will they actually have to pay?
Probably nothing, because most paying customers of cloudflare are probably signing away their rights to sue Cloudflare for damages by being down for a while when they purchase Cloudflare's services (maybe some customers have SLAs with monetary values attached, I dunno). I honestly have a hard time suggesting that those customers are individually wrong to do so - Cloudflare isn't down that often, and whatever amount it cost any individual customer by being down today might be more than offset by the DDOS protection they're buying.
Anyway if you want Cloudflare regulated to prevent this, name the specific regulations you want to see. Should it be illegal under US law to use `unwrap` in Rust code? Should it be illegal for any single internet services company to have more than X number of customers? A lot of the internet also breaks when AWS goes down because many people like to use AWS, so maybe they should be included in this regulatory framework too.
We have collectively agreed to a world where software service providers have no incentive to be reliable as they are shielded from the consequences of their mistakes and somehow we see it as acceptable that software have a ton of issues and defects. The side effect is that research on actually lowering the cost of safety has little return on investment. It doesn't have be so.
> Anyway if you want Cloudflare regulated to prevent this, name the specific regulations you want to see.
I want software provider to be liable for the damage they cause and minimum quality regulation on par with an actual engineering discipline. I have always been astounded that nearly all software licences start with extremely broad limitation of liability provisions and people somehow feel fine with it. Try to extend that to any other product you regularly use in your life and see how that makes you fell.
How to do proper testing, formal methods and resilient design have been known for decades. I would personnaly be more than okay with let's move less fast and stop breaking things.
So do you want to make it illegal to punish GNU GPL licensed software because that license has a warranty disclaimer? Do you want to make it illegal for a company like Cloudflare to use open source licensed software with similar warranty disclaimers, or for the SLA agreements and penalties for violating them that they make with their own paying customers to be legally unenforceable? What if I just have a personal website and I break the javascript on it because I was careless, how should that be legally treated?
I'm not against research into more reliable software or using better engineering techniques that result in more reliable software. What I'm concerned about is the regulatory regime - in other words, what software it is or is not legal to write or sell for money - and how to properly incentivize software service providers to use techniques that result in more reliable software without causing a bunch of bad second order effects.
You can't go out in the middle of your city, build a shoddy bridge, say you wave all responsibilities and then wash your hands with the consequences when it predictably breaks. Why can you do that with pieces of software?
Limiting the scope of liability waivers is not the same things as censoring what software can be produced. It's just ensuring that everyone actually take responsibility for the things they distribute.
As I said previously, the current situation doesn't make sense to me. People have been brainwashed in believing that the way software is released currently, half finished and crippled with bugs, is somehow normal and acceptable. It absolutely doesn't have to be this way.
It'a beyond shameful that the average developers today is blissfully unaware of anything related to producing actually secure pieces of software. I am pretty sure I can walk into more than 90% of development shops today and no one there will know what formal methods are. With some luck, they might have some static analysers running, probably from a random provider and be happy with the crappy percentages that it outputs.
It's not about research. It's about a field which entirely refuses to become mature despite being pivotal to the modern economy. And why would it? Software products somehow get a free pass for the shit they push on everyone.
We are in the classical "market for lemons" trap where negative externalities are not priced in and investing in security will just get you to lose against companies that don't care. Every major incidents remind us we need out. The market has already showed it won't self correct. It's a classical case where regulatory intervention is necessary and legitimate.
The shift is already happening by the way. The EU product liability directive was adopted in 2024 and the transition period ends in December 2026. The US "National Cybersecurity Strategy" signals intend to review the status quo. It's coming faster that people realise.
That we’re even having this discussion is a major step forward. That we’re still having this discussion is a depressing testament to how slow slowly the mainstream has adopted better ideas.
Zig is undergoing this meltdown. Shame it's not memory safe. You can only get so far in developing programming wisdom before Eternal September kicks in and we're back to re-learning all the lessons of history as punishment for the youthful hubris that plagues this profession.
But yes, I wish I had learned more, and somehow stumbled upon all the good stuff, or be taught at university about at least what Rust achieves today.
I think it has to be noted Rust still allows performance with the safety it provides. So that's something maybe.
The most useful thing exceptions give you is not static compile time checking, it's the stack trace, error message, causal chain and ability to catch errors at the right level of abstraction. Rust's panics give you none of that.
Look at the error message Cloudflare's engineers were faced with:
thread fl2_worker_thread panicked: called Result::unwrap() on an Err value
That's useless, barely better than "segmentation fault". No wonder it took so long to track down what was happening.A proxy stack written in a managed language with exceptions would have given an error message like this:
com.cloudflare.proxy.botfeatures.TooManyFeaturesException: 200 > 60
at com.cloudflare.proxy.botfeatures.FeatureLoader(FeatureLoader.java:123)
at ...
and so on. It'd have been immediately apparent what went wrong. The bad configs could have been rolled back in minutes instead of hours.In the past I've been able to diagnose production problems based on stack traces so many times I was been expecting an outage like this ever since the trend away from providing exceptions in new languages in the 2010s. A decade ago I wrote a defense of the feature and I hope we can now have a proper discussion about adding exceptions back to languages that need them (primarily Go and Rust):
https://blog.plan99.net/what-s-wrong-with-exceptions-nothing...
tldr: Capturing a backtrace can be a quite expensive runtime operation, so the environment variables allow either forcibly disabling this runtime performance hit or allow selectively enabling it in some programs.
By default it is disabled in release mode.
Similarly, capturing a stack trace in a error type (within a Result for example) is perfectly possible. But this is a choice left to the programmer, because capturing a trace is not cheap.
I used to be an SRE at Google. Back then we also had big outages caused by bad data files pushed to prod. It's a common enough issue so I really sympathize with Cloudflare, it's not nice to be on call for issues like that. But Google's prod environments always generated stack traces for every kind of failure, including CHECK failures (panics) in C++. You could also reflect the stack traces of every thread via HTTP. I used to diagnose bugs in production under time pressure quite regularly using just these tools. You always need detailed diagnostics.
Languages shouldn't have panics, tbh, it's a primitive concept. It so rarely makes sense to handle errors that way. I know there's a whole body of Rust/Go lore claiming panics are fine, but it's not a good move and is one of the reasons I've stayed away from Go over the years and wouldn't use Rust for anything higher than low level embedded components or operating system code that has to export a C ABI. You always want diagnostics and recoverable errors; this kind of micro-optimization doesn't make sense outside of extremely constrained embedded environments that very few of us work in.
https://doc.rust-lang.org/std/panic/index.html
An uncaught exception in C++ or an uncaught panic in Rust terminates the program. The unwinding is the same mechanism. I think the implementation is what comes with LLVM, but I haven't checked.
I was also a Google SRE, and I liked the stacktrace facilities so much that I got permission to open source a library inspired from it: https://github.com/bombela/backward-cpp (I know I am not doing a great job maintaining it)
At Uber I implemented a similar stackrace introspection for RPC tasks via HTTP for Go services.
You can also catch a Go panic. Which we did in our RPC library at Uber.
It would be great for all of that to somehow come ready made though. A sort of flag "this program is a service, turn on all the good diagnostics, here is my main loop".
IMO making unwrap a clippy lint (or perhaps a warning) would be a decent start. Or maybe renaming unwrap.
A tenet of systems code is that every possible error must be handled explicitly and exhaustively close to the point of occurrence. It doesn’t matter if it is Rust, C, etc. Knowing how to write systems code is unrelated to knowing a systems language. Rust is a systems language but most people coming into Rust have no systems code experience and are “holding it wrong”. It has been a recurring theme I’ve seen with Rust development in a systems context.
C is pretty broken as a language but one of the things going for it is that it has a strong systems code culture surrounding it that remembers e.g. why we do all of this extra error handling work. Rust really needs systems code practice to be more strongly visible in the culture around the language.
A tenet of systems code is that every possible error must be handled
explicitly and exhaustively close to the point of occurrence.
All the more reason it doesn't really belong in examples for third party libraries.Change your API boundary, surface the discrepancy between your requirements and the potential failing case at the edges where it can be handled.
If you need the value, you need to handle the case that it’s not available explicitly. You need to define your error path(s)
Anything else leads to, well, this.
This is a failure caused by lazy Rust programming and not relying on the language's design features.
It's a shame this code can even be written. It is surprising and escapes the expected safety of the language.
I'm terrified of some dependency using unwrap() or expect() and crashing for something entirely outside of my control.
We should have an opt-in strict Cargo.toml declaration that forbids compilation of any crate that uses entirely preventable panics. The only panics I'll accept are those relating to memory allocation.
This is one of the sharpest edges in the language, and it needs to be smoothed away.
The problem starts with Rust stdlib. It panics on allocation failure. You expect Rust programmers to look at stdlib and not imitate it?
Sure, you can try to taboo unwrap(), but 1) it won't work, and 2) it'll contort program design in places where failure really is a logic bug, not a runtime failure, and for which unwrap() is actually appropriate.
The real solution is to go back in time, bonk the Rust designers over the head with a cluebat, and have them ship a language that makes error propagation the default and syntactically marks infallible cleanup paths --- like C++ with noexcept.
Of course it will. I've built enormous systems, including an entire compiler, without once relying on the local language equivalent of `.unwrap()`.
> 2) it'll contort program design in places where failure really is a logic bug, not a runtime failure, and for which unwrap() is actually appropriate.
That's a failure to model invariants in your API correctly.
> ... have them ship a language that makes error propagation the default and syntactically marks infallible cleanup paths --- like C++ with noexcept.
Unchecked exceptions aren't a solution. They're a way to avoid taking the thought, time, and effort to model failure paths, and instead leave that inherent unaddressed complexity until a runtime failure surprises users. Like just happened to Cloudflare.
Your argument also implies that things like `slice[i]` are never okay.
The blog post doesn’t address the issue, it simply pretends it’s not a real problem.
Also from the post: “If we were to steelman advocates in favor of this style of coding, then I think the argument is probably best limited to certain high reliability domains. I personally don’t have a ton of experience in said domains …”
Enough said.
> The blog post doesn’t address the issue, it simply pretends it’s not a real problem.
It very explicitly addresses it! It even gives real examples.
> Also from the post: “If we were to steelman advocates in favor of this style of coding, then I think the argument is probably best limited to certain high reliability domains. I personally don’t have a ton of experience in said domains …” > > Enough said.
Ad hominem... I don't have experience working on, e.g., medical devices upon which someone's life depends. So the point of that sentence is to say, "yes, I acknowledge this advice may not apply there." You also cherry picked that quote and left off the context, which is relevant here.
And note that you said:
> I have to disagree that unwrap is ever OK.
That's an extreme position. It isn't caveated to only apply to certain contexts.
It's not orthogonal. `Result` isn't a local invariant, and yes, `.unwrap()` does require lying. If your code depends on an API that can fail, and you cannot handle that failure locally (`.unwrap()` is not handling it), then your type signature needs to express that you can fail -- and you need to raise an error on that failure.
> That's an extreme position. It isn't caveated to only apply to certain contexts.
No, it's a principled position. Correct code doesn't `.unwrap()`, but code that hides failure cases -- or foists invariant enforcement onto programmers remembering not to screw up -- does.
I've built and worked on ridiculously complex code bases without a single instance of `.unwrap()` or the local language equivalent; it's just not necessary. This is just liked the unchecked exception debate in Java -- complex explanations for a very simple goal of avoiding the thought, time, and effort to accurately model a system's invariants.
I don't think you understand what an internal runtime invariant is. Either way, I don't know of any widespread libraries (in any language) that follow this "principled" position. That makes it de facto extreme.
> I've built and worked on ridiculously complex code bases without a single instance of `.unwrap()` or the local language equivalent; it's just not necessary.
Show me. If you're using `slice[i]`, then you're using `unwrap()`. It introduces a panicking branch.
> If your code depends on an API that can fail, and you cannot handle that failure locally (`.unwrap()` is not handling it), then your type signature needs to express that you can fail -- and you need to raise an error on that failure.
You use `unwrap()` when you know the failure cannot happen.
I note you haven't engaged with any of the examples I provided in the blog.
That’s an invariant meant to be expressed by your type system — and it is.
You’ve failed to model your invariants in your API — and thus the type system — if you ever reach a point where an engineer has to manually assess and assert whether “cannot” applies.
You will get it wrong. That is bad code.
That gives you the same behavior as unwrap with a less useful error message though. In theory you can write useful messages, but in practice (and your example) expect is rarely better than unwrap in modern rust
This is Rust's Null Pointer Exception.
unwrap(), expect(), bad math, etc. - this is all caused by lazy Rust developers or Rust developers not utilizing the language's design features.
The language should grow the ability to mark this code as dangerous, and we should have static tools to exclude this code from our dependency tree.
I don't want some library I use to `unwrap()` and cause my application to crash because I didn't anticipate their stupid panic.
Rust developers have clearly leaned on this crutch far too often:
https://github.com/search?q=unwrap%28%29+language%3ARust&typ...
The Rust team needs to plug this leak.
My blog on this topic was linked above, you should read it: https://burntsushi.net/unwrap/
> The language should grow the ability to mark this code as dangerous, and we should have static tools to exclude this code from our dependency tree.
Might be useful to point out that this static tool exists (clippy::unwrap_used).
> unwrap(), expect(), bad math, etc. - this is all caused by lazy Rust developers or Rust developers not utilizing the language's design features.
That's factually incorrect. (And insulting.)
Note that they're not criticizing the language. I read "Rust developers" in this context as developers using Rust, not those who develop the language and ecosystem. (In particular they were not criticizing you.)
I think it's reasonable to question the use of unwrap() in this context. Taking a cue from your blog post^ under runtime invariant violations, I don't think this use matches any of your cases. They assumed the size of a config file is small, it wasn't, so the internet crashed.
> We shouldn't be using unwrap() or expect() at all.
So the context of their comment is not some specific nuanced example. They made a blanket statement.
> Note that they're not criticizing the language. I read "Rust developers" in this context as developers using Rust, not those who develop the language and ecosystem.
I have the same interpretation.
> I think it's reasonable to question the use of unwrap() in this context. Taking a cue from your blog post^ under runtime invariant violations, I don't think this use matches any of your cases. They assumed the size of a config file is small, it wasn't, so the internet crashed.
Yes? I didn't say it wasn't reasonable to question the use of unwrap() here. I don't think we really have enough information to know whether it was inappropriate or not.
unwrap() is all about nuance. I hope my blog post conveyed that. Because unwrap() is a manifestation of an assertion on a runtime invariant. A runtime invariant can be arbitrarily complicated. So saying things like, "we shouldn't be using unwrap() or expect() at all" is an extreme position to carve out that is also way too generalized.
I stand by what I said. They are factually mistaken in their characterization of the use of unwrap()/expect() in general.
That is their opinion, I disagree with it, but I don't think it's an insulting or invalid opinion to have. There are codebases that ban nulls in other languages too.
> They are factually mistaken in their characterization of the use of unwrap()/expect() in general.
It's an opinion about a stylistic choice. I don't see what fact there is here that could be mistaken.
> unwrap(), expect(), bad math, etc. - this is all caused by lazy Rust developers or Rust developers not utilizing the language's design features.
The factually incorrect part of this is the statement that use of `unwrap()`, `expect()` and so on is caused by X or Y, where X is "lazy Rust developers" and Y is "Rust developers not utilizing the language's design features." But there are, factually, other causes than X or Y for use of `unwrap()`, `expect()` and so on. So stating that it is all caused by X or Y is factually incorrect. Moreover, X is 100% insulting when applied to any one specific individual. Y can be insulting when applied to any one specific individual.
Now this:
> We shouldn't be using unwrap() or expect() at all.
That's an opinion. It isn't factually incorrect. And it isn't insulting.
> unwrap(), expect(), bad math, etc. - this is all caused by lazy Rust developers or Rust developers not utilizing the language's design features
I just read that line as shorthand for large outages caused by misuse of unwrap(), expect(), bad math etc. - all caused by...
That's also an opinion, by my reading.
I assumed we were talking specifically about misuses, not all uses of unwrap(), or all bad bugs. Anyway, I think we're ultimately saying the same thing. It's ironic in its own way.
I fully agree with burntsushi that echelon is taking an extreme and arguably wrong stance. His sentiment becomes more and more correct as Rust continues to evolve ways to avoid unwrap as an ergonomic shortcut, but I don't think we are quite there yet for general use. There absolutely is code that should never panic, but that involves tradeoffs and design choices that aren't true for every project (or even the majority of them)
And because it gets picked up by LLMs. It would be interesting to know if this particular .unwrap() was written by a human.
In theory, experienced human code reviewers can course correct newer LLM-guided devs work before it blows up. In practice, reviewers are already stretched thin and submitters absolute to now rapidly generate more and more code to review makes that exhaustion effect way worse. It becomes less likely they spot something small but obvious amongst the haystack of LLM generated code bailing there way.
Yes, and: I've found this to be mostly true, if you make sure you take the time to deeply understand what the code is doing. When I asked an LLM to do something for me in Javascript, then I said, "What if X happens, wouldn't that cause Y? Would it be better to restructure it like so and so to make it more robust?" The LLM immediately improves it.
Any experienced programmer who was taking the time to review this code, on learning that unwrap() has a "panic" inside, would certainly change it. But as you say, reviewers are already stretched thin.
It's not about whether you should ban unwrap() in production. You shouldn't. Some errors are logic bugs beyond which a program can't reasonably continue. The problem is that the language makes it too easy for junior developers (and AI!) to ignore non-logic-bug problems with unwrap().
Programmers early in their careers will do practically anything to avoid having to think about errors and they get angry when you tell them about it.
Except maybe Haskell.
How many times can you truly prove that an `unwrap()` is correct and that you also need that performance edge?
Ignoring the performance aspect that often comes from a hat-trick, to prove such a thing you need to be wary of the inner workings of a call giving you a `Return`. That knowledge is only valid at the time of writing your `unwrap()`, but won't necessarily hold later.
Also, aren't you implicitly forcing whoever changes the function to check for every smartass dev that decided to `unwrap` at their callsite? That's bonkers.
If I were Cloudflare I would immediately audit the codebase for all uses of unwrap (or similar rust panic idioms like expect), ensure that they are either removed or clearly documented as to why it's worth crashing the program there, and then add a linter to their CI system that will fire if anyone tries to check in a new commit with unwrap in it.
So the point of unwrap() is not to prove anything. Like an assertion it indicates a precondition of the function that the implementer cannot uphold. That's not to say unwrap() can't be used incorrectly. Just that it's a valid thing to do in your code.
Note that none of this is about performance.
A function or a keyword would interrupt that and make it less tempting
Returning a Result by definition means the method can fail.
No more than returning an int by definition means the method can return -2.
Some call points to a function that returns an int will never return -2.
Sometimes you know things the type system does not know.
What? Returning an int does in fact mean that the method can return -2. I have no idea what your argument is with this, because you seem to be disagreeing with the person while actually agreeing with them.
What? No it doesn't.
fn square(n: i32) -> i32 {
n * n
}
This method cannot return -2.Though in this case it's more like knowing that the specific way you call the function in foo.rs will never get back a -2.
fn bar(n: i32, allow_negative: bool) -> i32 {
let new = n * 2;
if allow_negative || new >= 0 { new } else { 0 }
}
bar(x, false)If unwrap() were named UNWRAP_OR_PANIC(), it would be used much less glibly. Even more, I wish there existed a super strict mode when all places that can panic are treated as compile-time errors, except those specifically wrapped in some may_panic_intentionally!() or similar.
It's way less book-keeping with exceptions, since you, intentionally, don't have to write code for that exceptional behavior, except where it makes sense to. The return by value method, necessarily, implements the same behavior, where handling is bubbled up to the conceptually appropriate place, through returns, but with much more typing involved. Care is required for either, since not properly bubbling up an exception can happen in either case (no re-raise for exceptions, no return after handling for return).
That is the main reason why zig doesn’t have exceptions.
With return values, you can trivially ignore an exception.
let _ = fs::remove_file("file_doesn't_exist");
or
value, error = some_function()
// carry on without doing anything with error
In the wild, I've seen far more ignoring return errors, because of the mechanical burden of having type handling at every function call.This is backed by decades of writing libraries. I've tried to implement libraries without exceptions, and was my admittedly cargo-cult preference long ago, but ignoring errors was so prevalent among the users of all the libraries that I now always include a "raise" type boolean that defaults to True for any exception that returns an error value, to force exceptions, and their handling, as default behavior.
> In big projects you can basically never know when or how something can fail.
How is this fundamentally different than return value? Looking at a high level function, you can't know how it will fail, you just know it did fail, from the error being bubbled up through the returns. The only difference is the mechanism for bubbling up the error.
Maybe some water is required for this flame war. ;)
The great Raymond Chen wrote an excellent blog post on how this isn't really true, and how exceptions can lure programmers into mistakenly thinking they can just forget about failure cases.
Cleaner, more elegant, and harder to recognize https://devblogs.microsoft.com/oldnewthing/20050114-00/?p=36...
(ctrl-f for taskbar to skip to heart of his point.)
What he seems to be saying is that "obviously in C I would be checking the icon handle for being non-null so clearly error value handling is superior" but this is only obvious to someone knowing the API and checking values for validity has to be done in exception based code too. It's just that exception based code doesn't pretend that it cannot panic somewhere where you don't know. The default, better assumption for programming is that you don't know what this code is doing but it should just work. Unchecked exception handling is the best way to fit that paradigm, you should not have to care about every single line and what it does and constantly sort of almost obsessively check error values of all the APIs you ever use to have this false hope that it cannot panic because you did your duty. No, it can still panic and all this error checking is not helping you program better or more clearly or faster. It swamps the code with so many extra lines that it's practically double the size. All this makes it less clear and that is also what his post shows.
In practice, programmers don't find it easy to keep in mind that certain functions might throw. This is a real problem with unchecked exceptions and with C-style error codes that sloppy programmers might ignore entirely.
> [...] on which properties to set first can happen in any language
A carefully designed library using a statically typed functional language, especially a pure functional language, might sometimes be able to eliminate such hidden ordering bugs.
Rust used to have a feature to help the compiler detect invalid ordering of imperative operations, called typestates. This feature has since been mostly removed, though, as it saw little use. [0]
> isn't related at all to whether you use try-catch handling or error values as return codes
I guess Chen is assuming a reasonably diligent programmer who makes a habit of never discarding status/error values returned by functions. C++'s [[nodiscard]] can help ensure this.
(Of course, outside of C++, those aren't the only options. Idiomatic Haskell and Zig code forces the programmer to explicitly handle the possibility of an error. Same goes for Java's checked exceptions.)
> What he seems to be saying is that "obviously in C I would be checking the icon handle for being non-null so clearly error value handling is superior"
I don't think he's exactly arguing for the C-style approach, he's more just criticizing exceptions, especially unchecked exceptions. I agree the C-style approach has considerable problems.
> It's just that exception based code doesn't pretend that it cannot panic somewhere where you don't know.
With checked exceptions, you know precisely which operations can throw.
> Unchecked exception handling is the best way to fit that paradigm, you should not have to care about every single line and what it does and constantly sort of almost obsessively check error values of all the APIs you ever use to have this false hope that it cannot panic because you did your duty
You do need to care about every line, or your plausible-looking code is likely to misbehave when an exception occurs, as Chen's post demonstrates. Unchecked exceptions deprive the compiler of the ability to ensure good exception-handling coverage. There is no error-handling model that allows to programmer to write good code by pretending errors won't arise.
(I presume that by panic you mean throw an unchecked exception.)
try {
data = some_sketchy_function();
} catch (e) {
handle the error;
}
vs result = some_sketchy_function();
if let Err(e) = result {
handle the error;
}
Or better yet, compare the problematic cases where the error isn't handled: data = some_sketchy_function();
vs data = some_sketchy_function().UNWRAP_OR_PANIC();
In the former (the try-catch version that doesn't try or catch), the lack of handling is silent. It might be fine! You might just depend on your caller using `try`. In the latter, the compiler forces you to use UNWRAP_OR_PANIC (or, in reality, just unwrap) or `data` won't be the expected type and you will quickly get a compile failure.What I suspect you mean, because it's a better argument, is:
try {
sketchy_function1();
sketchy_function2();
sketchy_function3();
sketchy_function4();
} catch (e) {
...
}
which is fair, although how often is it really the right thing to let all the errors from 4 independent sources flow together and then get picked apart after the fact by inspecting `e`? It's an easier life, but it's also one where subtle problems constantly creep in without the compiler having any visibility into them at all.There is already a try/catch around that code, which produces the Result type, which you can presumptuously .unwrap() without checking if it contains an error.
Instead, one should use the question mark operator, that immediately returns the error from the current function if a Result is an error. This is exactly similar to rethrowing an exception, but only requires typing one character, the "?".
React.__SECRET_INTERNALS_DO_NOT_USE_OR_YOU_WILL_BE_FIRED comes to mind. I did have to reach to this before, but it certainly works for keeping this out of example code and other things like reading other implementations without the danger being very apparent.
At some point it was renamed to __CLIENT_INTERNALS_DO_NOT_USE_OR_WARN_USERS_THEY_CANNOT_UPGRADE which is much less fun.
Not for this guy:
As usual: people problem, not a tech problem. In the last years a lot of strides have been made. But people will be people.
at some point machine would be better in coding because well machine code is machine instruction task
same like chess, engine is better than human grandmaster because its solvable math field
coding is no different
Might be worth noting that your description of chess is slightly incorrect. Chess technically isn't solved in the sense that the optimal move is known for any arbitrary position is known; it's just that chess engines are using what amounts to a fancy brute force for most of the game and the combination of hardware and search algorithm produces a better result than the human brain does. As such, chess engines are still capable of making mistakes, even if actually exploiting them is a challenge.
"chess engines are still capable of making mistakes", I'm sorry no
inaccurate yes but not mistake
The thing is that there is no known general objective criteria for "best" and "bad" moves. The best we have so far is based on engine evaluations, but as I said before that is because chess engines are better at searching the board's state space than humans, not because chess engines have solved chess in the mathematical sense. Engines are quite capable of misevaluating positions, as demonstrated quite well by the Top Chess Engine Championship [0] where one engine thinks it made a good move while the other thinks that move is bad, and this is especially the case when resources are limited.
The closest we are to solving chess are via tablebases, which are far from covering the entire state space and are basically as much of an exemplar of pure brute force as you can get.
> "chess engines are still capable of making mistakes", I'm sorry no
If you think chess engines are infalliable, then why does the Top Chess Engine Championship exist? Surely if chess engines could not make mistakes they would always agree on a position's evaluation and what move should be made, and therefore such an exercise would be pointless?
> inaccurate yes but not mistake
From the perspective to attaining perfect play an inaccuracy is a mistake.
[0]: https://en.wikipedia.org/wiki/Top_Chess_Engine_Championship
are you playing chess or not?????? if you playing chess then its oblivious how to differentiate bad move and best move
Yes it is objective, these thing called best move not without reason
"If you think chess engines are infalliable, then why does the Top Chess Engine Championship exist?"
to create better chess engine like what do even talking about here????, are you saying just because there are older bad engine that mean this thing is pointless ????
if you playing chess up to a decent level 1700+ (like me), you know that these argument its wrong and I assure you to learn chess to a decent level
up until that point that you know high level chess is brute force games and therefore solvable math
The key words in what I said are "general" and "objective". Yes, it's possible to determine "good" or "bad" moves in specific positions. There's no known method to determine "good" or "bad" moves in arbitrary positions, as would be required for chess to be considered strongly solved.
Furthermore, if it's "obvious" how to differentiate good and bad moves then we should never see engines blundering, right?
So (for example) how do you explain this game between Stockfish and Leela where Stockfish blunders a seemingly winning position [0]? After 37... Rdd8 both Stockfish and Leela think white is clearly winning (Stockfish's evaluation is +4.00, while Leela's evaluation is +3.81), but after 38. Nxb5 Leela's evaluation plummets to +0.34 while Stockfish's evaluation remains at +4.00. In the end, it turns out Leela was correct after 40... Rxc6 Stockfish's evaluation also drops from +4.28 to 0.00 as it realizes that Leela has a forced stalemate.
Or this game also between Stockfish and Leela where Leela blunders into a forced mating sequence and doesn't even realize it for a few moves [1]?
Engines will presumably always play what they think is the "best" move, but clearly sometimes this "best" move is wrong. Evidently, this means differentiating "good" and "bad" moves is not always obvious.
> Yes it is objective, these thing called best move not without reason
If it's objective, then why is it possible for engines to disagree on whether a move is good or bad, as they do in the above example and others?
> to create better chess engine like what do even talking about here????
The ability to create better chess engines necessarily implies that chess engines can and do make mistakes, contrary to what you asserted.
> are you saying just because there are older bad engine that mean this thing is pointless ????
No. What I'm saying is that your explanation for why chess engines are better than humans is wrong. Chess engines are not better than humans because they have solved chess in the mathematical sense; chess engines are better than humans because they search the state space faster and more efficiently than humans (at least until you reach 7 pieces on the board).
> up until that point that you know high level chess is brute force games and therefore solvable math
"Solvable" and "solved" are two very different things. Chess is solvable, in theory. Chess is very far from being solved.
[0]: https://www.chess.com/computer-chess-championship#event=309&...
[1]: https://www.chess.com/computer-chess-championship#event=309&...
In a fascinating coincidence, there is a tonyhart7 on both chess.com and lichess, and they have been banned for cheating on both websites.
But now after we are past that and it has a lot of mind share, I'd say it's time to start tightening the bolts.
Rust's unwrap isn't the same as std::expected::value. The former panics - i.e. either aborts the program or unwinds depending on context and is generally not meant to be handled. The latter just throws an exception that is generally expected to be handled. Panics and exceptions use similar machinery (at least they can depending on compiler options) but they are not equivalent - for example nested panics in destructors always abort the program.
In code that isn't meant to crash `unwind` should be treated as a sign saying that "I'm promising that this will never happen", but just like in C++ where you promise that pointers you deference are valid and signed integers you add don't overflow making promises like that is a necessary part of productive programming.
Panics aren't exceptions, any "panic" in Rust can be thought of as an abort of the process (Rust binaries have the explicit option to implement panics as aborts). Companies like Dropbox do exactly this in their similar Rust-based systems, so it wouldn't surprise me if Cloudflare does the same.
"Banning exceptions" wouldn't have done anything here, what you're looking for is "banning partial functions" (in the Haskell sense).
But more generally you could catch the panic at the FL2 layer to make that decision intentional - missing logic at that layer IMHO.
But the bigger change is to make sure that config changes roll out gradually instead of all at once. That’s the source of 99% of all widespread outages
Another option is to make sure that config changes that fail to parse continue using the old config instead of resulting in an unusable service.
But ultimately it’s not the panic that’s the problem but a failure to specify how panics within FL2 layers should be handled; each layer is at least one team and FL2’s job is providing a safe playground for everyone to safely coexist regardless of the misbehavior of any single component
But as always such failures are emblematic of multiple things going wrong at once. You probably want to end up using both catch_unwind for the typical case and the supervisor for the case where there’s a segfault in some unsafe code you call or native library you invoke.
I also mention the fundamental tension of do you want to fail open or closed. Most layers should probably fail open. Some layers (eg auth) it’s safer to fail closed.
OR even, the bot code crashing should itself be generating alerts.
Canary deployment would be automatically rolled back until P0 incident resolved.
All of this could probably have happened and contained at their scale in less than a minute as they would likely generate enough "omg the proxy cannot handle its config" alerts off of a deployment of 0.001% near immediately.
The root cause here was that a file was mildly corrupt (with duplicate entries, I guess). And there was a validation check elsewhere that said "THIS FILE IS TOO BIG".
But if that's a validation failure, well, failing is correct? What wasn't correct was that the failure reached production. What should have happened is that the validation should have been a unified thing and whatever generated the file should have flagged it before it entered production.
And that's not an issue with function return value API management. The software that should have bailed was somewhere else entirely, and even there an unwrap explosion (in a smoke test or pre-release pass or whatever) would have been fine.
Ideally every validation should have a well-defined failure path. In the case of a config file rotation, validation failure of the new config could mean keeping the old config and logging a high-priority error message. In the case of malformed user-provided data, it might mean dropping the request and maybe logging it for security analysis reasons. In the case of "pi suddenly equals 4" checks the most logical approach might be to intentionally crash, as there's obviously something seriously wrong and application state has corrupted in such a way that any attempt to continue is only going to make things worse.
But in all cases there's a reason behind the post-validation-failure behavior. At a certain point leaving it up to "whatever happens on .unwrap() failure" isn't good enough anymore.
I don't like to use implicit unwrap. Even things that are guaranteed to be there, I treat as explicit (For example, (self.view?.isEnabled ?? false), in a view controller, instead of self.view.isEnabled).
I always redefine @IBOutlets from:
@IBOutlet weak var someView!
to: @IBOutlet weak var someView?
I'm kind of a "belt & suspenders" type of guy.In this particular case, I would rather crash. It’s easier to spot in a crash report and you get a nice stack trace.
Silent failure is ultimately terrible for users.
Note: for the things I control I try to very explicitly model state in such a way as I never need to force unwrap at all. But for things beyond my control like this situation, I would rather end the program than continue with a state of the world I don’t understand.
See my above/below comment.
A good tool for catching stuff during development, is the humble assert()[0]. We can use precondition()[1], to do the same thing, in ship code.
The main thing is, is to remain in control, as much as possible. Rather than let the PC leave the stack frame, throw the error immediately when it happens.
[0] https://docs.swift.org/swift-book/documentation/the-swift-pr...
[1] https://docs.swift.org/swift-book/documentation/the-swift-pr...
Agreed.
Unfortunately, crashes in iOS are “silent failures,” and are a loss of control.
What this practice does, is give me the option to handle the failure “noisily,” and in a controlled manner; even if just emitting a log entry, before calling a system failure. That can be quite helpful, in threading. Also, it gives me the option to have a valid value applied, if there’s a structural failure.
But the main reason that I do that with @IBOutlets, is that it forces me to acknowledge, throughout the rest of the code, that it’s an optional. I could always treat implicit optionals as if they were explicit, anyway. This just forces me to.
I have a bunch of practices that folks can laugh at, but my stuff works pretty effectively, and I sleep well.
Also, I have found App Store crash reports to be next to useless. TestFlight ones are a bit better.
But if I spend a lot of time, doing it right, the first time, we can avoid all kinds of heartbreak.
What have you found useless about the crash reports from the App Store? It would be really nice for it to have something like a breadcrumb capability, but typically the stack trace of the crash is sufficient to see what went wrong.
Crash early, crash often. Find the bugs and bad assumptions.
No it's not. Read my other comments.
ToMAYto, ToMAHto.
I have learned that it's a bad idea to trash other folks' methodologies without taking the time to understand why they do things, the way they do.
I have found dogma to be an impediment, in my own work. As I've gotten older, the sharp edges have been sanded off.
Have a great day!
Oh I am aware. They do it because
A) they don’t have a mental model of correct execution. Events just happen to them with a feeling of powerlessness. So rather than trying to form one they just litter the code with cases things that might happen
> As I've gotten older, the sharp edges have been sanded off.
B) they have grown in bad organizations with bad incentives that penalize the appearance of making mistakes. So they learn to hide them.
For example there might be an initiative that rewards removing crashes in favor of silent error.
> Bad data was only generated if the query ran on a part of the cluster which had been updated. As a result, every five minutes there was a chance of either a good or a bad set of configuration files being generated and rapidly propagated across the network.
The file should be versioned and rollout of new versions should be staged.
(There is definitely a trade-off; often times in the security critical path, you want to go as fast as possible because changes may be blocking a malicious actor. But if you move too fast, you break things. Here, they had a potential poison input in the pathway for synchronizing this state and Murphy's Law suggests it was going to break eventually, so the question becomes "How much damage can we tolerate when it does?")
That feature file is generated every 5 minutes at all times; the change to permissions was rolled out gradually over the clickhouse cluster, and whether a bad version of that file was generated depended on whether the part of the cluster that had the bad permissions generated the file.
You might start with a basic timeline of what happened, then you'd start exploring: why did this change affect so many customers (this would be a line of questioning to find a potential root cause), why did it take so long to discover or recover (this might be multiple lines of questioning), etc.
The real issue is further up the chain where the malformed feature file got created and deployed without better checks.
I do not think that if the bot detection model inside your big web proxy has a configuration error it should panic and kill the entire proxy and take 20% of the internet with it. This is a system that should fail gracefully and it didn't.
> The real issue
Are there single "real issues" with systems this large? There are issues being created constantly (say, unwraps where there shouldn't be, assumptions about the consumers of the database schema) that only become apparent when they line up.
The thing I dislike most about Nginx is that if you are using it as a reverse proxy for like 20 containers and one of them is up, the whole web server will refuse to start up:
nginx: [emerg] host not found in upstream "my-app"
Obviously making 19 sites also unavailable just because one of them is caught in a crash loop isn't ideal. There is a workaround involving specifying variables, like so (non-Kubernetes example, regular Nginx web server running in a container, talking to other containers over an internal network, like Docker Compose or Docker Swarm): location / {
resolver 127.0.0.11 valid=30s; # Docker DNS
set $proxy_server my-app;
proxy_pass http://$proxy_server:8080/;
proxy_redirect default;
}
Sadly, if you try to use that approach, then you just get: nginx: [emerg] "proxy_redirect default" cannot be used with "proxy_pass" directive with variables
Sadly, switching the redirect configuration away from the default makes some apps go into a redirect loop and fail to load: mostly legacy ones, where Firefox shows something along the lines of "The page isn't redirecting properly". It sucks especially badly if you can't change the software that you just need to run and suddenly your whole Nginx setup is brittle. Apache2 and Caddy don't have such an issue.That's to say that all software out there has some really annoying failure modes, even is Nginx is pretty cool otherwise.
I'm really confused how so many people are finding it acceptable to bring down your entire reverse-proxy because the length of feature sets for the ML model in one of your components was longer than expected.
Also wonder with a sharded system why are they not slow rolling out changes and monitoring?
Maybe the validation code should've handled the larger size, but also the db query produced something invalid. That shouldn't have ever happened in the first place.
Agreed, that's also my takeaway.
I don't see the problem being "lazy programmers shouldn't have called .unwrap()". That's reductive. This is a complex system and complex system failures aren't monocausal.
The function in question could have returned a smarter error rather than panicking, but what then? An invariant was violated, and maybe this system, at this layer, isn't equipped to take any reasonable action in response to that invariant violation and dying _is_ the correct thing to do.
But maybe it could take smarter action. Maybe it could be restarted into a known good state. Maybe this service could be supervised by another system that would have propagated its failure back to the source of the problem, alerting operators that a file was being generated in such a way that violated consumer invariants. Basically, I'm describing a more Erlang model of failure.
Regardless, a system like this should be able to tolerate (or at least correctly propagate) a panic in response to an invariant violation.
The point of option is the crash path is more verbose and explicit than the crash-free path. It takes more code to check for NULL in C or nil in Go; it takes more code in Rust to not check for Err.
2. You’re getting confused by technology again. This isn’t about technology.
That's too semantic IMHO. The failure mode was "enforced invariant stopped being true". If they'd written explicit code to fail the request when that happened, the end result would have been exactly the same.
> That's too semantic IMHO. The failure mode was "enforced invariant stopped being true". If they'd written explicit code to fail the request when that happened, the end result would have been exactly the same.
Problem is, the enclosing function (`fetch_features`) returns a `Result`, so the `unwrap` on line #82 only serves as a shortcut a developer took due to assuming `features.append_with_names` would never fail. Instead, the routine likely should have worked within `Result`.
But it's a fatal error. It doesn't matter whether it's implicit or explicit, the result is the same.
Maybe you're saying "it's better to be explicit", as a broad generalization I don't disagree with that.
But that has nothing to do with the actual bug here, which was that the invariant failed. How they choose to implement checking and failing the invariant in the semantics of the chosen language is irrelevant.
Maybe the new config has a new update. Who knows? Do we want to keep operating on the old config? Maybe maybe not.
But operating on old config when you don't want to is definitely worse.
Crashing on a config update is usually only done if it could cause data corruption if the configs aren't in sync. That's obviously not the case here since the updates (although distributed in real time) are not coupled between hosts. Such systems usually are replicated state machines where config is totally ordered relative to other commands. Example: database schema and write operations (even here the way many databases are operated they don't strongly couple the two).
Crashing is generally better than behaving incorrectly due to stale configs. Because the problem would get fixed faster.
> But it's a fatal error. It doesn't matter whether it's implicit or explicit, the result is the same.
I agree it is an error, but disagree that it should be a fatal error at that location. The reason being is the method defining the offending `unwrap` construct produces a `Result`, which is fully capable of representing any error `features.append_with_names` could produce.
> But that has nothing to do with the actual bug here, which was that the invariant failed.
The bug is by invoking `unwrap` the process crashed. To the degree that Cloudfare had a massive outage.
Had the logic been such that a `Result` representing this error condition activated an alternate workflow to handle the error (perhaps by logging it, emitting a notification event alerting SRE's, transitioning into a failure mode, or all of these options), then a global outage might have been averted.
Which makes:
> How they choose to implement checking and failing the invariant in the semantics of the chosen language is irrelevant.
Very relevant indeed.
If the `.unwrap()` was replaced with `.expect("Feature config is too large!")` it would certainly make the outage shorter.
It wouldn't, not meaningfully. The outage was caused by change in how they processed the queries. They had no way to observe the changes, nor canaries to see that change is killing them. Plus, they would still need to manually feed and restart services that ingested bad configs.
`expect` would shave a few minutes; you would still spend hours figuring out and fixing it.
Granted, using expect is better, but it's not a silver bullet.
Or the good old:
let x = match res {
Ok(x) => x,
Err(_) => unreachable!(),
}1:
?
2: map_err, or, or_else, etc.
3: match ... {
Ok(..) => {},
Err(..) => {},
}
4: if let ... {
}
Then it would have been idiomatic Rust code and wouldn't have failed at all.The function signature returned a `Result<(), (ErrorFlags, i32)>`
Seems like it should have returned an Err((ErrorFlags, i32)) here. Case 2 or 3 above would have done nicely.
Removing unwrap() from Rust would have forced the proper handling of the function call and would have prevented this.
Unwrap() is Rust's original sin.
How so? “Parse, don’t validate” implies converting input into typed values that prevent representation of invalid state. But the parsing still needs to be done correctly. An unchecked unwrap really has nothing to do with this.
The config bug reaching prod without this being caught and pinpointed immediately is the strange part.
And, it took like over an hour between the problem started til my sites went down. That is just crazy.
> This wasn't an attack, but a classic chain reaction triggered by “hidden assumptions + configuration chains” — permission changes exposed underlying tables, doubling the number of lines in the generated feature file. This exceeded FL2's memory preset, ultimately pushing the core proxy into panic.
> Rust mitigates certain errors, but the complexity in boundary layers, data flows, and configuration pipelines remains beyond the language's scope. The real challenge lies in designing robust system contracts, isolation layers, and fail-safe mechanisms.
> Hats off to Cloudflare's engineers—those on the front lines putting out fires bear the brunt of such incidents.
> Technical details: Even handling the unwrap correctly, an OOM would still occur. The primary issue was the lack of contract validation in feature ingest. The configuration system requires “bad → reject, keep last-known-good” logic.
> Why did it persist so long? The global kill switch was inadequate, preventing rapid circuit-breaking. Early suspicion of an attack also caused delays.
> Why not roll back software versions or restart?
> Rollback isn't feasible because this isn't a code issue—it's a continuously propagating bad configuration. Without version control or a kill switch, restarting would only cause all nodes to load the bad config faster and accelerate crashes.
> Why not roll back the configuration?
> Configuration lacks versioning and functions more like a continuously updated feed. As long as the ClickHouse pipeline remains active, manually rolling back would result in new corrupted files being regenerated within minutes, overwriting any fixes.
* For clarity, I am aware that the original tweets are written in Chinese, and they still have the stench of LLM writing all over them; it's not just the translation provided in the above comment.
> classic chain reaction triggered by “hidden assumptions + configuration chains”
"Classic/typical "x + y"", particularly when diagnosing an issue. This one is a really easy tell because humans, on aggregate, do not use quotation marks like this. There is absolutely no reason to quote these words here, and yet LLMs will do a combined quoted "x + y" where a human would simply write something natural like "hidden assumptions and configuration chains" without extraneous quotes.
> The configuration system requires “bad → reject, keep last-known-good” logic.
Another pattern with overeager usage of quotes is this ""x → y, z"" construct with very terse wording.
> This wasn't an attack, but a classic chain reaction
LLMs aggressively use "Not X, but Y". This is also a construct commonly used by humans, of course, but aside from often being paired with an em-dash, another tell is whether it actually contributes anything to the sentence. "Not X, but Y" is strongly contrasting and can add a dramatic flair to the thing being constrasted, but LLMs overuse it on things that really really don't need to be dramatised or contrasted.
> Rust mitigates certain errors, but the complexity in boundary layers, data flows, and configuration pipelines remains beyond the language's scope. The real challenge lies in designing robust system contracts, isolation layers, and fail-safe mechanisms.
Two lists of three concepts back-to-back. LLMs enjoy, love, and adore this construct.
> Hats off to Cloudflare's engineers—those on the front lines putting out fires bear the brunt of such incidents.
This kind of completely vapid, feel-good word soup utilising a heroic analogy for something relatively mundane is another tell.
And more broadly speaking, there's a sort of verbosity and emptiness of actual meaning that permeates through most LLM writing. This reads absolutely nothing like what an engineer breaking down an outage looks like. Like, the aforementioned line of... "Rust mitigates certain errors, but the complexity in boundary layers, data flows, and configuration pipelines remains beyond the language's scope. The real challenge lies in designing robust system contracts, isolation layers, and fail-safe mechanisms.". What is that actually communicating to you? It piles on technical lingo and high-level concepts in a way that is grammatically correct but contains no useful information for the reader.
Bad writing exists, of course. There's plenty of bad writing out there on the internet, and some of it will suffer from flaws like these even when written by a human, and some humans do like their em-dashes. But it's generally pretty obvious when the writing is taken on aggregate and you see recognisable pattern after pattern combined with em-dashes combined with shallowness of meaning combined with unnecessary overdramatisations.
Like GP I'm not very good at spotting these patterns yet, so explicit real-world examples go a long way.
1. Their bot management system is designed to push a configuration out to their entire network rapidly. This is necessary so they can rapidly respond to attacks, but it creates risk as compared to systems that roll out changes gradually.
2. Despite the elevated risk of system wide rapid config propagation, it took them 2 hours to identify the config as the proximate cause, and another hour to roll it back.
SOP for stuff breaking is you roll back to a known good state. If you roll out gradually and your canaries break, you have a clear signal to roll back. Here was a special case where they needed their system to rapidly propagate changes everywhere, which is a huge risk, but didn’t quite have the visibility and rapid rollback capability in place to match that risk.
While it’s certainly useful to examine the root cause in the code, you’re never going to have defect free code. Reliability isn’t just about avoiding bugs. It’s about understanding how to give yourself clear visibility into the relationship between changes and behavior and the rollback capability to quickly revert to a known good state.
Cloudflare has done an amazing job with availability for many years and their Rust code now powers 20% of internet traffic. Truly a great team.
Once every 5m is not "rapidly". It isn't uncommon for configuration systems to do it every few seconds [0].
> While it’s certainly useful to examine the root cause in the code.
Believe the issue is as much an output from a periodic run (clickhouse query) caused by (on the surface, an unrelated change) causing this failure. That is, the system that validated the configuration (FL2) was different to the one that generated it (ML Bot Management DB).
Ideally, it is the system that vends a complex configuration that also vends & tests the library to consume it, or the system that consumes it, does so as if it was "tasting" the configuration first before devouring it unconditionally [1].
Of course, as with all distributed system failures, this is all easier said and done in hindsight.
[0] Avoiding overload in distributed systems by putting the smaller service in control (pg 4), https://d1.awsstatic.com/builderslibrary/pdfs/Avoiding%20ove...
[1] Lessons from CloudFront (2016), https://youtube.com/watch?v=n8qQGLJeUYA&t=1050
Isn't rapidly more of how long it takes to get from A to Z rather than how often it is performed? You can push out a configuration update every fortnight but if it goes through all of your global servers in three seconds, I'd call it quite rapid.
In a productive way, this view also shifts the focus to improving the system (visibility etc), empowering the team, rather than focusing on the code which broke (probably strikes fear in the individuals, to do anything!)
I don’t really buy this requirement. At least make it configurable with a more reasonable default for “routine” changes. E.g. ramping to 100% over 1 hour.
As long as that ramp rate is configurable, you can retain the ability to respond fast to attacks by setting the ramp time to a few seconds if you truly think it’s needed in that moment.
Of course, this is all so easy to say after the fact..
> In the internal incident chat room, we were concerned that this might be the continuation of the recent spate of high volume Aisuru DDoS attacks:
https://developers.cloudflare.com/bots/get-started/bot-manag...
> Unrelated to this incident, we were and are currently migrating our customer traffic to a new version of our proxy service, internally known as FL2. Both versions were affected by the issue, although the impact observed was different.
> Customers deployed on the new FL2 proxy engine, observed HTTP 5xx errors. Customers on our old proxy engine, known as FL, did not see errors, but bot scores were not generated correctly, resulting in all traffic receiving a bot score of zero. Customers that had rules deployed to block bots would have seen large numbers of false positives. Customers who were not using our bot score in their rules did not see any impact.
If every time there's a new bot someone needs to write code that can blow up their whole service, maybe they need to iterate a bit on this design?
In fact, the root bug (faulty assumption?) was in one or more SQL catalog queries that were presumably written some time ago.
(Interestingly the analysis doesn’t go into how these erroneous queries made it into production OR whether the assumption was “to spec” and it’s the security principal change work that was faulty. Seems more likely to be the former.)
Generally I would say we as an industry are more nonchalant about config changes vs binary changes. Where an org might have great processes and systems in place for binary rollouts, the whole fleet could be reading config from a database in a much more lax fashion. Those systems are quite risky actually.
Even only in CF’s “critical path” there must be dozens of interconnected services and systems. How do you close the loop between an observed panic at the edge and a database configuration change N systems upstream?
And yes, there is a lint you can use against slicing ('indexing_slicing') and it's absolutely wild that it's not on by default in clippy.
[lints.clippy]
dbg_macro = "deny"
unwrap_used = "deny"
expect_used = "deny"This is sobering.
My new fear is some dependency unwrap()ing or expect()ing something where they didn't prove the correctness.
Unwrap() and expect() are an anti-pattern and have no place in idiomatic Rust code. The language should move to deprecate them.
Perhaps it needs a scarier name, like "assume_ok".
This lets me do logging at minimum. Sometimes I can gracefully degrade. I try to be elegant in failure as possible, but not to the point where I wouldn't be able to detect errors or would enter a bad state.
That said, I am totally fine with your use case in your application. You're probably making sane choices for your problem. It should be on each organization to decide what the appropriate level of granularity is for each solution.
My worry is that this runtime panic behavior has unwittingly seeped into library code that is beyond our ability and scope to observe. Or that an organization sets a policy, but that the tools don't allow for rigid enforcement.
The handler could log the error and then panic. Much better than chasing bad hunches about a DDoS.
If you're using Result<T,E>, there's no automatic language feature for statically typing a nested E that mirrors how it was called.
So out of brevity, they unwrap.
Expect to see this sort of error crop up a lot as people use LLMs to vibe with the borrow checker.
As the user, I can't tell the difference, but it might have sped up their recovery a bit.
I imagine it would also require less time debugging a panic. That kind of breadcrumb trail in your logs is a gift to the future engineer and also customers who see a shorter period of downtime.
I'm sure that there are misapplied guidelines to do that instead of being nice to incoming bot management configuration files, and someone might have been scolded (or worse) for proposing or attempting to handle them more safely.
Both are important, and I am pretty sure, that someone is gonna fix that line of code pretty soon.
How can you write the proxy without handling the config containing more than the maximum features limit you set yourself?
How can the database export query not have a limit set if there is a hard limit on number of features?
Why do they do non-critical changes in production before testing in a stage environment?
Why did they think this was a cyberattack and only after two hours realize it was the config file?
Why are they that afraid of a botnet? Does not leave me confident that they will handle the next Aisuru attack.
I'm migrating my customers off Cloudflare. I don't think they can swallow the next botnet attacks and everyone on Cloudflare go down with the ship, so it will be safer to not be behind Cloudflare when it hits.
But the case for Cloudflare here is complicated. Every engineer is very free to make a better system though.
Cloudflare builds a global scale system, not an iphone app. Please act like it.
There will always be bugs in code, even simple code, and sometimes those things don't get caught before they cause significant trouble.
The failing here was not having a quick rollback option, or having it and not hitting the button soon enough (even if they thought the problem was probably something else, I think my paranoia about my own code quality is such that I would have been rolling back much sooner just in case I was wrong about the “something else”).
It goes over my head why Cloudflare is HN's darling while others like Google, Microsoft and AWS don't usually enjoy the same treatment.
Do the others you mentioned provide such detailed outage reports, within 24 hours of an incident? I’ve never seen others share the actual code that related to the incident.
Or the CEO or CTO replying to comments here?
>Press Release
This is not press release, they always did these outage posts from the start of the company.
https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
Azure (albeit pretty old): https://devblogs.microsoft.com/devopsservice/?p=17665
AWS: https://aws.amazon.com/message/101925/
GCP: https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1S...
The code sample might as well be COBOL for people not familiar with Rust and its error handling semantics.
> Or the CEO or CTO replying to comments here?
I've looked around the thread and I haven't seen the CTO here nor the CEO, probably I'm not familiar with their usernames and that's on me.
> This is not press release, they always did these outage posts from the start of the company.
My mistake calling them press releases. Newspapers and online publications also skim this outage report to inform their news stories.
I wasn't clear enough on my previous comment. I'd like all major players in the internet and web infrastructure to be held to higher standards. As it stands when it comes to them or the tech department of a retail store the retail store must answer to more laws when surface area of combined activities is took into account.
Yes, Cloudflare excels where others don't or barely bother and I too enjoyed the pretty graphs, diagrams and I've learned some nifty Rust tricks.
EDIT: I've removed some unwarranted snark from my comment which I apologize for.
Every system has a non-reducible risk and no data rollback is trivial, especially for a CDN.
I guess the noncritical change here was the change to the database? My experience has been a lot of teams do a poor job having a faithful replica of databases in stage environments to expose this type of issue.
Permissions stuff might be caught without a completely faithful replica, but there are always going to be attributes of the system that only exist in prod.
Is that an overreaction?
Name me global, redundant systems that have not (yet) failed.
And if you used cloudflare to protect against botnet and now go off cloudflare... you are vulnerable and may experience more downtime if you cannot swallow the traffic.
I mean no service have 100% uptime - just that some have more nines than others.
Whatever you do, unless you have their bandwidth capacity, at some point those "self-hosted" will get flooded with traffic.
The fact that cloudflare can literally ready every bit of communication (as it sits between the client and your server) is already plenty bad. And yet, we accept this more easily, then a bit of downtime. We shall not ask about the prices for that service ;)
To me its nothing more then the whole "everybody on the cloud" issue, when most do not need the resource that cloud companies like AWS provide (and the bill), and yet, get totally tied down to this one service.
I am getting old lol ...
What is the cost of many-9s uptime from Cloudflare? For DDoS protection it is $0/month on their free tier:
Its free as long as you really are small, not worth milking. The moment you can afford to run your own mini dc at your office, you start to enter the "well, hello there" for CF.
As someone who has (and is) runs (running) a DC with all the electrical/UPS, cooling, piping, HVAC+D stuff to deal with: it can be a lot of just time/overhead.
Especially if you don't have a number of folks in-house to deal with all that 'non-IT' equipment (I'm a bit strange in that I have an interest in both IT and HVAC-y stuff).
The bandwidth costs of a ddos alone would close down a small shop.
Cloudflare provide an incredibly good service with a great track record, and sometimes shit happens.
What would some good examples of those be? I think something like Anubis is mostly against bot scraping, not sure how you'd mitigate a DDoS attack well with self-hosted infra if you don't have a lot of resources?
On that note, what would be a good self-hosted WAF? I recall using mod_security with Apache and the OWASP ruleset, apparently the Nginx version worked a bit slower (e.g. https://www.litespeedtech.com/benchmarks/modsecurity-apache-... ), there was also the Coraza project but I haven't heard much about it https://coraza.io/ or maybe the people who say that running a WAF isn't strictly necessary also have a point (depending on the particular attack surface).
Genuine questions.
There is haproxy-protection, which I believe is the basis of Kiwiflare. Clients making new connections have to solve a proof-of-work challenge that take about 3 seconds of compute time.
Enterprise: https://www.haproxy.com/solutions/ddos-protection-and-rate-l...
How they magically manage DDOS larger than their bandwidth?
If the plan is to have larger bandwidth than any DDOS it is going to be expensive, quickly.
If you're just renting servers instead, you have a few options that are effectively closer to a 1% commit, but better have a plan B for when your upstreams drop you if the incoming attack traffic starts disrupting other customers - see Neoprotect having to shut down their service last month.
But at the same time, what value do they add if they:
* Took down the the customers sites due to their bug.
* Never protected against an attack that our infra could not have handled by itself.
* Don't think that they will be able to handle the "next big ddos" attack.
It's just an extra layer of complexity for us. I'm sure there are attacks that could help our customers with, that's why we're using them in the first place. But until the customers are hit with multiple ddos attacks that we can not handle ourself then it's just not worth it.
That is always a risk with using a 3rd party service, or even adding extra locally managed moving parts. We use them in DayJob, and despite this huge issue and the number of much smaller ones we've experienced over the last few years their reliability has been pretty darn good (at least as good as the Azure infrastructure we have their services sat in front of).
> • Never protected against an attack that our infra could not have handled by itself.
But what about the next one… Obviously this is a question sensitive to many factors in our risk profiles and attitudes to that risk, there is no one right answer to the “but is it worth it?” question here.
On a slightly facetious point: if something malicious does happen to your infrastructure, that it does not cope well with, you won't have the “everyone else is down too” shield :) [only slightly facetious because while some of our clients are asking for a full report including justification for continued use of CF and any other 3rd parties, which is their right both morally and as written in our contracts, most, especially those who had locally managed services affected, have taken the “yeah, half our other stuff was affected to, what can you do?” viewpoint].
> • Don't think that they will be able to handle the "next big ddos" attack.
It is a war of attrition. At some point a new technique, or just a new botnet significantly larger than those seen before, will come along that they might not be able to deflect quickly. I'd be concerned if they were conceited enough not to be concerned about that possibility. Any new player is likely to practise on smaller targets first before directly attacking CF (in fact I assume that it is rather rare that CF is attacked directly) or a large enough segment of their clients to cause them specific issues. Could your infrastructure do any better if you happen to be chosen as one of those earlier targets?
Again, I don't know your risk profile so can say which is the right answer, if there even is an easy one other than “not thinking about it at all” being a truly wrong answer. Also DDoS protection is not the only service many use CF for, so those need to be considered too if you aren't using them for that one thing.
I do like the flat cost of Cloudflare and feature set better but they have quite a few outages compared to other large vendors--especially with Access (their zero trust product)
I'd lump them into GitHub levels of reliability
We had a comparable but slightly higher quote from an Akamai VAR.
They explain that at some length in TFA.
That's often the case with human error as especially aviation safety experts know: https://en.wikipedia.org/wiki/Swiss_cheese_model
Any big and noticeable incident is one of the "we failed on so many levels here" kind, by definition.
Isn’t getting cyberattacked their core business?
Yet you omit to acknowledge that the remaining 99.99999% logic written that powers Cloudflare works flawlessly.
Also, hindsight is 20/20
A system that is 99.99999% flawless, can still be unusable.
optimism bias: 100/100
Having an unprivileged application querying system.columns to infer the table layout is just bad; Not having a proper, well-defined table structure indicates sloppiness in the overall schema design, specially if it changes quickly. Considering specifically clickhouse, and even if this approach would be a good idea, the unprivileged way of doing it would be "DESCRIBE TABLE <name>", NOT iterating system.columns. The gist of it - sloppy design not even well implemented.
Having a critical application issuing ad-hoc commands to system.* tablespace instead of using a well-tested library is just amateurism, and again - bad engineering; IMO it is good practice to consider all system.* privileged applications and ensure their querying is completely separate from your application logic; Sometimes some system tables change, and fields are added and/or removed - not planning for this will basically make future compatibility a nightmare.
Not only the problematic query itself, but the whole context of this screams "lack of proper application design" and devs not knowing how to use the product and/or read the documentation. Granted, this is a bit "close to home" for me, because I use ClickHouse extensively (at a scale - I'm assuming - several orders of magnitude smaller than CloudFlare) and I have spent a lot of time designing specifically to avoid at least some of these kind of mistakes. But, if I can do it at my scale, why aren't they doing it?
The database issue screamed at me: lack of expertise. I don't use CH, but seeing someone to mess with a production system and they being surprised "Oh, it does that?", is really bad. And this is obviously not knowledge that is hard to achieve, buried deep in a manual or an edge case only discoverable by source code, it's bread and butter knowledge you should know.
What is confusing, that they didn't add this to their follow-up steps. With some benefit of doubt I'd assume they didn't want to put something very basic as a reason out there, just to protect the people behind it from widespread blame. But if that's not the case, then it's a general problem. Sadly it's not uncommon that components like databases are dealt with, on an low effort basis. Just a thing we plug in and works. But it's obviously not.
It's not that many things had to fail, it's that many things that are obvious haven't been done. It would be a valid excuse if many "exotic" scenarios would have to align, not when it's obvious error cases that weren't handled and changes have not been tested.
While having wrong first assumptions is just how things work when you try to analyze the issue[1], not testing changes before production is just stupidity and nothing else.
The story would be different if eg. multiple unlikely, hard to track things happened at once without anyone making a clearly linkable event, something that would also happen in staging. Most of the things mentioned could essentially statically checked. This is the prime example of what you want as any tech person, because it's not hard to prevent compared to a lot of scenarios where you deal with balancing likelihoods of scenarios, timings, etc.
You don't think someone is a great plumber, because they forgot their tools and missed that big hole in the pipe and also rang at the wrong door, because all these things failed. You think someone is a good plumber if they said they would have to go back to fetch a bulky specialized tool, because this is the rare case in which they need it, but they could also do this other thing in this spcific case. They are great plumbers if they tell you how this happened in first place and how to fix it. They are great plumbers if they manage to fix something outside of their usual scope.
Here pretty much all of the things that you pay them for failed. At a large scale.
I am sure this has there are reasons which we don't now about, and I hope that CloudFlare can fix them. Be it management focusing on the wrong things, be it developers not being in the wrong position or annoyed enough to care or something else entirely. However, not doing these things is (likely) a sign that currently they are not in the state of creating reliable systems - at least none reliable enough for what they are doing. It would be perfectly fine if they ran a web shop or something, but if as experienced many other companies rely on you being up or their stuff fails, then maybe you should not run a company with products like "Always Online".
[1] And should make you adapt the process of analyzing issues. Eg. making sure config changes are "very loud" in monitoring. It's one of the most easily tracked thing that can go wrong, and can relatively easily be mapped to a point in time compared to many other things.
Let those who have never written a bug before cast the first stone.
Reminds me of House of Dynamite, the movie about nuclear apocalypse that really revolves around these very human factors. This outage is a perfect example of why relying on anything humans have built is risky, which includes the entire nuclear apparatus. “I don’t understand why X wasn’t built in such a way that wouldn’t mean we live in an underground bunker now” is the sentence that comes to mind.
I guess you are right, likely a social issue, but certainly not a single exhausted parent.
The new config file was not (AIUI) invalid (syntax-wise) but rather too big:
> […] That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.
> The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.
"To make error is human. To propagate error to all server in automatic way is #devops"
This saying dates back to 1969: To err is human but to really foul things up requires a computer.
* https://quoteinvestigator.com/2010/12/07/foul-computer/
Also: I know there’s a proverb which says ‘To err is human,’ but a human error is nothing to what a computer can do if it tries.
If this was a routine config change, I could see how it could take 2 hours to start the mediation plan. However they should have dashboards that correlate config setting changes with 500 errors (or equivalent). It gets difficult when you have many of of these going out at the same time and they are slowly rolled out.
The root cause document is mostly for high level and the public. The details on this specific outage will be in a internal document with many action items, some of them maybe quarter long projects including fixing this specific bug and maybe some linter/monitor to prevent it from happening again.
That and why the hell wasn't their alerting showing up colossal amount of panics in their bot manager thing?
This is also a pretty good example why having stack traces by default is great. That error could have been immediately understood just from a stack trace and a basic exception message.
This is the danger of automated control systems. If they get hacked or somehow push out bad things (CloudStrike), they will have complete control and be very efficient.
Rust compiler is a god of sorts, or at least a law of nature haha
Way to comment and go instantly off topic
Them calling unwrap on a limit check is the real issue imo. Everything that takes in external input should assume it is bad input and should be fuzz tested imo.
In the end, what is the point of having a limit check if you are just unwrapping on it
Using the question mark operator [1] and even adding in some anyhow::context goes a long way to being able to fail fast and return an Err rather then panicking.
Sure you need to handle Results all the way up the stack but it forces you to think about how those nested parts of your app will fail as you travel back up the stack.
[1]: https://doc.rust-lang.org/rust-by-example/std/result/questio...
Average Go code has much less panics than Rust has unwraps, which are functionally equivalent.
I'd prefer a loud crash over that.
In the original PHP code, all worked, only it didn't properly check for bots.
The new Rust code did a loud crash and took off half of the internet.
It's not in the type system, but it's idiomatic
This .unwrap() sounds too easy for what it does, certainly much easier than having an entire try..catch block with an explicit panic. Full disclosure: I don't actually know Rust.
Any project has to reason about what sort of errors can be tolerated gracefully and which cannot. Unwrap is reasonable in scenarios you expect to never be reached, because otherwise your code will be full of all sorts of possible permutations and paths that are harder to reason about and may cascade into extremely nuanced or subtle errors.
Rust also has a version of unwrap called "expect" where you provide a string that logs why the unwrap occurred. It's similar, but for pieces of code that are crucial it could be a good idea to require all 'unwraps' to instead be 'expects' so that people at least are forced to write down a reason why they believe the unwrap can never be reached.
While there are certainly many things to admire about Rust, this is why I prefer Golang's "noisy" error handling. In golang that would be either:
feature_values, err := features.append_with_names(...)
And the compiler would have complained that this value of `err` was unused; or you'd write: feature_values, _ := features.append_with_names(...)
And it would be far more obvious that an error message is being ignored.(Renaming `unwrap` to `unwrapOrPanic` would probably help too.)
fvs, err := features.AppendWithNames(..)
if err != nil {
// this will NEVER break
panic(err)
}
Ultimately I don't think language design can be the sole line of defence against system failures; it can only guide developers to think about error casesPeople's biggest complaints about golang's errors:
1. You have to _TYPE_OUT_ what to do on EVERY.SINGLE.ERROR. SOO BOORING!
2. They clutter up the code and make it look ugly.
Rust is so much cleaner and more convenient (they say)! Just add ?, or .unwrap()!
Well, with ".unwrap()", you can type it fast enough that you're on to the next problem before it occurs to your brain to think about what to do if there is an error. Whereas, in golang, by the time you type in, "if err != nil {", you've broken the flow enough that now you're much more likely to be thinking, "Hmm, could this ever fail? What should we do if it does?" That break in flow is annoying, but necessary.
And ".unwrap()" looks so unassuming, it's easy to overlook on review; that "panic()" looks a lot more dangerous, and again, would be more likely to trigger a reviewer into thinking, "Wait, is it OK if this thing panics? Is this really so unlikely to happen?"
Renaming it `.unwrap_or_panic()` would probably help with both.
1. Culturally, using `unwrap` is an omerta to Rust developers in the same way `panic` is an omerta to Go devs;
2. In the Rust projects I've seen there is usually a linter rule forbidding `unwrap` so you can't use it in production
Unfortunately none of the meanings Wikipedia knows [1] seems to fit this usage. Did you perhaps mean "taboo"?
I disagree that "unwrap()" seems as scary as "panic()", but I will certainly agree to sibling commenters have a point when they say that "bar, _ := foo()" is a lot less scary than "unwrap()".
- it's literally written out that you're assuming it to be Ok
- there are no indications that the `_` is an error: it could very well be some other return value from the function. in your example, it could be the number of appended features, etc
That's why Go's error handling is indeed noisy: it's noise and you reduce noise by not handling errors. Rust's is terse yet verbose: if you add stuff it's because you're doing something wrong. You explicitly spelled out the error is being ignored.
Haven't used Go so maybe I'm missing some consideration, but I don't see how ", _" is more obvious than ".unwrap()". If anything it seems less clear, since you need to check/know the function's signature to see that it's an error being ignored (wouldn't be the case for a function like https://pkg.go.dev/math#Modf).
It may be that forcing handling at every call tends to makes code verbose, and devs insensitized to bad practice. And the diagnostic Rust provided seems pretty garbage.
There is bad practice here too -- config failure manifesting as request failure, lack of failing to safe, unsafe rollout, lack of observability.
Back to language design & error handling. My informed view is that robustness is best when only major reliability boundaries need to be coded.
This the "throw, don't catch" principle with the addition of catches on key reliability boundaries -- typically high-level interactions where you can meaningfully answer a failure.
For example, this system could have a total of three catch clauses "Error Loading Config" which fails to safe, "Error Handling Request" which answers 5xx, and "Socket Error" which closes the HTTP connection.
Rust has a lot of helpers to make it less verbose, even that error they demonstrate could've been written in some form `...code()?` with `?` helper that would have propagated the error forwards.
However I do acknowledge that writing Error types is boring sometimes so people don't bother to change their error types and just unwrap. But even my dinghy little apps for my personal use I do simple serach `unwrap` and make sure I have as few as possible.
The end result would've been the exact same if they "handled" the error: a bunch of 500s. The language being used doesn't matter if an invariant in your system is broken.
If anything, the "crash early" mentality may even be nefarious: instead of handling the error and keeping the old config, you would spin on trying to load a broken config on startup.
_In theory_ they could have used the old config, but maybe there are reasons that’s not possible in Cloudflare’s setup. Whether or not that’s an invariant violation or just an error that can be handled and recovered from is a matter of opinion in system design.
And crashing on an invariant violation is exactly the right thing to do rather than proceed in an undefined state.
At a previous job (cloud provider), we've had exactly this kind of issue, with exactly the same root cause. The entrypoint for the whole network had a set of rules (think a NAT gateway) that were reloaded periodically from the database. Someone rewrote that bit of plumbing from Python to Go. Someone else performed a database migration. Suddenly, the plumbing could not find the data, and pushed an empty file to prod. The rewrite lacked "if empty, do nothing and raise an alert", that the previous one had. I'll let you imagine what happened next :)
it'd be kinda hard to amend the clippy lints to ignore coroutine unwraps but still pipe up on system ones. i guess.
edit: i think they'd have to be "solely-task-color-flavored" so definitely probably not trivial to infer
I have been saying for years that Rust botched error handling in unfixable ways. I will go to the grave believing Rust fumbled.
The design of the Rust language encourages people to use unwrap() to turn foreseeable runtime problems into fatal errors. It's the path of least resistance, so people will take it.
Rust encourages developers to consider only the happy path. No wonder it's popular among people who've never had to deal with failure.
All of the concomitant complexity--- Result, ?, the test thing, anyhow, the inability for stdlib to report allocation failure --- is downstream of a fashion statement against exceptions Rust cargo-culted from Go.
The funniest part is that Rust does have exceptions. It just calls them panics. So Rust code has to deal with the ergonomic footgun of Result but pays anyway for the possibility of exceptions. (Sure, you can compile with panic=abort. You can't count on it.)
I could not be more certain that Rust should have been a language with exceptions, not Result, and that error objects are a gross antipattern we'll regret for decades.
(You usually want to make a function infallible if you're using your noexcept function as part of a cleanup path or part of a container interface that allows for more optimizations of it knows certain container operations are infallible.)
Rust makes infallibility the syntactic default and makes you write Result to indicate fallibility. People often don't want to color their functions this way. Guess what happens when a programmer is six levels deep in infallible-colored function calls and does something that can fail.
.unwrap()
Guess what, in Rust, is fallible?
Mutex acquire.
Guess what you need to do often on infallible cleanup paths?
Mutex acquire.
Also, exception handling is hard and lame. We don't need exceptions, just add a "match" block after every line in your program.
I'm also not sure what you're getting at with the comment about exception handling being lame. I think the ML/Haskell inspired model that Rust uses of having a parameterized Result type for fallible operations is generally better than exceptions for a variety of reasons (although maybe better Exception semantics could help with some of this), but what does this have to do with match blocks?
Undoubtedly yes.
> ...but what does this have to do with match blocks?
You tell me. You're the one advocating for placing one after every single function call.
First multi-million dollar .unwrap() story.
The way they wrote the code means that having more than 200 features is a hard non-transient error - even if they recovered from it, it meant they'd have had the same error when the code got to the same place.
I'm sure when the process crashed, k8s restarted the pod or something - then it reran the same piece of code and crashed in the same place.
While I don't necessarily agree with crashing as business strategy, I don't think that doing anything other than either dropping the extra rules or allocating more memory - neither of which the original code was built to do (probably by design).
The code made the local hard assumption that there won't ever be more than 200 rules and its okay to crash if that count is exceeded.
If you design your code around an invariant never being violated (which is fine), you have to make it clear on a higher level that they did.
This isn't a Rust problem (though Rust does make it easy to do the wrong thing here imo)
That's not always foolproof, e.g. a freshly (re)started process doesn't have any prior state it can fall back to, so it just hard crashes. But restarts are going to be rate limited anyways, so even then there is time to mitigate the issue before it becomes a large scale outage
Having the feature table pivoted (with 200 feature1, feature2, etc columns) meant they had to do meta queries to system.columns to get all the feature columns which made the query sensitive to permissioning changes (especially duplicate databases).
A Crowdstrike style config update that affects all nodes but obviously isn't tested in any QA or staged rollout strategy beforehand (the application panicking straight away with this new file basically proves this).
Finally an error with bot management config files should probably disable bot management vs crash the core proxy.
I'm interested here why they even decided to name Clickhouse as this error could have been caused by any other database. I can see though the replicas updating causing flip / flopping of results would have been really frustrating for incident responders.
The solution to that problem wasn't better testing of database permutations or a better staging environment (though in time we did do those things). It was (1) a watchdog system in our proxies to catch arbitrary deadlocks (which caught other stuff later), (2) segmenting our global broadcast domain for changes into regional broadcast domains so prod rollouts are implicitly staged, and (3) a process for operators to quickly restore that system to a known good state in the early stages of an outage.
(Cloudflare's responses will be different than ours, really I'm just sticking up for the idea that the changes you need don't follow obviously from the immediate facts of an outage.)
I don't use Rust, but a lot of Rust people say if it compiles it runs.
Well Rust won't save you from the usual programming mistake. Not blaming anyone at cloudflare here. I love Cloudflare and the awesome tools they put out.
end of day - let's pick languages | tech because of what we love to do. if you love Rust - pick it all day. I actually wanna try it for industrial robot stuff or small controllers etc.
there's no bad language - just occassional hiccups from us users who use those tools.
could have been tight deadline, managerial pressure or just the occasional slip up.
1. At startup, load the last known good config.
2. When signaled, load the new config.
3. When that passes validation, update the last-known-good pointer to the new version.
That way something like this makes the crash recoverable on the theory that stale config is better than the service staying down. One variant also recorded the last tried config version so it wouldn’t even attempt to parse the latest one until it was changed again.
For Cloudflare, it’d be tempting to have step #3 be after 5 minutes or so to catch stuff which crashes soon but not instantly.
as they say in the post, these files get generated every 5 minutes and rolled out across their fleet.
so in this case, the thing farther up the callstack is a "watch for updated files and ingest them" component.
that component, when it receives the error, can simply continue using the existing file it loaded 5 minutes earlier.
and then it can increment a Prometheus metric (or similar) representing "count of errors from attempting to load the definition file". that metric should be zero in normal conditions, so it's easy to write an alert rule to notify the appropriate team that the definitions are broken in some way.
that's not a complete solution - in particular it doesn't necessarily solve the problem of needing to scale up the fleet, because freshly-started instances won't have a "previous good" definition file loaded. but it does allow for the existing instances to fail gracefully into a degraded state.
in my experience, on a large enough system, "this could never happen, so if it does it's fine to just crash" is almost always better served by a metric for "count of how many times a thing that could never happen has happened" and a corresponding "that should happen zero times" alert rule.
Panics should be logged, and probably grouped by stack trace for things like prometheus (outside of process). That handles all sorts of panic scenarios, including kernel bugs and hardware errors, which are common at cloudflare scale.
Similarly, mitigating by having rapid restart with backoff outside the process covers far more failure scenarios with far less complexity.
One important scenario your approach misses is “the watch config file endpoint fell over”, which probably would have happened in this outage if 100% of servers went back to watching all of a sudden.
Sure, you could add an error handler for that too, and for prometheus is being slow, and an infinite other things. Or, you could just move process management and reporting out of process.
The solution to this problem wasn’t restarting the failing process. It was correctly modeling the failure case, so that then the type system forced you to correctly handle it.
Unwrapping is a very powerful and important assertion to make in Rust whereby the programmer explicitly states that the value within will not be an error, otherwise panic. This is a contract between the author and the runtime. As you mentioned, this is a human failure, not a language failure.
Pause for a moment and think about what a C++ implementation of a globally distributed network ingress proxy service would look like - and how many memory vulnerabilities there would be… I shudder at the thought… (n.b. nginx)
This is the classic example of when something fails, the failure cause over indexes on - while under indexing on the quadrillions of memory accesses that went off without a single hitch thanks to the borrow checker.
I postulate that whatever the cost in millions or hundreds of millions of dollars by this Cloudflare outage, it has paid for more than by the savings of safe memory access.
I mean thats an unfalsifiable statement, not really fair. C is used to successfully launch spaceships.
Whereas we have a real Rust bug that crashed a good portion of the internet for a significant amount of time. If this was a C++ service everyone would be blaming the language, but somehow Rust evangelicals are quick to blame it on "unidiomatic Rust code".
A language that lets this easily happen is a poorly designed language. Saying you need to ban a commonly used method in all production code is broken.
Consider that the set of possible failures enabled by language design should be as small as possible.
Rust's set is small enough while also being productive. Until another breakthrough in language design as impactful as the borrow checker is invented, I don't imagine more programmers will be able to write such a large amount of safe code.
Well, no, most Rust programmers misunderstand what the guarantees are because they keep parroting this quote. Obviously the language does not protect you from logic errors, so saying "if it compiles, it works" is disingenuous, when really what they mean is "if it compiles, it's probably free of memory errors".
It's a common thing I've experienced and seen a lot of others say that the stricter the language is in what it accepts the more likely it is to be correct by the time you get it to run. It's not just a Rust thing (although I think Rust is _stricter_ and therefore this does hold true more of the time), it's something I've also experienced with C++ and Haskell.
So no, it's not a guarantee, but that quote was never about Rust's guarantees.
Even more now after this outage.
But it's a fact that "if it compiles it runs" is often associated with Rust, in HN at least. A quick Algolia search tells me that.
This is not a Rust problem. Someone consciously chose to NOT handle an error, possibly thinking "this will never happen". Then someone else conconciouly reviewed (I hope so) a PR with an unwrap() and let it slide.
Now it might be that it was tested, but then ignored or deprioritised by management...
Disagree. Rust is at least giving you an "are you sure?" moment here. Calling unwrap() should be a red flag, something that a code reviewer asks you to explain; you can have a linter forbid it entirely if you like.
No language will prevent you from writing broken code if you're determined to do so, and no language is impossible to write correct code in if you make a superhuman effort. But most of life happens in the middle, and tools like Rust make a huge difference to how often a small mistake snowballs into a big one.
No one treats it like that and nearly every Rust project is filled with unwraps all over the place even in production system like Cloudflare's.
If you haven't read the Rust Book at least, which is effectively Rust 101, you should not be writing Rust professionally. It has a chapter explaining all of this.
I didn't read anything in that section about unwrap/expect that it shouldn't be used in production code. If anything I read it as perfectly acceptable.
It would be better if that would be the other way round "linter forbids it unless you ask it not to". Never wrong to allow users to shoot themself in the foot, but it should be explicit.
Do you grok what the issue was with the unwrap, though...?
Idiomatic Rust code does not use that. The fact that it's allowed in a codebase says more about the engineering practices of that particular project/module/whatever. Whoever put the `unwrap` call there had to contend with the notion that it could panic and they still chose to do it.
It's a programmer error, but Rust at least forces you to recognize "okay, I'm going to be an idiot here". There is real value in that.
The "no unwrap" rule is common in most production codebases. Chill.
Anecdotally I can write code for several hours, deploy it to a test sandbox without review or running tests and it will run well enough to use it, without silly errors like null pointer exceptions, type mismatches, OOBs etc. That doesn't mean it's bug-free. But it doesn't immediately crash and burn either. Recently I even introduced a bug that I didn't immediately notice because careful error handling in another place recovered from it.
This is the first significant outage that has involved Rust code, and as you can see the .unwrap is known to carry the risk of a panic and should never be used on production code.
Cloudflare is very cheap at these prices.
Assuming something similar to Sentry would be in use, it should clearly pick up the many process crashes that start occurring right as the downtime starts. And the well defined clean crashes should in theory also stand out against all the random errors that start occuring all over the system as it begins to go down, precisely because it's always failing at the exact same point.
The issue here is about the system as a whole not any line of code.
Unsoundness in the type system that leads to a systemic failure is about the system as a whole.
Not everything can be recovered from restarting a process, and process correctness and recovery is something that also derives from your type system.
People here chatting about unwrap remind me of them :)
If you depend on engineers not fucking up, you will fail. Using unwrap is assuming humans won’t get human-enforced invariants wrong. They will. They did here.
As someone that works in formal verification of crypto systems, watching people like yourself advocate for hope-and-prayer development methodology is astonishing.
However, I understand why we’re still having this debate. It’s the same debate that’s been occurring for the same reasons for decades.
Doing things correctly is mentally more difficult, and so people jump through ridiculous rhetorical hoops to justify why they will not — or quite often, mentally cannot — perform that intellectual labor.
It’s a disheartening lack of craftsmanship and industry accountability, but it’s nothing new.
The oposing views here are not "hope and prayers" vs "good engineering", it's assuming things will fail at every stage vs assuming one can build a layer of abstraction that is flawless, on top of which we can build.
Resilient systems trump "correct" systems, and I would pick a system designed under the assumption that fake errors will be injected regularly, that process will be killed at random, that entire rack of machines will be unplugged at random at any time, that whole datacenters will be put off grid for fun, over a system that's been "proven correct", any day. I though it was common knowledge.
Of coursre I'm not arguing against proving that a software is correct. I would actually argue that some formal methods would come handy to model these kind of systemic failures and reveal the worste cases with largest blast radius.
But considering the case at hand, the code for that FL2 bot had an assertion regarding the size of received data and that was a valid assertion, and the process decided to panic, and that was the right decision. What was not right was the lack of instrumentation that should have made these failures obvious, and the fact that the user queries failed when that non-essential bot failed, instead of bypassing that bot.
I get it, don’t pick languages just because they are trendy, but if any company’s use case is a perfect fit for Rust it’s cloudflare.
but Rust's type system did catch this error - and then author decided it's fine to panic if this error happens
> You won't see Go or Java developers making such strong claims about their preferred languages.
yess no Java developer ever said that OOP will solve world hunger
The issue is that it wasn't fine to panic, thus Rust did not catch this error.
So they basically hardcoded something, didn't bother to cover the overflow case with unit tests, didn't have basic error catching that would fallback and send logs/alerts to their internal monitoring system and this is why half of the internet went down?
This simply means, the exception handling quality of your new FL2 is non-existent and is not at par / code logic wise similar to FL.
I hope it was not because of AI driven efficiency gains.
On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures
As of 17:06 all systems at Cloudflare were functioning as normal
6 hours / 5 years gives ~99.98% uptime.There never was an unbound "select all rows from some table" without a "fetch first N rows only" or "limit N"
If you knew that this design is rigid, why not leverage the query to actually do it ?
What am I missing ?
Anyway regardless of which language you use to construct a SQL query, you're not obligated to put in a max rows
Why have we built / permitted the building of / Subscribed to such a Failure-intolerant "Network"?
Even worse - the small botnet that controls everything.
Of course, some users were still blocked, because the Turnstile JS failed to load in their browser but the subsequent siteverify check succeeded on the backend. But overall the fail-open implementation lessened impact to our customers nonetheless.
Fail-open with Turnstile works for us because we have other bot mitigations that are sufficient to fall back on in the event of a Cloudflare outage.
They are escape hatches. Without those your language would never take off.
But here's the thing. Escape hatches are like emergency exits. They are not to be used by your team to go to lunch in a nearby restaurant.
---
Cloudflare should likely invest in better linting and CI/CD alerts. Not to mention isolated testing i.e. deploy this change only to a small subset and monitor, and only then do a wider deployment.
Hindsight is 20/20 and we can all be smartasses after the fact of course. But I am really surprised because lately I am only using Rust for hobby projects and even I know I should not use `unwrap` and `expect` beyond the first iteration phases.
---
I have advocated for this before but IMO Rust at this point will benefit greatly from disallowing those unsafe APIs by default in release mode. Though I understand why they don't want to do it -- likely millions of CI/CD pipelines will break overnight. But in the interim, maybe a rustc flag we can put in our `Cargo.toml` that enables such a stricter mode? Or have that flag just remove all the panicky API _at compile time_ though I believe this might be a Gargantuan effort and is likely never happening (sadly).
In any case, I would expect many other failures from Cloudflare but not _this_ one in particular.
Bubbling up the error or None does not make the program correct. Panicking may be the only reasonable thing to do.
If panicking is guaranteed because of some input mistake to the system your failure is in testing.
I am not trashing on them, I've made such mistakes in the past, but I do expect more from them is all.
And you will not believe how many alerts I got for the "impossible" errors.
I do agree there was not too much that could have been done, yes. But they should have invested in more visibility and be more thorough. I mean, hobbyist Rust devs seem to do that better.
It was just a bit disappointing for me. As mentioned above, I'd understand and sympathise with many other mistakes but this one stung a bit.
I'm just pushing back a bit on the idea that unwrap() is unsafe - it's not, and I wouldn't even call it a foot gun. The code did what it was written to do, when it saw the input was garbage it crashed because it couldn't make sense of what to do next. That's a desirable property in reliable systems (of course monitoring that and testing it is what makes it reliable/fixable in the first place).
Using those should be done in an extremely disciplined manner. I agree that there are many legitimate uses but in the production Rust code I've seen this has rarely been the case. People just want to move on and then forget to circle back and add proper error handling. But yes, in this case that's not quite true. Still, my point that an APM alert should have been raised on the "impossible" code path before panicking, stands.
If you think about it, it’s not really different from handling the bubbled up error inside of Rust. You don’t (?) your results and your errors go away, they just move up the chain.
This is all gone. The internet is a centralised system in the hand of just a few companies. If AWS goes down half the internet does. If Azure, Google Cloud, Oracle Cloud, Tencent Cloud or Alibaba Cloud goes down a large part of the internet does.
Yesterday with Cloudflare down half the sites I tried gave me nothing but errors.
The internet is dead.
The last thing we need here is for more of the internet to sign up for Cloudflare.
Its fair to be upset at their decision making - use that to renegotiate your contract.
My dude, everything is a footgun if you hold it wrong enough
Sometimes you have smart people in the room who dig deeper and fish it out, but you cannot always rely on that.
I'm also suspicious that
> Eliminating the ability for core dumps or other error reports to overwhelm system resources
from the blog had a lot more to do with the issue than perhaps the narrative is letting on.
My best guess is too many alerts firing without a clear hierarchy and possibilities to seprate cause from effect. It's a typical challenge but I wish they would shed some light on that. And its a bit concerning that improving observability is not part of their follow up steps.
Sounds like the ops team had one hell of a day.
i’m a little confused on how this was initially confused for an attack though?
is there no internal visibility into where 5xx’s are being thrown? i’m surprised there isn’t some kind of "this request terminated at the <bot checking logic>" error mapping that could have initially pointed you guys towards that over an attack.
also a bit taken aback that .unwrap()’s are ever allowed within such an important context.
would appreciate some insight!
2. Attacks that make it through the usual defences make servers run at rates beyond their breaking point, causing all kinds of novel and unexpected errors.
Additionally, attackers try to hit endpoints/features that amplify severity of their attack by being computationally expensive, holding a lock, or trigger an error path that restarts a service — like this one.
this was in the middle of a scheduled maintenance, with all requests failing at a singular point - that being a .unwrap().
there should be internal visibility into the fact a large number of requests are failing all at the same LOC - and attention should be focused there instantly imo.
or at the very least, it shouldn't take 4 hours for anyone to even consider it wasn't an attack.
in situations such as this, where your entire infra is fucked, you should have multiple crisis teams working in parallel, under different assumptions.
if even one additional team was created that worked under the assumption it was an infra issue rather than an attack, this situation could have been resolved many hours earlier.
for a product as vital to the internet as cloudflare, it is unacceptable to not have this kind of crisis management.
Companies seem to place a lot of trust is configs being pushed automatically without human review into running systems. Considering how important these configs are, shouldn't they perhaps first be deployed to a staging/isolated network for a monitoring window before pushing to production systems?
Not trying to pontificate here, these systems are more complicated than anything I have maintained. Just trying to think of best practices perhaps everyone can adopt.
What would you propose to fix it? The fixed cost of being DDoS-proof is in the hundreds of millions of dollars.
Hell, I would be very curious to know the costs to keep HackerNews running. They probably serve more users than my current client.
People want to chase the next big thing to write it on their CV, not architect simple systems that scale. (Do they even need to scale?)
I never said serving millions of requests is more expensive. Protecting your servers is more expensive.
> Hell, I would be very curious to know the costs to keep HackerNews running. They probably serve more users than my current client.
HN uses Cloudflare. You're making my point for me. If you included the fixed costs that Cloudflare's CDN/proxy is giving to HN incredibly cheaply, then running HN at the edge with good performance (and protecting it from botnets) would costs hundreds of millions of dollars.
> People want to chase the next big thing to write it on their CV, not architect simple systems that scale. (Do they even need to scale?)
Again, attacking your own straw men here.
Writing high-throughput web applications is easier than ever. Hosting them on the open web is harder than ever.
From the ping output, I can see HN is using m5hosting.com. This is why HN was up yesterday, even though everything on CF was down.
> Writing high-throughput web applications is easier than ever. Hosting them on the open web is harder than ever.
Writing proper high-throughput applications was never easy and will never be. It is a little bit easier because we have highly optimized tools like nginx or nodejs so we can offset critical parts. And hosting is "harder than ever" if you complicate the matter, which is a quite common pattern these days. I saw people running monstrosities to serve some html & js in the name of redundancy. You'd be surprised how much a single bare-metal (hell, even a proper VM from DigitalOcean or Vultr) can handle.
"Single" means "you only need one," not that there is only one.
I can imagine that this could easily lead to less visibility into issues.
I don’t think the infrastructure has been as fully recovered as they think yet…
> Throwing us off and making us believe this might have been an attack was another apparent symptom we observed: Cloudflare’s status page went down. The status page is hosted completely off Cloudflare’s infrastructure with no dependencies on Cloudflare.
also cloudflare:
> The Cloudflare Dashboard was also impacted due to both Workers KV being used internally and Cloudflare Turnstile being deployed as part of our login flow.
Cloudflare's status page: https://www.cloudflarestatus.com/
Cloudflare Dashboard: https://dash.cloudflare.com/
Unclear to me if it's an Atlassian-managed deployment they have, or if it's self-managed, I'm not familiar with Statuspage and their website isn't helping. Though if it's managed, I'm not sure how they can know for sure there's no interdependence. (Though I guess we could technically keep that rabbit hole going indefinitely.)
https://blog.cloudflare.com/18-november-2025-outage/#:~:text...
https://blog.cloudflare.com/18-november-2025-outage/#:~:text...
> As we wrote before, we believe Blackbird Tech's dangerous new model of patent trolling — where they buy patents and then act their own attorneys in cases — may be a violation of the rules of professional ethics.
https://blog.cloudflare.com/patent-troll-battle-update-doubl...
ChatGPT didn't invent the em dash, some people were always using it. But yeah, it's often one of the signs of AI.
The lack of canary: cause for concern, but I more or less believe Cloudflare when they say this is unavoidable given the use case. Good reason to be extra careful though, which in some ways they weren't.
The slowness to root cause: sheer bad luck, with the status page down and Azure's DDoS yesterday all over the news.
The broken SQL: this is the one that I'd be up in arms about if I worked for Cloudflare. For a system with the power to roll out config to ~all of prod at once while bypassing a lot of the usual change tracking, having this escape testing and review is a major miss.
So basically bad config should be explicitly processed and handled by rolling back to known working config.
I think that's explicitly a non-goal. My understanding is that Cloudflare prefers fail safe (blocking legitimate traffic) over fail open (allowing harmful traffic).
Crashing is not an outage. It’s a restart and a stack trace for you to fix.
But you’re still missing it. Crashing is not bad. It’s good. It’s how you leverage OS level security and reliability.
In fact I'd argue that crashing is bad. It means you failed to properly enumerate and express your invariants, hit an unanticipated state, and thus had to fail in a way that requires you to give up and fall back on the OS to clean up your process state.
[edit]
Sigh, HN and its "you're posting too much". Here's my reply:
> Why? The end user result is a safe restart and the developer fixes the error.
Look at the thread your commenting on. The end result was a massive world-wide outage.
> That’s what it’s there for. Why is it bad to use its reliable error detection and recovery mechanism?
Because you don't have to crash at all.
> We don’t want to enumerate all possible paths. We want to limit them.
That's the exact same thing. Anything not "limited" is a possible path.
> If my program requires a config file to run, crash as soon as it can’t load the config file. There is nothing useful I can do (assuming that’s true).
Of course there's something useful you can do. In this particular case, the useful thing to do would have been to fall back on the previous valid configuration. And if that failed, the useful thing to do would be to log an informative, useful error so that nobody has to spend four hours during a worldwide outage to figure out what was going wrong.
Why? The end user result is a safe restart and the developer fixes the error.
> fall back on the OS to clean up your process state.
That’s what it’s there for. Why is it bad to use its reliable error detection and recovery mechanism?
> It means you failed to properly enumerate and express your invariants
We don’t want to enumerate all possible paths. We want to prune them.
If my program requires auth info to run, crash as soon as it can’t load it. There is nothing useful I can do (assuming that’s true).
The world wide outage was actually caused by deploying several incorrect programs in an incorrect system.
The root one was actually a bad query as outlined in the article.
Let’s get philosophical for a second. Programs WILL be written incorrectly - you will deploy to production something that can’t possibly work. What should you do with a program that can’t work? Pretend this can’t happen? Or let you know so you can fix it?
Type systems provide compile time guarantees of correctness such that systems cannot fail in ways covered by the type system.
In this case, they used an unsound hole in the type system to do something that unnecessarily abandoned those compile-time invariants and in the process caused a world-wide outage.
The answer is not to embrace poking unsound holes in your type system in the first place.
In this particular case, it was the "limit 200" because "performance reasons" so I think there was more space to implement the latter than the former.
Are you in the right thread?
The problem was a query producing incorrect data. The crash helped them find it.
What do you think happens when a program crashes?
Barely given the initial impression it was a DDoS.
Although you can argue that's bad observability.
But the architectural assumption that the bot file build logic can safely obtain this operationally critical list of features from derivative database metadata vs. a SSOT seems like a bigger problem to me.
Gonna use that one at $WORK.
All that said, to have an outage reported turned around practically the same day, that is this detailed, is quite impressive. Here's to hoping they make their changes from this learning, and we don't see this exact failure mode again.
i think this is happening way too frequently
meanwhile VPS, dedicated servers hum along without any issues
i dont want to use kubernetes but if we have to build mission critical systems doesn't seem like building on cloudflare is going to cut it
Reputationally this is extremely embarrassing for Cloudflare, but imo they seem to get their feet back on the ground. I was surprised to see not just one, but two apologies to the internet. This just cements how professional and dedicated the Cloudflare team is to ensure stable resilient internet and how embarrassed they must have been.
A reputational hit for sure, but outcome is lessons learned and hopefully stronger resilience.
What I'm trying to say is that things would be much better if everyone took a chill pill and accepted the possibility that in rare instances, the internet doesn't work and that's fine. You don't need to keep scrolling TikTok 24/7.
> but my use case is especially important
Take a chill pill. Probably it isn't.
...
(I'd pick Haskell, cause I'm having fun with it recently :P)
I haven’t worked in Rust codebases, but I have never worked in a Go codebase where a `panic` in such a location would make it through code review.
Is this normal in Rust?
They just sell proxies, to whoever.
Why are they the only company doing ddos protection?
I just don't get it.
> The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.
I'm no FAANG 10x engineer, and I appreciate things can be obvious in hindsight, but I'm somewhat surprised that engineering at the level of Cloudflare does not:
1. Push out files A/B to ensure the old file is not removed.
2. Handle the failure of loading the file (for whatever reason) by automatically reloading the old file instead and logging the error.
This seems like pretty basic SRE stuff.
Even if the servers weren't crashing it is possible that a bet set of parameters results in far too many false positives which may as well be complete failure.
Even if you want this data to be very fresh you can probably afford to do something like:
1. Push out data to a single location or some subset of servers.
2. Confirm that the data is loaded.
3. Wait to observe any issues. (Even a minute is probably enough to catch the most severe issues.)
4. Roll out globally.
464646449
Free fire
It sounds like the change could've been rolled out more slowly, halted when the incident started and perhaps rolled back just in case.
Big tech is a fucking joke.
This has to sting a bit after that post.
What could have prevented this failure?
Cloudflare's software could have included a check that refused to generate the feature file if it's size was higher than the limit.
A testcase could have caught this.
These folks weren't operating for charity. They were highly paid so-called professionals.
Who will be held accountable for this?
> Eliminating the ability for core dumps or other error reports to overwhelm system resources
but this is not mentioned at all in the timeline above. My best guess would be that the process got stuck in a tight restart loop and filled available disk space with logs, but I'm happy to hear other guesses for people more familiar with Rust.
> As well as returning HTTP 5xx errors, we observed significant increases in latency of responses from our CDN during the impact period. This was due to large amounts of CPU being consumed by our debugging and observability systems, which automatically enhance uncaught errors with additional debugging information.
(Just above https://blog.cloudflare.com/18-november-2025-outage/#how-clo...)
There are plenty of resources , yet it's somehow never enough. You do tons of pretty amazing things with pretty amazing tools that also have notable shortcomings.
You're surround by smart people who do lots of great work, but you also end up in incident reviews where you find facepalm-y stuff. Sometimes you even find out it was a known corner case that was deemed too unlikely to prioritize.
The last incident for my team that I remember dealing with there ended up with my coworker and I realizing the staging environment we'd taken down hours earlier was actually the source of data for a production dashboard, so we'd lost some visibility and monitoring for a bit.
I've also worked at Facebook (pre-Meta days) and at Datadog, and I'd say it was about the same. Most things are done quite well, but so much stuff is happening that you still end up with occasional incidents that feel like they shouldn't have happened.
I wrote a book on feature stores by O'Reilly. The bad query they wrote in Clickhouse could have been caused by another more error - duplicate rows in materialized feature data. For example, in Hopsworks it prevents duplicate rows by building on primary key uniqueness enforcement in Apache Hudi. In contrast, Delta lake and Iceberg do not enforce primary key constraints, and neither does Clickhouse. So they could have the same bug again due to a bug in feature ingestion - and given they hacked together their feature store, it is not beyond the bounds of possibility.
Reference: https://www.oreilly.com/library/view/building-machine-learni...
I'm impressed they were able to corral people this quickly.
If that's true, is there a way to tell (easily) whether a site is using cloudflare or not?
Just ping the host and see if the ip belongs to CF.
I'd agree that the use of `unwrap` could possibly make sense in a place where you do want the system to fail hard. There's lot of good reasons to make the system fail hard. I'd lean towards an `expect` here, but whatever.
That said, the function already returns a `Result` and we don't know what the calling code looks like. Maybe it does do an `unwrap` there too, or maybe there is a save way for this to log and continue that we're not aware of because we don't have enough info.
Should a system as critical as the CF proxy fail hard? I don't know. I'd say yes if it was the kind of situation that could revert itself (like an incremental rollout), but this is such an interesting situation since it's a config being rolled out. Hindsight is 20:20 obviously, but it feels like there should've been better logging, deployment, rollback, and parsing/validation capabilities, no matter what the `unwrap`/`Result` option is.
Also, it seems like the initial Clickhouse changes could've been testing much better, but I'm sure the CF team realizes that.
On the bright side, this is a very solid write up so quickly after the outage. Much better than those times we get it two weeks later.
The report actually seems to confirm this - it was indeed a crash on ingesting the bad config. However I'm actually surprised that the long duration didn't come from "it takes a long time to restart the fleet manually" or "tooling to restart the fleet was bad".
The problem mostly seems to have been "we didn't knew whats going on". Some look into the proxy logs would hopefully have shown the stacktrace/unwrap, and metrics about the incoming requests would hopefully have shown that there's no abnormal amount of requests coming in.
Excuse me, what you've just said? Who decided on “Cloudflare's importance in the Internet ecosystem”? Some see it differently, you know, there's no need for that self-assured arrogance of an inseminating alpha male.
However, I have a question from a release deployment process perspective. Why was this issue not detected during internal testing ? I didn't find the RCA analysis covering this aspect. Doesn't cloudflare have an internal test stage as part of its CICD pipeline. Looking the description of the issue, it should have been immediately detected in internal stage test environment.
How about 1. The permissions change project is paused or rolled back until 2. All impacted database interactions (SQL queries) are evaluated for improper assumptions or better 3. Their design that depends on database metainfo and schema is replaced with ones that use specific tables and rows in tables instead of using the meta info as part of their application. 4. All hard coded limits are centralized in a single global module and referenced from their users and then back propagated to any separate generator processes that validate against the limit before pushing generated changes
1) Lack of validation of the configuration file.
Rolling out a config file across the global network every 5 minutes is extremely high risk. Even without hindsight, surely one would see then need for very careful validation of this file before taking on that risk?
There were several things "obviously" wrong with the file that validation should have caught:
- It was much bigger than expected.
- It had duplicate entries.
- Most importantly, when loaded into the FL2 proxy, the proxy would panic on every request. At the very least, part of the validation should involve loading the file into the proxy and serving a request?
2) Very long time to identify and then fix such a critical issue.
I can't understand the complete lack of monitoring or reporting? A panic in Rust code, especially from an unwrap, is the application screaming that there's a logic error! I don't understand how that can be conflated with a DDoS attack. How are your logs not filled with backtraces pointing to the exact "unwrap" in question?
Then, once identified, why was it so hard to revert to a known good version of the configuration file? How did noone foresee the need to roll back this file when designing a feature that deploys a new one globally every 5 minutes?
Even a simple key-value map per feature should have allowed for insertions as simple as a put/replace of the value and not appending to the file. That was not the case here, where Cloudflare kept appending to the file for any feature to be added. And I am assuming the features are bot attack patterns as features. Anyway, there is something fundamental here that Cloudflare should rethink. If someone can educate me on the design, I can continue reading the next few lines.
Cloudflare's incident report is written clearly and explicitly, so based on my own understanding, I’m going to try reproducing this outage. Already completed:
CK cluster Permission change triggering data doubling Cache propagation Unaffected proxy services Proxy services with bot score errors
TODO:
unwrap panic during pre-allocation of cache Full demonstration of the entire outage process
What were the teams doing between 11 to 1300 hrs , no explanation of what investigations were going on to not being able to figure the root cause.
Didn't the services that were crashing due to OOM raise any alerts?
This is shitty at so many levels.
Go to jeffblearning on LinkedIn. I took it down with 253 copies of a text file delivered through a vulnerability in Novo’s systems.
I’ve documented all of it.
It’s not done yet…
Shouldn't the architecture setup in such a way that subcomponents can fail without impacting the critical function of the component?
nawgz•2mo ago
> The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail
A configuration error can cause internet-scale outages. What an era we live in
Edit: also, after finishing my reading, I have to express some surprise that this type of error wasn't caught in a staging environment. If the entire error is that "during migration of ClickHouse nodes, the migration -> query -> configuration file pipeline caused configuration files to become illegally large", it seems intuitive to me that doing this same migration in staging would have identified this exact error, no?
I'm not big on distributed systems by any means, so maybe I'm overly naive, but frankly posting a faulty Rust code snippet that was unwrapping an error value without checking for the error didn't inspire confidence for me!
jmclnx•2mo ago
norskeld•2mo ago
mewpmewp2•2mo ago
I think it's quite rare for any company to have exact similar scale and size of storage in stage as in prod.
Aeolun•2mo ago
We’re like a millionth the size of cloudflare and we have automated tests for all (sort of) queries to see what would happen with 20x more data.
Mostly to catch performance regressions, but it would work to catch these issues too.
I guess that doesn’t say anything about how rare it is, because this is also the first company at which I get the time to go to such lengths.
mewpmewp2•2mo ago
In this case it seems the database table in question seemed modest in size (the features for ML) so naively thinking they could have kept stage features always in sync with prod at the very least, but could be they didn't consider that 55 rows vs 60 rows or similar could be a breaking point given a certain specific bug.
It is much easier to test with 20x data if you don't have the amount of data cloudflare probably handles.
Aeolun•2mo ago
Either way, you don’t need to do it on every commit, just often enough that you catch these kinds of issues before they go to prod.
mewpmewp2•2mo ago
tatersolid•2mo ago
Cloudflare doesn’t run in AWS. They are a cloud provider themselves and mostly run on bare metal. Where would these extra 100k physical servers come from?
Aeolun•2mo ago
Doing stuff at scale doesn’t suddenly mean you skip testing.
And just because they host stuff themselves doesn’t mean they couldn’t run on the cloud if they needed to.
mewpmewp2•2mo ago
Their main cost of revenue is these infra costs.
norskeld•2mo ago
Jach•2mo ago
gishh•2mo ago
NetMageSCW•2mo ago
shoo•2mo ago
I also found the "remediation and follow up" section a bit lacking, not mentioning how, in general, regressions in query results caused by DB changes could be caught in future before they get widely rolled out.
Even if a staging env didn't have a production-like volume of data to trigger the same failure mode of a bot management system crash, there's also an opportunity to detect that something has gone awry if there were tests that the queries were returning functionally equivalent results after the proposed permission change. A dummy dataset containing a single http_requests_features column would suffice to trigger the dupe results behaviour.
In theory there's a few general ways this kind of issue could be detected, e.g. someone or something doing a before/after comparison to test that the DB permission change did not regress query results for common DB queries, for changes that are expected to not cause functional changes in behaviour.
Maybe it could have been detected with an automated test suite of the form "spin up a new DB, populate it with some curated toy dataset, then run a suite of important queries we must support and check the results are still equivalent (after normalising row order etc) to known good golden outputs". This style of regression testing is brittle, burdensome to maintain and error prone when you need to make functional changes and update what then "golden" outputs are - but it can give a pretty high probability of detecting that a DB change has caused unplanned functional regressions in query output, and you can find out about this in a dev environment or CI before a proposed DB change goes anywhere near production.