We must be at the trillion dollar mistake by now, right?
Zig is fairly easy to adopt in existing C systems and is guaranteed null safe (although not use after free safe)
Rust, although quite safe, bears a fairly high adoption cost as existing code often cannot be ported directly.
Borgo (https://github.com/borgo-lang/borgo) is a nil safe language that compiles to go, so is easily adoptable in existing Go systems.
Scala, Kotlin, Java you just can’t get away from null :/
Even in safe languages like Rust, you can still introduce errors with the same semantics in application code as null pointer error: doing `some_optional.unwrap().someMethod()` will crash your program just as much as `someNullableObject.someMethod()` in Java.
As one example, I just learned (by way of a nasty production app crash) that Kotlin chose to make all checked exceptions silently unchecked. Kind of a stunning own-goal for a language from 2010.
In everyday Kotlin code, I see either a sealed class or Result for cases where you'd find checked exceptions in Java, and otherwise normal unchecked exceptions from `require`, `check`, precondition assertions.
But it gets a lot of hate in PL circles for reasons I don't completely understand.
My tl;dr:
The vlang team has repeatedly made false promises and assertions about what the language can do, ranging from a "we put things that are on our roadmap in our features list" to "we've promised something that is outright impossible". Analyzing how this has changed over the years, the only reasonable conclusion is that they're grifters who are okay with lying.
In addition, they regularly engage in flamewars about it on this site, and their moderation practices of silencing any sort of criticism in their own spaces and calling anyone who dares question it idiots are well known and documented.
You don't need to rewrite everything to prevent the majority of new bugs, it's enough to protect new code and keep the battle tested stuff around
If you are talking about the issue for migrating to rust, well re-writing hundred of million lines of lines of code never makes sense.
Not only if this was true it would kill developer productivity, but how can this even be done. When the compiler versions bumps that's a lot changes that need to be gated. Every team that works on a dependency will have to add feature flags for any change and bug fix they make.
Rules in aerospace are written in blood. Rules of internet software are written in inconvenience, so productivity is usually given much higher priority than reducing risk of catastrophic failure.
This also got a laugh:
We posted our first incident report to Cloud Service Health about ~1h after the start of the crashes, due to the Cloud Service Health infrastructure being down due to this outage. For some customers, the monitoring infrastructure they had running on Google Cloud was also failing, leaving them without a signal of the incident or an understanding of the impact to their business and/or infrastructure.
You should always have at least some kind of basic monitoring that's on completely separate infrastructure, ideally from a different vendor. (And maybe Google should too)
> It took up to ~2h 40 mins to fully resolve in us-central-1
this would have cost their customers tens of millions, maybe north of $100M.
not surprised they'd have an extreme write up like this.
In the rollout duration, the combinatorial explosion of code paths is rather annoying to deal with. On the other hand, it does encourage not having too many things going on at once. If some of the changes affect business metrics, it will be hard to glean any insights if you have too many of them going on at once.
I guess the issue here is if you're crash looping, once the task comes up it will generate load retrying to get the config, so even if you're no longer crash looping (and hence no longer backing off at borg) you're still causing overload.
As long as the initial rate of tasks coming up is enough to cause overload, this will result in persisting the outage even once all tasks are up (assuming that the overload is sufficient to bring goodput of tasks becoming healthy to near zero).
Interestingly you can read that one of the mitigations they applied was to fan out config reads to the multiregional mirrors of the database instead of just the regional us-central1 mirror, presumably the multi regional mirrors brought in significantly more capacity than just the regional mirrors, spreading the load.
I'd be curious to know how much configuration they're loading that it caused such load.
So no, it doesn't.
Hey, it's better than lurking undefined behavior and data corruption issues!
Doesn't keep your binary from crashing, though.
It does. If a function returns a Result or Option type, you can't just use its data; you have to either automatically propagate the error case upward, pattern match on both the success and error cases, or just use ".unwrap()" and explicitly choose to crash in the case of an error. There's no implicit crash like there is in a language that lets you dereference null pointers.
Processes and systems have flaws that can be fixed, humans always will make mistakes.
1) A global feature release that went everywhere at the same time
2) Null pointer dereference
3) Lack of appropriate retry policies that resulted in a thundering herd problem
All of these are absolutely standard mistakes that everyone who's worked in the industry for some time has seen numerous times. There is nothing novel here, no weird distibited system logic, no google scale, just rookie mistakes all the way
This is 100% a process problem.
Most existing software stacks (including Google's C++, Go, Java) are by no means in a position to solve this problem, but that doesn't change that it is, in fact, a problem that fundamentally can be solved using types.
Of course, that'd require a full rewrite of the service.
And when the failures are request-scoped, you're back to the outage not being global but affecting only the customers using this feature with a bad config.
We will modularize Service Control’s architecture, so the functionality is isolated and fails open. Thus, if a corresponding check fails, Service Control can still serve API requests.
It's clear that they have issues in the service itself not related to language.
Even better if the type checker specifically highlights the fact that a value can be zero, and prevents compilation of code that doesn't specifically take this possibility into account.
* Not dealing with null data properly
* Not testing it properly
* Not having test coverage showing your new thing is tested
* Not exercising it on a subset of prod after deployment to show it works without falling over before it gets pushed absolutely everywhere
Standards in this industry have dropped over the years, but by this much? If you had done this 10 years ago as a Google customer for something far less critical everyone on their side would be smugly lolling at you, and rightly so.
> Not dealing with null data properly
This is _hardly_ a "junior level mistake". That kind of bug is pervasive in all the languages they're likely using for this service (Go, Java, C++) written even by the most "senior" developers.
This was reading fields from a database which were coming back null, not some pointer that mutates to being null after a series of nasty state transitions, and so this is very much in the junior category.
I haven't done such a mistake in many, many years, other kinds of null issues from internal logic I'm still very much guilty of at times, but definitely not "I forgot to handle a null from a data source I read from". It's so pervasive and common that you definitely learn from your mistakes.
Amateur hour in Mountain View.
For a company the size and quality of Google to be bringing down the majority of their stack with this type of error really suggests they do not implement appropriate mitigations after serious issues.
They rolled out a change without feature flagging, didn’t implement exponential backoffs in the clients, didn’t implement load shedding in the servers.
This is all in the google SRE book from many years ago.
I read their SRE books, all of this stuff is in there: https://sre.google/sre-book/table-of-contents/ https://google.github.io/building-secure-and-reliable-system...
have standards slipped? or was the book just marketing
HN likes to pretend that FAANG is the pinnacle of existence. The best engineers, the best standards, the most “that wouldn’t have happened here,” the yardstick by which all companies should be measured for engineering prowess.
Incidents like this repeatedly happening reveal that’s mostly a myth. They aren’t much smarter, their standards are somewhat wishful thinking, their accomplishments are mostly rooted in the problems they needed to solve just like any other company.
So I would say there is a difference between AWS architects and engineers (although I know first hand that certain things are subobtimal, but...) and those of several other companies who have less customers but experienced successful attacks (or data loss). Even if you take Microsoft, there is huge difference in security posture between AWS and Azure (and I say this as a big fan of the so-called "private cloud" (previously know as just your own infra)).
https://www.lastweekinaws.com/blog/azures_vulnerabilities_ar...
https://www.lastweekinaws.com/blog/azures_vulnerabilities_ar...
(2022)
I think you're only seeing what you want to see, because somehow bringing FANG engineers down a peg makes you feel better?
A broken deployment due to a once-in-a-lifetime configuration change in a project that wasn't allocated engineering effort to allow more robust and resilient deployment modes doesn't turn any engineer into an incompetent fool. Sometimes you need to flip a switch, and you can't spare a team working one year to refactor the whole thing.
This seems to imply that the person in charge at G was right to cause this outage... and that Google is very short-staffed and too poor to afford to do proper engineering work?
Somehow that doesn't inspire confidence in their engineering prowess. Sure seems to me that bad engineering leadership decisions is equivalent to bad engineering.
Opinions are my own.
But if you change a schema, be it DB, protobuf, whatever, this is the major thing your tests should be covering.
This is why people are so amazed by it.
The document also doesn't say there wasn't testing in staging or prod.
So do we. If you change the schema it doesn't matter if it's unit testing, you either have a validation stage for all data coming in beforehand to check these assumptions (the way many python people use pydantic on incoming data, for example), and that must be updated for every change, or every unit must be defensive against this sort of thing.
Google’s standards, and from what I can tell, most FAANG standards are like beauty filters on Instagram. Judging yourself, or any company, against them is delusional.
- And they used machines without ECC and their index got corrupted because of it? And instead of hiding the head in shame and getting lessons from IBM old timers they published a paper about it?
- What really accelerated the demise of Google+ was that an API issue allowed the harvesting of private profile fields for millions of users, and they hid that for months fearing the backlash...
Dont worry, you will have plenty more outages from the land of we only hire the best....
On GKE, we see different services (like Postgres and NATS) running on the same VM in different containers receive/send stream contents (e.g. HTTP responses) where the packets of the stream have been mangled with the contents of other packets. We've been seeing it since 2024, and all the investigation we've done points to something outside our apps and deeper in the system. We've only seen it in one Kubernetes cluster, and it lasts 2-3 hours and then magically resolves itself; draining the node also fixes it.
If there are physical nodes with faulty RAM, I bet something like this could happen. Or there's a bug in their SDN or their patched version of the Linux kernel.
All the standard tools for binary rollouts and config pushes will typically do some kind of gradual rollout.
In some ways Google Cloud had actually greatly improved the situation since a bunch of global systems were forced to become regional and/or become much more reliable. Google also used to have short global outages that weren't publicly remarked on (at the time, if you couldn't connect to Google, you assumed your own ISP was broken), so this event wasn't as rare as you might think. Overall I don't think there is a worsening trend unless someone has a spreadsheet of incidents proving otherwise.
[I was an SRE at Google several years ago]
> a policy change was inserted into the regional Spanner tables that Service Control uses for policies. Given the global nature of quota management, this metadata was replicated globally within seconds
If there’s a root cause here, it’s that “given the global nature of quota management” wasn’t seen as a red flag that “quota policy changes must use the standard gradual rollout tooling.”
The baseline can’t be “the trend isn’t worsening;” the baseline should be that if global config rollouts are commonly the cause of problems, there should be increasingly elevated standards for when config systems can bypass best practices. Clearly that didn’t happen here.
I hope the pendulum swings the other way around now in the discussion.
[disclaimer that I worked as a GCP SRE for a long time, but not left recently]
If I understood it correctly, this service checks proper authorisation among other things, so isn’t failing open a security risk?
The downstream effects tend to be pretty grim, and to make things worse, they start to show up only after 6 months. It's also a coinflip whether a reverse decision will be made after another major outage - itself directly attributable to the decisions made in the aftermath of the previous one.
What makes these kinds of issues particularly challenging is that by their very definition, the conditions and rules will be codified deep inside nested error handling paths. As an engineer maintaining these systems, you are outside of the battle tested happy paths and first-level unhappy paths. The conditions to end up in these second/third-level failure modes are not necessarily well understood, let alone reproducible at will. It's like writing code in C, and having all your multi-level error conditions be declared 'volatile' because they may be changed by an external force at any time, behind your back.
I mean, it's not like they don't have that technology: the worldwide sync was exactly what caused the outage.
At $WORK we use Consul for this job.
It sounds to me like something needed to be recompiled and redeployed.
> Without the appropriate error handling, the null pointer caused the binary to crash.
Even worse if this was AI generated C or C++ code, wasn't this not tested before deployment?
This is why you write tests before the acutal code and why vibe-coding is a scam as well. This would also never have happened if it was in Rust.
I expect far better than this from Google and we are still dealing with null pointer crashes to this day.
"If this was vibe coded, this is even worse. This proves that vibe coding is bad."
> policy change was inserted into the regional Spanner tables
> This policy data contained unintended blank fields
> Service Control... pulled in blank fields... hit null pointer causing the binaries to go into a crash loop
No one wondered hm maybe this isn’t a good idea?
I've been there. The product guy needs the new feature enabled for everyone, and he needs it yesterday. Suggestions of feature flagging are ignored outright. The feature is then shipped for every user, and fun ensues.
Another example of Hoare’s “billion-dollar mistake” in multiple multiple Google systems:
- Why is it possible to insert unintended “blank fields” (nulls)? The configuration should have a schema type that doesn’t allow unintended nulls. Unfortunately Spanner itself is SQL-like and so fields must be declared NOT NULL explicitly, the default is nullable fields.
- Even so, the program that manages these policies will have its own type system and possibly an application level schema language for the configuration. This is another opportunity to make invalid states unrepresentable.
- Then in Service Control, there’s an opportunity to prove “schema on read” as you deserialize policies from the data store into application objects, again either a programming language type or application level schema could be used to validate policy rows have the expected shape before they leave the data layer. Perhaps the null pointer error occurred in this layer, but since this issue occurred in a new code path, it sounds more likely the invalid data escaped the data layer into application code.
- Finally, the Service Control application is written in a language that allows for null pointer references.
If I were a maintainer of this system, the minimally invasive chance I would be thinking about, is how to introduce an application level schema to the policy writer and the policy reader that uses a “tagged enum type” or “union type” or “sum type” to represent policies that cannot express null. Ideally each new kind of policy could be expressed as a new variant added to the union type. You can add this in app code without rewriting the whole program to a safe language. Unfortunately it seems proto3, google’s usual schema language, doesn’t have this constraint.
Example of one that does: https://github.com/stepchowfun/typical
- All the code has unit tests and integration tests
- Binary and config file changes roll out slowly job by job, region by region, typically over several days. Canary analysis verifies these slow rollouts.
- Even panic rollbacks are done relatively slowly to avoid making the situation worse. For example globally overloading databases with job restarts. A 40m outage is better than a 4 hour outage.
I have no insider knowledge of this incident, but my read of the PM is: The code was tested, but not this edge case. The quota policy config is not rolled out as a config file, but by updating a database. The database was configured for replication which meant the change appeared in all the databases globally within seconds instead of applying job by job, region by region, like a binary or config file change.
I agree on the frustration with null pointers, though if this was a situation the engineers thought was impossible it could have just as likely been an assert() in another language making all the requests fail policy checks as well.
Rewriting a critical service like this in another language seems way higher risk than making sure all policy checks are flag guarded, that all quota policy checks fail open, and that db changes roll out slowly region by region.
Disclaimer: this is all unofficial and my personal opinions.
so... it wasn't tested
Asserts are much easier to forbid by policy.
Which definitely seems like shortcut 'on to the next thing and ignore QA due diligence.'
It’d say something that core Google products don’t or won’t take a dependency on Google Cloud…
This reads to me like someone finally won an argument they’d been having for some time.
The most important thing to look at is how much had to go wrong for this to surface. It had to be a bug without test coverage that wasn't covered by staged rollouts or guarded by a feature flag. That essentially means a config-in-db change. Detection was fast, but rolling out the fix was slow out of fear of making things worse.
The NPE aspect is less interesting. It could have been any number of similar "this can't happen" errors. It could have been mutually exclusive fields are present in a JSON object, and the handling logic does funny things. Validation during mutation makes sense, but the rollout strategy is more important since it can catch and mitigate things you haven't thought of.
Is this information available anywhere?
softveda•13h ago
miyuru•13h ago
yen223•8h ago
miyuru•5h ago
error handling is very basic, the only explanation these kind of bad code to get pushed to prod is LLMs and high trust on LLM automation.
they wont admit this publicly anyway, there is too much money invested on LLMs.
bananapub•9h ago
polotics•9h ago
fidotron•9h ago
This is a dumbfounding level of mistake for an organization such as Google.
Unroasted6154•7h ago
In addition it looks like the code was not ready for production and the mistake was not gating it behind a feature flag. It didn't go through the normal release process.
piva00•7h ago
I don't think "completely untested" is correct but tested way below expectations for such structural piece of code is a lesson they should learn, it does look like an amateur-hour mistake.
Unroasted6154•7h ago
sebazzz•9h ago
JackSlateur•5h ago