Google Cloud Incident Report – 2025-06-13

https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW

158•denysvitali•14h ago

Comments

softveda•13h ago

So code that was untested (the code path that failed was never exercised), perhaps there is no test environment, and not even peer reviewed ( it did not have appropriate error handling nor was it feature flag protected.) was pushed to production, what a surprise !!

miyuru•13h ago

I would not be surprised if the code was AI generated.

yen223•8h ago

I like the faith you have that people weren't making null-pointer mistakes before LLMs.

miyuru•5h ago

It did happen before LLMs, but there are well documented process to catch them. Google literally wrote the book on SRE best practices.

error handling is very basic, the only explanation these kind of bad code to get pushed to prod is LLMs and high trust on LLM automation.

they wont admit this publicly anyway, there is too much money invested on LLMs.

bananapub•9h ago

There absolutely is a test environment, it was absolutely reviewed and Google has absolutely spent Moon-landing money on testing and in particular static analysis.

polotics•9h ago

ok so what gives then?

fidotron•9h ago

Moon landing money on static analysis that failed to identify the existence of a completely untested code path? Or even to shake this out with random data generation?

This is a dumbfounding level of mistake for an organization such as Google.

Unroasted6154•7h ago

What makes you think it was completely untested? The condition that triggered the null pointer exception was obviously not tested, but it doesn't mean it didn't have tests or even 100% unit test coverage for the coverage tools.

In addition it looks like the code was not ready for production and the mistake was not gating it behind a feature flag. It didn't go through the normal release process.

piva00•7h ago

If Google spent Moon-landing level of money in their quality/deployment infrastructure I expect a much better coverage checker than "100% unit tested", they are famous for having a whole fuzzing infrastructure, coverage analysers for more complex interplay of logic is something I use daily in a non-Google levels of spending (even though still a big enough corporation) which often reminds me that I forgot to write a functional test to cover a potential orchestration issue.

I don't think "completely untested" is correct but tested way below expectations for such structural piece of code is a lesson they should learn, it does look like an amateur-hour mistake.

Unroasted6154•7h ago

The main issues to me seem to be that the code was not gated by a flag when it was not ready to be used, thus skipping a lot of the testing / release qualification.

sebazzz•9h ago

Continuous Integration/Continuous Disaster

JackSlateur•5h ago

No amount of "whatever" can prevent bugs to reach production

Philpax•9h ago

> Without the appropriate error handling, the null pointer caused the binary to crash.

We must be at the trillion dollar mistake by now, right?

bix6•9h ago

I wonder how many SLAs they just cooked for the year

tough•8h ago

of their own or their customers with theirs?

Sytten•8h ago

If only we had a language that could prevent those /s

paulddraper•8h ago

Which one do you have in mind? Haskell?

zamadatix•8h ago

Does it really matter which specific language?

paulddraper•4h ago

The previous comment seemed to think the specific language mattered.

trealira•8h ago

I imagine the one they were thinking of was Rust, since it seems like that's more likely to be used where Go or C++ is, and it actually lets you use what are basically restricted pointers. But yeah, any language with an Option type that forces you to deal with the empty variant before you can use the data would have prevented it, including Haskell.

jitl•7h ago

Kotlin is very easily adoptable in existing Java systems and is much safer (although not guaranteed safe)

Zig is fairly easy to adopt in existing C systems and is guaranteed null safe (although not use after free safe)

Rust, although quite safe, bears a fairly high adoption cost as existing code often cannot be ported directly.

Borgo (https://github.com/borgo-lang/borgo) is a nil safe language that compiles to go, so is easily adoptable in existing Go systems.

paulddraper•4h ago

Java Optional is just as safe as Kotlin Option (and predates it by a couple years).

Scala, Kotlin, Java you just can’t get away from null :/

jitl•3h ago

The important distinction for me with Kotlin's system is that the compiler guides developers to write null-safe code by default because it's built into the language itself. Whereas Optional in Java requires extra effort in instantiating Optional.of(...) all over the place. Java Optional is opt-in, similar to (but weaker than) SQL `NOT NULL` - easy to forget, rather than Kotlin opt-out.

Even in safe languages like Rust, you can still introduce errors with the same semantics in application code as null pointer error: doing `some_optional.unwrap().someMethod()` will crash your program just as much as `someNullableObject.someMethod()` in Java.

saltypal•2h ago

I would only partially agree that Kotlin is "much safer."

As one example, I just learned (by way of a nasty production app crash) that Kotlin chose to make all checked exceptions silently unchecked. Kind of a stunning own-goal for a language from 2010.

jitl•2h ago

Oof, that feels like a blunder for Java interop, although I've never encountered use of checked exceptions in my admittedly limited Java experience.

In everyday Kotlin code, I see either a sealed class or Result for cases where you'd find checked exceptions in Java, and otherwise normal unchecked exceptions from `require`, `check`, precondition assertions.

adsharma•1h ago

Vlang is a much more active go variant with nil safety.

But it gets a lot of hate in PL circles for reasons I don't completely understand.

mirashii•55m ago

There's plenty of reasons in this article and the comments here: https://news.ycombinator.com/item?id=39492680.

My tl;dr:

The vlang team has repeatedly made false promises and assertions about what the language can do, ranging from a "we put things that are on our roadmap in our features list" to "we've promised something that is outright impossible". Analyzing how this has changed over the years, the only reasonable conclusion is that they're grifters who are okay with lying.

In addition, they regularly engage in flamewars about it on this site, and their moderation practices of silencing any sort of criticism in their own spaces and calling anyone who dares question it idiots are well known and documented.

morkalork•6h ago

It's been a long time, but didn't google also have a whole Java library of data structures, utilities and annotations for avoiding using nulls at all?

supriyo-biswas•6h ago

You can also just use a validation library of some sort such as go-playground/validator for Go and issues will sort themselves out by enforcing the loaded schema.

Unroasted6154•8h ago

Good luck re-writing 25 years of C++ though.

tux3•7h ago

It was Google's study that showed almost all bugs are in new code (and this was also the case of this incident)

You don't need to rewrite everything to prevent the majority of new bugs, it's enough to protect new code and keep the battle tested stuff around

Unroasted6154•7h ago

You can do that for new binaries. For existing ones you can't really or you get in a worse place for a long time.

bananapub•9h ago

lol at whoever approved the report not catching the fuckup of “red-button” instead of “big red button”.

__float•8h ago

With all of the basic mistakes in the content of this report, how is removal of the word "big" worthy of description as a fuckup?

koakuma-chan•9h ago

Still not rewriting in Rust?

VWWHFSfQ•8h ago

They probably will now

_zoltan_•8h ago

enough with rust already

Unroasted6154•7h ago

For all practical purposes impossible at this scale. The issue is not really the bug tbh.

koakuma-chan•7h ago

What is the issue?

Unroasted6154•7h ago

The issue was that a bug got triggered globally within a few seconds and all the things that led to that. The fact that is was a null pointer exception is almost irrelevant.

If you are talking about the issue for migrating to rust, well re-writing hundred of million lines of lines of code never makes sense.

charcircuit•9h ago

>We will enforce all changes to critical binaries to be feature flag protected and disabled by default.

Not only if this was true it would kill developer productivity, but how can this even be done. When the compiler versions bumps that's a lot changes that need to be gated. Every team that works on a dependency will have to add feature flags for any change and bug fix they make.

knorker•8h ago

And yet it's not even close to the productivity hurting rules and procedures followed for aerospace software.

Rules in aerospace are written in blood. Rules of internet software are written in inconvenience, so productivity is usually given much higher priority than reducing risk of catastrophic failure.

rkagerer•8h ago

Yeah, some of their takeaways seem overly stifling. This sounds like it wasn't a case of a broken process or missing ingredients. All the tools to prevent it were there (feature flags, null handling, basic static analysis tools). Someone just didn't know to use them.

This also got a laugh:

We posted our first incident report to Cloud Service Health about ~1h after the start of the crashes, due to the Cloud Service Health infrastructure being down due to this outage. For some customers, the monitoring infrastructure they had running on Google Cloud was also failing, leaving them without a signal of the incident or an understanding of the impact to their business and/or infrastructure.

You should always have at least some kind of basic monitoring that's on completely separate infrastructure, ideally from a different vendor. (And maybe Google should too)

mritchie712•8h ago

this is the worst GCP outage I can remember

> It took up to ~2h 40 mins to fully resolve in us-central-1

this would have cost their customers tens of millions, maybe north of $100M.

not surprised they'd have an extreme write up like this.

porridgeraisin•5h ago

I've actually seen this happen before. Most changes get deployed with a gradually enabled(% of users/cities/...) feature flag and then cleaned up later on. There is a slack notification from the central service which manages them that tells you how many rollouts are complete reminding you to clean them up. It escalates to SRE if you don't pay heed for long enough.

In the rollout duration, the combinatorial explosion of code paths is rather annoying to deal with. On the other hand, it does encourage not having too many things going on at once. If some of the changes affect business metrics, it will be hard to glean any insights if you have too many of them going on at once.

master_crab•9h ago

The big issue here (other than the feature rollout) is the lack of throttling. Exponential backoff is a fairly standard integration for scaled applications. Most cloud services use it. I’m surprised it wasn’t implemented for something as fundamental as Service Control.

foota•8h ago

My reading is that this is on start up. E.g., some config needs to be read when tasks come up. It's easy to have backoff in your normal API path, but miss it in all the other places your code talks to services.

__turbobrew__•3h ago

Does borg not backoff your task if it crash loops? That is how k8s does it.

foota•2h ago

I'm not sure, but it clearly wasn't sufficient if it does.

I guess the issue here is if you're crash looping, once the task comes up it will generate load retrying to get the config, so even if you're no longer crash looping (and hence no longer backing off at borg) you're still causing overload.

As long as the initial rate of tasks coming up is enough to cause overload, this will result in persisting the outage even once all tasks are up (assuming that the overload is sufficient to bring goodput of tasks becoming healthy to near zero).

Interestingly you can read that one of the mitigations they applied was to fan out config reads to the multiregional mirrors of the database instead of just the regional us-central1 mirror, presumably the multi regional mirrors brought in significantly more capacity than just the regional mirrors, spreading the load.

I'd be curious to know how much configuration they're loading that it caused such load.

Sytten•8h ago

TLDR a dev forgot an if err != nil { return 0, err } in some critical service

VWWHFSfQ•8h ago

That is still the most bizarre error handling pattern that is just completely accepted by the Go community. And after a decade of trying to fix it, the Go team just recently said that they've given up.

usrnm•8h ago

It's not that bizarre, it's exactly how errors used to be handled in older languages like C. Golang is not strange, it's just outdated. It was created by Unix greybeards, after all

VWWHFSfQ•8h ago

I guess "bizarre in the 21st century" would be more accurate.

koakuma-chan•8h ago

Does the Go compiler not force you to handle errors

Mond_•8h ago

It doesn't exactly, no. There are linters and a compiler-enforced check preventing unused variables. Overall it's pretty easy to accidentally drop errors or overwrite them before checking.

So no, it doesn't.

koakuma-chan•8h ago

Well at least you shouldn’t be able to actually dereference a nil, right?

shawabawa3•8h ago

Sadly, and imo almost unforgivably in such a modern language, no

koakuma-chan•7h ago

fwiw Go is simple, can be easily translated from TypeScript thanks to how similar it is, and has a GC, so you don’t have to learn what the hell a borrow checker is

Mond_•7h ago

You'll get a runtime panic.

Hey, it's better than lurking undefined behavior and data corruption issues!

Doesn't keep your binary from crashing, though.

Thaxll•8h ago

No language forces to handle errors. Even Rust.

trealira•6h ago

> No language forces to handle errors. Even Rust.

It does. If a function returns a Result or Option type, you can't just use its data; you have to either automatically propagate the error case upward, pattern match on both the success and error cases, or just use ".unwrap()" and explicitly choose to crash in the case of an error. There's no implicit crash like there is in a language that lets you dereference null pointers.

Thaxll•5h ago

let _ = works fine in Rust.

trealira•2h ago

That still can't cause a null pointer dereference.

Unroasted6154•7h ago

Probably C++ or java though.

eddd-ddde•6h ago

More critically, some service allowed for unvalidated instant rollout of policy changes to prod. The actual bug in the code is largely irrelevant compared to the first issue.

Processes and systems have flaws that can be fixed, humans always will make mistakes.

asim•8h ago

Google post-mortems never cease to amaze me. From seeing it inside the company to outside. The level of detail, its amazing. The thing is. They will never make the same mistake again. They learn from it, put in the correct protocols and error handling and then create an even more robust system. The thing is, at the scale of Google there is always something going wrong, the point is, how is it being handled not to affect the customer/user and other systems. Honestly it's an ongoing thing you don't see unless you're inside and even then on a per team basis you might see things no one else is seeing. It is probably the closet we're going to come to the most complex systems of the universe, because we as humans will never do better than this. Maybe AGI does, but we won't.

usrnm•8h ago

As I understand it the outage was caused by several mistakes:

1) A global feature release that went everywhere at the same time

2) Null pointer dereference

3) Lack of appropriate retry policies that resulted in a thundering herd problem

All of these are absolutely standard mistakes that everyone who's worked in the industry for some time has seen numerous times. There is nothing novel here, no weird distibited system logic, no google scale, just rookie mistakes all the way

Thaxll•8h ago

Null pointer dereference has nothing to do with that problem, they have billions of loc you think that kind of issue will never happen?

This is 100% a process problem.

usrnm•8h ago

A null pointer crashing your entire service in 2025 is a process problem

Mond_•8h ago

I mean, I don't want to be a "Keeps bringing up Rust in the comments" sort of guy, but null pointer dereferences are in fact a problem that can be solved essentially completely using strict enough typing and static analysis.

Most existing software stacks (including Google's C++, Go, Java) are by no means in a position to solve this problem, but that doesn't change that it is, in fact, a problem that fundamentally can be solved using types.

Of course, that'd require a full rewrite of the service.

koakuma-chan•8h ago

You just brought up Rust. Welcome to the club.

Thaxll•8h ago

You're assuming that not having a crash would prevent the problem they had.

jsnell•7h ago

In this case that seems like a safe assumption? The crash meant there was impact to all customers, not just the ones using the new feature.

Thaxll•7h ago

Ok so your Rust program exit gracefuly because of handled deserialization error, how does it helps?

jsnell•7h ago

Why in the world would the program exit due to that? This is a server. I'd you're going to fail entirely due to the error, the natural scope of error propagation is to fail the request. Having the entire program quit would be insane.

And when the failures are request-scoped, you're back to the outage not being global but affecting only the customers using this feature with a bad config.

Thaxll•5h ago

They said it themself:

We will modularize Service Control’s architecture, so the functionality is isolated and fails open. Thus, if a corresponding check fails, Service Control can still serve API requests.

It's clear that they have issues in the service itself not related to language.

Mond_•7h ago

I'd say the binary not crashing would certainly be an improvement.

Even better if the type checker specifically highlights the fact that a value can be zero, and prevents compilation of code that doesn't specifically take this possibility into account.

fidotron•8h ago

But this is a whole series of junior level mistakes:

* Not dealing with null data properly

* Not testing it properly

* Not having test coverage showing your new thing is tested

* Not exercising it on a subset of prod after deployment to show it works without falling over before it gets pushed absolutely everywhere

Standards in this industry have dropped over the years, but by this much? If you had done this 10 years ago as a Google customer for something far less critical everyone on their side would be smugly lolling at you, and rightly so.

VWWHFSfQ•8h ago

> junior level mistakes

> Not dealing with null data properly

This is _hardly_ a "junior level mistake". That kind of bug is pervasive in all the languages they're likely using for this service (Go, Java, C++) written even by the most "senior" developers.

fidotron•8h ago

From their report: "This policy data contained unintended blank fields."

This was reading fields from a database which were coming back null, not some pointer that mutates to being null after a series of nasty state transitions, and so this is very much in the junior category.

VWWHFSfQ•8h ago

It doesn't matter where the null came from, if the language silently allows it then every skill level of engineer will make the mistake.

piva00•7h ago

I don't agree that in this specific case (reading potential null data from a source) it's the kind of failure of null handling that every skill level would do, after you're experienced enough and you are reading values from somewhere with a schema you naturally check the schema to know if you ought handle nulls or not. Even if you are lazy, you want to avoid having to write null-checking and null-handling if it's unnecessary so you will go check the schema.

I haven't done such a mistake in many, many years, other kinds of null issues from internal logic I'm still very much guilty of at times, but definitely not "I forgot to handle a null from a data source I read from". It's so pervasive and common that you definitely learn from your mistakes.

jeffrallen•8h ago

So yes, for some of them. But not this one: this one is a major embarrassment.

Amateur hour in Mountain View.

OtherShrezzing•7h ago

This error was an uncaught null pointer issue.

For a company the size and quality of Google to be bringing down the majority of their stack with this type of error really suggests they do not implement appropriate mitigations after serious issues.

__turbobrew__•3h ago

> They will never make the same mistake again

They rolled out a change without feature flagging, didn’t implement exponential backoffs in the clients, didn’t implement load shedding in the servers.

This is all in the google SRE book from many years ago.

blibble•8h ago

this is really amateur level stuff: NPEs, no error handling, no exponential backoff, no test coverage, no testing in staging, no gradual rollout, fail deadly

I read their SRE books, all of this stuff is in there: https://sre.google/sre-book/table-of-contents/ https://google.github.io/building-secure-and-reliable-system...

have standards slipped? or was the book just marketing

tough•8h ago

someone vibe coded a push to prod on friday?

koakuma-chan•8h ago

Ironically Gemini wouldn’t forget a null check

marcinzm•8h ago

As an outsider my quick guess is that at some point after enough layoffs and the CEO accusing everyone of being lazy, people focus on speed/perceived output over quality. After a while the culture shifts so if you block such things then you're the problem and will be ostracized.

gjsman-1000•7h ago

As an outsider, what I perceive is quite different:

HN likes to pretend that FAANG is the pinnacle of existence. The best engineers, the best standards, the most “that wouldn’t have happened here,” the yardstick by which all companies should be measured for engineering prowess.

Incidents like this repeatedly happening reveal that’s mostly a myth. They aren’t much smarter, their standards are somewhat wishful thinking, their accomplishments are mostly rooted in the problems they needed to solve just like any other company.

dvfjsdhgfv•5h ago

You might be right to some extend, but not entirely. For example, there have been almost no incidents in AWS where one customer would be able to access the data of another customer because of AWS fault. The cases so far like Superglue etc. were very limited and IMHO AWS security is quite solid.

So I would say there is a difference between AWS architects and engineers (although I know first hand that certain things are subobtimal, but...) and those of several other companies who have less customers but experienced successful attacks (or data loss). Even if you take Microsoft, there is huge difference in security posture between AWS and Azure (and I say this as a big fan of the so-called "private cloud" (previously know as just your own infra)).

belter•2h ago

Every time I hear a company uses Azure I check public leak sites for their data...

https://www.lastweekinaws.com/blog/azures_vulnerabilities_ar...

reaperducer•1h ago

Every time I hear a company uses Azure I check public leak sites for their data...

https://www.lastweekinaws.com/blog/azures_vulnerabilities_ar...

(2022)

motorest•5h ago

> Incidents like this repeatedly happening reveal that’s mostly a myth. They aren’t much smarter, their standards are somewhat wishful thinking, their accomplishments are mostly rooted in the problems they needed to solve just like any other company.

I think you're only seeing what you want to see, because somehow bringing FANG engineers down a peg makes you feel better?

A broken deployment due to a once-in-a-lifetime configuration change in a project that wasn't allocated engineering effort to allow more robust and resilient deployment modes doesn't turn any engineer into an incompetent fool. Sometimes you need to flip a switch, and you can't spare a team working one year to refactor the whole thing.

floydnoel•2h ago

> Sometimes you need to flip a switch, and you can't spare a team working one year to refactor the whole thing.

This seems to imply that the person in charge at G was right to cause this outage... and that Google is very short-staffed and too poor to afford to do proper engineering work?

Somehow that doesn't inspire confidence in their engineering prowess. Sure seems to me that bad engineering leadership decisions is equivalent to bad engineering.

spacemadness•4h ago

That’s just PR that serves these companies. I’ve never seen them that way. The stupid avoidable bugs and terrible UX in a lot of their products tells you enough at the surface level. What’s true is these companies do hire some amazing specialists but that doesn’t make them the pinnacle of engineering overall.

maigret•8h ago

Try to a few Lighthouse measurements on their web pages and you’ll see they don’t maintain the highest engineering standards.

koakuma-chan•8h ago

Yep, people (including me) like to shit on Next.js, but I have a fairly complex app, and it’s still at 100 100 100 100

kyrra•8h ago

I wish they would share more details here. Your take isn't fully correct. There was testing, just not for the bad input (the blank fields in the policy). They also didn't say there was no testing in staging, just that a flag would have caught it.

Opinions are my own.

fidotron•7h ago

> There was testing, just not for the bad input (the blank fields in the policy).

But if you change a schema, be it DB, protobuf, whatever, this is the major thing your tests should be covering.

This is why people are so amazed by it.

kyrra•7h ago

Sorry, I meant that was unit testing.

The document also doesn't say there wasn't testing in staging or prod.

fidotron•7h ago

> Sorry, I meant that was unit testing.

So do we. If you change the schema it doesn't matter if it's unit testing, you either have a validation stage for all data coming in beforehand to check these assumptions (the way many python people use pydantic on incoming data, for example), and that must be updated for every change, or every unit must be defensive against this sort of thing.

ajb•7h ago

...the constant familiarity with even the most dangerous instruments soon makes men loose their first caution in handling them; they readily, therefore, come to think that the rules laid down for their guidance are unnecessarily strict - report on the explosion of a gunpowder magazine at Erith, 1864

belter•7h ago

The incredible reliability and high standards in most of the air travel industry proves this wrong.

ajb•5h ago

Yes, I don't say that a quote from 1864 is the last word on work culture. Nevetheless, it does capture something of human nature: what is now known as "normalisation of deviance".

mplanchard•5h ago

I guess you’ve never heard pilots complain about the FAA

gjsman-1000•7h ago

Standards fallen?

Google’s standards, and from what I can tell, most FAANG standards are like beauty filters on Instagram. Judging yourself, or any company, against them is delusional.

belter•7h ago

- You do know their AZs are just firewalls across the same datacenter?

- And they used machines without ECC and their index got corrupted because of it? And instead of hiding the head in shame and getting lessons from IBM old timers they published a paper about it?

- What really accelerated the demise of Google+ was that an API issue allowed the harvesting of private profile fields for millions of users, and they hid that for months fearing the backlash...

Dont worry, you will have plenty more outages from the land of we only hire the best....

atombender•1h ago

I wonder if machines without ECC could perhaps explain why our apps periodically see TCP streams with scrambled contents.

On GKE, we see different services (like Postgres and NATS) running on the same VM in different containers receive/send stream contents (e.g. HTTP responses) where the packets of the stream have been mangled with the contents of other packets. We've been seeing it since 2024, and all the investigation we've done points to something outside our apps and deeper in the system. We've only seen it in one Kubernetes cluster, and it lasts 2-3 hours and then magically resolves itself; draining the node also fixes it.

If there are physical nodes with faulty RAM, I bet something like this could happen. Or there's a bug in their SDN or their patched version of the Linux kernel.

singron•7h ago

Nearly every global outage at Google has looked vaguely like this. I.e. a bespoke system that rapidly deploys configs globally gets a bad config.

All the standard tools for binary rollouts and config pushes will typically do some kind of gradual rollout.

In some ways Google Cloud had actually greatly improved the situation since a bunch of global systems were forced to become regional and/or become much more reliable. Google also used to have short global outages that weren't publicly remarked on (at the time, if you couldn't connect to Google, you assumed your own ISP was broken), so this event wasn't as rare as you might think. Overall I don't think there is a worsening trend unless someone has a spreadsheet of incidents proving otherwise.

[I was an SRE at Google several years ago]

btown•6h ago

From the OP:

> a policy change was inserted into the regional Spanner tables that Service Control uses for policies. Given the global nature of quota management, this metadata was replicated globally within seconds

If there’s a root cause here, it’s that “given the global nature of quota management” wasn’t seen as a red flag that “quota policy changes must use the standard gradual rollout tooling.”

The baseline can’t be “the trend isn’t worsening;” the baseline should be that if global config rollouts are commonly the cause of problems, there should be increasingly elevated standards for when config systems can bypass best practices. Clearly that didn’t happen here.

palcu•2h ago

One of the problems has been that most users have requested that quotas get updated as fast as possible and that they should be consistent across regions, even for global quotas. As such people have been prioritising user experience rather than availability.

I hope the pendulum swings the other way around now in the discussion.

[disclaimer that I worked as a GCP SRE for a long time, but not left recently]

perryizgr8•6h ago

At google scale, if their standards were not sky high, such incidents would be happening daily. That it happens once in a blue moon indicates that they are really meticulous with all those processes and safeguards almost all the time.

gyesxnuibh•6h ago

That book was written with 40% of the engineers compared to when I left a couple years ago (not sure how many now with the layoffs now). I'm guessing those hires haven't read it yet. So yeah, reads like standards slipping to me.

Xenoamorphous•8h ago

> We will modularize Service Control’s architecture, so the functionality is isolated and fails open. Thus, if a corresponding check fails, Service Control can still serve API requests.

If I understood it correctly, this service checks proper authorisation among other things, so isn’t failing open a security risk?

foota•8h ago

I think this was probably specific to the quota checks.

bostik•7h ago

This was the bit that I spotted as potentially conflicting as well. Having managed (and sanitised!) tech&security policies at a small tech company, the fail-open vs. fail-closed decisions are rarely clear cut. What makes it worse is that a panicked C-suite member can make a blanket policy decision without consulting anyone outside their own circle.

The downstream effects tend to be pretty grim, and to make things worse, they start to show up only after 6 months. It's also a coinflip whether a reverse decision will be made after another major outage - itself directly attributable to the decisions made in the aftermath of the previous one.

What makes these kinds of issues particularly challenging is that by their very definition, the conditions and rules will be codified deep inside nested error handling paths. As an engineer maintaining these systems, you are outside of the battle tested happy paths and first-level unhappy paths. The conditions to end up in these second/third-level failure modes are not necessarily well understood, let alone reproducible at will. It's like writing code in C, and having all your multi-level error conditions be declared 'volatile' because they may be changed by an external force at any time, behind your back.

jeffrallen•8h ago

Whatever this "red-button" technology is, is pants. If you know you want to turn something off at incident + 10 mins, it should be off within a minute. Not "Preparing a change to trigger the red-button", but "the stop flag was set by an operator in a minute and was synched globally within seconds".

I mean, it's not like they don't have that technology: the worldwide sync was exactly what caused the outage.

At $WORK we use Consul for this job.

foota•8h ago

Generally, even these emergency changes are done not entirely immediately to prevent a fix from making things worse. This is an operational choice though, not a technical limitation. My guess being involved in similar issues in the past is the ~15 minute delay preparing the change was either that it wasn't a normally used big red button, so it wasn't clear how to use it, or there was some other friction preparing the change.

ollien•8h ago

What is the difference between a red button and a feature flag, anyway? The report says there was no feature flagging, yet they had this "red button".

It sounds to me like something needed to be recompiled and redeployed.

rvz•8h ago

> We will improve our static analysis and testing practices to correctly handle errors and if need be fail open.

> Without the appropriate error handling, the null pointer caused the binary to crash.

Even worse if this was AI generated C or C++ code, wasn't this not tested before deployment?

This is why you write tests before the acutal code and why vibe-coding is a scam as well. This would also never have happened if it was in Rust.

I expect far better than this from Google and we are still dealing with null pointer crashes to this day.

Philpax•8h ago

I'm as much of a Rust advocate as anyone, but what does vibe-coding have to do with any of this?

CaptainFever•5h ago

Seems to just be the new thing to casually hate on.

"If this was vibe coded, this is even worse. This proves that vibe coding is bad."

montebicyclelo•8h ago

TLDR, unexpected blank fields

> policy change was inserted into the regional Spanner tables

> This policy data contained unintended blank fields

> Service Control... pulled in blank fields... hit null pointer causing the binaries to go into a crash loop

eddd-ddde•6h ago

Really whatever null condition caused a crash is mostly irrelevant. The big problem is instantaneous global replication of policy changes . All code can fail unexpectedly, the gradual rollout of the code was pointless since it doesn't take effect until any policy changes. Yet policy changes are near instantaneous.

darkwater•8h ago

Usually Google and FAANG outages in general are due to things that happen only at Google scale but this incident seems from a generic small/medium company with 30 engineers at most.

mkl95•8h ago

I've seen it happen at companies with 100s of engineers, some of them ex-FAANG, including ex-Googlers. The average FAANG engineer has a way more pedestrian work ethic than the FAANG engineers on HN want you to believe.

Kye•8h ago

No error handling, empty fields no one noticed. Was this change carelessly vibe coded?

paulddraper•8h ago

> We posted our first incident report to Cloud Service Health about ~1h after the start of the crashes, due to the Cloud Service Health infrastructure being down due to this outage.

No one wondered hm maybe this isn’t a good idea?

stackskipton•4h ago

Getting approval from marketing/senior leadership to setup this thing that is outside normal infrastructure is very difficult so it was probably decided against.

mkl95•8h ago

> The issue with this change was that it did not have appropriate error handling nor was it feature flag protected.

I've been there. The product guy needs the new feature enabled for everyone, and he needs it yesterday. Suggestions of feature flagging are ignored outright. The feature is then shipped for every user, and fun ensues.

owenthejumper•8h ago

If this wasn’t vibe coded I’ll eat a frog or something.

jitl•7h ago

> This policy data contained unintended blank fields. Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop.

Another example of Hoare’s “billion-dollar mistake” in multiple multiple Google systems:

- Why is it possible to insert unintended “blank fields” (nulls)? The configuration should have a schema type that doesn’t allow unintended nulls. Unfortunately Spanner itself is SQL-like and so fields must be declared NOT NULL explicitly, the default is nullable fields.

- Even so, the program that manages these policies will have its own type system and possibly an application level schema language for the configuration. This is another opportunity to make invalid states unrepresentable.

- Then in Service Control, there’s an opportunity to prove “schema on read” as you deserialize policies from the data store into application objects, again either a programming language type or application level schema could be used to validate policy rows have the expected shape before they leave the data layer. Perhaps the null pointer error occurred in this layer, but since this issue occurred in a new code path, it sounds more likely the invalid data escaped the data layer into application code.

- Finally, the Service Control application is written in a language that allows for null pointer references.

If I were a maintainer of this system, the minimally invasive chance I would be thinking about, is how to introduce an application level schema to the policy writer and the policy reader that uses a “tagged enum type” or “union type” or “sum type” to represent policies that cannot express null. Ideally each new kind of policy could be expressed as a new variant added to the union type. You can add this in app code without rewriting the whole program to a safe language. Unfortunately it seems proto3, google’s usual schema language, doesn’t have this constraint.

Example of one that does: https://github.com/stepchowfun/typical

esprehn•7h ago

I work on Cloud, but not this service. In general:

- All the code has unit tests and integration tests

- Binary and config file changes roll out slowly job by job, region by region, typically over several days. Canary analysis verifies these slow rollouts.

- Even panic rollbacks are done relatively slowly to avoid making the situation worse. For example globally overloading databases with job restarts. A 40m outage is better than a 4 hour outage.

I have no insider knowledge of this incident, but my read of the PM is: The code was tested, but not this edge case. The quota policy config is not rolled out as a config file, but by updating a database. The database was configured for replication which meant the change appeared in all the databases globally within seconds instead of applying job by job, region by region, like a binary or config file change.

I agree on the frustration with null pointers, though if this was a situation the engineers thought was impossible it could have just as likely been an assert() in another language making all the requests fail policy checks as well.

Rewriting a critical service like this in another language seems way higher risk than making sure all policy checks are flag guarded, that all quota policy checks fail open, and that db changes roll out slowly region by region.

Disclaimer: this is all unofficial and my personal opinions.

blibble•7h ago

> The code was tested, but not this edge case.

so... it wasn't tested

jsnell•6h ago

> it could have just as likely been an assert() in another language

Asserts are much easier to forbid by policy.

esprehn•6h ago

That's fair, though `if (isInvalidPolicy) reject();` causes the same outage. So the eng process policy change seems to be failing open and slow rollouts to catch that case too.

usrnm•6h ago

How is the fact that it was a database change and not a binary or a config supposed to make it ok? A change is a change, global changes that go everywhere at once are a recipe for disaster, it doesn't matter what kind of changes we're talking about. This is a second Crowdsrike

fidotron•6h ago

This is the core point. A canary deployment that was not preceded by deploying data that activates the region of the binary in question will prove nothing useful at all, while promoting a false sense of security.

ethbr1•3h ago

The root problem is that a dev team didn't appropriately communicate criteria for testing their new feature.

Which definitely seems like shortcut 'on to the next thing and ignore QA due diligence.'

sagarm•3h ago

Google does not separate dev and QA in general (I'm sure there are exceptions).

gdenning•6h ago

Why did it take so long for Google to update their status page at https://www.google.com/appsstatus/dashboard/? According to this report, the issue started at 10:49 am PT, but when I checked the Google status page at 11:35 am PT, everything was still green. I think this is something else they need to investigate.

QuinnyPig•6h ago

I saw a lot of third party services (i.e. CloudFlare) go down, but did any non-Cloud Google properties see an impact?

It’d say something that core Google products don’t or won’t take a dependency on Google Cloud…

mplanchard•5h ago

Regardless of the business need for near instantaneous consistency of the data globally (i.e. quota management settings are global), data replication needs to be propagated incrementally with sufficient time to validate and detect issues.

This reads to me like someone finally won an argument they’d been having for some time.

spacemadness•4h ago

Guess all that leet code screening only goes so far, huh?

dehrmann•3h ago

I recently started as a GCP SRE. I don't have insider knowledge about this, and my views on it are my own.

The most important thing to look at is how much had to go wrong for this to surface. It had to be a bug without test coverage that wasn't covered by staged rollouts or guarded by a feature flag. That essentially means a config-in-db change. Detection was fast, but rolling out the fix was slow out of fear of making things worse.

The NPE aspect is less interesting. It could have been any number of similar "this can't happen" errors. It could have been mutually exclusive fields are present in a JSON object, and the handling logic does funny things. Validation during mutation makes sense, but the rollout strategy is more important since it can catch and mitigate things you haven't thought of.

JaggerFoo•2h ago

I would be interested in seeing the elapsed time to recovery for each location up to us-central-1.

Is this information available anywhere?

Words of War (2025)

AMD's AI Future Is Rack Scale 'Helios'

Tiny Awards

Anam

Tclpysh: Multi-language shell supporting both Tcl and Python

Strange radio pulses detected coming from ice in Antarctica

What is systems programming, really? (2018)

Computing Is Efficient

Exploring the Best Input Representation for Electrocardiogram-Language Models

Patent reveals Huawei's quad-chiplet rival for Nvidia's Rubin AI GPUs

The Tech Industry Doesn't Understand Consent

PCIe 6.0 SSDs for PCs won't arrive until 2030 – PCIe 5.0 SSDs are here to stay

Why AI and SQL Go Together Like Peanut Butter and Jelly

Architecture That Works with Challenging Terrain, Not Against It

Physical restoration of a painting with a digitally constructed mask

Do Metaprojects

Department of Homeland Security Predator B Drones Are Orbiting over Los Angeles

Ask HN: Do you think there's censorship on HN?

The sound that everyone loves

I built what startups say they need, but they won't sign up

Write Your Own Programs

Photon transport through the entire adult human head

Amazon joins the big nuclear party, buying 1.92 GW for AWS

LOPSA Board Seeks to Dissolve Organization

Google is making search worse to sell more ads

Enlightenment-1 (QiMeng-1): The First Automatically Generated RISC-V CPU

Under the microscope: Batman Begins (Playstation 2, GameCube, Xbox) )

First Ukraine, Now Israel: Drone Smuggling Is Potent New War Weapon

Seven replies to the viral Apple reasoning paper – and why they fall short

Acorn in the f'n WWDC 2025 Keynote

Words of War (2025)

AMD's AI Future Is Rack Scale 'Helios'

Tiny Awards

Anam

Tclpysh: Multi-language shell supporting both Tcl and Python

Strange radio pulses detected coming from ice in Antarctica

What is systems programming, really? (2018)

Computing Is Efficient

Exploring the Best Input Representation for Electrocardiogram-Language Models

Patent reveals Huawei's quad-chiplet rival for Nvidia's Rubin AI GPUs

The Tech Industry Doesn't Understand Consent

PCIe 6.0 SSDs for PCs won't arrive until 2030 – PCIe 5.0 SSDs are here to stay

Why AI and SQL Go Together Like Peanut Butter and Jelly

Architecture That Works with Challenging Terrain, Not Against It

Physical restoration of a painting with a digitally constructed mask

Do Metaprojects

Department of Homeland Security Predator B Drones Are Orbiting over Los Angeles

Ask HN: Do you think there's censorship on HN?

The sound that everyone loves

I built what startups say they need, but they won't sign up

Write Your Own Programs

Photon transport through the entire adult human head

Amazon joins the big nuclear party, buying 1.92 GW for AWS

LOPSA Board Seeks to Dissolve Organization

Google is making search worse to sell more ads

Enlightenment-1 (QiMeng-1): The First Automatically Generated RISC-V CPU

Under the microscope: Batman Begins (Playstation 2, GameCube, Xbox) )

First Ukraine, Now Israel: Drone Smuggling Is Potent New War Weapon

Seven replies to the viral Apple reasoning paper – and why they fall short

Acorn in the f'n WWDC 2025 Keynote

Google Cloud Incident Report – 2025-06-13

Comments