Snapchat heavily used Google AppEngine to scale. This was basically a magical Java runtime that would 'hot path split' the monolithic service into lambda-like worker pools. Pretty crazy, but it worked well.
Snapchat leaned very heavily on this though and basically let Google build the tech that allowed them to scale up instead of dealing with that problem internally. At one point, Snap was >70% of all GCP usage. And this was almost all concentrated on ONE Java service. Nuts stuff.
Anyway, eventually Google was no longer happy with supporting this and the corporate way of breaking up is "hey we're gonna charge you 10x what did last year for this, kay?" (I don't know if it was actually 10x. It was just a LOT more)
So began the migration towards Kubernetes and AWS EKS. Snap was one of the pilot customers for EKS before it was generally available, iirc. (I helped work on this migration in 2018/2019)
Now, 6+ years later, I don't think Snap heavily uses GCP for traffic unless they migrated back. And this outage basically confirms that :P
GCP is behind in market share, but has the incredible cheat advantage of just not being Amazon. Most retailers won't touch Amazon services with a ten foot pole, so the choice is GCP or Azure. Azure is way more painful for FOSS stacks, so GCP has its own area with only limited competition.
Honestly as a (very small) shareholder in Amazon, they should spin off AWS as a separate company. The Amazon brand is holding AWS back.
Big monopolists do not unlock more stock market value, they hoard it and stifle it.
However I have seen many people flee from GCP because: Google lacks customer focus, Google is free about killing services, Google seems to not care about external users, people plain don’t trust Google with their code, data or reputation.
0: https://chrpopov.medium.com/scaling-cloud-infrastructure-5c6...
1: https://eng.snap.com/monolith-to-multicloud-microservices-sn...
The general idea being that you'll losing money due to opportunity cost.
Personally, I think you're better off just not laying people off and having them work the less (but still) profitable stuff. But I'm not in charge.
If you are on AWS and AWS goes down, that's covered in the news as a bunch of billion dollar companies were also down. Customer probably gives you a pass.
Exactly - I've had clients say, "We'll pay for hot standbys in the same region, but not in another region. If an entire AWS region goes down, it'll be in the news, and our customers will understand, because we won't be their only service provider that goes down, and our clients might even be down themselves."
Show up at a meeting where a whole bunch of people appear to have wet themselves, and we’ll all agree not to mention it ever again…
They measure uptime using averages of "if any part of a chain is even marginally working".
People experience downtime however as "if any part of a chain is degraded".
They make lawyers happy and they stop intelligence services to access the associated resources.
For example, no one would even consider accessing data from a European region without the right paperwork.
But yeah, that's pretty hard and there are other reasons customers might want to explicitly choose the region.
Yes, within the same region. Doing stuff cross-region takes a little bit more effort and cost, so many skip it.
It's ridiculous how everything is being stored in the cloud, even simple timers. It's past high time to move functionality back on-device, which would come with the advantage of making it easier to de-connect from big tech's capitalist surveillance state as well.
Assuming we’re talking about hosting things for Internet users. My fiber internet connection has gone down multiple times, though relatively quickly restored. My power has gone out several times in the last year, with one storm having it out for nearly 24 hrs. I was sleep when it went out and I didn’t start the generator until it was out for 3-4 hours already, far longer than my UPSes could hold up. I’ve had to do maintenance and updates both physical and software.
All of those things contribute to a downtime significantly higher than I see with my stuff running on Linode, Fly.io or AWS.
I run Proxmox and K3s at home and it makes things far more reliable, but it’s also extra overhead for me to maintain.
Most or all of those things could be mitigated at home, but at what cost?
If you had /two/ houses, in separate towns, you'd have better luck. Or, if you had cell as a backup.
Or: if you don't care about it being down for 12 hours.
These are the issues I've ran into that have caused downtime in the last few years:
- 1x power outage: if I had set up restart on power, probably would have been down for 30-60 minutes, ended up being a few hours (as I had to manually press the power button lol). Probably the longest non-self-inflicted issue.
- Twitch bot library issues: Just typical library bugs. Unrelated to self-hosting.
- IP changes: My IP actually barely ever changes, but I should set up DDNS. Fixable with self-hosting (but requires some amount of effort).
- Running out of disk space: Would be nice to be able to just increase it.
- Prooooooobably an internet outage or two, now that I think about it? Not enough that it's been a serious concern, though, as I can't think of a time that's actually happened. (Or I have a bad memory!)
I think that's actually about it. I rely fairly heavily on my VPN+personal cloud as all my notes, todos, etc are synced through it (Joplin + Nextcloud), so I do notice and pay a lot of attention to any downtime, but this is pretty much all that's ever happened. It's remarkable how stable software/hardware can be. I'm sure I'll eventually have some hardware failure (actually, I upgraded my CPU 1-2 years ago because it turns out the Ryzen 1700 I was using before has some kind of extremely-infrequent issue with Linux that was causing crashes a couple times a month), but it's really nice.
To be clear, though, for an actual business project, I don't think this would be a good idea, mainly due to concerns around residential vs commercial IPs, arbitrary IPs connecting to your local network, etc that I don't fully pay attention to.
binsquare•7h ago
Msurrow•7h ago
aurareturn•7h ago
So blame humans even if an AI wrote some bad code.
Msurrow•6h ago
xenocratus•6h ago
Disagree, a human might be the cause/trigger, but the fault is pretty much always systemic. A whole lot of things has to happen for that last person to cause the problem.
Msurrow•6h ago
Edit: and more important who governed the system, ie made decisions about maintainance, staffing, training, processes and so on
karel-3d•6h ago
portaouflop•5h ago
WelcomeShorty•4h ago
"Oct 20 3:35 AM PDT The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution."
https://health.aws.amazon.com/health/status
erpellan•4h ago
blibble•2h ago