They need to get a grip on this.
myrepo git:(fix/context-types-settings) gp
ERROR: user:1234567:user
fatal: Could not read from remote repository.
myrepo git:(fix/context-types-settings) ssh -o ProxyCommand=none git@github.com
PTY allocation request failed on channel 0
Hi user! You've successfully authenticated, but GitHub does not provide shell access.
Connection to github.com closed.Remember hotmail :)
we need more soverenity and decentralization.
The underlying tech is still decentralized, but what good does that do when we've made everything that uses it dependent on a few centralized services?
You lost me there
So I don't get why the project has "lost you", but I also suspect you're the kind of person any project could readily afford to lose as a user.
I do realize that we're trying to pack quite a bit of information in this sentence/tagline. I think it's reasonably well phrased, but for the uninitiated might require some "unpacking" on their end.
If we "lost you" on that tagline, and my explanation or that of hungariantoast (which is correct as well) helped you understand, I would appreciate if you could criticize more constructively and suggest a better way to introduce these features in a similarly dense tagline, or say what else you would think is a meaningful but short explanation of the project. If you don't care to do that, that's okay, but Radicle won't be able to improve just based on "you lost me there".
In case you actually understood the sentence just fine and we "lost you" for some other reason, I would appreciate if you could elaborate on the reason.
Edit: ugh... if you rely on GH Actions for workflows though actions/checkout@v4 is also currently experiencing the git issues, so no dice if you depend on that.
Where is your god now, proponents of immutable filesystems?!
I started packing things into docker containers because of that. Makes it a bit more of a hassle to change things in production.
At the largest place I did have prod creds for everything because sometimes they are necessary and I had the seniority (sometimes you do need them in a "oh crap" scenario).
They where all setup on a second account in my work Mac which had a danger will Robinson wallpaper because I know myself, far far too easy to mentally fat finger when you have two sets of creds.
I actually had the privilege of being sent to the server.
Because my suggestion they have a spare ADSL connection for out of channel stuff was an unnecessary expense... Til he broke the firewall knocked a bunch of folks offline across a huge physical site and locked himself out of everything.
The spare line got fitted the next month.
They done borked it good.
ERROR: no healthy upstream fatal: Could not read from remote repository.
"Why do we need so many people to keep things running!?! We never have downtime!!"
The funny thing is that the over hiring during the pandemic also had the predictable result of mass lay-offs.
Whoever manages HR should be the ones fired after two back to back disasters like this.
I utterly hate being at the mercy of a third party with an after thought of a "status page" to stare at.
We self-host GitLab but the team owning it is having hard time scaling it. From my understanding talking to them, the design of gitaly makes it very hard to scale it beyond certain repo size and # of pushes per day (for reference: our repos are GBs in size, ~1M commits, hundreds of merges per day)
Self-hosted Gitlab periodically blocks access for auto-upgrades. Github.com upgrades are usually invisible.
Github.com is periodically hit with the broad/systemic cloud-outage. Self-hosted Gitlab is more decentralized infra, so you don't have the systemic outages.
With self-hosted Gitlab, you likely to have to deal with rude bots on your own. Github.com has an ops team that deals with the rude bots.
I'm sure the list goes on. (shrug)
Microsoft CEO says up to 30% of the company’s code was written by AI https://techcrunch.com/2025/04/29/microsoft-ceo-says-up-to-3...
Time to leak that.
The enterprise cloud in EU, US, and Australia has no issues.
If you look at the incident history disruptions happen often in the public cloud for years already. Before AI wrote code for them.
It's not just HTTPS, I can't push via SSH either.
I'm not convinced it's just "some" operations either; every single one I've tried fails.
And that's if you get a status page update at all.
A. Are these major issues with cloud/SaaS tools becoming more common, or is it just that they get a lot more coverage now? It seems like we see major issues across AWS, GCP, Azure, Github, etc. at least monthly now and I don't remember that being the case in the past.
B. If it's becoming more common, what are the reasons? I can think of a few, but I don't know the answer, so if anyone in-the-know has insight I'd appreciate it.
Operations budget cuts/layoffs? Replacing critical components/workflows with AI? Just overall growing pains, where a service has outgrown what it was engineered for?
Thanks
Someone answered this morning, while Cloudflare outage, it's AI vibe coding and I tend to think there is something true in this. At some point there might be some tiny grain of AI engaged which starts the avalanche ending like this.
However, this is an unexpected bell curve. I wonder if GitHub is seeing more frequent adversarial action lately. Alternatively, perhaps there is a premature reliance on new technology at play.
I was trying to do a 1.0 release today. Codeberg went down for "10 minutes maintenance" multiple times while I was running my CI actions.
And then github went down.
Cursed.
What change is how many services GitHub can be having issues.
A lot of people are pointing to AI vibe coding as the cause, but I think more often than not, incidents happen due to poor maintenance of legacy code. But I guess this may be changing soon as AI written code starts to become "legacy" faster than regular code.
FWIW Microsoft is convinced moving Github to Azure will fix these outages
Your second point is a little disingenuous. Yes, Microsoft and Windows have been wildly successful from a cultural adoption standpoint. But that's not the point I was trying to argue.
Even if Windows weren't a dogshit product, which it is, Microsoft is a lot more than just an operating system. In the 90's they actively tried to sabotage any competition in the web space, and held web standards back by refusing to make Internet Explorer actually work.
So many weird paths we could have gone down it's almost strange Microsoft won.
1.) It's already a miracle Xerox PARC escaped their parent company's management for as long as they did.
3.) IBM was playing catch-up on the supercomputer front since the CDC 6400 in 1964. Arguably, they did finally catch up in the mid-late 80's with the 3090.
Gary was on a flight when IBM called up the Digital Research looking for an OS for the IBM-PC. Gary’s wife, Dorothy, wouldn’t sign an NDA without it going through Gary, and supposedly they never got negotiations back on track.
Instead they actively tried to murder open standards [1] that they viewed as competitive and normalized the antitrust nightmare that we have now.
I think by nearly any measure, Microsoft is not a net good. They didn't invent the operating system, there were lots of operating systems that came out in the 80's and 90's, many of which were better than Windows, that didn't have the horrible anticompetitive baggage attached to them.
[1] https://en.wikipedia.org/wiki/Embrace,_extend,_and_extinguis...
A few decades back Microsoft were first to the prize with asynchronous JavaScript, Silverlight really was flash done better and still missed, a proper extension of their VB6/MFC client & dev experience out to the web would have gobbled up a generation of SaaS offerings, and they had a first in class data analysis framework with integrated REPL that nailed the central demands of distributed/cloud-first systems and systems configuration (F#). That on top of near perfect control of the document and consumer desktop ecosystems and some nutty visualization & storage capabilities.
Plug a few of their demos from 2002 - 2007 together and you’ve got a stack and customer experience we’re still hurting for.
In fact all of your points are only true if we accept that Windows would be the only operating system.
Microsoft half-asses most things. If they had taken over the internet, we would likely have the entirety of the internet be even more half-asses than it already is.
https://www.zdnet.com/article/ms-moving-hotmail-to-win2000-s...
> In 2002, the amusement continued when a network security outfit discovered an internal document server wide open to the public internet in Microsoft's supposedly "private" network, and found, among other things, a whitepaper[0] written by the hotmail migration team explaining why unix is superior to windows.
Hahaha, that whitepaper is pure gold!
[0]: https://web.archive.org/web/20040401182755/http://www.securi...
https://techrights.org/n/2025/08/12/Microsoft_Can_Now_Stop_R...
ever since Musk greenlighted firing people again.. CEOs can't wait to pull the trigger
2/ Then we cannot expect big tech to stay as sharp as in the 2000s and 2010s.
There was a time banks had all the smart people, then the telco had them, etc. But people get older, too comfortable, layers of bad incentive and politics accumulate and you just become a dysfunctional big mess.
I suspect (although have not researched) that global traffic is up, by throughput but also by session count.
This contributes to a lot more awareness. Slack being down wasn't impactful when most tech companies didn't use Slack. An AWS outage was less relevant when the 10 apps (used to be websites) you use most didn't rely on a single AZ in AWS or you were on your phone less.
I think as a society it just has more impact than it used to.
Be good to your Stability reliability engineers for the next few months... it's downtime season!
Among other mentioned factors like AI and layoffs: mass brain damage caused by never-ending COVID re-infections.
Since vaccines don't prevent transmission, and each re-infection increases the chances of long COVID complications, the only real protection right now is wearing a proper respirator everywhere you go, and basically nobody is doing that anymore.
There are tons of studies to back this line of reasoning.
I think that "more coverage" is part of it, but also "more centralization." More and more of the web is centralized around a tiny number of cloud providers, because it's just extremely time-intensive and cost-prohibitive for all but the largest and most specialized companies to run their own datacenters and servers.
Three specific examples: Netflix and Dropbox do run their own datacenters and servers; Strava runs on AWS.
> If it's becoming more common, what are the reasons? I can think of a few, but I don't know the answer, so if anyone in-the-know has insight I'd appreciate it.
I worked at AWS from 2020-2024, and saw several of these outages so I guess I'm "in the know."
My somewhat-cynical take is that a lot of these services have grown enormously in complexity, far outstripping the ability of their staff to understand them or maintain them:
- The OG developers of most of these cloud services have moved on. Knowledge transfer within AWS is generally very poor, because it's not incentivized, and has gotten worse due to remote work and geographic dispersion of service teams.
- Managers at AWS are heavily incentivized to develop "new features" and not to improve the reliability, or even security, of their existing offerings. (I discovered numerous security vulnerabilities in the very-well-known service that I worked for, and was regularly punished-rather-than-rewarded for trying to get attention and resources on this. It was a big part of what drove me to leave Amazon. I'm still sitting on a big pile of zero-day vulnerabilities in ______ and ______.)
- Cloud services in most of the world are basically a 3-way oligopoly between AWS, Microsoft/Azure, and Google. The costs of switching from one provider to another are often ENORMOUS due to a zillion fiddly little differences and behavior quirks ("bugs"). It's not apparent to laypeople — or even to me — that any of these providers are much more or less reliable than the others.
It's more work and slower. I'm convinced half of the reason they keep it that way is because the barrier to entry is higher and it scares contributors away.
You mean, assuming everyone in the conversation is using different email providers. (ie. Not the company wide one, and not gmail... I think that covers 90% of all email accounts in the company...)
Well you can with some effort. But there's certainly some inconvenience.
But yes ssh pushing was down, was my first clue.
My work laptop had just been rebooted (it froze...) and the CPU was pegged by security software doing a scan (insert :clown: emoji), so I just wandered over to HN and learned of the outage at that point :)
The downtime we do have each year is typically also on our terms, not in the middle of a work day or at a critical moment.
The reason for buying centralized cloud solutions is not uptime, it's to safe the headache of developing and maintaining the thing.
Multi-AZ RDS is 100% higher availability than me managing something.
Anecdotal, but ¯\_(ツ)_/¯
Most software doesn’t need to be distributed. But it’s the growth paradigm where we build everything on principles that can scale to world-wide low-latency accessibility.
A UNIX pipe gets replaced with a $1200/mo. maximum IOPS RDS channel, bandwidth not included in price. Vendor lock-in guaranteed.
Meaning the cloud may go down more frequently than small scale self deployments , however downtimes are always on average much shorter on cloud. A lot of money is at stake for clouds providers, so GitHub et al have the resources to put to fix a problem compared to you or me when self hosting.
On the other hand when things go down self hosted, it is far more difficult or expensive to have on call engineers who can actual restore services quickly .
The skill to understand and fix a problem is limited so it takes longer for semi skilled talent to do so, while the failure modes are simpler but not simple.
The skill difference between setting up something locally that works and something works reliably is vastly different. The talent with the latter are scarce to find or retain .
I’ve seen that work first hand to keep critical stuff deployable through several CI outages, and also has the upside of making it trivial to debug “CI issues”, since it’s trivial to run the same target locally
You should aim for this but there are some things that CI can do that you can't do on your own machine, for example running jobs on multiple operating systems/architectures. You also need to use CI to block PRs from merging until it passes, and for merge queues/trains to prevent races.
Ended up expanding this little quip into a blogpost to refer to in the future, feedback welcome! https://tech.davis-hansson.com/p/ci-offgrid/
We moved to GHA b/c nobody ever got fired ^W^W^W^W leadership thought eng running CI was not a good use of eng time. (Without much question into how much time was actually spent on it… which was pretty close to none. Self-hosted stuff has high initial cost for the setup … and then just kinda runs.)
Ironically, one of our self-hosted CI outages was caused by Azure — we have to get VMs from somewhere, and Azure … simply ran out. We had to swap to a different AZ to merely get compute.
The big upside to a self-hosted solution is that when stuff breaks, you can hold someone over the fire. (Above, that would be me, unfortunately.) With Github? Nobody really cares unless it is so big, and so severe, that they're more or less forced to, and even then, the response is usually lackluster.
It just workz [;
However, since we use github.com fore more than just a git hosting it is SPOF in most cases, and we treat it as a snow day.
I am fairly certain that the vast majority comes from improper use (bypassing security measures, like riding on top of the cabin) or something going wrong during maintenance.
Maybe I've just been unlucky, but so far my experience with CI pipelines that have extra steps in them for compliance reasons is that they are full of actual security problems (like curl | bash, or like how you can poison a CircleCI cache using a branch nobody reviewed and pick up the poisoned dependency on a branch which was reviewed but didn't contain the poison).
Plus, it's a high value target with an elevated threat model. Far more likely to be attacked than each separate dev machine. Plus, a motivated user might build the software themselves out of paranoia, but they're unlikely to securely self host all the infra necessary to also run it through CI.
If we want it to be secure, the automation you're talking about needs to runnable as part of a local build with tightly controlled inputs and deterministic output, otherwise it breaks the chain of trust between user and developer by being a hop in the middle which is more about a pinky promise and less about something you can verify.
It would be great to also have the continuous build and test and whatever else you “need” to keep the project going as local alternatives as well. Of course.
[1] Or maybe there is just that much downtime on GitHub now that it can’t be shrugged off
You can commit, branch, tag, merge, etc and be just fine.
Now, if you want to share that work, you have to push.
Yes you lose some convenience (like GitHub's pull requests UI can't be used, but you can temporarily use the other Git server's UI for that.
I think their point was that you're not fully locked in to GitHub. You have the repo locally and can mirror it on any Git remote.
It is awfully convenient, web interface, per branch permissions and such.
But you can choose a different server.
Maybe this will push more places towards self-hosting?
* jobs not being picked up
* jobs not being able to be cancelled
* jobs running but showing up as failed
* jobs showing up as failed but not running
* jobs showing containers as pushed successfully to GitHub's registry, but then we get errors while pulling them
* ID token failures (E_FAIL) and timeouts.
I don't know if this is related to GitHub moving to Azure, or because they're allowing more AI generated code to pass through without proper reviews, or something else, but as a paying customer I am not happy.I have been thinking about this a lot lately. What would be a tweak that might improve this situation?
Even if a website is down, someone somewhere most likely has it cached. Why can't I read it from their cache? If I'm trying to reach a static image file, why do I have to get it from the source?
I guess I want torrent DHT for the web.
How can C-suite stock RSU/comp/etc be tweaked to make them give a crap about this, or security?
---
Decades ago, I was a teenager and I realized that going to fancy hotel bars was really interesting. I looked old enough, and I was dressed well. This was in Seattle. I once overheard a low-level cellular company exec/engineer complain about how he had to climb a tower, and check the radiation levels (yes non-ionizing). But this was a low level exec, who had to take responsibility.
He joked about how while checking a building on cap hill, he waved his wand above his head, and when he heard the beeps... he noped tf out. He said that it sucked that he had to do that, and sign-off.
That is actually cool, and real engineering/responsibility at the executive level.
Can we please get more of that type of thing?
It's no coincidence that the clueless MBA who takes pride in knowing nothing about the business they're apart of proliferated during economic "spring time" -- low interest rates, genuine technological breakthroughs to capitalize on, early mover advantage, etc. When everyone is swimming in money, it's easier to get a slice without adequately proving why you deserve it.
Now we're in "winter." Interest rates are high, innovation is debatably slowing, and the previous early movers are having to prove their staying power.
All that to say: the bright side, I hope, of this pretty shitty time is that hopefully we don't _need_ to "put all this nerd talk into terms that someone in the average C-suite could understand," because hopefully the kinds of executives who are simultaneously building and running _tech companies_ and who are allergic to "nerd talk" will very simply fail to compete.
That's the free market (myth as it may often be in practice) at work -- those who are totally uninterested in the subject matter of their own companies aren't rewarded for their ignorance.
At least microsoft decided we all deserve a couple hour break from work.
this has broken a few pipeline jobs for me, seems like they're underplaying this incident
My guess is that it has to do with the Cloudflare outage this morning.
These companies are supposed to have the top people on site reliability. That these things keep happening and no one really knows why makes me doubt them.
Alternatively,
The takeaway for today: clearly, Man was not meant to have networked, distributed computing resources.
We thought we could gather our knowledge and become omniscient, to be as the Almighty in our faculties.
The folly.
The hubris.
The arrogance.
Our git server is hosted by Atlassian. I think we've had one outage in several years?
Our self hosted Jenkins setup is similarly robust, we've had a handful of hours of "Can't build" in again, several years.
We are not a company made up of rockstars. We are not especially competent at infrastructure. None of the dev teams have ever had to care about our infrastructure (occasionally we read a wiki or ask someone a question).
You don't have to live in this broken world. It's pretty easy not to. We had self hosted Mercurial and jenkins before we were bought by the megacorp, and the megacorp's version was even better and more reliable.
Self host. Stop pretending that ignoring complexity is somehow better.
Gemini 3 Pro after 3 random things announced Github was the issue.
the problem isn’t with centralized internet services, the problem is a fundamental flaw with http and our centralized client server model. the solution doesn’t exist. i’ll build it in a few years if nobody else does.
How many more outages until people start to see that farming out every aspect of their operations maybe, might, could have a big effect on their overall business? What's the breaking point?
Then again, the skills to run this stuff properly are getting more and more rare so we'll probably see more and more big incidents popping up more frequently like this as time goes on.
The VCs look at stars before deciding which open-core startup to invest in.
The 4 or 5 9s of reliability simply do not matter as much.
It feels like resiliency is becoming a bit of a lost art in networked software. I've spent a good chunk of this year chasing down intermittent failures at work, and I really underestimated how much work goes into shrinking the "blast radius", so to speak, of any bug or outage. Even though we mostly run a monolith, we still depend on a bunch of external pieces like daemons, databases, Redis, S3, monitoring, and third-party integrations, and we generally assume that these things are present and working in most places, which wasn't always the case. My response was to better document the failure conditions, and once I did, realize that there was many more than we initially thought. Since then we've done things like: move some things to a VPS instead of cloud services, automate deployment more than we already had, greatly improve the test suite and docs to include these newly considered failure conditions, and generally cut down on moving parts. It was a ton of effort, but the payoff has finally shown up: our records show fewer surprises which means fewer distractions and a much calmer system overall. Without that unglamorous work, things would've only grown more fragile as complexity crept in. And I worry that, more broadly, we're slowly un-learning how to build systems that stay up even when the inevitable bug or failure shows up.
For completeness, here are the outages that prompted this: the AWS us-east-1 outage in October (took down the Lightspeed R series API), the Azure Front Door outage (prevented Playwright from downloading browsers for tests), today’s Cloudflare outage (took down Lightspeed’s website, which some of our clients rely on), and the Github outage affecting basically everyone who uses it as their git host.
That's why it's always DNS right?
> No one wants to pay for resilience/redundancy
These companies do take it seriously, on the software side, but when it comes to configurations, what are you going to do:
Either play it by ear, or literally double your cloud costs for a true, real prod-parallel to mitigate that risk. It looks like even the most critical and prestigious companies in the world are doing the former.
There's also the problem that doubling your cloud footprint to reduce the risk of a single point of failure introduces new risks: more configuration to break, new modes of failure when both infrastructures are accidentally live and processing traffic, etc.
Back when companies typically ran their own datacenters (or otherwise heavily relied on physical devices), I was very skeptical about redundant switches, fearing the redundant hardware would cause more problems than it solved.
Which is why the “art” of engineering is reducing complexity while retaining functionality.
100%
> No one wants to pay for resilience/redundancy. I've launched over a dozen projects going back to 2008, clients simply refuse to pay for it, and you can't force them. They'd rather pinch their pennies, roll the dice and pray.
Well, fly by night outfits will do that. Bigger operations like GitHub will try to do the math on what an outage costs vs what better reliability costs, and optimize accordingly.
Look at a big bank or a big corporation's accounting systems, they'll pay millions just for the hot standby mainframes or minicomputers that, for most of them, would never be required.
They do have multiple layers of redundancies, and thus have the big budgets, but they won't be kept hot, or there will be some critical flaws that all of the engineers know about but they haven't been given permission/funding to fix, and are so badly managed by the firm, they dgaf either and secretly want the thing to burn.
There will be sustained periods of downtime if their primary system blips.
They will all still be dependent on some hyper-critical system that nobody really knows how it works, the last change was introduced in 1988 and it (probably) requires a terminal emulator to operate.
They weren't using mainframes, just "big iron" servers, but each one would have been north of $5 million for the box alone, I guess on a 5ish year replacement schedule. Then there's all the networking, storage, licensing, support, and internal administration costs for it which would easily cost that much again.
Now people will say SAP systems are made entirely of dict tape and bubblegum. But it all worked. This system ran all their sales/purchasing sites and portals and was doing a million dollars every couple of minutes so that all paid for itself many times over during the course of that bug. Cold standby would not have cut it. Especially since these big systems take many minutes to boot and HANA takes even longer to load from storage.
Used to, but it feels like there is no corporate responsibility in this country anymore. These monopolies have gotten so large that they don't feel any impact from these issues. Microsoft is huge and doesn't really have large competitors. Google and Apple aren't really competing in the source code hosting space in the same way GitHub is.
Not my experience. Any banking I used, in multiple countries, had multiple and significant outages and some of them where their cards have failed to function. Do a search of "U.S. Bank outage" to see how many outages have happened so far this year.
† Hopefully there aren’t any hospitals that depends on GitHub being continuously available?
Microsoft: the film Idiocracy was not supposed to be a manual
https://thenewstack.io/github-will-prioritize-migrating-to-a...
Performance issues always scare me. A lot of the time it's indicative of fragile systems. Like with a lot of banking software - the performance is often bad because the software relies on 10 APIs to perform simple tasks.
I doubt this is the case with GitHub, but it still makes you wonder about their code and processes. Especially when it's been a problem for many years, with virtually no improvement.
This is what happens when they decide that all the budget should be spent on AI stuff rather than solid infra and devops
Small and scrappy startup -> taking on bigger customers for greater profits / ARR -> re-architecting for "enterprise" customers and resiliency / scale -> more idealism in engineering -> profit chasing -> product bloat -> good engineers leave -> replaced by other engineers -> failures expand.
This may be an acceptable lifecycle for individual companies as they each follow the destiny of chasing profits ultimately. Now picture it though for all the companies we've architected on top of (AWS, CloudFlare, GCP, etc.) Even within these larger organizations, they are comprised of multiple little businesses (eg: EC2 is its own business effectively - people wise, money wise)
Having worked at a $big_cloud_provider for 7 yrs, I saw this internally on a service level. What started as a foundational service, grew in scale, complexity, and architected for resiliency, slowly eroded its engineering culture to chase profits. Fundamental services becoming skeletons of their former selves, all while holding up the internet.
There isn't a singular cause here, and I can't say I know what's best, but it's concerning as the internet becomes more centralized into a handful of players.
tldr: how much of one's architecture and resiliency is built on the trust of "well (AWS|GCP|CloudFlare) is too big to fail" or "they must be doing things really well"? The various providers are not all that different from other tech companies on the inside. Politics, pressure, profit seeking.
But the small product also would not be able to handle any real amount of growth as it was, because it was a mess of tech debt and security issues and manual one-off processes and fragile spaghetti code that only Jeff knows because he wrote it in a weekend, and now he’s gone.
So by definition, if a service is large enough to serve a zillion people, it is probably big and bloated and complex.
I’m not disagreeing with you, I liked your comment and I’m just rambling. I have worked with several startups and was surprised at how poorly their tech scaled (and how riddled with security issues they were) as we got into it.
Nothing will shine a flashlight on all the stress cracks of a system like large-scale growth on the web.
Totally agree with your take as well.
I think the unfortunate thing is that there can exist a "goldie locks zone" to this, where the service is capable of serving a zillion people AND is well architected. Unfortunately it can't seem to last forever.
I saw this in my career. More product SKUs were developed, new features/services defined by non-technical PMs, MBAs entered the chat, sales became the new focus over availability, and the engineering culture that made this possible eroded day by day.
The years I worked in this "goldie locks zone" I'd attribute to:
- strong technical leadership at the SVP+ level that strongly advocated for security, availability, then features (in that order).
- a strong operational culture. Incidents were exciting internally, post mortems shared at a company wide level, no matter how small.
- recognition for the engineers who chased ambulances and kept things running, beyond their normal job, this inspired others to follow in their footsteps.
Git is distributed, it should be possible to put something between our servers and github which pulls from github when it's running and otherwise serves whatever it used to have. A cache of some sort. I've found the five year old https://github.com/jonasmalacofilho/git-cache-http-server which is the same sort of idea.
I've run a git instance on a local machine which I pull from, where a cron job fetches from upstream into it, which solved the problem of cloning llvm over a slow connection, so it's doable on a per-repo basis.
I'd like to replace it globally though because CI looks like "pull from loads of different git repos" and setting it up once per-repo seems dreadful. Once per github/gitlab would be a big step forward.
That larger view held only by a small sliver of employees is likely why reliability is not a concern. That leads to the every team for themselves mentality. “It’s not our problem, and we won’t make it our problem so we don’t get dinged at review time” (ok that is Microsoft attitude leaking)
Then there’s their entrenched status. Real talk, no one is leaving GitHub. So customers will suck it up and live with it while angry employees grumble on an online forum. I saw this same attitude in major companies like Verio and Verisign in the early 2000s. “Yeah we’re down but who else are you going to go to? Have a 20% discount since you complained. We will only be 1% less profitable this quarter due to it” The kang and kodos argument personified.
These views are my own and not related to my employer or anyone associated with me.
I worked for one of the largest company in my country, they had "catch-up" with GitHub and it is not longer about GitHub as you folks are used to but AI aka CoPilot.
We are seeing major techs such as but not limited to Google, AWS and Azure going under after making public that their code is 30% AI generated (Google).
Even Xbox(Microsoft) and its gaming studio got destroyed (COD BO7) for heavily dependency on AI.
Don't you find it coincidence all of these system outage worldwide happening right after they proudly shared heavily dependency on AI??
Companies aren't using AI/ML to improve processes but to replace people, full stop. The AI stock market is having a massive meltdown as we speak with indications that the AI bubble went live.
If you as a company wanna keep your productivity at 99.99% from now on:
* GitLab: Self-hosted GitLab/runners * Datacenter: AWS/GCP/Azure is no longer a safe option or cheaper, we have data center companies such as Equinix which have a massive backup plan in place. I have visited one, they are prepared for a nuclear war and I am not even being dramatic. If I was starting a new company in 2025, I would go back to datacenter over AWS/GCP/Azure * Self-host everything you can, and no, it does not require 5 days in the office to manage all of that.
As the AI bubble goes sideways, you don't know how your company data is being held, CoPilot uses GitHub to train its AI for instance. Yes, the big company I work for had a clause to forbids GitHub from using the company's repo from AI training.
How many companies can afford having a dedicated GitHub team to speak to?? How many companies read the contracts or have any saying??
Not many really.
Yeah sure, cloud is easier, you just pay the bills, but at what cost??
shooker435•2mo ago
it's up now (the incident, not the outage)