frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: Empusa – Visual debugger to catch and resume AI agent retry loops

https://github.com/justin55afdfdsf5ds45f4ds5f45ds4/EmpusaAI
1•justinlord•14s ago•0 comments

Show HN: Bitcoin wallet on NXP SE050 secure element, Tor-only open source

https://github.com/0xdeadbeefnetwork/sigil-web
1•sickthecat•2m ago•0 comments

White House Explores Opening Antitrust Probe on Homebuilders

https://www.bloomberg.com/news/articles/2026-02-06/white-house-explores-opening-antitrust-probe-i...
1•petethomas•2m ago•0 comments

Show HN: MindDraft – AI task app with smart actions and auto expense tracking

https://minddraft.ai
1•imthepk•7m ago•0 comments

How do you estimate AI app development costs accurately?

1•insights123•8m ago•0 comments

Going Through Snowden Documents, Part 5

https://libroot.org/posts/going-through-snowden-documents-part-5/
1•goto1•9m ago•0 comments

Show HN: MCP Server for TradeStation

https://github.com/theelderwand/tradestation-mcp
1•theelderwand•12m ago•0 comments

Canada unveils auto industry plan in latest pivot away from US

https://www.bbc.com/news/articles/cvgd2j80klmo
1•breve•13m ago•0 comments

The essential Reinhold Niebuhr: selected essays and addresses

https://archive.org/details/essentialreinhol0000nieb
1•baxtr•15m ago•0 comments

Rentahuman.ai Turns Humans into On-Demand Labor for AI Agents

https://www.forbes.com/sites/ronschmelzer/2026/02/05/when-ai-agents-start-hiring-humans-rentahuma...
1•tempodox•17m ago•0 comments

StovexGlobal – Compliance Gaps to Note

1•ReviewShield•20m ago•1 comments

Show HN: Afelyon – Turns Jira tickets into production-ready PRs (multi-repo)

https://afelyon.com/
1•AbduNebu•21m ago•0 comments

Trump says America should move on from Epstein – it may not be that easy

https://www.bbc.com/news/articles/cy4gj71z0m0o
5•tempodox•21m ago•1 comments

Tiny Clippy – A native Office Assistant built in Rust and egui

https://github.com/salva-imm/tiny-clippy
1•salvadorda656•26m ago•0 comments

LegalArgumentException: From Courtrooms to Clojure – Sen [video]

https://www.youtube.com/watch?v=cmMQbsOTX-o
1•adityaathalye•29m ago•0 comments

US moves to deport 5-year-old detained in Minnesota

https://www.reuters.com/legal/government/us-moves-deport-5-year-old-detained-minnesota-2026-02-06/
4•petethomas•32m ago•2 comments

If you lose your passport in Austria, head for McDonald's Golden Arches

https://www.cbsnews.com/news/us-embassy-mcdonalds-restaurants-austria-hotline-americans-consular-...
1•thunderbong•36m ago•0 comments

Show HN: Mermaid Formatter – CLI and library to auto-format Mermaid diagrams

https://github.com/chenyanchen/mermaid-formatter
1•astm•52m ago•0 comments

RFCs vs. READMEs: The Evolution of Protocols

https://h3manth.com/scribe/rfcs-vs-readmes/
2•init0•59m ago•1 comments

Kanchipuram Saris and Thinking Machines

https://altermag.com/articles/kanchipuram-saris-and-thinking-machines
1•trojanalert•59m ago•0 comments

Chinese chemical supplier causes global baby formula recall

https://www.reuters.com/business/healthcare-pharmaceuticals/nestle-widens-french-infant-formula-r...
2•fkdk•1h ago•0 comments

I've used AI to write 100% of my code for a year as an engineer

https://old.reddit.com/r/ClaudeCode/comments/1qxvobt/ive_used_ai_to_write_100_of_my_code_for_1_ye...
2•ukuina•1h ago•1 comments

Looking for 4 Autistic Co-Founders for AI Startup (Equity-Based)

1•au-ai-aisl•1h ago•1 comments

AI-native capabilities, a new API Catalog, and updated plans and pricing

https://blog.postman.com/new-capabilities-march-2026/
1•thunderbong•1h ago•0 comments

What changed in tech from 2010 to 2020?

https://www.tedsanders.com/what-changed-in-tech-from-2010-to-2020/
3•endorphine•1h ago•0 comments

From Human Ergonomics to Agent Ergonomics

https://wesmckinney.com/blog/agent-ergonomics/
1•Anon84•1h ago•0 comments

Advanced Inertial Reference Sphere

https://en.wikipedia.org/wiki/Advanced_Inertial_Reference_Sphere
1•cyanf•1h ago•0 comments

Toyota Developing a Console-Grade, Open-Source Game Engine with Flutter and Dart

https://www.phoronix.com/news/Fluorite-Toyota-Game-Engine
2•computer23•1h ago•0 comments

Typing for Love or Money: The Hidden Labor Behind Modern Literary Masterpieces

https://publicdomainreview.org/essay/typing-for-love-or-money/
1•prismatic•1h ago•0 comments

Show HN: A longitudinal health record built from fragmented medical data

https://myaether.live
1•takmak007•1h ago•0 comments
Open in hackernews

Behind the scenes: Redpanda Cloud's response to the GCP outage

https://www.redpanda.com/blog/gcp-outage-june-redpanda-cloud
77•eatonphil•7mo ago

Comments

RadiozRadioz•7mo ago
Hmm. Here's what I read from this article: RedPanda didn't happen to use any of the stuff in GCP that went down, so they were unaffected. They use a 3rd party for alerting and dashboarding, and that 3rd party went down, but RedPanda still had their own monitoring.

When I read "major outage for a large part of the internet was just another normal day for Redpanda Cloud customers", I expected a brave tale of RedPanda SREs valiantly fixing things, or some cool automatic failover tech. What I got instead was: Google told RedPanda there was an issue, RedPanda had a look and their service was unaffected, nothing needed failing over, then someone at RedPanda wrote an article bragging about their triple-nine uptime & fault tolerance.

I get it, an SRE is doing well if you don't notice them, but the only real preventative measure I saw here that directly helped with this issue, is that they over provision disk space. Which I'd be alarmed if they didn't do.

literallyroy•7mo ago
Yeah I thought they were going to show something cool like multi-tenant architecture. Odd to write this article when it was clear they expected to be impacted as they were reaching out to customers.
dangoodmanUT•7mo ago
I think you're missing the point. What I took away was that: "Because we design for zero dependencies for full operation, we didn't go down". Their extra features like tiered storage and monitoring going down didn't affect normal operations, which it seems like it did for similar solutions with similar features.
echelon•7mo ago
> triple-nine uptime & fault tolerance.

Haha, we used to joke that's how many nines our customer-facing Ruby on Rails services had compared against our resilient five nines payments systems. Our heavy infra handled billions in daily payment volume and couldn't go down.

With the Ruby teams, we often playfully quipped, "which nines are those?" humorously implying the leading digit itself wasn't itself a nine.

sokoloff•7mo ago
AKA: "We're closing in on our third 8 of uptime..."
gopher_space•7mo ago
> I expected a brave tale of RedPanda SREs valiantly fixing things, or some cool automatic failover tech.

It's a tale of how they set things up so they wouldn't need to valiantly fix things, and I think the subtext is probably that Redpanda doesn't pass responsibility on to a third party.

There are plenty of domains and, more importantly, people who need uptime guarantees to mean "fix estimate from a real human working on the problem" and not eventual store credit. Payroll is a classic example.

RadiozRadioz•7mo ago
Nothing about the way they architected their system even mattered in this incident. Their service just wasn't using any of the infrastructure that failed - there was no event here that actually put their system design to the test. There just isn't a story here.

It's like if the power went out in the building next door, and you wrote a blog post about how amazing the reliability of your office computers are compared to your neighbor. If your power had gone out too but you had provisioned a bunch of UPSs and been fine, then there's something to talk about.

To extend the analogy, if the neighborhood had a reputation for brown-outs and you deliberately chose not to build an office there, then maybe you have something. But here, RedPanda's GCP offering is inside GCP, this failure in GCP has never happened before, they just got lucky.

bdavbdav•7mo ago
“We got lucky as the way we designed it happened not to use the part of the service that was degraded”
smoyer•7mo ago
And we're oblivious enough about that luck that we're patting ourselves on the back in public.
belter•7mo ago
And we are linking our blog to the AWS doc on cell architectures, while talking about multiaz clusters on GCP azs that are nothing like that...
rybosome•7mo ago
Must be hell inside GCP right now. That was a big outage, and they were tired of big outages years ago. It was already extremely difficult to move quickly and get things done due to the reliability red tape, and I have to imagine this will make it even harder.
siscia•7mo ago
In fairness, their design does not seem to be regional. With problems in one region bringing down another, apparently not unrelated, region.

With this kind of architecture, this sort of problems is just bound to happen.

During my time in AWS, region independence was a must. And some services were able to operate at least for a while without degrading also when some core dependencies were not available. Think like loosing S3.

And after that, the service would keep operating, but with a degraded experience.

I am stunned that this level of isolation is not common in GCP.

valenterry•7mo ago
How does AWS do that though? Do the re-implement all the code in every region? Because even the slightest re-use of code could trigger a synchronous (possibly delayed) downtime of all regions.
crop_rotation•7mo ago
Reusing code doesn't trigger region dependencies.

> Do the re-implement all the code in every region?

Everyone does.

The difference is AWS very strongly ensures that regions are independent failure domains. The GCP architecture is global with all the pros and cons that implies. e.g GCP has a truly global load balancer while AWS can not since everything is at core regional.

nijave•7mo ago
They definitely roll out code (at least for some services) one region at a time. That doesn't prevent old bugs/issues from coming up but it definitely helps prevent new ones from becoming global outages.
valenterry•7mo ago
Right, that makes sense. But if it's an evil bug that triggers e.g. over a year-change only, then that might not help.

So I suppose theoretically also AWS can go down all together, even if less likely.

cyberax•7mo ago
Region (and even availability zones) in AWS are independent. The regions all have overlapping IPv4 addresses, so direct cross-region connectivity is impossible.

So it's actually really hard to accidentally make cross-region calls, if you're working inside the AWS infrastructure. The call has to happen over the public Internet, and you need a special approval for that.

Deployments also happen gradually, typically only a few regions at a time. There's an internal tool that allows things to be gradually rolled out and automatically rolled back if monitoring detects that something is off.

rybosome•7mo ago
Global dependencies were disallowed back in 2018 with a tiny handful of exceptions that were difficult or impossible to make fully regional. Chemist, the service that went down, was one of those.

Generally GCP wants regionality, but because it offers so many higher-level inter-region features, some kind of a global layer is basically inevitable.

dangoodmanUT•7mo ago
Does Route53 depend on services in us-east-1 though? Or maybe it's something else, but i recall us-east-1 downtime causing service downtime for global services
cyberax•7mo ago
As far as I remember, Route53 is semi-regional. The master copy is kept in us-east-1, but individual regions have replicated data. So if us-east-1 goes down, the individual regions will keep working with the last known state.

Amazon calls this "static stability".

toast0•7mo ago
Static stability is a good start, but isn't enough.

In this outage, my service (on GCP) had static stability, which was great. However, some other similar services failed, and we got more load, but we couldn't start additional instances to handle the load because of the outage, and so we had overloaded servers and poor service quality.

Mayhaps we could have adjusted load across regions to manage instance load, but that's not something we normally do.

flaminHotSpeedo•7mo ago
One of the core pieces of static stability (at least in one definition, it's an overloaded term) is being able to handle failure scenarios from a steady state.

The classic example is overprovisioning so that you can handle the extra zonal load in the event of a zonal outage without needing to scale up.

flaminHotSpeedo•7mo ago
AWS regions are fundamentally different from GCP regions. GCP marketing tries really hard to make it seem otherwise, or that GCP has all the advantages of AWS regions plus the advantages of their approach, which means heavily on "effectively global" services. There are tradeoffs, for example multi region in GCP is often trivial and GCP can enforce fairness across regions, but that comes at the cost of availability. Which would be fine - GCP SLA's reflect the fact that they rarely consider regions to be a reliable fault containers, but GCP marketing, IMO, creates a dangerous situation by pretending to be something they aren't.

Even in the mini incident report they were going through extreme linguistic gymnastics trying to claim they are regional. Describing the service that caused the outage, which is responsible for global quota enforcement and is configured using a data store that replicates data globally in near real time, with apparently no option to delay replication, they said:

   Service Control is a regional service that has a regional datastore that it reads quota and policy information from. This datastore metadata gets replicated almost instantly globally to manage quota policies for Google Cloud and our customers.
Not only would AWS call this a global service, the whole concept of global quotas would not fly at AWS.
buremba•7mo ago
I think making the identity piece regional hurts the UX a lot. I like GCP's approach, where you manage multiple regions with a single identity, but I'm not sure how they can make it resilient to regional failures.
nijave•7mo ago
Async replication? I think you could run semi independent regions with an orchestrator that copies config to each one. You'd go into a degraded read only state but it wouldn't be hard down.

Of course bugs in the orchestrator could cause outages but ideally that piece is a pretty simple "loop over regions and call each regional API update method with the same arguments"

delusional•7mo ago
> they were tired of big outages years ago

One could hope that they'd realize whatever red tape they've been putting up so far hasn't helped, and so more of it probably wont either.

If what you're doing isn't having an effect you need to do something different, not just more.

kubb•7mo ago
They’ll do more of the same. The leads are clueless and sensible voices of criticism are deftly squashed.
raverbashing•7mo ago
Lol I love how they call "not spreading your services needlessly across many different servers" as an "Architectural Pattern" (Cell based arch)

They are right, of course, but the way things, the obvious needs to be said sometimes

macintux•7mo ago
Years ago I had the misfortune of helping a company recover from an outage.

It turned out that they had services in two data centers for redundancy, but they divided their critical services between them.

So if either data center went offline, their whole stack was dead. Brilliant design. That was a very long week; fortunately by now I’ve forgotten most of it.

Peterpanzeri•7mo ago
“We got lucky as the way we designed it happened not to use the part of the service that was degraded” this is a stupid statement from them, hope they will be prepared next time
mankyd•7mo ago
Why is that stupid? They did get lucky. They are acknowledging that, had they used that, they would have had problems. And now they will work to be more prepared.

Acknowledging that one still has risks and that luck plays a factor is important.

beefnugs•7mo ago
I learned a lesson : "use less cloud"
zzyzxd•7mo ago
The article is unnecessarily long only to brag about "a service we didn't use went down so it didn't affect us". If I want to be picky, their architecture is also not perfect:

- Their alerts were not durable. The outage took out the alert system so humans were just eyeballing dashboards during the outage. What if your critical system went down along with that alert system, in the middle of night?

- The cloud marketplace service was affected by cloudflare outage and there's nothiing they could do.

- Tiered stroage was down, disk usage went above normal level. But there's no anomaly detection and no alerts. It survived because t0 storage was massively over provisioned.

- They took pride in using industry well-known designs like cell-based architecture, redundancy, multi-az...ChatGPT would be able to give me a better list

And I don't get whey they had to roast Crowdstrike at the end. I mean, the Crowdstrike incident was really amateur stuff, like, the absolute lowest bar I can think of.

diroussel•7mo ago
> Modern computer systems are complex systems — and complex systems are characterized by their non-linear nature, which means that observed changes in an output are not proportional to the change in the input. This concept is also known in chaos theory as the butterfly effect,

This isn't quite right. Linear systems can also be complex, and linear dynamic systems can also exhibit the butterfly effect.

That is why the butterfly effect is so interesting.

Of course non-linear systems can have a large change in output based on a small input, because they allow step changes, and many other non-linear processes.