Tell HN: DigitalOcean's managed services broke each other after update

76•neilfrndes•3w ago

Yesterday my production app went down. The cause? DigitalOcean's managed PostgreSQL update broke private VPC connectivity to their managed Kubernetes.

Public endpoint worked. Private endpoint timed out. Root cause: a Cilium bug (#34503) where ARP entries go stale after infrastructure changes.

DO support responded relatively quickly (<12hrs). Their fix? Deploy a DaemonSet from a random GitHub user to ping stale ARP entries every 10 seconds. The upstream Cilium fix is merged but not yet deployed to DOKS. No ETA.

I chose managed services specifically to avoid ops emergencies. We're a tiny startup paying the premium so someone else handles this. Instead, I spent late night hours debugging VPC routing issues in a networking layer I don't control.

HN's usual advice is "just use managed services, focus on the business." Generally good advice. But managed doesn't mean worry-free, it means trading your failure modes for the vendor's failure modes. You're not choosing between problems and no problems. You're choosing between problems you control and (fewer?) problems you don't.

Still using DO. Still using managed services. Just with fewer illusions about what "managed" means.

Comments

cosmin800•3w ago

Lower prices come with a cost. I am not a fan of AWS but they higher reliability.

delish•3w ago

The font color implies this comment is downvoted, but I earnestly encourage readers to take very seriously the difference in SLOs and SLAs between high-cost vendors like AWS and GCP and low-cost vendors like DigitalOcean. Read their docs; do not assume DO is "the same, but lower cost."

deathanatos•3w ago

… are the published SLAs worth more than use as toilet paper?

I think it boils down to who offers the highest quality / $, and that's an impossible metric to really measure except via experience.

But with a number of the "big" clouds, there's what the SLA says, and then the actual lived performance of the system. Half the time the SLA weasels out of the outage — e.g., "the API works" is not in SLA scope for a number of cloud services, only thinks like "the service is serving your data". E.g., your database is up? SLA. You can make API calls modify it? Not so much. VMs are running? SLA. API calls to alloc/dealloc? No. Support responded to you? SLA. The respond contains any meaningful content? Not so fast. Even if your outage is covered by SLA, getting that SLA to action often requires a mountain of work: I have to prove to the cloud vendor that they've strayed from their own SLA¹, and force them to issue a credit, and often then the benefit of the credit outweight my time in salary. Oftentimes the exchanges in support town seem to reveal that the cloud provider has, apparently, no monitoring whatsoever to be able to see what actual perf I am experiencing. (E.g., I have had tickets with Azure where they seem blithely unaware their APIs are returning 500s …)

So, published is one thing. On paper, IDK, maybe Azure & GCP probably look pretty on par. In practice, I would laugh at that idea.

¹AWS is particularly guilty of this; I could summarize their support as "request ID or GTFO".

neilfrndes•3w ago

We were on AWS for a while. The complexity was way higher than what our team could manage. DOKS is simpler, and this is the first major issue we've hit in many months.

killingtime74•3w ago

AWS frequently has outages. us-east-1 anyone

Nextgrid•3w ago

I've had an RDS instance fail in an equally-weird way that required manual AWS operator intervention 24 hours later, until which we were effectively locked out of our data and had to restore from a recent backup & rebuild the missing data from logs to bring the service back up in the meantime.

All the supposed "savings" of using managed services to save on staff costs evaporated immediately. No refund from the provider obviously despite it being an edge-case in their implementation.

sethops1•3w ago

Obligatory, do you actually need kubernetes? I struggle to imagine any tiny startup that does.

osigurdson•3w ago

Running Kubernetes in a managed environment like DO is no harder than using docker compose.

neilfrndes•3w ago

As the sibling comment already mentioned, k8s is not much more complexity once you're past the learning curve. I used to host with ec2 + scripts earlier. K8s actually solves a lot of problems that you will have to solve yourself anyway.

hdjrudni•3w ago

As a solo dev who just started his second cluster a few days ago... I like it.

Upfront costs a little higher than I'd like. I'm paying $24 for a droplet + $12 for a load balancer, plus maybe $1 for a volume.

I could probably run my current workload on a $12 droplet but apparently Cilium is a memory hog and that makes the smaller droplet infeasible, and it seems not practical to not run a load balancer.

But now I can run several distinct apps running different frameworks and versions of php, node, bun, nginx, whatever and spin them up and tear them down in minutes and I kind of love that. And if I ever get any significant amount of users I can press a button and scale up or horizontally.

I don't have to muck about with pm2 or supervisord or cronjobs, that's built in. I don't have to muck about with SSL certs/certbot, that's built in.

I have SSO across all my subdomains. That was a little annoying to get running, took a day and a half to figure out but it was a one time thing and the config is all committed in YAML so if I ever forget how it works I have something to reference instead of trying to remember 100 shell commands I randomly ran on a naked VPS.

Upgrades are easy. Can upgrade the distro or whatever package easily.

Downsides are deploys take a minute or two instead of sub-second.

It took weeks of tinkering to get a good DX going, but I've happily settled on DevSpace. Again it takes a couple minutes to start up and probably oodles of RAM instead of milliseconds but I can maintain 10 different projects without trying to keep my dev machine in sync with everything.

So some trade-offs but I've decided it's a net win after you're over the initial learning hump.

Nextgrid•3w ago

> I can run several distinct apps running different frameworks and versions > don't have to muck about with pm2 or supervisord or cronjobs, that's built in. I don't have to muck about with SSL certs/certbot

But doesn't literally any PaaS and provider with a "run a container" feature (AWS Fargate/ECS, etc) fit the bill without the complexity, moving parts and failure modes of K8s.

K8s makes sense when you need a control plane to orchestrate workloads on physical machines - its complexity and moving parts are somewhat justified there because that task is actually complex.

But to orchestrate VMs from a cloud provider - where the hypervisor and control plane already offers all of the above? Why take on the extra overhead by layering yet another orchestration layer on top?

sfifs•2w ago

Not the original poster but have tried all that that. It's far easier with Kubernetes - just deployment, service secret & ingress config and stuff just works cleanly in namespaces without stuff at any risk of clobberring each other.

cadamsdotcom•3w ago

100% uptime is impossible of course, a 100% reliable service would survive the next ice age.

But reliability at the holy grails of 4 and 5 nines (99.99%, 99.999% uptime) means ever greater investment - geographically dispersing your service, distributed systems, dealing with clock drift, multi master, eventual consistency, replication, sharding.. it’s a long list.

Questions to ask: could you do better yourself - with the resources you have? Is it worth the investment of a migration to get there? Whats the payoff period for that extra sliver of uptime? Will it cost you in focus over the longer term? Is the extra uptime worth all those costs?

Nextgrid•3w ago

> could you do better yourself

For this particular failure mode absolutely - this is amateur-level stuff that shouldn't have happened.

You know how to make something that works keep working? Not messing with it. Of course, this doesn't pay salaries if your entire career is based on "fixing" things that work until they don't.

There is no reason to hurry a Postgres upgrade - the thing shouldn't be internet accessible anyway, so no risk of security issues.

If you do want to update, it's best to test the update on a test/staging system. Which I'm sure they would have if they didn't have to pay a 10-90x markup on the compute price.

Finally, when you do the update, you'd do it manually during a time where you are present and outside of business hours to further minimize the impact of something going wrong, instead of the upgrade happening out of the blue at a random time.

cadamsdotcom•3w ago

One amateur moment doesn’t make a service’s management amateur.

If you run it yourself there’s a chance you will trade the mistakes made by DO for different mistakes made by your own team - and still have similar overall reliability.

Nextgrid•3w ago

Simply moving the time at which the mistakes occur can be extremely valuable. Doing it yourself means you can say "no touching the server during business hours". You can't guarantee that with a provider.

Even if you work out that you cannot do better, at least you are no longer paying the insane premium of the managed highly-available service (since it's not actually capable of delivering).

solaris2007•3w ago

AWS designs and implements their foundational services holistically. I can understand that the services "higher up the stack" may not feel this way to AWS customers sometimes. However, the foundation of VPCs, EC2, EBS and S3, are very strong.

If the word "production" is suppose to really mean something to you, move your workload to Google Cloud, or move it to AWS, or on https://cast.ai

Disclaimer: I have no commercial affiliation with Cast AI.

tatersolid•3w ago

> AWS designs and implements their foundational services holistically.

I’d say they implement their services circularly. The outage-inducing circular dependency between Dynamo and Route53 is not a “holistic” design.

kevin_nisbet•3w ago

> I chose managed services specifically to avoid ops emergencies. We're a tiny startup paying the premium so someone else handles this. Instead, I spent late night hours debugging VPC routing issues in a networking layer I don't control.

This happens with managed services and I understand the frustration, but vendors are just as fallible as the rest of us and are going to have wonky behaviour and outages, regardless of the stability they advertise. This is always part of build vs buy, buy doesn't always guarentee a friction free result.

It happens with the big cloud providers as well, I've spent hours with AWS chasing why some VMs are missing routing table entries inside the VPC, or on GCP we had to just ban a class of VMs because the packet processing was so bad we couldn't even get a file copy to complete between VMs.

Nextgrid•3w ago

> vendors are just as fallible as the rest of us

One of the issues I have with this is the insane markups they're charging for services that ultimately aren't any better than what you can do yourself.

If they aren't any better at least save yourself some money.

Ma8ee•3w ago

> but vendors are just as fallible as the rest of us

Isn’t the point that they shouldn’t be. They should have specialists dedicated to running these kind of things, test upgrades before rolling out, et c., while for the rest of us it’s just one of many things we try to handle.

mmh0000•3w ago

  > I chose managed services specifically to avoid ops emergencies

You may not be spending enough time on HN reading all the horror stories =P

The benefit of a managed service isn't that it doesn't go down; though it probably goes down less than something you self-manage, unless you're a full-time SRE with the experience to back it.

The benefit of a managed service is you say: "It's not my problem, I opened a ticket, now I'm going to get lunch, hope it's back up soon."

neilfrndes•3w ago

I've read a few horror stories, but I always thought it wouldn't happen to me :)

> It's not my problem, I opened a ticket, now I'm going to get lunch, hope it's back up soon.

That's a good way of thinking about it.

hdjrudni•3w ago

> though it probably goes down less than something you self-manage, unless you're a full-time SRE with the experience to back it.

I wonder how true that is. This went down because of a bad update, which is probably like 99.99% of outages. The other 0.01% is cosmic rays causing hardware failures.

My server was up for 3.5 years with no outages because I just didn't touch it. I had to take it offline a couple days ago to move it which made me a little sad. Took a snapshot and moved it to a new droplet, brought it back up as-is and it's running great again.

Anyway, emergencies are less emergy if things go down while you're upgrading and shuffling things around yourself. You expect hiccups if you're the one causing the hiccups. It's when someone else is tinkering on the other side of the country/planet and blows something up that suddenly you have an emergency.

Nextgrid•3w ago

I concur. I've seen a lot of companies outside the techbro world where the entire thing runs on a single VPS/dedicated server with a setup that would make any sysadmin squirm. And yet, it just works and makes them money?

Which isn't too surprising - hardware is extremely reliable nowadays. When's the last time your laptop broke? And that laptop lives a much harsher life than server HW in a datacenter. Obviously everyone is going to have their own anecdotes about this, but I think it's fair to say that overall the failure rates are quite low.

You know why their (often awful) setups work and consistently beat the major clouds in terms of uptime? No moving parts for K8s and all the "best practices", and most importantly, there is nobody "fixing" the working setup until it doesn't work. Ironically they are getting better uptime by avoiding all the things that are marketed as improving uptime.

kikimora•3w ago

>My server was up for 3.5 years with no outages because I just didn't touch it.

Problem #1 keeping OS current. Chances are you run an outdated OS with some RCE vulnerabilities.

Problem #2 setup is hard to scale organizationally. How to give access to the server to other people? How to monitor what they do? How to replicate server setup across teams and keep it in sync? So on and so forth.

In an org. something always change, and you have to touch servers as a result.

ebiederm•3w ago

I don't know if this is realistic but as a general rule if I was contracting with someone so that my business would have higher reliability, I would ask for a service level agreement with a agreed upon amount the vendor will pay you for every unit of time there service is not up.

At least then your pain is their pain, and they are incentivesed to prevent problems and fix them quickly.

SahAssar•3w ago

Usually those agreements either just give you credits for the same service, pay way less than you lost or basically everything falls under force majeure.

If it works for you that's great, but when the actual shit hits the fan I don't think you should expect actual compensation.

neilfrndes•3w ago

At our scale I doubt if we can get any cloud provider to write custom contracts. But if I had negotiating power, I completely agree.

Nextgrid•3w ago

Nobody that uses Kubernetes and random shit from Github would sign such an agreement if they actually had to pay out and could not weasel their way out of it. That would be signing up for a near-unlimited liability and business suicide.

Let's assume an incident costs you (the customer) ~5k, just assuming the time it takes to get a professional on very short notice to debug (since the whole promise of managed services is that you no longer need technical staff at all). That's also ignoring the actual cost to your business (lost sales, reputational risk, or missing your own SLAs).

For the provider to be willing to pay out something like this they'd need to charge you monthly several times that amount (otherwise just one incident and they're forever underwater on the LTV). Yet such a monthly amount would make the service unaffordable to all but the most deep-pocketed customers... for whom the impact of an outage on their business would cost even more meaning they'd want the payouts to be even bigger, leading to a catch-22.

High-availability good enough for the provider to put 5-figure sums on the line is actually really hard (there's a reason actual critical stuff like stock exchange order processing or card transactions don't run on the "cloud", nor on Kubernetes for that matter), so the next best thing is make-believe "high availability" where everyone (except the occasional poor soul like you that actually believed the marketing) understands the charade and plays along (because their own SLAs are often make-believe too).

See also: the recent Cloudflare or AWS outages.

calvinmorrison•3w ago

At my work we pay a boring, regional VPS host that is not fancy. In fact its maybe a few levels above "your 2000's web host, with a LAMP stack, a FTP login and a bad admin panel". Just a bit above that.

However, they ALWAYS pick up the phone on the 3rd ring with a capable, on call linux sysadmin with good general DB, services, networking, DNS, email knowledge.

abnercoimbre•3w ago

Wait, customer support with a competent sysadmin? You're not making this up? It sounds ethereal.

calvinmorrison•3w ago

Pair.com! Pittsburgh baby

Fhch6HQ•3w ago

That is a name I haven't heard in a long time, pleasantly surprised to hear they're still in business.

adityaathalye•3w ago

This is the way... you live in operator heaven.

Nextgrid•3w ago

Bonus point is that with such a simple stack you don't need to phone them often because the thing just works.

Most cloud outages are self-inflicted with the endless churn trying to reinvent things, not actual hardware failure. Just not touching the working system would boost their reliability and uptime, but then a lot of people would lose justification for their salaries so it can't happen.

calvinmorrison•3w ago

you get REALLY, REALLY far with a single server. Most SaaS products don't have a million users. You can get even further with just 2 servers.

AlbinoDrought•3w ago

Since this is about DO managed Postgres: if you're using it with replicas, they use async replication and RPO can be greater than 15 minutes. Since failover is triggered during upgrades, there ends up being a lot of periods where you can lose multiple minutes of committed data.

roryirvine•3w ago

Do they at least allow you to set your own schedule for upgrade windows? That way you could schedule them for quiet times of day, minimising the likelihood of there being significant replica lag.

It's common to do this on AWS and the other hyperscale providers (though, of course, they tend to do synchronous replication anyway, meaning that this particular failure mode wouldn't apply) - upgrades are a common source of unforeseen issues, so it makes sense to minimise the potential blast radius by running them out of hours.

itake•3w ago

I just had a 12hr outage due to flyio's quick and easy postgres minor patch update cooking my database.

I ended up downloading the entire volume, setting up my own docker container locally, exporting it, creating a new cluster (on the latest major patch).

Lost most of my day yesterday

hdjrudni•3w ago

Oof. I have a very similar set up except I'm using their managed MySQL instead of PostgreSQL. It appears I wasn't hit.

Same thought as you.. I just didn't want to figure out and manage MySQL-with-failover myself so I switched their managed solution a year or two ago and my bill went up like 300% or more (was running fine on a ~$12 or maybe $24 droplet + $5 volume but now costs, I don't even remember, $150 or so).

yellow_lead•3w ago

Try a different managed service. We're using Render for a year with no DB outages. Although, we have gone down with Cloudflare several times.

As far as dbs go, I believe Amazon RDS is quite reliable. I think Render uses it under the hood.

You could also consider AWS ECS directly with RDS.

anurag•3w ago

Render's built its own Postgres (we don't use RDS). Glad to hear it's working well for you!

yellow_lead•3w ago

That's pretty cool! Thanks for your work. No PaaS is perfect but quite satisfied with Render.

sfifs•3w ago

Oh I've run into exactly the same issue on my personal cluster and I had no clue what was the issue. Is this solvable?

mystraline•3w ago

I know its not quite the same, but Ive been moving some of my personal services off of docker, and back to a full VM.

I find less things that can go wrong with VMs. I can log and monitor them better, and increase resources as I see what's going on per machine.

Docker was smearing all the machines together. For early testing, its great due to speed of redeploy and cleaning state. But once you want to start tuning, docker is pretty hard to get right.

Maybe I'm not a great systems engineer. But I do like my lower complexity systems. 1 service per machine is, in my opinion easier to get right.

atmosx•3w ago

“Welcome to the real world Neo!”

“There is no cloud, it’s just somebody else’s computer”

etc etc…

lep_qq•3w ago

This resonates. We run a similar setup (managed K8s + managed DBs) and hit a comparable issue last year with a cloud provider's CNI update that broke pod-to-service networking for 6 hours. The irony is that "managed" services often abstract away the problems you can fix (config, scaling, backups) while exposing you to problems you can't fix (vendor infrastructure bugs, dependency conflicts between their managed components). What helped us:

Redundancy across failure domains: We now run critical stateful workloads with connection pooling that can failover between private and public endpoints. Yes, it's more complexity, but it's complexity we control. Synthetic monitoring for managed services: We probe not just our app, but also the managed service endpoints from multiple network paths. Catches these "infrastructure layer" failures faster. Backup connectivity paths: For managed DBs, we keep both private VPC and public (firewalled) endpoints configured. If one breaks, we can switch in minutes via config.

The DaemonSet workaround is... alarming. It's essentially asking you to run production-critical infrastructure code from an untrusted source because their managed platform has a known bug with no ETA. Your point about trading failure modes is spot on. Managed services are still worth it for small teams, but the value prop is "fewer incidents" not "no incidents," and when they do happen, your MTTR is now bounded by vendor response time instead of your team's skills. Did DO at least provide the DaemonSet from an official source, or was it literally "here's a random GitHub link"?

neilfrndes•2w ago

> Did DO at least provide the DaemonSet from an official source, or was it literally "here's a random GitHub link"?

quoting verbatim from their email:

> For long-term remediation, our team has also created a DaemonSet that runs this flush command on all nodes automatically. You can find it at the link: https://github.com/okamidash/ARP-DOKS-FIX

Discuss – Do AI agents deserve all the hype they are getting?

Ask HN: Anyone Using a Mac Studio for Local AI/LLM?

LLMs are powerful, but enterprises are deterministic by nature

Ask HN: Non AI-obsessed tech forums

Ask HN: Ideas for small ways to make the world a better place

Ask HN: 10 months since the Llama-4 release: what happened to Meta AI?

Ask HN: Who wants to be hired? (February 2026)

Ask HN: Who is hiring? (February 2026)

Ask HN: Non-profit, volunteers run org needs CRM. Is Odoo Community a good sol.?

AI Regex Scientist: A self-improving regex solver

Tell HN: Another round of Zendesk email spam

Ask HN: Is Connecting via SSH Risky?

Ask HN: Has your whole engineering team gone big into AI coding? How's it going?

Ask HN: Why LLM providers sell access instead of consulting services?

Ask HN: How does ChatGPT decide which websites to recommend?

Ask HN: What is the most complicated Algorithm you came up with yourself?

Ask HN: Is it just me or are most businesses insane?

Ask HN: Mem0 stores memories, but doesn't learn user patterns

Ask HN: Is there anyone here who still uses slide rules?

Kernighan on Programming

Ask HN: Anyone Seeing YT ads related to chats on ChatGPT?

Ask HN: Does global decoupling from the USA signal comeback of the desktop app?

Ask HN: Any International Job Boards for International Workers?

We built a serverless GPU inference platform with predictable latency

Ask HN: Does a good "read it later" app exist?

Ask HN: Have you been fired because of AI?

Ask HN: Anyone have a "sovereign" solution for phone calls?

Ask HN: Cheap laptop for Linux without GUI (for writing)

Ask HN: How Did You Validate?

Ask HN: OpenClaw users, what is your token spend?

Discuss – Do AI agents deserve all the hype they are getting?

Ask HN: Anyone Using a Mac Studio for Local AI/LLM?

LLMs are powerful, but enterprises are deterministic by nature

Ask HN: Non AI-obsessed tech forums

Ask HN: Ideas for small ways to make the world a better place

Ask HN: 10 months since the Llama-4 release: what happened to Meta AI?

Ask HN: Who wants to be hired? (February 2026)

Ask HN: Who is hiring? (February 2026)

Ask HN: Non-profit, volunteers run org needs CRM. Is Odoo Community a good sol.?

AI Regex Scientist: A self-improving regex solver

Tell HN: Another round of Zendesk email spam

Ask HN: Is Connecting via SSH Risky?

Ask HN: Has your whole engineering team gone big into AI coding? How's it going?

Ask HN: Why LLM providers sell access instead of consulting services?

Ask HN: How does ChatGPT decide which websites to recommend?

Ask HN: What is the most complicated Algorithm you came up with yourself?

Ask HN: Is it just me or are most businesses insane?

Ask HN: Mem0 stores memories, but doesn't learn user patterns

Ask HN: Is there anyone here who still uses slide rules?

Kernighan on Programming

Ask HN: Anyone Seeing YT ads related to chats on ChatGPT?

Ask HN: Does global decoupling from the USA signal comeback of the desktop app?

Ask HN: Any International Job Boards for International Workers?

We built a serverless GPU inference platform with predictable latency

Ask HN: Does a good "read it later" app exist?

Ask HN: Have you been fired because of AI?

Ask HN: Anyone have a "sovereign" solution for phone calls?

Ask HN: Cheap laptop for Linux without GUI (for writing)

Ask HN: How Did You Validate?

Ask HN: OpenClaw users, what is your token spend?

Tell HN: DigitalOcean's managed services broke each other after update

Comments