Public endpoint worked. Private endpoint timed out. Root cause: a Cilium bug (#34503) where ARP entries go stale after infrastructure changes.
DO support responded relatively quickly (<12hrs). Their fix? Deploy a DaemonSet from a random GitHub user to ping stale ARP entries every 10 seconds. The upstream Cilium fix is merged but not yet deployed to DOKS. No ETA.
I chose managed services specifically to avoid ops emergencies. We're a tiny startup paying the premium so someone else handles this. Instead, I spent late night hours debugging VPC routing issues in a networking layer I don't control.
HN's usual advice is "just use managed services, focus on the business." Generally good advice. But managed doesn't mean worry-free, it means trading your failure modes for the vendor's failure modes. You're not choosing between problems and no problems. You're choosing between problems you control and (fewer?) problems you don't.
Still using DO. Still using managed services. Just with fewer illusions about what "managed" means.
cosmin800•13h ago
delish•12h ago
deathanatos•11h ago
I think it boils down to who offers the highest quality / $, and that's an impossible metric to really measure except via experience.
But with a number of the "big" clouds, there's what the SLA says, and then the actual lived performance of the system. Half the time the SLA weasels out of the outage — e.g., "the API works" is not in SLA scope for a number of cloud services, only thinks like "the service is serving your data". E.g., your database is up? SLA. You can make API calls modify it? Not so much. VMs are running? SLA. API calls to alloc/dealloc? No. Support responded to you? SLA. The respond contains any meaningful content? Not so fast. Even if your outage is covered by SLA, getting that SLA to action often requires a mountain of work: I have to prove to the cloud vendor that they've strayed from their own SLA¹, and force them to issue a credit, and often then the benefit of the credit outweight my time in salary. Oftentimes the exchanges in support town seem to reveal that the cloud provider has, apparently, no monitoring whatsoever to be able to see what actual perf I am experiencing. (E.g., I have had tickets with Azure where they seem blithely unaware their APIs are returning 500s …)
So, published is one thing. On paper, IDK, maybe Azure & GCP probably look pretty on par. In practice, I would laugh at that idea.
¹AWS is particularly guilty of this; I could summarize their support as "request ID or GTFO".
neilfrndes•9h ago
killingtime74•9h ago
Nextgrid•8h ago
All the supposed "savings" of using managed services to save on staff costs evaporated immediately. No refund from the provider obviously despite it being an edge-case in their implementation.