This assumption is completely out of touch, and is especially funny when the goal is to build an extra large cluster.
https://github.com/k3s-io/kine is a reasonably adequate substitute for etcd. sqlite, MySQL, PostgreSQL can also be substituted in. Etcd is from the ground up built to be more scale-out reliable, and that rocks to have baked in. But given how easy it is to substitute etcd out, I feel like we are at least a little off if we're trying to say "etcd is also the entire point of k8s" (the APIserver is)
Holy Raft protocol is the blockchain of cloud.
k3s doesn't require etcd, I'm pretty sure GKE uses Spanner and Azure uses Cosmos under the hood.
But from the article, turning off fsync and expecting to only lose a few ms of updates. I've tried to recover etcd on volumes that lied about fsync and experienced a power outage, and I don't think we managed to recover it. There might be more options now to recover and ignore corrupted WAL entries, but at that time it was very difficult and I think we ended up just reinstalling from scratch. For clusters where this doesn't matter or the SLOs for recovery account for this, I'm totally onboard, but only if you know what you're doing.
And similar the point from the article that "full control plane data loss isn’t catastrophic in some environments" is correct, in the sense of what the author means by some environments. Because I don't think it's limited to those that are management by gitops as suggested, but where there is enough resiliency and time to redeploy and do all the cleanup.
Anyways, like much advice on the internet, it's not good or bad, just highly situational, and some of the suggestions should only be applied if the implications are fully understood.
I would like to see how moving to database that scales write throughput with replicas would behave, namely FoundationDB. I think this will require more than an intermediary like kine to be efficient, as the author illustrates the apisever does a fair bit of its own watching and keeping state. I also think there's benefit, at least for blast radius, to shard the server by api group or namespace.
I think years ago this would have been a non starter with the community, but given AWS has replaced etcd (or at least aspects) with their internal log service for their large cluster offering, I bet there's some appetite for making this interchangable and bringing and open source solution to market.
I share the authors viewpoint that for modern cloud based deployments, you're probably best avoiding it and relying on VMs being stable and recoverable. I think reliability does matter if you want to actually realize the "borg" value and run it on bare metal across a serious fleet. I haven't found the business justification to work on that though!
ktpsns•1d ago
[1] what is a node? Typically it is a synonym for "server". In some configurations HPC schedulers allow node sharing. Then we talk about order of 100k cores to be scheduled.
stackskipton•1d ago
osigurdson•1h ago
I'm sure they mean actual servers / not just cores. Even in traditional HPC it isn't abstracted to the level of individual cores usually since most HPC jobs care about memory bandwidth - even with Infiniband or other techniques throughput / latency is much worse than on a single machine. Of course, multiple machines are connected (usually using MPI / Infiniband) but important to try to minimize communication between nodes where possible.
For AI workloads, they are running GPUs - so 10K+ cores on a single device so even less likely to be talking about cores here.