Unless you have a dedicated team to do the stuff for you.
Crunchydata is a good starting point
you want 3 9s of availability for your DBs maybe more.
Then you need 4 9s for your cluster/orchestrator.
If your team can make that cluster, then it makes more sense to put all under one roof then develop a whole new infrastructure with the same level of reliability or more.
But if your workloads stop and can't be started on the same node you've got a degradation if not an outage.
In real practice, it's so cheap to keep your operator running redundantly, that it's probably going to have more nines than your workloads, but it doesn't need to be
In my world scaling is required. Meaning new nodes and new pods. Meaning you need a control plane.
Even in development, no control plane means no updates.
In production, no scaling means im going to have a user facing issue at the next traffic spike
What I'm saying is that the two probabilities are independent, possibly correlated, but not dependent. You need some number of nines in your control plane for scaling operations. You need some number of nines in your control plane for updates. These are very few, and they don't overly affect the serving plane, so long as the serving plane is itself resilient to the errors that happen even when a control plane is running, like sudden node failure.
Proper modeling of these failure conditions is not as simple as multiplying probabilities. The chance of failures in your serving path goes up as the time between control plane readiness goes up. You calculate (Really, only ever guesstimate, but you can get some good information for those guesses) the probability of a failure in the serving plane (incl. increases in traffic to the point of overload) before the control plane has had a chance to take actions again, and you worry about MTTF and MTBR of the control plane more than the "Reliability" - You can have a control plane with 75% or less "uptime" by action failure rate but that still takes actions on a regular cadence and never notice.
You can build reliable infrastructure out of unreliable components. The control plane itself is an unreliable component, and you can serve traffic at massive scale with control planes faulty or down completely - Without affecting serving traffic. You don't need more nines in your control plane than your serving cluster - That is the only point I am addressing/contesting. You can have many, many less and still be doing right fine.
This can be solved by building statically (or using something like Nix) or by at least using containers.
justinclift•5mo ago
mdaniel•5mo ago
Fuck them