For instance, postgres can hit this sort of QPS easily, afaik. It’s not distributed, but I’m sure Vitess could do something similar. The query patterns don’t seem particularly complex either.
Not trying to be reductive - I’m sure there’s some complexity here I’m missing!
Projects like kine allow K8s users to swap sqlite or postgres in place of etcd which (I assume, please correct me otherwise) would deliver better throughput since those backends don't need to perform consenus operations.
A well managed HA postgresql (active/passive) is going to run circles around etcd for kube controlplane operations.
The caveat here is increased risk of downtime, and a much higher management overhead, which is why its not the default.
> To support the cluster’s massive scale, we relied on a proprietary key-value store based on Google’s Spanner distributed database... We didn’t witness any bottlenecks with respect to the new storage system and it showed no signs of it not being able to support higher scales.
It's basically "take global state of node load and capacity, pick where to schedule it", and I'd imagine probably not running in parallel coz that would be far harder to manage.
Binpacking seems to be a well-defined NP-hard problem: https://en.wikipedia.org/wiki/Bin_packing_problem
We treat it as a best effort alternative when native GCS access isn't possible.
They’re wonderful for low volume, low performance and low reliability operations. (browsing, copying, integrating with legacy systems that do not permit native access), but beyond that they consume huge resources and do odd things when the backend is not in its most ideal state.
Fair, but far from a common advice I’m willing to tell people (other CTOs) to do.
Honestly, I'd give FUSE a second chance, you'd be surprised at how useful it can be -- after all, it's literally running in userland so you don't need to do anything funky with privileges. However, if I starting afresh on a similar project I'd probably be looking at using 9p2000.L instead.
I commented though because GCP highlights it in a few places as component for AI workloads. I'm curious if anyone is using it in an important application and happy with it.
I only skimmed the article though, but I'm confident that it's more a physical hardware, time, space and electricity problem than a software / orchestration one; the article mentions that a cluster that size needs to be multi-datacenter already given the sheer power requirements (2700 watts for one GPU in a single node).
so i guess the title is not true?
> Unfortunately running 1M real kubelets is beyond my budget.
One of the main issues of cilium is that the bpf maps scale with the number of nodes/pods in the cluster, so you get exponential memory growth as you add more nodes with the cilium agent on them. https://docs.cilium.io/en/stable/operations/performance/scal...
I see the appeal of K8s in dividing raw, stateful hardware to run multiple parallel workloads, but if you're dealing with stateless cloud VMs, why would you need K8S and its overhead when the VM hypervisor already gives you all that functionality?
And if you insist anyway, run a few big VMs rather than many small ones, since K8s overhead is per-node.
(For the record I’m not a k8s fanatic. Most of the time a regular VM is better. But a VM isn’t = a kubernetes cluster).
Your advice about bigger machines is spot on - K8s biggest problem is how relatively heavyweight the kublet is, with memory requirements of roughly half a gig. On a modern 128g server node that's a reasonable overhead, for small companies running a few workloads on 16g nodes it's a cost of doing business, but if you're running 8 or 4g nodes, it looks pretty grim for your utilization.
It really just depends on if you feel that you get value from the orchestration that full k8s offers.
Note that on k8s or podman, you can get rid of most of the 'cost' of that virtualization for single placement and or long lived pods by simply sharing a emptyDir or volume shared between pod members.
# Create Pod
podman pod create --name pgdemo-pod
# Create client
podman run -dt -v pgdemo:/mnt --pod pgdemo-pod -e POSTGRES_PASSWORD=password --name client docker.io/ubuntu:25.04
# Unsafe hack to fix permissions in quick demo and install packages
podman exec client /bin/bash -c 'chmod 0777 /mnt; apt update ; apt install -y postgresql-client'
# Create postgres server
podman run -dt -v pgdemo:/mnt --pod pgdemo-pod -e POSTGRES_PASSWORD=password --name pg docker.io/postgres:bookworm -c unix_socket_directories='/mnt,/var/run/postgresql/'
# Invoke client using unix socket
podman exec -it client /bin/bash -c "psql -U postgres -h /mnt"
# Invoke client using localhost network
podman exec -it client /bin/bash -c "psql -U postgres -h localhost"
There is enough there for you to test to see that the performance is so close to native sharing unix sockets that way, that there is very little performance cost and a lot of security and workflow benefits to gain.As podman is daemonless, easily rootless, and on mac even allows you to ssh into the local linux vm with `podman machine ssh` you aren't stuck with the hidden abstractions of docker-desktop which hides that from you it has lots of value.
Plus you can dump a k8s like yaml to use for the above with:
podman kube generate pgdemo-pod
So you can gain the advantages of k8s without the overhead of the cluster, and there are ways to launch those pods from systemd even from a local user that has zero sudo abilities etc...I am using it to validate that upstream containers don't have dial home by producing pcap files, and I would also typically run the above with no network on the pgsql host, so it doesn't have internet access.
IMHO the confusion of k8s pods, being the minimal unit of deployment, with the fact that they are just a collection of containers with specific shared namespaces in the general form is missed.
As Redhat gave podman to CNCF in 2024, I have shifted to it, so haven't seen if rancher can do the same.
The point being is that you don't even need the complexity of minikube on VM's, you can use most of the workflow even for the traditional model.
[0] https://kubernetes.io/blog/2025/04/25/userns-enabled-by-defa...
I think you're not familiar with Kubernetes and what features it provides.
For example, kubernetes supports blue-green deployments and rollbacks, software-defined networks, DNS, node-specific purges and taints, etc. Those are not hypervisor features.
Also, VMs are the primitives of some cloud providers.
It sounds like you heard about how Borg/Kubernetes was used to simplify the task of putting together clusters with COTS hardware and you didn't bothered to learn more about Kubernetes.
And in reality no one sizes their machines correctly. They always do some handwavey thing like we need 4 cores, but maybe well burst and maybe there will be an outage so lets double it. Now all that utilization can be watched and you can take advantage of over subscription.
K8s is pallets Vms are shipping containers
Systems / storage / network team can present a standardized set of primitives for any vm to consume that are more or less independent of the underlying bare metal.
Then the VMs can be live migrated when the inevitable hardware maintenance is needed (microcode patching , storage driver upgrades , etc etc etc). With no downtime for the vm itself
I did poke around a while ago to see what interfaces that etcd has calling into boltdb, but the interface doesn’t seem super clean right now, so the first step in getting off boltdb would be creating a clean interface that could be implemented by another db.
We have multiple hundreds of resources allocated for each build and do hundreds of builds a day. The current cluster has been doing this for a couple of years now.
There’s also some issues with large Results, though I think you have to manually enable that. From their site
> CAUTION: the larger you make the size, more likely will the CRD reach its max limit enforced by the etcd server leading to bad user experience.
And then if you use Chains you’re opening up a whole other can of worms.
I contracted with a large institution that was moving all of their cicd to Tekton and they hit scaling issues with etcd pretty early in the process and had to get Red Hat to address some of them. If they couldn’t get them addressed by RH they were going to scrap the whole project.
I'm not so sure. I mean, everything has tradeoffs, and what you need to do to put together the largest cluster known to man is not necessarily what you want to have to put together a mundane cluster.
etcd is fine for what it is, but that's a system meant to be reliable and simple to implement. Those are important qualities, but it wasn't built for scale or for speed. Ironically, etcd recommends 5 as the ideal number of cluster members and 7 as a maximum based on Google's findings from running chubby, that between-member latency gets too big otherwise. With 5, that means you can't ever store more than 40GiB of data. I have no idea what a typical ratio of cluster nodes to total data is, but that only gives you about 307MiB per node for 130,000 nodes, which doesn't seem like very much.
There are other options. k3s made kine which acts as a shim intercepting the etcd API calls made by the apiserver and translating it into calls to some other dbms. Originally, this was to make a really small Kubernetes that used an embedded sqlite as its datastore, but you could do the same thing for any arbitrary backend by just changing one side of the shim.
It is kindof sad as these nodes are running around 2k IOPS to the disk and are mostly sitting idle at the hardware level, but etcd still regularly chokes.
I did look into kine in the past, but I have no idea if it is suitable for running a high performance data store.
> When your control plane has to span multiple data centers because the electricity cost of the cluster is too large for a single building to handle
The trick is you deploy your k8s clusters in multiple datacenters in the same region (think AZs in AWS term). The control plane can span multiple AZs which are in separate buildings, but close in geography. From the setups I work on the latency between datacenters in the same region is only about 500 microseconds.
I’m not able to find the blogpost but maybe someone else can!
The whole thing stinks of, AI investors are throwing money at AI companies, so go to GCP and tell them to solve the problem at any price so that they can keep scaling without needing to build the scheduling layer above the Kubernetes control planes.
They convince every Series A startup that they need a multi-region federated control plane for their 50 microservices. I spend half my time convincing my team not to emulate Google, because we don't have Google's scale problems—we have velocity problems.
Complexity is an asset for Google (it's a moat), but a liability for the rest of us. I just want a cluster that doesn't require a dedicated ops team to upgrade.
rvz•2mo ago
Obviously this is a typical experiment at Google on running a K8s cluster at 130K nodes but if there is a company out their that "requires" this scale, I must question their architecture and their infrastructure costs.
But of course someone will always request that they somehow need this sort of scale to run their enterprise app. But once again, let's remind the pre-revenue startups talking about scale before they hit PMF:
Unless you are ready to donate tens of billions of dollars yearly, you do not need this.
You are not Google.
mlnj•2mo ago
It's literally Google coming out with this capability and how is the criticism still "You are not Google"
Rastonbury•2mo ago
jcims•2mo ago
Tostino•2mo ago
Are they cloud based VMs, or your own hardware? If cloud based, do you reprovision all of them daily and incur no cost when you are not running jobs? If it's your own hardware, what else do you do with it when not batch processing?
jcims•2mo ago
game_the0ry•2mo ago
100% agree.
People at my co are horny to adopt k8s. Really, tech leads want to put it on their resume ("resume driven development") and use a tool that was made to solve a particular problem we never had. The downside is now we now need to be proficient it at, know how to troubleshoot it, etc. It was sold to leadership as something that would make our lives easier but the exact opposite has happened.
BruSwain•2mo ago
game_the0ry•2mo ago
scottyah•2mo ago
dilyevsky•2mo ago
You think they are just running it for fun? It's literally non-Google customers who wanted this as was explained in the article