I wrote this because alerting on CPU% / loadavg as machine health indicator has burned me a few times.
The simple split I use now is:
- CPU% = how busy the cores are
- PSI = how much time tasks are stalled (CPU / memory / IO)
In an eBPF agent I am working on (Linnix), I ended up looking at CPU and PSI together. High CPU + high PSI is interesting. High CPU + low PSI is usually just “busy”.
This obviously doesn’t replace latency/SLO alerts at the app level.It’s only about which host metric to look at.
parth21shah•2m ago