Keeping a GPU cluster healthy at scale isn't just a "nice to have"—it’s the difference between seamless training and a nightmare of idle nodes. That’s why we built NVSentinel, our open-source system designed to detect, classify, and auto-remediate hardware and software faults across Kubernetes nodes and NVSwitches.
mchmarny•1h ago