I am working on a platform for GPU rental and have recently encountered an extremely annoying issue.
On all machines with RTX 5090 and RTX PRO 6000 GPUs, the cards occasionally become completely unresponsive — usually after a few days of VM usage or at seemingly random times during startup/shutdown. Once it happens, the GPU can’t be reassigned. GPU is in a limbo state and doesn't respond to FLR. The only way out is a complete node reboot, which is undesirable, as it will stop VMs that are already running on the node.
H100s, B200s, and older RTX 4090s are solid, but these newer RTX cards are a menace. I understand that RTX cards are not designed for virtualization, and NVIDIA likely doesn't care; however, those cards are very well-suited for a variety of applications, and it would be nice to make virtualization work.
Is there a way to recover the GPU from this state without a complete node reboot?
We've put a $ 1,000 bounty on it if anyone is interested in helping. We also have an open position for an engineer who can help with problems like this.
dikobraz•2h ago
On all machines with RTX 5090 and RTX PRO 6000 GPUs, the cards occasionally become completely unresponsive — usually after a few days of VM usage or at seemingly random times during startup/shutdown. Once it happens, the GPU can’t be reassigned. GPU is in a limbo state and doesn't respond to FLR. The only way out is a complete node reboot, which is undesirable, as it will stop VMs that are already running on the node.
H100s, B200s, and older RTX 4090s are solid, but these newer RTX cards are a menace. I understand that RTX cards are not designed for virtualization, and NVIDIA likely doesn't care; however, those cards are very well-suited for a variety of applications, and it would be nice to make virtualization work.
Is there a way to recover the GPU from this state without a complete node reboot?
We've put a $ 1,000 bounty on it if anyone is interested in helping. We also have an open position for an engineer who can help with problems like this.