We have been using the new CUDA Checkpoint API (https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CH...) in combination with gVisor's checkpoint / restore API and our custom file system to greatly reduce container cold boot. This is particularly impactful if you need to warm-up GPUs, for example if you are using torch.compile (i.e. you entirely skip torch.compile on restore cold boot).
luiscape•20h ago
We have been using the new CUDA Checkpoint API (https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CH...) in combination with gVisor's checkpoint / restore API and our custom file system to greatly reduce container cold boot. This is particularly impactful if you need to warm-up GPUs, for example if you are using torch.compile (i.e. you entirely skip torch.compile on restore cold boot).