I am an early adopter of containers with a background in HPC. From the early days, I’ve tried to merge container tech into the HPC (and HTC) stack. Containers already make packing and deployment easier - especially in AI/ML and data science. How about checkpoint/restore?
Over the last couple of months, we at MemVerge have developed a Kubernetes Operator for transparent checkpointing and restoring, allowing you to use discounted Spot instances for long-running workloads, like bioinformatics workflows or ML training.
Here’s how it works: - the operator attaches a PVC to your pod - intercepts the STOP signal to checkpoint the pod - if the attached PVC contains a checkpoint when the pod is starting over, it will be restored instead of starting from scratch.
Here’s a 2m30s video that demonstrates interrupting a small training workload: https://youtu.be/K9yY6_2255Y
This can be triggered by someone draining the node (e.g., due to an EC2 Spot reclaim), deleting a pod, or another operator acting on its own logic. Our checkpoint engine captures every aspect of the process tree within the container: memory pages, file descriptors—even TCP connections, if you want us to. Until recently, it was targeted at CPU use cases only. We’ve now added support for NVIDIA GPUs, with AMD GPUs coming soon (via upstream CRIU plugins).
I’ve done some typical checkpoint/restore work (e.g., Jupyter notebooks, traditional jobs) and would love to hear what kinds of workloads you’re interested in checkpointing and restoring.
You can try it out in your Kubernetes environment with our 60-day trial: https://form.typeform.com/to/vZujMYxI