Why I built this: AWS Spot instances kill long-lived LLM inference jobs with a 2-minute warning. Losing gigabytes of KV cache and dropping client connections is painful.
How it works: CRIU migrates the entire process, which often breaks when holding GPU locks. libccmc surgically extracts only the TCP connection. It dumps 80 bytes of socket state via TCP_REPAIR. During the move, an eBPF XDP program replies to the client with Window=0 ACKs, putting their TCP stack into a persist timer. We restore the socket on the target, drift the VIP, and the stream continues seamlessly.
The library is pure C and open-source. I’ve tested it by keeping live SSE streams suspended for 10 minutes with zero drops.
I would love your feedback on the eBPF mechanics, the TCP_REPAIR sequence, or any TCP edge cases I might have missed!
sunchaodong•1h ago
Why I built this: AWS Spot instances kill long-lived LLM inference jobs with a 2-minute warning. Losing gigabytes of KV cache and dropping client connections is painful.
How it works: CRIU migrates the entire process, which often breaks when holding GPU locks. libccmc surgically extracts only the TCP connection. It dumps 80 bytes of socket state via TCP_REPAIR. During the move, an eBPF XDP program replies to the client with Window=0 ACKs, putting their TCP stack into a persist timer. We restore the socket on the target, drift the VIP, and the stream continues seamlessly.
The library is pure C and open-source. I’ve tested it by keeping live SSE streams suspended for 10 minutes with zero drops.
I would love your feedback on the eBPF mechanics, the TCP_REPAIR sequence, or any TCP edge cases I might have missed!