Hey guys, I’d love to know how well checkpointing actually works when running Airflow on spot instances. Is it really worth it? [Checkpointing saves the state of a process during execution so it can be restored after a failure.]
I wrote this article on building fault-tolerant Airflow pipelines on spot instances (https://spot.rackspace.com/blog/building-fault-tolerant-airf...), and one decision I made was to use S3 as the external state layer and checkpointing task outputs. Here’s a quick summary:
1. Each task writes its output to a specific S3 path.
2. When a worker node is preempted mid-task, Airflow retries the task, and the new pod reads directly from S3, picking up the last successfully written output from the upstream task.
3. Writes use replace=True, so if a task was interrupted mid-write and left a partial file, the retry simply overwrites it, keeping execution idempotent.
This is a very simple implementation, but I’m curious what checkpointing methods you all apply in production, or if it’s even something you bother with at all.
From this setup, one big question I keep coming back to is whether the overhead of writing to S3 ends up eating into the cost savings of using spot instances in the first place.
aleroawani•1h ago