Fault Tolerance Benchmark: Clockwork TorchPass, TorchFT and Checkpoint Restart

https://clockwork.io/blog/keeping-distributed-training-running-through-failures/

3•danzheng•1h ago

Comments

danzheng•1h ago

Hi all — I’m Dan, founding team at clockwork.io. Today we launched TorchPass! We'd love to get your feedback.

tl;dr: we built TorchPass because large distributed training jobs fail a lot, and checkpoint restart is expensive. TorchPass addresses this by migrating the state from failed resources to spares.

In large GPU clusters, even small failures (a GPU falling off the bus, a node crash, a network link flap) can bring down an entire distributed training job. And once you get into clusters with hundreds or thousands of GPUs, something is almost always failing. Research from Meta suggests mean time to failure drops to about 7.9 hours for a 1,024-GPU cluster. And when a single failure occurs, the entire distributed job crashes.

The usual recovery model is to take frequent checkpoints during training, and recover from the most recent checkpoint when a failure occurs. But:

all work since the last checkpoint is lost time is wasted replacing nodes and checkpoint reloading more time is lost restarting the entire distributed job

TorchPass uses a different approach: instead of restarting the job, it migrates the failed training rank to a spare GPU and resumes training at the same step.

TorchPass supports planned migration (triggered pre-emptively when an imminent failure is detected) or unplanned migration (triggered by a hard failure). Further details about how it works can be found here:https://clockwork.io/blog/torchpass-workload-fault-tolerance...

We ran a 3,000 step training benchmark using TorchTitan Llama-4 MoE Scout (109B) on 64 H200 GPUs with random failure injection to compare checkpoint restarts, TorchPass and TorchFT.

TorchPass completed in 405 min Checkpoint restart completed in 818 min. TorchFT in 930 min

Checkpoint restart was slower mainly because of the time taken to restore from checkpoint, restart the training and recompute the work since the last checkpoint.

TorchFT lost almost no time due to the failures, but was slower because it introduced a significant per-step overhead because it requires the using gloo (rather than NCCL) for cross replica all reduce operations.

Happy to answer questions about the implementation and benchmarks.

essekar•35m ago

One of the feedbacks we got while testing was - we might even reduce the duration of checkpointing. Which was a huge insight.

We are testing out & would love more collaboration from the community - if your team is running training jobs, hit us up.

Show HN: My 9-year, 4,500-song manual music archive (2017–2026)

I Was Interviewed by an AI Bot for a Job

Reka Edge – 7B fast, efficient VLM (open-weights)

Start at the Bottom of the Funnel

Using Unicode Half-Stars Symbols in Ratings

What Agentic Commerce Will Look Like

Show HN: AgentOS- a memory system for AI agents that learns what it doesn't know

Messenger RNA delivery to islet β cells using conjugated lipid nanoparticles

Valve facing UK lawsuit over music rights in games Valve doesn't make or own

LLM identifies it is being manipulated, predicts failure, then complies anyway

Protesters arrested under new Queensland hate speech laws

About the New York Attorney General Lawsuit Against Valve

Show HN: AgentClick – Human-in-the-loop review UI for AI coding agents

The Wiring Is More Dangerous Than the Weights

Show HN: Prompt Engineering GUI – Become an Expert Fast

My PostgreSQL database got nuked lol

Cost per outcome: measuring the real economics of AI workflows

Nemotron 3 Super: An Open Hybrid Mamba-Transformer Moe for Agentic Reasoning

The Debt Beneath the Dream

Soviet Life – Cinema of the People (March 2026)

xAI's Macrohard project stalls as Tesla ramps up a similar AI agent effort

Everyone is building AI trust frameworks; almost no one is reading the research

Perplexity Personal Computer

Hustlers are cashing in on China's OpenClaw AI craze

Applying Zipf's Law to grokking produces perpetual oscillations

Side chain conversations with Claude Code /btw

Replit raises $400M at a $9B valuation

Show HN: Slate – Open-source AI workspace with a built-in browser

Wayfair boosts catalog accuracy and support speed with OpenAI

Medical technology company in Michigan hit by suspected Iran-linked cyberattack