frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Fault Tolerance Benchmark: Clockwork TorchPass, TorchFT and Checkpoint Restart

https://clockwork.io/blog/keeping-distributed-training-running-through-failures/
3•danzheng•1h ago

Comments

danzheng•1h ago
Hi all — I’m Dan, founding team at clockwork.io. Today we launched TorchPass! We'd love to get your feedback.

tl;dr: we built TorchPass because large distributed training jobs fail a lot, and checkpoint restart is expensive. TorchPass addresses this by migrating the state from failed resources to spares.

In large GPU clusters, even small failures (a GPU falling off the bus, a node crash, a network link flap) can bring down an entire distributed training job. And once you get into clusters with hundreds or thousands of GPUs, something is almost always failing. Research from Meta suggests mean time to failure drops to about 7.9 hours for a 1,024-GPU cluster. And when a single failure occurs, the entire distributed job crashes.

The usual recovery model is to take frequent checkpoints during training, and recover from the most recent checkpoint when a failure occurs. But:

all work since the last checkpoint is lost time is wasted replacing nodes and checkpoint reloading more time is lost restarting the entire distributed job

TorchPass uses a different approach: instead of restarting the job, it migrates the failed training rank to a spare GPU and resumes training at the same step.

TorchPass supports planned migration (triggered pre-emptively when an imminent failure is detected) or unplanned migration (triggered by a hard failure). Further details about how it works can be found here:https://clockwork.io/blog/torchpass-workload-fault-tolerance...

We ran a 3,000 step training benchmark using TorchTitan Llama-4 MoE Scout (109B) on 64 H200 GPUs with random failure injection to compare checkpoint restarts, TorchPass and TorchFT.

TorchPass completed in 405 min Checkpoint restart completed in 818 min. TorchFT in 930 min

Checkpoint restart was slower mainly because of the time taken to restore from checkpoint, restart the training and recompute the work since the last checkpoint.

TorchFT lost almost no time due to the failures, but was slower because it introduced a significant per-step overhead because it requires the using gloo (rather than NCCL) for cross replica all reduce operations.

Happy to answer questions about the implementation and benchmarks.

essekar•35m ago
One of the feedbacks we got while testing was - we might even reduce the duration of checkpointing. Which was a huge insight.

We are testing out & would love more collaboration from the community - if your team is running training jobs, hit us up.

Show HN: My 9-year, 4,500-song manual music archive (2017–2026)

https://michaelperry.org/music/archive.html
1•ffsoftboiled•20s ago•0 comments

I Was Interviewed by an AI Bot for a Job

https://schwarztech.net/snippets/i-was-interviewed-by-an-ai-bot-for-a-job
1•speckx•2m ago•0 comments

Reka Edge – 7B fast, efficient VLM (open-weights)

https://huggingface.co/RekaAI/reka-edge-2603
1•kwajiehao•2m ago•1 comments

Start at the Bottom of the Funnel

https://writealfa.com/blog/saas-content-marketing-strategy
1•fazkan•2m ago•0 comments

Using Unicode Half-Stars Symbols in Ratings

https://hyperborea.org/tech-tips/half-stars/
2•todsacerdoti•3m ago•0 comments

What Agentic Commerce Will Look Like

https://connordempsey.substack.com/p/what-agentic-commerce-will-actually
1•cdempsey44•3m ago•0 comments

Show HN: AgentOS- a memory system for AI agents that learns what it doesn't know

1•ajstars•3m ago•0 comments

Messenger RNA delivery to islet β cells using conjugated lipid nanoparticles

https://www.sciencedirect.com/science/article/pii/S2666379126000510
1•PaulHoule•4m ago•0 comments

Valve facing UK lawsuit over music rights in games Valve doesn't make or own

https://www.ign.com/articles/valve-facing-uk-lawsuit-over-music-rights-in-games-valve-doesnt-make...
2•anonymousab•4m ago•0 comments

LLM identifies it is being manipulated, predicts failure, then complies anyway

https://github.com/skavanagh/lebron-james-is-president
2•spkavanagh6•5m ago•1 comments

Protesters arrested under new Queensland hate speech laws

https://www.abc.net.au/news/2026-03-11/qld-protesters-arrested-hate-speech-laws/106443370
1•pseudalopex•6m ago•0 comments

About the New York Attorney General Lawsuit Against Valve

https://help.steampowered.com/en/faqs/view/6300-A6C4-519D-A3F5
2•haunter•7m ago•0 comments

Show HN: AgentClick – Human-in-the-loop review UI for AI coding agents

https://github.com/agentlayer-io/AgentClick
1•harvenstar•8m ago•1 comments

The Wiring Is More Dangerous Than the Weights

https://openguard.sh/blog/wiring-is-more-dangerous-than-the-weights/
1•jitera•9m ago•0 comments

Show HN: Prompt Engineering GUI – Become an Expert Fast

https://claude.ai/public/artifacts/159692b0-cf07-4acb-9c54-a8f478b914d2
1•logicallee•11m ago•0 comments

My PostgreSQL database got nuked lol

https://akselmo.dev/posts/they-broke-my-server/
1•todsacerdoti•11m ago•0 comments

Cost per outcome: measuring the real economics of AI workflows

1•deborahjacob•11m ago•0 comments

Nemotron 3 Super: An Open Hybrid Mamba-Transformer Moe for Agentic Reasoning

https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-m...
1•pr337h4m•11m ago•0 comments

The Debt Beneath the Dream

https://om.co/2026/03/09/the-debt-beneath-the-dream/
2•oumua_don17•11m ago•0 comments

Soviet Life – Cinema of the People (March 2026)

https://claude.ai/public/artifacts/df9a1f48-8906-4315-9069-59ec4683aa15
1•water_badger•12m ago•0 comments

xAI's Macrohard project stalls as Tesla ramps up a similar AI agent effort

https://www.businessinsider.com/xai-macrohard-project-tesla-ai-agent-stalls-2026-3
3•spenvo•13m ago•0 comments

Everyone is building AI trust frameworks; almost no one is reading the research

https://weightedthoughts.substack.com/p/everyones-building-trust-frameworks
2•ylliprifti•13m ago•1 comments

Perplexity Personal Computer

https://twitter.com/perplexity_ai/status/2031790180521427166
2•hmokiguess•14m ago•0 comments

Hustlers are cashing in on China's OpenClaw AI craze

https://www.technologyreview.com/2026/03/11/1134179/china-openclaw-gold-rush/
2•joozio•14m ago•0 comments

Applying Zipf's Law to grokking produces perpetual oscillations

https://jagilley.github.io/zipfian-grokking.html
2•threevox•14m ago•0 comments

Side chain conversations with Claude Code /btw

https://twitter.com/trq212/status/2031506296697131352
2•gbourne1•16m ago•0 comments

Replit raises $400M at a $9B valuation

https://blog.replit.com/replit-raises-400-million-dollars
4•meetpateltech•16m ago•0 comments

Show HN: Slate – Open-source AI workspace with a built-in browser

https://github.com/slate-ai/slate
3•meteor333•17m ago•0 comments

Wayfair boosts catalog accuracy and support speed with OpenAI

https://openai.com/index/wayfair
2•surprisetalk•17m ago•0 comments

Medical technology company in Michigan hit by suspected Iran-linked cyberattack

https://www.fox17online.com/news/local-news/kzoo-bc/kalamazoo/stryker-headquarters-in-portage-clo...
4•SteveNuts•18m ago•0 comments