frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Autonomous recovery for distributed training jobs

https://docs.tensorpool.dev/features/agent
3•tsvoboda•1h ago
Hi HN! We’re TensorPool. We help companies access and optimize large scale compute for training foundation models.

The Problem

It’s been almost a year since we’ve finished YC, and we’ve just crossed 100,000 multinode training GPU hours run on our platform.

On those training runs, we’ve seen countless 3am job crashes because of issues like an Xid error from a flaky GPU or an S3 timeout that corrupted a checkpoint save. By the time you wake up and notice, you've lost 8+ hours of compute. You scramble to diagnose the issue, manually restart from the last checkpoint, and hope it doesn't happen again. Rinse and repeat.

For training runs that take days to weeks, this constant babysitting is exhausting and expensive. The research iteration cycles lost can also make or break a model release (especially for short reservations).

What We Built

This agent monitors your training jobs and autonomously recovers them when things go wrong. It works with Kubernetes, Slurm, and TensorPool Jobs.

We originally built the TensorPool Agent as an internal tool to help us debug failures with our own customers. Over time, we realized its performance was so good that we could automate the entire triage process. We're now releasing a public beta for people to use.

Best case: The TensorPool Agent detects the failure, diagnoses the root cause, fixes it, and restarts your job from the last checkpoint – all while you sleep ;)

Worst case: If the TensorPool agent can't fix the issue automatically, it delivers a preliminary RCA and a list of actions it attempted, giving you a head start on debugging.

How It Works

1) Registration – You provide credentials to your job scheduler via our dashboard. Perms are granted on a whitelist basis; you explicitly control what actions the agent can take.

2) Monitoring – The agent continuously monitors your job for failure conditions.

3) Recovery – On failure, the agent analyzes logs and attempts to diagnose the issue. If successful, it restarts the job from the last checkpoint and resumes monitoring. If not, you get an alert with full context.

Target Failure Modes

The agent is specifically designed for runtime errors that occur deep into training, like:

- CUDA OOM: Memory leaks, gradient explosions

- Xid errors: GPU hardware faults (Xid 79, 63, 48, etc.)

- Distributed communication failures: NCCL timeouts, rank failures

- Storage I/O errors: Checkpoint corruption

- Network issues: S3 request timeouts on mounted object storage

Comments

tsvoboda•1h ago
Would love to hear how you're handling recovery for long-running training jobs today, as well as what failure modes are most common/annoying for you.

Chinese Startup to Build a New Brain-Computer Interface–No Implant Required

https://www.wired.com/story/this-chinese-startup-wants-to-build-a-new-brain-computer-interface-no...
1•beardyw•2m ago•0 comments

Palestinian journalist Bisan Owda with 1.4M followers reports TikTok ban

https://www.aljazeera.com/news/2026/1/29/palestinian-journalist-bisan-owda-with-1-4m-followers-re...
1•bjourne•2m ago•0 comments

Honda Has Invented an AI Heads-Up About Potholes and Road Hazards

https://www.caranddriver.com/news/a70176340/honda-ai-technology-road-hazards/
1•RickJWagner•3m ago•0 comments

GNU gettext 1.0 released

https://lists.gnu.org/archive/html/info-gnu/2026-01/msg00007.html
2•layer8•5m ago•0 comments

Password Generator for Bios

https://github.com/bacher09/pwgen-for-bios
1•gregsadetsky•5m ago•0 comments

ArXiv says submissions must be in English: are AI translators up for the job?

https://www.nature.com/articles/d41586-026-00229-0
2•bikenaga•5m ago•0 comments

Apple acquires secretive Q․AI startup for $2B

https://9to5mac.com/2026/01/29/apple-acquires-secretive-q%E2%80%A4ai-startup-for-2-billion/
2•myth_drannon•7m ago•0 comments

Code World Model

https://github.com/facebookresearch/cwm
1•tosh•7m ago•0 comments

Lowercase Politics

https://neilthanedar.com/lowercase-politics/
1•thanedar•8m ago•0 comments

Accountability Sinks

https://aworkinglibrary.com/writing/accountability-sinks
2•_vaporwave_•9m ago•0 comments

The End of Human Code Review

https://twitter.com/kayvz/status/2016934777396609428
6•curiouska•10m ago•0 comments

Show HN: A WordPress plugin that detects affiliate links that don't convert

1•aflwp•10m ago•0 comments

Athena Parthenos: A Reconstruction (2000)

http://www.goddess-athena.org/Museum/Sculptures/Alone/Parthenos_reconstruction_x.htm
1•joebig•11m ago•0 comments

Datadog Monitors Are Down

https://status.datadoghq.com/incidents/1hmvg62sxmjp
3•nzach•12m ago•0 comments

Scientists assemble the most detailed map of dark matter

https://www.nationalgeographic.com/science/article/dark-matter-map-james-webb-space-telescope
2•layer8•12m ago•0 comments

I've Been a Public School Teacher for 20 Years. Trust Me: Homeschool Your Kids

https://twitter.com/creation247/status/2012598176138535041
8•bilsbie•12m ago•2 comments

Ratchets in Software Development

https://qntm.org/ratchet
1•nvader•13m ago•0 comments

What ICE Did to Alex Pretti Is Somehow Worse Than We Thought

https://www.esquire.com/news-politics/politics/a70177228/ice-alex-pretti-broken-ribs-killing/
6•MaysonL•14m ago•1 comments

'Pesticide cocktails' polluting apples across Europe, study finds

https://www.theguardian.com/environment/2026/jan/29/pesticide-cocktails-pollute-apples-europe-che...
2•akyuu•14m ago•0 comments

The State of Voice AI Instruction Following in 2026

https://www.coval.dev/blog/the-state-of-voice-ai-instruction-following-in-2026-a-conversation-wit...
2•underfox•15m ago•0 comments

Google's AI helped me make bad Nintendo knockoffs

https://www.theverge.com/news/869726/google-ai-project-genie-3-world-model-hands-on
2•vintagedave•15m ago•0 comments

A CLI tool to migrate VSCode themes to Zed

https://www.npmjs.com/package/migrazed
4•enbonnet•17m ago•0 comments

Vets Who Code 2026: What to Know Before You Apply

https://vetswhocode.io/blogs/vwc-2026-what-to-know-before-you-apply
2•mooreds•18m ago•0 comments

ACP Agent Registry in JetBrains IDEs

https://blog.jetbrains.com/ai/2026/01/acp-agent-registry/
3•saikatsg•20m ago•0 comments

Granular API Keys for Buttondown

https://buttondown.com/blog/2026-01-24-api-keys
1•mooreds•20m ago•0 comments

High-Fidelity RPPG Waveform Reconstruction from Palm Videos Using GANs

https://www.mdpi.com/1424-8220/26/2/563
1•PaulHoule•20m ago•0 comments

WebView2, a browser for WinForms in .NET 5

https://grantwinney.com/webview2-a-browser-for-winforms/
2•bob1029•20m ago•0 comments

MetMo Gear Train Fidget Toy

https://www.metmo.co.uk/collections/gear-train
1•cirrus3•22m ago•0 comments

Bonzai gives a bird's eye view of a codebase while AI generates

https://www.bonzai.dev/hn1
1•frootoftheloom•23m ago•1 comments

Is that allowed? Authentication and authorization in Model Context Protocol

https://stackoverflow.blog/2026/01/21/is-that-allowed-authentication-and-authorization-in-model-c...
1•mooreds•24m ago•0 comments