frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Autonomous recovery for distributed training jobs

https://docs.tensorpool.dev/features/agent
6•tsvoboda•5h ago
Hi HN! We’re TensorPool. We help companies access and optimize large scale compute for training foundation models.

The Problem

It’s been almost a year since we’ve finished YC, and we’ve just crossed 100,000 multinode training GPU hours run on our platform.

On those training runs, we’ve seen countless 3am job crashes because of issues like an Xid error from a flaky GPU or an S3 timeout that corrupted a checkpoint save. By the time you wake up and notice, you've lost 8+ hours of compute. You scramble to diagnose the issue, manually restart from the last checkpoint, and hope it doesn't happen again. Rinse and repeat.

For training runs that take days to weeks, this constant babysitting is exhausting and expensive. The research iteration cycles lost can also make or break a model release (especially for short reservations).

What We Built

This agent monitors your training jobs and autonomously recovers them when things go wrong. It works with Kubernetes, Slurm, and TensorPool Jobs.

We originally built the TensorPool Agent as an internal tool to help us debug failures with our own customers. Over time, we realized its performance was so good that we could automate the entire triage process. We're now releasing a public beta for people to use.

Best case: The TensorPool Agent detects the failure, diagnoses the root cause, fixes it, and restarts your job from the last checkpoint – all while you sleep ;)

Worst case: If the TensorPool agent can't fix the issue automatically, it delivers a preliminary RCA and a list of actions it attempted, giving you a head start on debugging.

How It Works

1) Registration – You provide credentials to your job scheduler via our dashboard. Perms are granted on a whitelist basis; you explicitly control what actions the agent can take.

2) Monitoring – The agent continuously monitors your job for failure conditions.

3) Recovery – On failure, the agent analyzes logs and attempts to diagnose the issue. If successful, it restarts the job from the last checkpoint and resumes monitoring. If not, you get an alert with full context.

Target Failure Modes

The agent is specifically designed for runtime errors that occur deep into training, like:

- CUDA OOM: Memory leaks, gradient explosions

- Xid errors: GPU hardware faults (Xid 79, 63, 48, etc.)

- Distributed communication failures: NCCL timeouts, rank failures

- Storage I/O errors: Checkpoint corruption

- Network issues: S3 request timeouts on mounted object storage

Comments

tsvoboda•5h ago
Would love to hear how you're handling recovery for long-running training jobs today, as well as what failure modes are most common/annoying for you.
hnotshe•3h ago
We're still figuring out how to detect "silent" failures where the job doesn't crash but stops making progress — like NCCL hangs where ranks are waiting indefinitely, or gradient norm explosions that don't trigger OOM but tank loss. Right now we rely on explicit errors in logs, but curious how others approach detecting "the job is technically running but something is very wrong" (if at all)?
jpollock•48m ago
Measurement and alerting is usually done in business metrics, not the causes. That way you catch classes of problems.

Not sure about expected loss, that's a decay rate?

But stuck jobs are via tasks being processed and average latency.

Project Genie: Experimenting with infinite, interactive worlds

https://blog.google/innovation-and-ai/models-and-research/google-deepmind/project-genie/
335•meetpateltech•5h ago•174 comments

PlayStation 2 Recompilation Project Is Absolutely Incredible

https://redgamingtech.com/playstation-2-recompilation-project-is-absolutely-incredible/
105•croes•3h ago•21 comments

Claude Code daily benchmarks for degradation tracking

https://marginlab.ai/trackers/claude-code/
455•qwesr123•8h ago•237 comments

Drug trio found to block tumour resistance in pancreatic cancer

https://www.drugtargetreview.com/news/192714/drug-trio-found-to-block-tumour-resistance-in-pancre...
167•axiomdata316•6h ago•78 comments

Retiring GPT-4o, GPT-4.1, GPT-4.1 mini, and OpenAI o4-mini in ChatGPT

https://openai.com/index/retiring-gpt-4o-and-older-models/
99•rd•1h ago•123 comments

Compressed Agents.md > Agent Skills

https://vercel.com/blog/agents-md-outperforms-skills-in-our-agent-evals
46•maximedupre•9h ago•24 comments

Flameshot

https://github.com/flameshot-org/flameshot
53•OsrsNeedsf2P•2h ago•18 comments

Launch HN: AgentMail (YC S25) – An API that gives agents their own email inboxes

96•Haakam21•5h ago•115 comments

Europe’s next-generation weather satellite sends back first images

https://www.esa.int/Applications/Observing_the_Earth/Meteorological_missions/meteosat_third_gener...
632•saubeidl•15h ago•89 comments

The Hallucination Defense

https://niyikiza.com/posts/hallucination-defense/
26•niyikiza•2h ago•63 comments

iPhone 16 Best-Selling Smartphone in 2025; Apple Takes 7 Spots in Top Models

https://counterpointresearch.com/en/insights/iphone-16-worlds-best-selling-smartphone-in-2025-app...
48•TMWNN•1h ago•57 comments

County pays $600k to pentesters it arrested for assessing courthouse security

https://arstechnica.com/security/2026/01/county-pays-600000-to-pentesters-it-arrested-for-assessi...
209•MBCook•3h ago•94 comments

The Value of Things

https://journal.stuffwithstuff.com/2026/01/24/the-value-of-things/
29•vinhnx•4d ago•11 comments

A lot of population numbers are fake

https://davidoks.blog/p/a-lot-of-population-numbers-are-fake
203•bookofjoe•8h ago•191 comments

My Mom and Dr. DeepSeek (2025)

https://restofworld.org/2025/ai-chatbot-china-sick/
106•kieto•3h ago•68 comments

Show HN: Kolibri, a DIY music club in Sweden

https://kolibrinkpg.com/
16•EastLondonCoder•6h ago•5 comments

Reflex (YC W23) Senior Software Engineer Infra

https://www.ycombinator.com/companies/reflex/jobs/Jcwrz7A-lead-software-engineer-infra
1•apetuskey•5h ago

Waymo robotaxi hits a child near an elementary school in Santa Monica

https://techcrunch.com/2026/01/29/waymo-robotaxi-hits-a-child-near-an-elementary-school-in-santa-...
232•voxadam•8h ago•412 comments

EmulatorJS

https://github.com/EmulatorJS/EmulatorJS
74•avaer•6d ago•11 comments

Is the RAM shortage killing small VPS hosts?

https://www.fourplex.net/2026/01/29/is-the-ram-shortage-killing-small-vps-hosts/
77•neelc•6h ago•90 comments

How to choose colors for your CLI applications (2023)

https://blog.xoria.org/terminal-colors/
134•kruuuder•7h ago•77 comments

Box64 Expands into RISC-V and LoongArch territory

https://boilingsteam.com/box64-expands-into-risc-v-and-loong-arch-territory/
22•ekianjo•4d ago•2 comments

Apple buys Israeli startup Q.ai

https://techcrunch.com/2026/01/29/apple-buys-israeli-startup-q-ai-as-the-ai-race-heats-up/
60•ishener•1h ago•13 comments

US cybersecurity chief leaked sensitive government files to ChatGPT: Report

https://www.dexerto.com/entertainment/us-cybersecurity-chief-leaked-sensitive-government-files-to...
354•randycupertino•6h ago•184 comments

Deep dive into Turso, the "SQLite rewrite in Rust"

https://kerkour.com/turso-sqlite
87•unsolved73•7h ago•79 comments

AI's impact on engineering jobs may be different than expected

https://semiengineering.com/ais-impact-on-engineering-jobs-may-be-different-than-initial-projecti...
60•rbanffy•4h ago•119 comments

Run Clawdbot/Moltbot on Cloudflare with Moltworker

https://blog.cloudflare.com/moltworker-self-hosted-ai-agent/
115•ghostwriternr•7h ago•43 comments

Automating Image Compression

https://www.ramijames.com/thoughts/on-automating-image-compression
11•ramijames•2d ago•0 comments

C++ Modules Are Here to Stay

https://faresbakhit.github.io/e/cpp-modules/
64•faresahmed•5d ago•70 comments

Usenet personality

https://en.wikipedia.org/wiki/Usenet_personality
56•mellosouls•3d ago•25 comments