frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

https://github.com/localgpt-app/localgpt
66•yi_wang•2h ago•23 comments

SectorC: A C Compiler in 512 bytes (2023)

https://xorvoid.com/sectorc.html
233•valyala•10h ago•45 comments

Haskell for all: Beyond agentic coding

https://haskellforall.com/2026/02/beyond-agentic-coding
24•RebelPotato•2h ago•4 comments

Speed up responses with fast mode

https://code.claude.com/docs/en/fast-mode
144•surprisetalk•10h ago•146 comments

Software factories and the agentic moment

https://factory.strongdm.ai/
175•mellosouls•13h ago•333 comments

Brookhaven Lab's RHIC concludes 25-year run with final collisions

https://www.hpcwire.com/off-the-wire/brookhaven-labs-rhic-concludes-25-year-run-with-final-collis...
62•gnufx•9h ago•55 comments

IBM Beam Spring: The Ultimate Retro Keyboard

https://www.rs-online.com/designspark/ibm-beam-spring-the-ultimate-retro-keyboard
19•rbanffy•4d ago•4 comments

Hoot: Scheme on WebAssembly

https://www.spritely.institute/hoot/
172•AlexeyBrin•15h ago•32 comments

Stories from 25 Years of Software Development

https://susam.net/twenty-five-years-of-computing.html
152•vinhnx•13h ago•16 comments

LLMs as the new high level language

https://federicopereiro.com/llm-high/
41•swah•4d ago•90 comments

First Proof

https://arxiv.org/abs/2602.05192
125•samasblack•12h ago•75 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
298•jesperordrup•20h ago•95 comments

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

https://github.com/Momciloo/fun-with-clip-path
69•momciloo•10h ago•13 comments

FDA intends to take action against non-FDA-approved GLP-1 drugs

https://www.fda.gov/news-events/press-announcements/fda-intends-take-action-against-non-fda-appro...
96•randycupertino•5h ago•212 comments

Al Lowe on model trains, funny deaths and working with Disney

https://spillhistorie.no/2026/02/06/interview-with-sierra-veteran-al-lowe/
98•thelok•12h ago•21 comments

Show HN: A luma dependent chroma compression algorithm (image compression)

https://www.bitsnbites.eu/a-spatial-domain-variable-block-size-luma-dependent-chroma-compression-...
35•mbitsnbites•3d ago•3 comments

Start all of your commands with a comma (2009)

https://rhodesmill.org/brandon/2009/commands-with-comma/
566•theblazehen•3d ago•206 comments

Show HN: Axiomeer – An open marketplace for AI agents

https://github.com/ujjwalredd/Axiomeer
7•ujjwalreddyks•5d ago•2 comments

The AI boom is causing shortages everywhere else

https://www.washingtonpost.com/technology/2026/02/07/ai-spending-economy-shortages/
286•1vuio0pswjnm7•16h ago•464 comments

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

https://www.windowscentral.com/microsoft/windows-11/windows-locked-me-out-of-notepad-is-the-thin-...
126•josephcsible•8h ago•154 comments

The silent death of good code

https://amit.prasad.me/blog/rip-good-code
81•amitprasad•4h ago•76 comments

Selection rather than prediction

https://voratiq.com/blog/selection-rather-than-prediction/
29•languid-photic•4d ago•9 comments

I write games in C (yes, C) (2016)

https://jonathanwhiting.com/writing/blog/games_in_c/
180•valyala•10h ago•165 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
899•klaussilveira•1d ago•275 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
225•limoce•4d ago•125 comments

Reinforcement Learning from Human Feedback

https://rlhfbook.com/
115•onurkanbkrc•15h ago•5 comments

The F Word

http://muratbuffalo.blogspot.com/2026/02/friction.html
111•zdw•3d ago•55 comments

Where did all the starships go?

https://www.datawrapper.de/blog/science-fiction-decline
141•speckx•4d ago•224 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
143•videotopia•4d ago•48 comments

Vouch

https://twitter.com/mitchellh/status/2020252149117313349
34•chwtutha•1h ago•5 comments
Open in hackernews

Show HN: Autonomous recovery for distributed training jobs

https://docs.tensorpool.dev/features/agent
12•tsvoboda•1w ago
Hi HN! We’re TensorPool. We help companies access and optimize large scale compute for training foundation models.

The Problem

It’s been almost a year since we’ve finished YC, and we’ve just crossed 100,000 multinode training GPU hours run on our platform.

On those training runs, we’ve seen countless 3am job crashes because of issues like an Xid error from a flaky GPU or an S3 timeout that corrupted a checkpoint save. By the time you wake up and notice, you've lost 8+ hours of compute. You scramble to diagnose the issue, manually restart from the last checkpoint, and hope it doesn't happen again. Rinse and repeat.

For training runs that take days to weeks, this constant babysitting is exhausting and expensive. The research iteration cycles lost can also make or break a model release (especially for short reservations).

What We Built

This agent monitors your training jobs and autonomously recovers them when things go wrong. It works with Kubernetes, Slurm, and TensorPool Jobs.

We originally built the TensorPool Agent as an internal tool to help us debug failures with our own customers. Over time, we realized its performance was so good that we could automate the entire triage process. We're now releasing a public beta for people to use.

Best case: The TensorPool Agent detects the failure, diagnoses the root cause, fixes it, and restarts your job from the last checkpoint – all while you sleep ;)

Worst case: If the TensorPool agent can't fix the issue automatically, it delivers a preliminary RCA and a list of actions it attempted, giving you a head start on debugging.

How It Works

1) Registration – You provide credentials to your job scheduler via our dashboard. Perms are granted on a whitelist basis; you explicitly control what actions the agent can take.

2) Monitoring – The agent continuously monitors your job for failure conditions.

3) Recovery – On failure, the agent analyzes logs and attempts to diagnose the issue. If successful, it restarts the job from the last checkpoint and resumes monitoring. If not, you get an alert with full context.

Target Failure Modes

The agent is specifically designed for runtime errors that occur deep into training, like:

- CUDA OOM: Memory leaks, gradient explosions

- Xid errors: GPU hardware faults (Xid 79, 63, 48, etc.)

- Distributed communication failures: NCCL timeouts, rank failures

- Storage I/O errors: Checkpoint corruption

- Network issues: S3 request timeouts on mounted object storage

Comments

tsvoboda•1w ago
Would love to hear how you're handling recovery for long-running training jobs today, as well as what failure modes are most common/annoying for you.
hnotshe•1w ago
We're still figuring out how to detect "silent" failures where the job doesn't crash but stops making progress — like NCCL hangs where ranks are waiting indefinitely, or gradient norm explosions that don't trigger OOM but tank loss. Right now we rely on explicit errors in logs, but curious how others approach detecting "the job is technically running but something is very wrong" (if at all)?
jpollock•1w ago
Measurement and alerting is usually done in business metrics, not the causes. That way you catch classes of problems.

Not sure about expected loss, that's a decay rate?

But stuck jobs are via tasks being processed and average latency.