frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

https://github.com/Momciloo/fun-with-clip-path
2•momciloo•58m ago•0 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
270•isitcontent•21h ago•35 comments

Show HN: Stacky – certain block game clone

https://www.susmel.com/stacky/
3•Keyframe•1h ago•0 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
365•vecti•23h ago•167 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
342•eljojo•23h ago•211 comments

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

https://github.com/sandys/kappal
18•sandGorgon•2d ago•10 comments

Show HN: A toy compiler I built in high school (runs in browser)

https://vire-lang.web.app
2•xeouz•1h ago•1 comments

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

https://github.com/phreda4/r3
81•phreda4•20h ago•15 comments

Show HN: Smooth CLI – Token-efficient browser for AI agents

https://docs.smooth.sh/cli/overview
94•antves•2d ago•70 comments

Show HN: Nginx-defender – realtime abuse blocking for Nginx

https://github.com/Anipaleja/nginx-defender
3•anipaleja•3h ago•0 comments

Show HN: Slack CLI for Agents

https://github.com/stablyai/agent-slack
52•nwparker•1d ago•11 comments

Show HN: BioTradingArena – Benchmark for LLMs to predict biotech stock movements

https://www.biotradingarena.com/hn
28•dchu17•1d ago•12 comments

Show HN: Artifact Keeper – Open-Source Artifactory/Nexus Alternative in Rust

https://github.com/artifact-keeper
154•bsgeraci•1d ago•64 comments

Show HN: Which chef knife steels are good? Data from 540 Reddit tread

https://new.knife.day/blog/reddit-steel-sentiment-analysis
2•p-s-v•1h ago•0 comments

Show HN: MCP App to play backgammon with your LLM

https://github.com/sam-mfb/backgammon-mcp
3•sam256•5h ago•1 comments

Show HN: ARM64 Android Dev Kit

https://github.com/denuoweb/ARM64-ADK
18•denuoweb•2d ago•2 comments

Show HN: I'm 75, building an OSS Virtual Protest Protocol for digital activism

https://github.com/voice-of-japan/Virtual-Protest-Protocol/blob/main/README.md
7•sakanakana00•6h ago•1 comments

Show HN: I built Divvy to split restaurant bills from a photo

https://divvyai.app/
3•pieterdy•6h ago•1 comments

Show HN: Gigacode – Use OpenCode's UI with Claude Code/Codex/Amp

https://github.com/rivet-dev/sandbox-agent/tree/main/gigacode
20•NathanFlurry•1d ago•9 comments

Show HN: XAPIs.dev – Twitter API Alternative at 90% Lower Cost

https://xapis.dev
3•nmfccodes•3h ago•1 comments

Show HN: I Hacked My Family's Meal Planning with an App

https://mealjar.app
2•melvinzammit•8h ago•0 comments

Show HN: I built a free UCP checker – see if AI agents can find your store

https://ucphub.ai/ucp-store-check/
2•vladeta•8h ago•2 comments

Show HN: Micropolis/SimCity Clone in Emacs Lisp

https://github.com/vkazanov/elcity
173•vkazanov•2d ago•49 comments

Show HN: Daily-updated database of malicious browser extensions

https://github.com/toborrm9/malicious_extension_sentry
14•toborrm9•1d ago•8 comments

Show HN: Falcon's Eye (isometric NetHack) running in the browser via WebAssembly

https://rahuljaguste.github.io/Nethack_Falcons_Eye/
6•rahuljaguste•20h ago•1 comments

Show HN: Compile-Time Vibe Coding

https://github.com/Michael-JB/vibecode
10•michaelchicory•10h ago•3 comments

Show HN: Horizons – OSS agent execution engine

https://github.com/synth-laboratories/Horizons
24•JoshPurtell•1d ago•5 comments

Show HN: Slop News – HN front page now, but it's all slop

https://dosaygo-studio.github.io/hn-front-page-2035/slop-news
18•keepamovin•11h ago•6 comments

Show HN: Local task classifier and dispatcher on RTX 3080

https://github.com/resilientworkflowsentinel/resilient-workflow-sentinel
25•Shubham_Amb•1d ago•2 comments

Show HN: Fitspire – a simple 5-minute workout app for busy people (iOS)

https://apps.apple.com/us/app/fitspire-5-minute-workout/id6758784938
2•devavinoth12•14h ago•0 comments
Open in hackernews

Show HN: Autonomous recovery for distributed training jobs

https://docs.tensorpool.dev/features/agent
12•tsvoboda•1w ago
Hi HN! We’re TensorPool. We help companies access and optimize large scale compute for training foundation models.

The Problem

It’s been almost a year since we’ve finished YC, and we’ve just crossed 100,000 multinode training GPU hours run on our platform.

On those training runs, we’ve seen countless 3am job crashes because of issues like an Xid error from a flaky GPU or an S3 timeout that corrupted a checkpoint save. By the time you wake up and notice, you've lost 8+ hours of compute. You scramble to diagnose the issue, manually restart from the last checkpoint, and hope it doesn't happen again. Rinse and repeat.

For training runs that take days to weeks, this constant babysitting is exhausting and expensive. The research iteration cycles lost can also make or break a model release (especially for short reservations).

What We Built

This agent monitors your training jobs and autonomously recovers them when things go wrong. It works with Kubernetes, Slurm, and TensorPool Jobs.

We originally built the TensorPool Agent as an internal tool to help us debug failures with our own customers. Over time, we realized its performance was so good that we could automate the entire triage process. We're now releasing a public beta for people to use.

Best case: The TensorPool Agent detects the failure, diagnoses the root cause, fixes it, and restarts your job from the last checkpoint – all while you sleep ;)

Worst case: If the TensorPool agent can't fix the issue automatically, it delivers a preliminary RCA and a list of actions it attempted, giving you a head start on debugging.

How It Works

1) Registration – You provide credentials to your job scheduler via our dashboard. Perms are granted on a whitelist basis; you explicitly control what actions the agent can take.

2) Monitoring – The agent continuously monitors your job for failure conditions.

3) Recovery – On failure, the agent analyzes logs and attempts to diagnose the issue. If successful, it restarts the job from the last checkpoint and resumes monitoring. If not, you get an alert with full context.

Target Failure Modes

The agent is specifically designed for runtime errors that occur deep into training, like:

- CUDA OOM: Memory leaks, gradient explosions

- Xid errors: GPU hardware faults (Xid 79, 63, 48, etc.)

- Distributed communication failures: NCCL timeouts, rank failures

- Storage I/O errors: Checkpoint corruption

- Network issues: S3 request timeouts on mounted object storage

Comments

tsvoboda•1w ago
Would love to hear how you're handling recovery for long-running training jobs today, as well as what failure modes are most common/annoying for you.
hnotshe•1w ago
We're still figuring out how to detect "silent" failures where the job doesn't crash but stops making progress — like NCCL hangs where ranks are waiting indefinitely, or gradient norm explosions that don't trigger OOM but tank loss. Right now we rely on explicit errors in logs, but curious how others approach detecting "the job is technically running but something is very wrong" (if at all)?
jpollock•1w ago
Measurement and alerting is usually done in business metrics, not the causes. That way you catch classes of problems.

Not sure about expected loss, that's a decay rate?

But stuck jobs are via tasks being processed and average latency.