frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Keeping 20k GPUs healthy

https://modal.com/blog/gpu-health
43•jxmorris12•4d ago

Comments

bflesch•2h ago
In his newsletter Ed Zitron hammered down the point that GPUs depreciate quickly, but these kind of reliability issues are shocking to read. The GPUs are so common to fail that they hang out in a 24/7 slack channel with customers like Meta (who apparently can't set up a cluster themselves..).

Ed Zitron also called out the business model of GPU-as-a-service middleman companies like modal deeply unsustainable, and I also don't see how they can make a profit if they are only reselling public clouds. Assuming they are VC funded the VCs need returns for their funds.

Unlike fiber cable during the dot com boom the currently used GPUs eventually end up in the trash bin. These GPUs are treated like toilet paper, you use them and throw them away, nothing you will give to the next generation.

Who will be the one who marks down these "assets"? Who is providing money to buy the next batch of GPUs, now that billions are already spent?

Maybe we'll see a wave of retirements soon.

> It’s underappreciated how unreliable GPUs are. NVIDIA’s hardware is a marvel, the FLOPs are absurd. But the reliability is a drag. A memorable illustration of how AI/ML development is hampered by reliability comes from Meta’s paper detailing the training process for the LLaMA 3 models: “GPU issues are the largest category, accounting for 58.7% of all unexpected issues.” > Imagine the future we’ll enjoy when GPUs are as reliable as CPUs. The Llama3 team’s CPUs were the problem only 0.5% of the time. In my time at Modal we can’t remember finding a single degraded CPU core. > For our Enterprise customers we use a shared private Slack channel with tight SLAs. Slack is connected to Pylon, tracking issues from creation to resolution. Because Modal is built on top of the cloud giants and designed for dynamic compute autoscaling, we can replace bad GPUs pretty fast!

ares623•1h ago
I suppose NVidia could invest in making their GPUs more reliable? But then that'll make everything else even more expensive lol. If only one of the companies on the chain can take one for the team.
touisteur•3m ago
And NVIDIA supposedly has the exact knowhow for reliablity, as their Jetson 'industrial' parts are qualified for 10-15 years at maximal temp. Of course Jetson is on another point of the flops and watts curve.

Just wondering, if reliability increases if you slow down your use of GPUs a bit. Like pausing more often and stopping chasing every bubble and nvlink-all-reduce optimization.

pixl97•1h ago
>These GPUs are treated like toilet paper, you use them and throw them away, nothing you will give to the next generation.

I'm guessing this may be highly dependant on what the bathtub curve looks like, and how much the provider wants to spend on cooling.

Of course with Nvidia being a near monopoly here, they might just not give a fuck and will pump out cards/servers with shitty reliability rates simply because people keep buying them and they don't suffer any economic loss or have to sit in front of a judge.

Be interesting to see what the error rate per TFLOP (no /s, we're looking at operations not time) is compared to older generation cards.

topaz0•25m ago
> Of course with Nvidia being a near monopoly here, they [...] will pump out cards/servers with shitty reliability rates simply because people keep buying them and they don't suffer any economic loss or have to sit in front of a judge.

Presumably this can't last that much longer, because the people that are buying/running these are already taking on loads of debt/venture capital to buy the past/current round of hardware without seeing much revenue from it. It's much harder to ask investors for multiples of your annual revenue just to maintain your current capabilities than it was a couple years ago to ask for many multiples of your revenue to expand your capabilities dramatically.

charles_irl•24m ago
> Ed Zitron also called out the business model of GPU-as-a-service middleman companies like modal deeply unsustainable, and I also don't see how they can make a profit if they are only reselling public clouds.

You got a link for that? I work on Modal and would be interested in seeing the argument!

We think building a proper software layer for multitenant demand aggregation on top of the public clouds is sufficient value-add to be a sustainable business (cf DBRX and Snowflake).

bluedino•1h ago
I help run a fleet of GPU servers, and I might see 1 DIMM or SSD failure for every 50-100 GPU failures.

I realize NVIDIA is just cranking them out as fast as they can, but the quality on them is terrible. They overheat, disappear after you reboot, they fall off the bus, memory failures, and then mix in all the software crashes your users generate...

Our current server vendor is actually good at replacing them, unlike our previous vendor, but the failure rates are just insane. If any other component failed this much we'd have the vendor buy the servers back.

dlcarrier•1h ago
They're also run far closer to the edge of their operational limits than CPUs, so you're far more likely to get one that barely passes manufacturing tests, then degrades just a little tiny bit and stops working.
bigwheels•1h ago
FWIW, NVIDIA enterprise hardware does come with good warranty and prompt RMA service.

A deep dive on why these beastly cards fail so frequently compared to all other common current day hardware would be fascinating!

indoordin0saur•17m ago
I don't know much about the subject but GPUs were originally meant for gaming and would run for a few hours to several hours a day and then would get rest periods. The amount of power draw on them would also vary throughout the time they were being actively used. With constant 24/7 usage at max capacity is it just possible that they are being pushed beyond what they were originally engineered for?
thundergolfer•39m ago
Author here. That 1:50-100 ratio looks roughly right based on my research, but my numbers have GPUs faring even worse.

  Component                      Type       MTBF (yrs)  AFR
  ─────────────────────────────────────────────────────────

  SSD                            Hardware   ~100        ~1%
  RAM uncorrectable error        Hardware   ~75         ~1-4%
  NVIDIA A100 critical error†    Hardware   0.18 (65d)  -
  NVIDIA H100 critical error†    Hardware   0.15 (50d)  -
† “Critical error” refers to a NVIDIA Xid or sXid error which is not recoverable, requiring application and GPU reset.

Only a minority of GPU 'failures' appear to be permanent hardware problems, such as row remapping errors. A lot seem to be, like another comment says, a consequence of operating too close to the operational limit, tipping over it, and then requiring a power cycle.

shrubble•21m ago
If you rebooted every server after 35 days, would that get rid of many of the problems?
layoric•4m ago
I'm quite surprised the A100 is not much better since the power levels for the Ampere cards I believe is a lot lower.

Does this mean even for a model that fits on a single server that trains for a few weeks will absolutely need a recovery process? Interested in peoples experiences around this.

eleventyseven•9m ago
> Today, we’re sharing our GPU reliability system as both a demonstration of our commitment to Modal customers and as a guide for fellow travelers renting hyperscaler or neocloud cards. It’s dangerous to go alone! Take this.

> We’ve chosen not to refer to cloud providers directly, but instead give them anonymized A, B, C, D identifiers. If you want know who’s who, track the clues or buy us a beer sometime.

Come on, either name names or admit it is pure PR.

Edit: or will someone who can decode the clues weigh in?

smsx•3m ago
Are the numbers in the H100 PCIE vs SXM table swapped for rows 3 onwards? It looks to me like the PCI is showing higher GiB/s numbers, which is counter to expectations. Or am I misunderstanding those benchmarks?

GPTZero finds 100 new hallucinations in NeurIPS 2025 accepted papers

https://gptzero.me/news/neurips/
561•segmenta•6h ago•285 comments

Show HN: isometric.nyc – giant isometric pixel art map of NYC

https://cannoneyed.com/isometric-nyc/
367•cannoneyed•4h ago•105 comments

Qwen3-TTS family is now open sourced: Voice design, clone, and generation

https://qwen.ai/blog?id=qwen3tts-0115
353•Palmik•7h ago•104 comments

Why does SSH send 100 packets per keystroke?

https://eieio.games/blog/ssh-sends-100-packets-per-keystroke/
142•eieio•2h ago•92 comments

CSS Optical Illusions

https://alvaromontoro.com/blog/68091/css-optical-illusions
91•ulrischa•3h ago•10 comments

Compiling Scheme to WebAssembly

https://eli.thegreenplace.net/2026/compiling-scheme-to-webassembly/
23•chmaynard•4d ago•3 comments

'Askers' vs. 'Guessers' (2010)

https://www.theatlantic.com/national/2010/05/askers-vs-guessers/340891/
36•BoorishBears•9h ago•16 comments

Recent discoveries on the acquisition of the highest levels of human performance

https://www.science.org/doi/abs/10.1126/science.adt7790
54•colincooke•3h ago•21 comments

I was banned from Claude for scaffolding a Claude.md file?

https://hugodaniel.com/posts/claude-code-banned-me/
213•hugodan•2h ago•155 comments

Tree-sitter vs. Language Servers

https://lambdaland.org/posts/2026-01-21_tree-sitter_vs_lsp/
177•ashton314•6h ago•48 comments

City Weather Explorer (3D comparison)

https://awjuliani.github.io/weather-explore/
15•emot•10h ago•2 comments

Launch HN: Constellation Space (YC W26) – AI for satellite mission assurance

https://constellation-io.com/
24•kmajid•4h ago•5 comments

Show HN: CLI for working with Apple Core ML models

https://github.com/schappim/coreml-cli
9•schappim•1h ago•0 comments

Reverse engineering Lyft Bikes for fun (and profit?)

https://ilanbigio.com/blog/lyft-bikes.html
24•ibigio•4h ago•5 comments

AnswerThis (YC F25) Is Hiring

https://www.ycombinator.com/companies/answerthis/jobs/r5VHmSC-ai-agent-orchestration
1•ayush4921•4h ago

'Active' sitting is better for brain health: review of studies

https://www.sciencealert.com/not-all-sitting-is-equal-one-type-was-just-linked-to-better-brain-he...
23•mikhael•2h ago•12 comments

Show HN: First Claude Code client for Ollama local models

https://twitter.com/serafimcloud/status/2014266928853110862
11•SerafimKorablev•4h ago•4 comments

Mote: An Interactive Ecosystem Simulation [video]

https://www.youtube.com/watch?v=Hju0H3NHxVI
42•evakhoury•22h ago•3 comments

Keeping 20k GPUs healthy

https://modal.com/blog/gpu-health
45•jxmorris12•4d ago•15 comments

My first year in sales as technical founder

https://www.fabiandietrich.com/blog/first-year-in-sales.html
8•f3b5•5d ago•2 comments

Design Thinking Books (2024)

https://www.designorate.com/design-thinking-books/
254•rrm1977•9h ago•115 comments

A Year of 3D Printing

https://brookehatton.com/blog/making/a-year-of-3d-printing/
54•nindalf•5d ago•57 comments

Your app subscription is now my weekend project

https://rselbach.com/your-sub-is-now-my-weekend-project
75•robteix•3d ago•84 comments

TTY and Buffering

https://mattrighetti.com/2026/01/12/tty-and-buffering
31•mattrighetti•5d ago•5 comments

Show HN: BrowserOS – "Claude Cowork" in the browser

https://github.com/browseros-ai/BrowserOS
29•felarof•4h ago•13 comments

Show HN: Synesthesia, make noise music with a colorpicker

https://visualnoise.ca
18•tevans3•15h ago•3 comments

It looks like the status/need-triage label was removed

https://github.com/google-gemini/gemini-cli/issues/16728
243•nickswalker•5h ago•61 comments

Vulnerable WhisperPair Devices – Hijack Bluetooth Accessories Using Fast Pair

https://whisperpair.eu/vulnerable-devices
10•gnabgib•4d ago•4 comments

Bootstrapping Bun

https://walters.app/blog/bootstrapping-bun
34•zerf•3d ago•0 comments

Show HN: Text-to-video model from scratch (2 brothers, 2 years, 2B params)

https://huggingface.co/collections/Linum-AI/linum-v2-2b-text-to-video
18•schopra909•4h ago•7 comments