frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

The bug that taught me more about PyTorch than years of using it

https://elanapearl.github.io/blog/2025/the-bug-that-taught-me-pytorch/
80•bblcla•2d ago

Comments

brilee•2h ago
Great write-up, but I admit that I found the interweaving of human and AI-written content/headlines/summaries pretty distracting. I kept on wanting to scroll past, but had to keep on backtracking to find the human thread again.

I think if you want to give your reader a quick intro to, e.g., what is the Adam optimizer, a simple link to Wikipedia is fine. No need to copy-paste an AI tutorial on Adam into the blog post.

CaptainOfCoit•1h ago
To be fair, you can easily click to hide those expanded sections. I found it a neat compromise between "Link to (usually) obtuse Wikipedia article" which aren't usually written for laypersons, and forcing me to read through stuff I already know about, I just hid the sections I already understood but found value in the others.
cadamsdotcom•2h ago
Sounds like Placeholder should somehow be split into InputPlaceholder and OutputPlaceholder, based on the usage.

Even identical classes could help future folks know copying back is platform specific: “hm, we wrote to an OutputPlaceholder but didn’t read back from it, that seems wrong”.

ramses0•1h ago
Apps Hungarian v. System Hungarian: https://herbsutter.com/2008/07/15/hungarian-notation-is-clea...
kccqzy•2h ago
This is a minor quibble but I don't really like the author calling Placeholder a leaky abstraction. It's just straight up an incomplete abstraction that only handles inputs but not outputs. As the author says, Placeholder should know about the difference and do the copy-back itself.
airza•2h ago
I too have been insanely burned by an MPS bug. I wish Apple would throw an engineer or two at making sure their hardware works with PyTorch.
montebicyclelo•1h ago
Incorrect Pytorch gradients with Apple MPS backend...

Yep this kind of thing can happen. I found and reported incorrect gradients for Apple's Metal-backed tensorflow conv2d in 2021 [1].

(Pretty sure I've seen incorrect gradients with another Pytorch backend, but that was a few years ago and I don't seem to have raised an issue to refer to... )

One might think this class of errors would be caught by a test suite. Autodiff can be tested quite comprehensively against numerical differentiation [2]. (Although this example is from a much simpler lib than Pytorch, so I could be missing something.)

[1] https://github.com/apple/tensorflow_macos/issues/230

[2] https://github.com/sradc/SmallPebble/blob/2cd915c4ba72bf2d92...

CaptainOfCoit•1h ago
Only slightly related, but how common are bugs in GPUs and/or CUDA? I'm currently on Day 5 of trying to debug why my GPT-OSS implementation (not using PyTorch) I've made from scratch isn't working correctly, and while I have it somewhat working with some naive and slow methods, I'm now doing an implementation of the tensor cores and have been just stuck for 2-3 days because of some small numerical difference I can't understand why it's happening.

Every day I'm getting closer to believing this is some sort of hardware bug in Blackwell or in CUDA itself, but as we know, the bug is (almost) never in the compiler or in the hardware. Until it is...

saagarjha•1h ago
How big is the numerical difference? If it's small it might be within the precision of the operation itself.
CaptainOfCoit•1h ago
Magnitudes away (maybe "small numerical difference" was an understatement), my current hypothesis is that I'm doing scaling wrong somewhere, but I can't help but sometimes slide into the "maybe there is something deeper wrong" territory in the evening after another day...
QuadmasterXLII•1h ago
You may be running into jensen (huang)’s inequality,

E(loss).cuda() <= E(loss.cuda())

CaptainOfCoit•1h ago
Would make sense I suppose if I was using two different GPUs for the same thing and get two different outcomes. But instead I have two implementations (one naive, one tensor cores) running on the same GPU, but getting different outcomes, where they should be the same.

But then this joke might be flying above my head as well.

p1esk•14m ago
Tensor cores use lower precision, so small numerical differences should be expected.
hansvm•15m ago
They exist, but they're not that common (give or take the "expected" numerical deviations based on the order of summation and whatnot, which can both be nontrivial and propagate error further).

Something I recommend doing, the best time being the start of the project and the second best time being now, is adding numerical gradient checking tests to all operations. You will make mistakes in your kernels from time to time, and it's valuable to know at a glance where those mistakes are.

Mind you, it's possible to write both the forward pass and the backward pass in a way that's wrong but compatible. An additional layer of checks I like to add is a dead-simple implementation of all algorithms -- no vectorization, no fancy blocking or re-orderings, nothing. Compare results to the simple implementation.

It sounds like a lot of work, but writing an optimized kernel is much slower than the numerical gradient checking and the simple kernel, and given how in numerical code it's basically impossible to identify the source of a bug without doing the equivalent of all of those checks, it only takes one bug in the whole project for the effort to pay off.

CaptainOfCoit•10m ago
Thanks a lot for the pointers, I think I've done a similar approach to what you suggest, lots of tiny (relative) tests for each step in the process, and doing sort of sanity checking between the naive stuff I first wrote which works and which does inference correctly, and the new kernel which is a lot more performant, but currently incorrect and produces incoherent outputs.

I'll try to replace bits by simplified versions though, probably could help at least getting closer to knowing where the issue is.

Anyone have more debugging tips I'd greatly appreciate it! Nothing is too small or "obvious", as I'm about to lose my mind more or less.

gugagore•1h ago
This is the first time I see "SGD" to mean "standard gradient descent" and not "stochastic gradient descent".
tavianator•1h ago
Presumably that's just a mistake. The author calls it "stochastic gradient descent" correctly elsewhere in the article
saagarjha•1h ago
Non-contiguous tensors have to be the #1 source of bugs in PyTorch lol
jebarker•42m ago
This is a great write up and I’d love to see more like it. Debugging this sort of thing in the megatron->pytorch->CUDA stack is what my team spends more than half of their time on as an ML research team.
hobom•41m ago
What a fantastic way to write a post mortem, pedagogically very useful.

Let's Help NetBSD Cross the Finish Line Before 2025 Ends

https://mail-index.netbsd.org/netbsd-users/2025/10/26/msg033327.html
154•jaypatelani•2h ago•49 comments

10k Downloadable Movie Posters From The 40s, 50s, 60s, and 70s

https://hrc.contentdm.oclc.org/digital/collection/p15878coll84/search
139•bookofjoe•1w ago•18 comments

The bug that taught me more about PyTorch than years of using it

https://elanapearl.github.io/blog/2025/the-bug-that-taught-me-pytorch/
82•bblcla•2d ago•20 comments

Asbestosis

https://diamondgeezer.blogspot.com/2025/10/asbestosis.html
144•zeristor•6h ago•93 comments

Formal Reasoning [pdf]

https://cs.ru.nl/~freek/courses/fr-2025/public/fr.pdf
32•Thom2503•3h ago•0 comments

A worker fell into a nuclear reactor pool

https://www.nrc.gov/reading-rm/doc-collections/event-status/event/2025/20251022en?brid=vscAjql9kZ...
506•nvahalik•14h ago•329 comments

World Simulator: Create and Play Interactive AI Worlds

https://worldsimulator.ai/
17•machmadera•4d ago•20 comments

You Already Have a Git Server

https://maurycyz.com/misc/easy_git/
186•chmaynard•4h ago•157 comments

Eavesdropping on Internal Networks via Unencrypted Satellites

https://satcom.sysnet.ucsd.edu/
68•Bogdanp•5d ago•7 comments

Pico-Banana-400k

https://github.com/apple/pico-banana-400k
278•dvrp•13h ago•42 comments

Show HN: W++ – Garbage-Collected Threads

10•sinisterMage•5d ago•3 comments

Clojure Land – Discover open-source Clojure libraries and frameworks

https://clojure.land/
105•TheWiggles•7h ago•23 comments

The Linux Boot Process: From Power Button to Kernel

https://www.0xkato.xyz/linux-boot/
329•0xkato•16h ago•68 comments

Writing a RISC-V Emulator in Rust

https://book.rvemu.app/
57•signa11•7h ago•15 comments

Advent of Code 2025: Number of puzzles reduce from 25 to 12 for the first time

https://adventofcode.com/2025/about#faq_num_days
175•vismit2000•6h ago•111 comments

Connect to a 1980s Atari BBS through the web

https://www.southernamis.com/ataribbsconnect
25•JPolka•5h ago•0 comments

LaserTweezer – Optical Trap

https://www.gaudi.ch/GaudiLabs/?page_id=578
38•o4c•7h ago•5 comments

D2: Diagram Scripting Language

https://d2lang.com/tour/intro/
178•benzguo•16h ago•39 comments

The Journey Before main()

https://amit.prasad.me/blog/before-main
252•amitprasad•19h ago•87 comments

California invests in battery energy storage, leaving rolling blackouts behind

https://www.latimes.com/environment/story/2025-10-17/california-made-it-through-another-summer-wi...
302•JumpCrisscross•19h ago•244 comments

PCB Edge USB C Connector Library

https://github.com/AnasMalas/pcb-edge-usb-c
114•walterbell•12h ago•45 comments

Bitmovin (YC S15) Is Hiring Engineering ICs and Managers in Europe

https://bitmovin.com/careers
1•slederer•8h ago

The FSF considers large language models

https://lwn.net/Articles/1040888/
15•birdculture•1h ago•3 comments

Why I code as a CTO

https://www.assembled.com/blog/why-i-code-as-a-cto
221•johnjwang•1d ago•172 comments

Project Amplify: Powered footwear for running and walking

https://about.nike.com/en/newsroom/releases/nike-project-amplify-official-images
99•justinmayer•18h ago•94 comments

NextSilicon reveals new processor chip in challenge to Intel, AMD

https://www.reuters.com/business/nextsilicon-reveals-new-processor-chip-challenge-intel-amd-2025-...
108•simojo•3d ago•22 comments

Show HN: Diagram as code tool with draggable customizations

https://github.com/RohanAdwankar/oxdraw
204•RohanAdwankar•18h ago•41 comments

You Should Feed the Bots

https://maurycyz.com/misc/the_cost_of_trash/
90•chmaynard•3h ago•65 comments

How programs get run: ELF binaries (2015)

https://lwn.net/Articles/631631/
121•st_goliath•18h ago•6 comments

GenAI Image Editing Showdown

https://genai-showdown.specr.net/
131•rzk•12h ago•30 comments