The bug that taught me more about PyTorch than years of using it

https://elanapearl.github.io/blog/2025/the-bug-that-taught-me-pytorch/

80•bblcla•2d ago

Comments

brilee•2h ago

Great write-up, but I admit that I found the interweaving of human and AI-written content/headlines/summaries pretty distracting. I kept on wanting to scroll past, but had to keep on backtracking to find the human thread again.

I think if you want to give your reader a quick intro to, e.g., what is the Adam optimizer, a simple link to Wikipedia is fine. No need to copy-paste an AI tutorial on Adam into the blog post.

CaptainOfCoit•1h ago

To be fair, you can easily click to hide those expanded sections. I found it a neat compromise between "Link to (usually) obtuse Wikipedia article" which aren't usually written for laypersons, and forcing me to read through stuff I already know about, I just hid the sections I already understood but found value in the others.

cadamsdotcom•2h ago

Sounds like Placeholder should somehow be split into InputPlaceholder and OutputPlaceholder, based on the usage.

Even identical classes could help future folks know copying back is platform specific: “hm, we wrote to an OutputPlaceholder but didn’t read back from it, that seems wrong”.

ramses0•1h ago

Apps Hungarian v. System Hungarian: https://herbsutter.com/2008/07/15/hungarian-notation-is-clea...

kccqzy•2h ago

This is a minor quibble but I don't really like the author calling Placeholder a leaky abstraction. It's just straight up an incomplete abstraction that only handles inputs but not outputs. As the author says, Placeholder should know about the difference and do the copy-back itself.

airza•2h ago

I too have been insanely burned by an MPS bug. I wish Apple would throw an engineer or two at making sure their hardware works with PyTorch.

montebicyclelo•1h ago

Incorrect Pytorch gradients with Apple MPS backend...

Yep this kind of thing can happen. I found and reported incorrect gradients for Apple's Metal-backed tensorflow conv2d in 2021 [1].

(Pretty sure I've seen incorrect gradients with another Pytorch backend, but that was a few years ago and I don't seem to have raised an issue to refer to... )

One might think this class of errors would be caught by a test suite. Autodiff can be tested quite comprehensively against numerical differentiation [2]. (Although this example is from a much simpler lib than Pytorch, so I could be missing something.)

[1] https://github.com/apple/tensorflow_macos/issues/230

[2] https://github.com/sradc/SmallPebble/blob/2cd915c4ba72bf2d92...

CaptainOfCoit•1h ago

Only slightly related, but how common are bugs in GPUs and/or CUDA? I'm currently on Day 5 of trying to debug why my GPT-OSS implementation (not using PyTorch) I've made from scratch isn't working correctly, and while I have it somewhat working with some naive and slow methods, I'm now doing an implementation of the tensor cores and have been just stuck for 2-3 days because of some small numerical difference I can't understand why it's happening.

Every day I'm getting closer to believing this is some sort of hardware bug in Blackwell or in CUDA itself, but as we know, the bug is (almost) never in the compiler or in the hardware. Until it is...

saagarjha•1h ago

How big is the numerical difference? If it's small it might be within the precision of the operation itself.

CaptainOfCoit•1h ago

Magnitudes away (maybe "small numerical difference" was an understatement), my current hypothesis is that I'm doing scaling wrong somewhere, but I can't help but sometimes slide into the "maybe there is something deeper wrong" territory in the evening after another day...

QuadmasterXLII•1h ago

You may be running into jensen (huang)’s inequality,

E(loss).cuda() <= E(loss.cuda())

CaptainOfCoit•1h ago

Would make sense I suppose if I was using two different GPUs for the same thing and get two different outcomes. But instead I have two implementations (one naive, one tensor cores) running on the same GPU, but getting different outcomes, where they should be the same.

But then this joke might be flying above my head as well.

p1esk•14m ago

Tensor cores use lower precision, so small numerical differences should be expected.

hansvm•15m ago

They exist, but they're not that common (give or take the "expected" numerical deviations based on the order of summation and whatnot, which can both be nontrivial and propagate error further).

Something I recommend doing, the best time being the start of the project and the second best time being now, is adding numerical gradient checking tests to all operations. You will make mistakes in your kernels from time to time, and it's valuable to know at a glance where those mistakes are.

Mind you, it's possible to write both the forward pass and the backward pass in a way that's wrong but compatible. An additional layer of checks I like to add is a dead-simple implementation of all algorithms -- no vectorization, no fancy blocking or re-orderings, nothing. Compare results to the simple implementation.

It sounds like a lot of work, but writing an optimized kernel is much slower than the numerical gradient checking and the simple kernel, and given how in numerical code it's basically impossible to identify the source of a bug without doing the equivalent of all of those checks, it only takes one bug in the whole project for the effort to pay off.

CaptainOfCoit•10m ago

Thanks a lot for the pointers, I think I've done a similar approach to what you suggest, lots of tiny (relative) tests for each step in the process, and doing sort of sanity checking between the naive stuff I first wrote which works and which does inference correctly, and the new kernel which is a lot more performant, but currently incorrect and produces incoherent outputs.

I'll try to replace bits by simplified versions though, probably could help at least getting closer to knowing where the issue is.

Anyone have more debugging tips I'd greatly appreciate it! Nothing is too small or "obvious", as I'm about to lose my mind more or less.

gugagore•1h ago

This is the first time I see "SGD" to mean "standard gradient descent" and not "stochastic gradient descent".

tavianator•1h ago

Presumably that's just a mistake. The author calls it "stochastic gradient descent" correctly elsewhere in the article

saagarjha•1h ago

Non-contiguous tensors have to be the #1 source of bugs in PyTorch lol

jebarker•42m ago

This is a great write up and I’d love to see more like it. Debugging this sort of thing in the megatron->pytorch->CUDA stack is what my team spends more than half of their time on as an ML research team.

hobom•41m ago

What a fantastic way to write a post mortem, pedagogically very useful.

Let's Help NetBSD Cross the Finish Line Before 2025 Ends

10k Downloadable Movie Posters From The 40s, 50s, 60s, and 70s

The bug that taught me more about PyTorch than years of using it

Asbestosis

Formal Reasoning [pdf]

A worker fell into a nuclear reactor pool

World Simulator: Create and Play Interactive AI Worlds

You Already Have a Git Server

Eavesdropping on Internal Networks via Unencrypted Satellites

Pico-Banana-400k

Show HN: W++ – Garbage-Collected Threads

Clojure Land – Discover open-source Clojure libraries and frameworks

The Linux Boot Process: From Power Button to Kernel

Writing a RISC-V Emulator in Rust

Advent of Code 2025: Number of puzzles reduce from 25 to 12 for the first time

Connect to a 1980s Atari BBS through the web

LaserTweezer – Optical Trap

D2: Diagram Scripting Language

The Journey Before main()

California invests in battery energy storage, leaving rolling blackouts behind

PCB Edge USB C Connector Library

Bitmovin (YC S15) Is Hiring Engineering ICs and Managers in Europe

The FSF considers large language models

Why I code as a CTO

Project Amplify: Powered footwear for running and walking

NextSilicon reveals new processor chip in challenge to Intel, AMD

Show HN: Diagram as code tool with draggable customizations

You Should Feed the Bots

How programs get run: ELF binaries (2015)

GenAI Image Editing Showdown

The bug that taught me more about PyTorch than years of using it

Comments

Let's Help NetBSD Cross the Finish Line Before 2025 Ends

10k Downloadable Movie Posters From The 40s, 50s, 60s, and 70s

The bug that taught me more about PyTorch than years of using it

Asbestosis

Formal Reasoning [pdf]

A worker fell into a nuclear reactor pool

World Simulator: Create and Play Interactive AI Worlds

You Already Have a Git Server

Eavesdropping on Internal Networks via Unencrypted Satellites

Pico-Banana-400k

Show HN: W++ – Garbage-Collected Threads

Clojure Land – Discover open-source Clojure libraries and frameworks

The Linux Boot Process: From Power Button to Kernel

Writing a RISC-V Emulator in Rust

Advent of Code 2025: Number of puzzles reduce from 25 to 12 for the first time

Connect to a 1980s Atari BBS through the web

LaserTweezer – Optical Trap

D2: Diagram Scripting Language

The Journey Before main()

California invests in battery energy storage, leaving rolling blackouts behind

PCB Edge USB C Connector Library

Bitmovin (YC S15) Is Hiring Engineering ICs and Managers in Europe

The FSF considers large language models

Why I code as a CTO

Project Amplify: Powered footwear for running and walking

NextSilicon reveals new processor chip in challenge to Intel, AMD

Show HN: Diagram as code tool with draggable customizations

You Should Feed the Bots

How programs get run: ELF binaries (2015)

GenAI Image Editing Showdown