frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Start all of your commands with a comma

https://rhodesmill.org/brandon/2009/commands-with-comma/
44•theblazehen•2d ago•5 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
636•klaussilveira•13h ago•187 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
934•xnx•18h ago•549 comments

What Is Ruliology?

https://writings.stephenwolfram.com/2026/01/what-is-ruliology/
35•helloplanets•4d ago•30 comments

How we made geo joins 400× faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
112•matheusalmeida•1d ago•28 comments

Jeffrey Snover: "Welcome to the Room"

https://www.jsnover.com/blog/2026/02/01/welcome-to-the-room/
13•kaonwarb•3d ago•11 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
44•videotopia•4d ago•1 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
222•isitcontent•13h ago•25 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
214•dmpetrov•13h ago•105 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
323•vecti•15h ago•142 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
372•ostacke•19h ago•94 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
359•aktau•19h ago•181 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
478•todsacerdoti•21h ago•236 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
277•eljojo•16h ago•165 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
406•lstoll•19h ago•273 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
85•quibono•4d ago•21 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
57•kmm•5d ago•3 comments

Delimited Continuations vs. Lwt for Threads

https://mirageos.org/blog/delimcc-vs-lwt
26•romes•4d ago•3 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
16•jesperordrup•3h ago•10 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
245•i5heu•16h ago•193 comments

Was Benoit Mandelbrot a hedgehog or a fox?

https://arxiv.org/abs/2602.01122
13•bikenaga•3d ago•2 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
54•gfortaine•11h ago•22 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
143•vmatsiiako•18h ago•64 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
284•surprisetalk•3d ago•38 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
1061•cdrnsf•22h ago•438 comments

Why I Joined OpenAI

https://www.brendangregg.com/blog/2026-02-07/why-i-joined-openai.html
135•SerCe•9h ago•121 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
178•limoce•3d ago•96 comments

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

https://github.com/phreda4/r3
70•phreda4•12h ago•14 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
28•gmays•8h ago•11 comments

FORTH? Really!?

https://rescrv.net/w/2026/02/06/associative
63•rescrv•21h ago•23 comments
Open in hackernews

A bug that taught me more about PyTorch than years of using it

https://elanapearl.github.io/blog/2025/the-bug-that-taught-me-pytorch/
465•bblcla•3mo ago

Comments

brilee•3mo ago
Great write-up, but I admit that I found the interweaving of human and AI-written content/headlines/summaries pretty distracting. I kept on wanting to scroll past, but had to keep on backtracking to find the human thread again.

I think if you want to give your reader a quick intro to, e.g., what is the Adam optimizer, a simple link to Wikipedia is fine. No need to copy-paste an AI tutorial on Adam into the blog post.

CaptainOfCoit•3mo ago
To be fair, you can easily click to hide those expanded sections. I found it a neat compromise between "Link to (usually) obtuse Wikipedia article" which aren't usually written for laypersons, and forcing me to read through stuff I already know about, I just hid the sections I already understood but found value in the others.
reilly3000•3mo ago
I came here to say the same thing. Claude’s voice was pretty evident, but became actually grating when the header was “The Fix”.
cadamsdotcom•3mo ago
Sounds like Placeholder should somehow be split into InputPlaceholder and OutputPlaceholder, based on the usage.

Even identical classes could help future folks know copying back is platform specific: “hm, we wrote to an OutputPlaceholder but didn’t read back from it, that seems wrong”.

ramses0•3mo ago
Apps Hungarian v. System Hungarian: https://herbsutter.com/2008/07/15/hungarian-notation-is-clea...
kccqzy•3mo ago
This is a minor quibble but I don't really like the author calling Placeholder a leaky abstraction. It's just straight up an incomplete abstraction that only handles inputs but not outputs. As the author says, Placeholder should know about the difference and do the copy-back itself.
airza•3mo ago
I too have been insanely burned by an MPS bug. I wish Apple would throw an engineer or two at making sure their hardware works with PyTorch.
montebicyclelo•3mo ago
Incorrect Pytorch gradients with Apple MPS backend...

Yep this kind of thing can happen. I found and reported incorrect gradients for Apple's Metal-backed tensorflow conv2d in 2021 [1].

(Pretty sure I've seen incorrect gradients with another Pytorch backend, but that was a few years ago and I don't seem to have raised an issue to refer to... )

One might think this class of errors would be caught by a test suite. Autodiff can be tested quite comprehensively against numerical differentiation [2]. (Although this example is from a much simpler lib than Pytorch, so I could be missing something.)

[1] https://github.com/apple/tensorflow_macos/issues/230

[2] https://github.com/sradc/SmallPebble/blob/2cd915c4ba72bf2d92...

liuliu•3mo ago
Yeah, luckily, you can unit tests these and fix them. They are not concurrency bugs (again, luckily).

BTW, numeric differentiation can only be tested very limitedly (due to algorithmic complexity when you doing big matrix). It is much easier / effective to test against multiple implementations.

antoine-levitt•3mo ago
You can easily test a gradient using only the forward pass by doing f(x+h) ~ f(x) + dot(g, h) for a random h
gcr•3mo ago
I’ve also found that some versions of torch get quite different inference results on MPS, ignoring gradient. See https://gist.github.com/gcr/4d8833bb63a85fc8ef1fd77de6622770
CaptainOfCoit•3mo ago
Only slightly related, but how common are bugs in GPUs and/or CUDA? I'm currently on Day 5 of trying to debug why my GPT-OSS implementation (not using PyTorch) I've made from scratch isn't working correctly, and while I have it somewhat working with some naive and slow methods, I'm now doing an implementation of the tensor cores and have been just stuck for 2-3 days because of some small numerical difference I can't understand why it's happening.

Every day I'm getting closer to believing this is some sort of hardware bug in Blackwell or in CUDA itself, but as we know, the bug is (almost) never in the compiler or in the hardware. Until it is...

saagarjha•3mo ago
How big is the numerical difference? If it's small it might be within the precision of the operation itself.
CaptainOfCoit•3mo ago
Magnitudes away (maybe "small numerical difference" was an understatement), my current hypothesis is that I'm doing scaling wrong somewhere, but I can't help but sometimes slide into the "maybe there is something deeper wrong" territory in the evening after another day...
QuadmasterXLII•3mo ago
You may be running into jensen (huang)’s inequality,

E(loss).cuda() <= E(loss.cuda())

CaptainOfCoit•3mo ago
Would make sense I suppose if I was using two different GPUs for the same thing and get two different outcomes. But instead I have two implementations (one naive, one tensor cores) running on the same GPU, but getting different outcomes, where they should be the same.

But then this joke might be flying above my head as well.

p1esk•3mo ago
Tensor cores use lower precision, so small numerical differences should be expected.
hansvm•3mo ago
They exist, but they're not that common (give or take the "expected" numerical deviations based on the order of summation and whatnot, which can both be nontrivial and propagate error further).

Something I recommend doing, the best time being the start of the project and the second best time being now, is adding numerical gradient checking tests to all operations. You will make mistakes in your kernels from time to time, and it's valuable to know at a glance where those mistakes are.

Mind you, it's possible to write both the forward pass and the backward pass in a way that's wrong but compatible. An additional layer of checks I like to add is a dead-simple implementation of all algorithms -- no vectorization, no fancy blocking or re-orderings, nothing. Compare results to the simple implementation.

It sounds like a lot of work, but writing an optimized kernel is much slower than the numerical gradient checking and the simple kernel, and given how in numerical code it's basically impossible to identify the source of a bug without doing the equivalent of all of those checks, it only takes one bug in the whole project for the effort to pay off.

CaptainOfCoit•3mo ago
Thanks a lot for the pointers, I think I've done a similar approach to what you suggest, lots of tiny (relative) tests for each step in the process, and doing sort of sanity checking between the naive stuff I first wrote which works and which does inference correctly, and the new kernel which is a lot more performant, but currently incorrect and produces incoherent outputs.

I'll try to replace bits by simplified versions though, probably could help at least getting closer to knowing where the issue is.

Anyone have more debugging tips I'd greatly appreciate it! Nothing is too small or "obvious", as I'm about to lose my mind more or less.

hansvm•3mo ago
Beyond that, the tips get less general-purpose. The two big over-arching ideas are:

1. Numerical code is the canonical example of "functional" code. If you prove all the pieces correct then the result is also correct. If you prove one wrong then you know why your overall code is wrong. As such, focusing more heavily than normal on proving each piece correct is prudent. Use automated techniques (like numerical gradient checking), and use randomized inputs. It's easier than you'd think for your favorite special cases to be correct in both right and wrong algorithms. Your eyes will deceive you, so use the computer to do your spot checks.

2. I lied in (1). Especially when you start involving GPUs, it's easy to have to start worrying about variable lifetimes, UAF, double-free, un-initialized memory, accidental clobberings, and other ways in which an innocent "functional" computation can stomp on something else you're doing. Still start with all the checks from (1), and if the parts are correct and the whole is broken then you're messing up global state somewhere. Tracking that down is more art than science, but one technique is adding a "poison" field, tracking deinit count, and otherwise exposing metrics regarding those failure modes. Panic/crash when you hit an invalid state, and once you figure out where the issue happens you can triage as normal (working backward from the broken state to figure out how you got there). With a solid memory management strategy up-front you'll not see this sort of thing, but if it's not something you've thought about then I wouldn't rule it out.

3. Not really another point, just an extension of (2), corruption can show up in subtle ways (like stack-copied pointers inside a paused async function closure which occasionally gets copied by your event loop). If global state is the issue, it's worth a full audit of the application.

jjmarr•3mo ago
Consumer-visible hardware bugs are extremely uncommon nowadays. There's approximately 10x as many people working in design verification as actual hardware design.

I say "consumer-visible" because the bugs still exist and people who can catch them early get promoted quickly and paid a lot. It's very exciting work if you can get it, since you really have to understand the full GPU to break it.

Good luck!!

gugagore•3mo ago
This is the first time I see "SGD" to mean "standard gradient descent" and not "stochastic gradient descent".
tavianator•3mo ago
Presumably that's just a mistake. The author calls it "stochastic gradient descent" correctly elsewhere in the article
elanapearl•3mo ago
haha oops yeah the other comment is correct- that was just a mistake

I originally wrote "vanilla" there but didn't want to repeat that word twice in a row so swapped it for "standard" without realizing it now looked like the SGD acronym

just fixed that to avoid confusion- thanks for pointing it out!

saagarjha•3mo ago
Non-contiguous tensors have to be the #1 source of bugs in PyTorch lol
jebarker•3mo ago
This is a great write up and I’d love to see more like it. Debugging this sort of thing in the megatron->pytorch->CUDA stack is what my team spends more than half of their time on as an ML research team.
ddelnano•3mo ago
Wouldn't the Nsight Systems suite provide coverage here? Are the tricky cases difficult to debug with the standard CUDA tooling stack?
jebarker•3mo ago
Yes, nsys is very helpful, especially when looking at perf issues. It’s often the case that bugs present like in this blog though - you just notice that training curves have regressed somehow - so even with good tooling it can be hard to figure out where to start looking in these very complex systems. Only gets worse if the symptoms only show up when running for a long time and at scale in a cluster.
hobom•3mo ago
What a fantastic way to write a post mortem, pedagogically very useful.
dangoodmanUT•3mo ago
The tinygrad folks talk about this a lot.

Not that I understand much of what they say, but it appears there are a lot of correctness bugs in pytorch that are flying under the radar, probably having a measurable impact on the results of model quality.

It would be interesting to see model weights comparison of the same model trained with the two to see if they exhibit meaningfully different behavior.

CaptainOfCoit•3mo ago
> Not that I understand much of what they say, but it appears there are a lot of correctness bugs in pytorch that are flying under the radar, probably having a measurable impact on the results of model quality.

Do you have any links to public thoughts about this? As if it was true, could mean a lot of research could be invalidated, so obviously would make huge news.

Also feels like something that would be relatively easy to make reproducible test cases from, so easy to prove if that's true or not.

And finally if something is easy to validate, and would make huge news, I feel like someone would already have attempted to prove this, and if it was true, would have published something a long time ago.

dangoodmanUT•3mo ago
Check their Twitter, I saw something either yesterday or earlier today iirc
Calavar•3mo ago
There are many more ways to degrade model performance than to enhance it, so I would expect the vast majority of bugs to lead to artificially reduced accuracy, not artificially increased accuracy.

So if PyTorch is full of numerical flaws, that would likely mean many models with mediocre/borderline performance were discarded (never published) because they just failed to meet the threshold where the authors felt it was worth their time to package it up for a mid-tier conference. A finding that many would-be mediocre papers are actually slightly less mediocre than believed would be an utterly unremarkable conclusion and I believe that's why we haven't seen a bombshell analysis of PyTorch flaws and reproducibility at NeurIPS.

A software error in, say, a stats routine or a data preprocessing routine would be a different story because the degrees of freedom are fewer, leaving a greater probability of an error hitting a path that pushes a result to look artificially better as opposed to artificially worse

bobbylarrybobby•3mo ago
Could this really invalidate research? Managing to produce a model that works (assuming you check all of the myriad modeling correctness checkboxes) is sufficient on its own. The fact that the modeling process itself was broken in some way — but not the assumptions made of the model inputs, or data leakage assumptions, or anything that fundamentally undermines any model produced — has no bearing on the outcome, which is the fact that you got a model that evidently did make accurate predictions.
Majromax•3mo ago
> Could this really invalidate research? Managing to produce a model that works (assuming you check all of the myriad modeling correctness checkboxes) is sufficient on its own.

In the academic sense, a model that happens to work isn't research; the product of research should be a technique or insight that generalizes.

"Standard technique X doesn't work in domain Y, so we developed modified technique X' that does better" is the fundamental storyline of many machine learning papers, and that could be 'invalidated' if the poor performance of X was caused by a hidden correctness bug avoided by X'.

p1esk•3mo ago
a lot of research could be invalidated, so obviously would make huge news.

A lot of research is unreproducible crap. That’s not news to anyone. Plus, bugs usually make results worse, not better.

3abiton•3mo ago
That's why project like nanochat are really cool, you can get around the limitations of such gigantic libraries, while at the same time understanding the underlying architecture.
woodson•3mo ago
Nanochat is using PyTorch under the hood. I don’t understand your point.
CamperBob2•3mo ago
They might be referring to Karpathy's earlier micrograd tutorial, where the whole thing is built from scratch. That was how I learned the basics myself.
coredog64•3mo ago
When we update Torch versions, we're required to run a test where the only change is the library change and compare the outputs. We saw a measurable improvement in accuracy by upgrading from torch 2.4.x to 2.7.x.
embedding-shape•3mo ago
> we're required to run a test

What do you mean with "we're required to"? Isn't that something you do with all libraries and something you as an engineer want to do, at the very least to prove correctness? Personally I couldn't imagine using a 3rd party library without at least have some basic tests to confirm correctness, even when I use PyTorch I do the same.

doctorpangloss•3mo ago
I see another commenter highlighted this:

> The exact same float32 code updates weights on CPU but fails on MPS

It's MPS... Exactly zero research is being impacted. Why doesn't the $3.9T corporation contribute more to torch?

ACCount37•3mo ago
I mean, some researchers clearly use Apple Silicon for their "cheap and cheerful" runs.
CrazyStat•3mo ago
As noted near the end of the article, an Apple employee had already contributed a fix to the bug:

> Checking the latest version revealed the bug was already fixed in v2.4, patched by an ML engineer at Apple last year using almost the exact same approach I’d used.

dapperdrake•3mo ago
https://moyix.blogspot.com/2022/09/someones-been-messing-wit...

TLDR: Python gevent compiled with -Ofast messes up x87 floating point unit state. Bad for PyTorch.

woodson•3mo ago
I thought that the effect of these compiler flags was widely known in numerical computing. It allows e.g., reordering of floating point computations and in general disregards IEEE 754. As such, these results are expected, I’d think.
tempay•3mo ago
Widely know amongst very niche groups, most of whom have either been burnt by the issue or heard about someone who has and have it ingrained in their mind out of fear of debugging such a thing.

I’d bet the majority of ML people are unaware, including those doing lower level stuff.

pm215•3mo ago
The unexpected thing in that particular case is that even if you were well aware and avoided the flag when building your numeric code, the way some other non-numeric-computing person compiled some unrelated non-numeric module like "gevent" could result in the fast-math behaviour being applied to your code too. (Happily gcc has now fixed this.)
Q6T46nT668w6i3m•3mo ago
The tinygrad folks talk too much
dataflow•3mo ago
Dumb question: why isn't there some kind of assertion to sanity-check some bits of the GPU results against CPU's?
nraynaud•3mo ago
Naive question: ML tensor libraries don’t use a Z-order memory layout like textures do? It’s not beneficial like it is for textures?
empiko•3mo ago
I think that z-order is used to increase speed of loading texture from RAM. But this is not an issue in ML. You usually have all your model weights directly loaded into your GPU memory and you do not need caching for your inputs. At the same time, the entire stack for ML is heavily optimized for other memory layouts already.
hinkley•3mo ago
Reminds me of the largest AJAX app I worked on, back when jquery was still hot and IE6 still existed as a problem.

The landing page in our app used jqueryUI’s drag and drop support, back around the time they declared bankruptcy on the confusing buggy code and wouldn’t even accept bug fixes because they were replacing it component by component (which was taking almost 3x as long as predicted). We had columns you could drag items between but they had a max height and scroll bars and it turned out jqueryUI would let you drag items into different rows if the overflow area for adjacent drag targets overlapped your row.

The person who found it couldn’t fix it. The other fixer couldn’t fix it. I diagnosed it but the spaghetti code was a recursive mess and I could not find a spot where I could fix it. Especially given I couldn’t send in a patch to them.

So I spent half of my free time the last day of every (2 week) sprint for almost six months before I finally found a small function I could monkey patch to wrap it in a short circuit check for clipping region. I spent maybe 20,30 hours on this, a lot of it just getting back to the same situation to debug. But it felt like it took forever to fix it.

The short circuit also made drag and drop faster, which was just getting in the edge of distracting. Particularly on a crowded page.

CaptainOfCoit•3mo ago
I remember many similar cycles of having different browsers open side-by-side, and trying to pinpoint (without the developer tools we know and love today) the exact reason why one border was one pixel in one browser, and two pixels in the other, throwing the whole layout off.

Also remembering when Firebug for Firefox appeared, and made so many things so much easier. Suddenly things that took hours took days, and it was so much easier when you had some introspection tools.

yard2010•3mo ago
* { border: red 1px solid } Remember when IE6 was a thing? The kids today are angry at chrome for good reasons and yet, there was a time in which the most popular browser didn't implement jack shit from the specs. And it was the kind of browser that ships with the OS.

God the bad karma for working with this crap. I'm glad it's over.

hinkley•3mo ago
I had to do a reflow reordering trick on a sibling page in that app and it doubled or tripled the speed on FF and safari, but on IE6 the test case went from 30s to 3.5s. Good Christ.
hinkley•3mo ago
That bug took me on a whirlwind tour of that code and I understand why they wanted to start over. Woof.
tosapple•3mo ago
>> and made so many things so much easier. >> Suddenly things that took hours took days

Inverse? Shouldn't it be things that took days took hours ?

ipsum2•3mo ago
Apple used to contribute to the PyTorch MPS backend, but decided to create their own framework (MLX) instead, fragmenting the ecosystem for very little gain. (MLX is basically PyTorch, but invented-at-apple)

Meta, the creator and main contributor to PyTorch, does not use Macs for their day-to-day ML work (they focus on GPUs and CPUs), so the MPS backend is sadly incomplete and has errors like the one you see here.

almostgotcaught•3mo ago
none of this is correct (except the part where FB doesn't use apple in prod).

EDIT: for the downvoters - i'll repeat, this is not a correct assessment of the relationship between Apple and PyTorch. but you can keep downvoting if you want <shrug>

ipsum2•3mo ago
Please be specific if you have anything to say. By the way, the co-creator and core maintainer of PyTorch has the same opinion as me.

https://x.com/soumithchintala/status/1978848796953161754

"MacStudio you ask?

Apple Engineering's *actual* time spent on PyTorch support has't given me confidence that PyTorch Mac experience would get anywhere close to NVIDIA's any time soon, if ever.

The Meta engineers continue to do a huge amount of heavy-lifting for improving the MPS backend, including feeling the responsibility for the Mac experience. Apple's priorities keep changing, the number of engineering hours they contribute keeps changing and their interest in actually and wholly owning the PyTorch MPS backend keeps varying.

If Apple wants MacStudio to become an actual AI devbox, and not just an AI inference machine, then prioritizing software support for PyTorch (>90% marketshare in AI) would probably be a good idea."

hedgehog•3mo ago
Apple has never cared about ML research on their hardware. I've never been able to pin down a specific reason why, best I can figure out is they don't see it bringing enough additional hardware sales to be a focus.
almostgotcaught•3mo ago
lol @ quoting soumith - the guy's sole job responsibility is tweeting.
ipsum2•3mo ago
If you have more knowledge than the core maintainer of pytorch, why are you unwilling to share, instead of snarking?
almostgotcaught•3mo ago
He's not a core maintainer and hasn't been for years - pytorch's contributors are completely public

https://github.com/pytorch/pytorch/graphs/contributors

ipsum2•3mo ago
Got it, you're a bad troll. He's listed as the "Lead Core Maintainer" on PyTorch.
almostgotcaught•3mo ago
i'm sure mark zuckerberg is as well <shrug>
Q6T46nT668w6i3m•3mo ago
Everyone I know at Meta uses a Mac
ipsum2•3mo ago
No one at Meta runs local inference on a Mac, unless its for fun.
sampton•3mo ago
MLX and MPS are 2 completely different teams within Apple. It's more like MPS team doesn't have control or visibility into PyTorch roadmap and can only contribute so much from their side.
mirekrusin•3mo ago
Nice work, surprising, I'd imagine implementations are cross tested all the time and this kind of bugs have no way of appearing?
modeless•3mo ago
Another reason people use Nvidia. You know that Nvidia is the most used backend and the most likely to have this kind of bug found and fixed before you encounter it.
cryber•3mo ago
this is a great writeup! methodical without being pedantic.
hershyb_•3mo ago
awesome read!
anal_reactor•3mo ago
If I understand correctly, the root cause of the bug was improper use of object-oriented programming. A `Placeholder` object behaves differently depending on how it was created, and requires the user to have this awareness. The check `if is_continuous` should only ever exist inside the code of the `Placeholder` class.
albertzeyer•3mo ago
The bug was with non-contiguous data in tensors.

I also had a very similar bug a while ago, broken gradients due to non-contiguous data for masked_select: https://github.com/pytorch/pytorch/issues/99638

In my case, it was easier to identify: I had another implementation of my loss function before that did not use masked_select. But then I thought I can be clever and use masked_select to take out the non-masked frames and calculate the loss only on those. But it wasn't working. Also, it only happened for some models, not for all. It turns out, it was always happening when the data coming out of the model was non-contiguous.

I think the bugs with non-contiguous data are not so uncommon. I wonder how much of that we still have.

dcl•3mo ago
Is this why I cannot seem to fine tune YOLO models on a Apple M4? The loss hits nan after a few batches. Same code using Windows PC and Google Colab CPU and GPU is fine...
EdwardDiego•3mo ago
Kudos to Elana for a) such a thorough deep dive and b) a great write-up of it. I understand very little about ML libraries, but was able to follow this easily :)
farhanhubble•3mo ago
Great work hunting the bug down the stack. The writeup is top notch. I wish I documented some of the nastiest bugs I found in such detail.

Funnily, only a few days ago I was thinking about just how far the field has come since 2014 or so when you'd build a computational graph, initialize weights manually and so on, versus now, where you just have to use a library like Ultralytics or HuggingFace most of the time. Then I thought about just how many deep, undetected bugs there would be in this mountain of abstraction. Bugs that make the computation invalid.

Rileyen•3mo ago
Just read the article and it instantly brought back memories of when I spent days trying to fix a broken loss in a PyTorch model. Turned out I had passed the wrong optimizer parameters. I ended up digging all the way from the model to the CUDA kernel. Debugging took longer than training.

What’s the trickiest bug you’ve ever run into?