frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

nvidia-smi hangs indefinitely after ~66 days

https://github.com/NVIDIA/open-gpu-kernel-modules/issues/971
108•tosh•2h ago•17 comments

BirdyChat becomes first European chat app that is interoperable with WhatsApp

https://www.birdy.chat/blog/first-to-interoperate-with-whatsapp
464•joooscha•10h ago•289 comments

Adoption of EVs tied to real-world reductions in air pollution: study

https://keck.usc.edu/news/adoption-of-electric-vehicles-tied-to-real-world-reductions-in-air-poll...
242•hhs•5h ago•201 comments

Palantir has no place in UK public services

https://www.opendemocracy.net/en/zarah-sutlana-palantir-no-place-uk-public-services-ministry-of-d...
72•jethronethro•1h ago•13 comments

Two Weeks Until Tapeout

https://essenceia.github.io/projects/two_weeks_until_tapeout/
70•client4•4h ago•2 comments

The Responsibility of Intellectuals (1967)

https://www.nybooks.com/articles/1967/02/23/a-special-supplement-the-responsibility-of-intelle/
34•andsoitis•2h ago•15 comments

David Patterson: Challenges and Research Directions for LLM Inference Hardware

https://arxiv.org/abs/2601.05047
25•transpute•3h ago•1 comments

Show HN: VM-curator – a TUI alternative to libvirt and virt-manager

https://github.com/mroboff/vm-curator
14•theYipster•2h ago•2 comments

We X-Rayed a Suspicious FTDI USB Cable

https://eclypsium.com/blog/xray-counterfeit-usb-cable/
111•aa_is_op•6h ago•42 comments

Postmortem: Our first VLEO satellite mission (with imagery and flight data)

https://albedo.com/post/clarity-1-what-worked-and-where-we-go-next
153•topherhaddad•9h ago•50 comments

A Lament for Aperture

https://ikennd.ac/blog/2026/01/old-man-yells-at-modern-software-design/
17•firloop•4d ago•3 comments

Second Win11 emergency out of band update to address disastrous Patch Tuesday

https://www.windowscentral.com/microsoft/windows-11/windows-11-second-emergency-out-of-band-updat...
68•speckx•2h ago•20 comments

Claude Code's new hidden feature: Swarms

https://twitter.com/NicerInPerson/status/2014989679796347375
360•AffableSpatula•15h ago•243 comments

Raspberry Pi Drag Race: Pi 1 to Pi 5 – Performance Comparison

https://the-diy-life.com/raspberry-pi-drag-race-pi-1-to-pi-5-performance-comparison/
149•verginer•11h ago•77 comments

TikTok is officially US-owned for American users, here's what's changing

https://9to5mac.com/2026/01/23/tiktok-is-officially-us-owned-for-american-users-heres-whats-chang...
22•WaitWaitWha•1h ago•19 comments

The Temporal Consistency Challenge in Video Restoration

https://blog.videowatermarkremove.com/the-temporal-consistency-challenge-from-optical-flow-to-spa...
10•ilmj8426•4d ago•0 comments

Typography on Pencils (2023)

https://www.presentandcorrect.com/blogs/blog/typography-on-pencils-1-5
37•NaOH•4d ago•2 comments

Draig, a Welsh Programming Language

https://raku.land/zef:l10n/L10N::CY
25•librasteve•2d ago•19 comments

Memory layout in Zig with formulas

https://raymondtana.github.io/math/programming/2026/01/23/zig-alignment-and-sizing.html
91•raymondtana•14h ago•23 comments

Ask HN: Gmail spam filtering suddenly marking everything as spam?

163•goopthink•13h ago•102 comments

Small Kafka: Tansu and SQLite on a free t3.micro

https://blog.tansu.io/articles/broker-aws-free-tier
72•rmoff•4d ago•10 comments

Poland's energy grid was targeted by never-before-seen wiper malware

https://arstechnica.com/security/2026/01/wiper-malware-targeted-poland-energy-grid-but-failed-to-...
181•Bender•8h ago•56 comments

First Design Engineer Hire – Build Games at Gym Class (YC W22)

https://www.ycombinator.com/companies/gym-class-by-irl-studios/jobs/ywXHGBv-design-engineer-senio...
1•hackerews•8h ago

Show HN: Semantic search engine for Studio Ghibli movie

https://ghibli-search.anini.workers.dev/
35•aninibread•3d ago•9 comments

High-bandwidth flash progress and future

https://blocksandfiles.com/2026/01/19/a-window-into-hbf-progress/
22•tanelpoder•4d ago•4 comments

Maze Algorithms (2017)

http://www.jamisbuck.org/mazes/
113•surprisetalk•1d ago•27 comments

Agent orchestration for the timid

https://substack.com/inbox/post/185649875
86•markferree•10h ago•21 comments

Understanding Rust Closures

https://antoine.vandecreme.net/blog/rust-closures/
45•avandecreme•11h ago•16 comments

Shared Claude: A website controlled by the public

https://sharedclaude.com/
50•reasonableklout•21h ago•20 comments

I added a Bluesky comment section to my blog

https://micahcantor.com/blog/bluesky-comment-section.html
238•hydroxideOH-•9h ago•85 comments
Open in hackernews

nvidia-smi hangs indefinitely after ~66 days

https://github.com/NVIDIA/open-gpu-kernel-modules/issues/971
106•tosh•2h ago

Comments

grayhatter•1h ago
a pet peeve of mine, (along with people brigading on issues/threads e.g. posting them to unrelated news sites... op....) is woefully incorrect language.

> at day 66 all our jobs started randomly failing

if there's a definable pattern, you can call it unpredictabily, but you can't call it randomly.

paulddraper•1h ago
Unexpectedly is probably what they meant
JohnLeitch•40m ago
Seems quite predictable given the others in the bug report encountering the same.
toast0•31m ago
IMHO, what they said means that on day 65 all jobs work, on day 66, jobs work or don't, seemingly at random.

But what they seem to be indicating is that all jobs fail on day 66. There's no randomness in evidence.

stevenhuang•26m ago
It's from the perspective of not knowing anything about the issue. It would look like jobs failing randomly one day when everything was fine the day before. Not hard to understand.
wincy•1h ago
Crazy, so if I understand correctly, something with B200s and nvlink is causing issues where after 66 days and 12 hours of uptime, nvidia-smi and other jobs start failing, timing out, then once you restart the cluster it starts working again.

They suspect jobs will work if you only use 1 B200, but one person power cycled so wasn’t able to test it. Hopefully they won’t have to wait another 66 days for further troubleshooting.

layla5alive•1h ago
Some 32-bit counter somewhere used when in NVLINK overflows?
mook•1h ago
Isn't 32bit counter 49 days? Assuming that one was counting milliseconds, at least.

Only remember that because that's the limit for Windows 95…

repiret•35m ago
100ns intervals. My favorite part of that story is how long after Windows 95 was released before anybody discovered the bug.
themafia•8m ago
66 days + 12 hours are 5,745,600,000,000,000 ns. The log2 of this is 52.351...

Javascript and some other languages only have integer precision up to 52 bits then they switch to floating point.

Curious.

blackoil•1h ago
*China specific code leaked into mainline.
zeehio•1h ago
66 days 14 hours and 24 minutes (66.6 days) would have been a far more diabolical hang...
nulone•1h ago
NVLink postRxDetLinkMask errors show up right before the hang. Has anyone captured a bug report or stack trace while nvidia-smi is stuck to see what it's blocking on?
yoshicoder•41m ago
I wonder if the process to debugging this is just to search for what power of 2 times a time unit equals ~66 days
userbinator•37m ago
I think it's an overflow of a scaled counter.

Also, who else immediately noticed the AI-generated comment?

jorl17•19m ago
This is only very tangentially related, but I got flashbacks to a time where we had dozens of edge/IoT raspberry pi devices with completely unupgradeable kernels with a bug that would make the whole USB stack shut down after "roughly a week" (7-9 days) of uptime. Once it got shut down, the only way to fix it was to do a full restart, and, at the time, we couldn't really be restarting those devices (not even at night).

This means that every single device would seemingly randomly completely break: touchscreen, keyboard, modems, you name it. Everything broke. And since the modem was part of it, we would lose access to the device — very hard to solve because maintenance teams were sometimes hours (& flights!) away.

It seemed to happen at random, and it was very hard to trace it down because we were also gearing up for an absolutely massive (hundreds of devices, and then a couple of months later, thousands) launch, and had pretty much every conceivable issue thrown at us, from faulty USB hubs, broken modems (which would also kill the USB hub if they pulled too much power), and I'm sure I've forgotten a bunch of other issues.

Plus, since the problem took a week to manifest, we couldn't really iterate on fixes quickly - after deploying a "potential fix", we'd have to wait a whole week to actually see if it worked. I can vividly remember the joy I had when I managed to get the issue to consistently happen only in the span of 2 hours instead of a week. I had no idea _why_, but at least I could now get serviceable feedback loops.

Eventually, after trying to mess with every variable we could, and isolating this specific issue from the other ones, we somehow figured out that the issue was indeed a bug in the kernel, or at least in one of its drivers: https://github.com/raspberrypi/linux/issues/5088 . We had many serial ports and a pattern of opening and closing them which triggered the issue. Upgrading the kernel was impossible due to a specific vendor lock-in, and we had to fix live devices and ship hundreds of them in less than a month.

In the end, we managed to build several layers on top of this unpatchable ever-growing USB-incapacitating bug: (i) we changed our serial port access patterns to significantly reduce the frequency of crashes; (ii) we adjusted boot parameters to make it much harder to trigger (aka "throw more memory at the memory leak"); (iii) we built a system that proactively detected the issue and triggered a USB reset in a very controlled fashion (this effectively killed the network of the device for a while, but we had no choice!); (iv) if, for some reason, all else failed, a watchdog would still reboot the system (but we really _really_ _reaaaally_ didn't want this to happen).

In a way, even though these issues suck, it's when we are faced with them that we really grow. We need to grab our whole troubleshooting arsenal, do things that would otherwise feel "wrong" or "inelegant", and push through the issues. Just thinking back to that period, I'm engulfed by a mix of gratitude for how much I learned, and an uneasy sense of dread (what if next time I won't be able to figure it out)?

pajko•5m ago
Timestamps should NOT be compared like this. Exactly this is why time_before() or time_after() exist.

https://elixir.bootlin.com/linux/v6.15.7/source/include/linu...