frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: MCP to get latest dependency package and tool versions

https://github.com/MShekow/package-version-check-mcp
1•mshekow•3m ago•0 comments

The better you get at something, the harder it becomes to do

https://seekingtrust.substack.com/p/improving-at-writing-made-me-almost
2•FinnLobsien•5m ago•0 comments

Show HN: WP Float – Archive WordPress blogs to free static hosting

https://wpfloat.netlify.app/
1•zizoulegrande•7m ago•0 comments

Show HN: I Hacked My Family's Meal Planning with an App

https://mealjar.app
1•melvinzammit•7m ago•0 comments

Sony BMG copy protection rootkit scandal

https://en.wikipedia.org/wiki/Sony_BMG_copy_protection_rootkit_scandal
1•basilikum•9m ago•0 comments

The Future of Systems

https://novlabs.ai/mission/
2•tekbog•10m ago•1 comments

NASA now allowing astronauts to bring their smartphones on space missions

https://twitter.com/NASAAdmin/status/2019259382962307393
2•gbugniot•15m ago•0 comments

Claude Code Is the Inflection Point

https://newsletter.semianalysis.com/p/claude-code-is-the-inflection-point
3•throwaw12•16m ago•1 comments

Show HN: MicroClaw – Agentic AI Assistant for Telegram, Built in Rust

https://github.com/microclaw/microclaw
1•everettjf•16m ago•2 comments

Show HN: Omni-BLAS – 4x faster matrix multiplication via Monte Carlo sampling

https://github.com/AleatorAI/OMNI-BLAS
1•LowSpecEng•17m ago•1 comments

The AI-Ready Software Developer: Conclusion – Same Game, Different Dice

https://codemanship.wordpress.com/2026/01/05/the-ai-ready-software-developer-conclusion-same-game...
1•lifeisstillgood•19m ago•0 comments

AI Agent Automates Google Stock Analysis from Financial Reports

https://pardusai.org/view/54c6646b9e273bbe103b76256a91a7f30da624062a8a6eeb16febfe403efd078
1•JasonHEIN•22m ago•0 comments

Voxtral Realtime 4B Pure C Implementation

https://github.com/antirez/voxtral.c
2•andreabat•25m ago•1 comments

I Was Trapped in Chinese Mafia Crypto Slavery [video]

https://www.youtube.com/watch?v=zOcNaWmmn0A
2•mgh2•31m ago•0 comments

U.S. CBP Reported Employee Arrests (FY2020 – FYTD)

https://www.cbp.gov/newsroom/stats/reported-employee-arrests
1•ludicrousdispla•33m ago•0 comments

Show HN: I built a free UCP checker – see if AI agents can find your store

https://ucphub.ai/ucp-store-check/
2•vladeta•38m ago•1 comments

Show HN: SVGV – A Real-Time Vector Video Format for Budget Hardware

https://github.com/thealidev/VectorVision-SVGV
1•thealidev•40m ago•0 comments

Study of 150 developers shows AI generated code no harder to maintain long term

https://www.youtube.com/watch?v=b9EbCb5A408
1•lifeisstillgood•40m ago•0 comments

Spotify now requires premium accounts for developer mode API access

https://www.neowin.net/news/spotify-now-requires-premium-accounts-for-developer-mode-api-access/
1•bundie•43m ago•0 comments

When Albert Einstein Moved to Princeton

https://twitter.com/Math_files/status/2020017485815456224
1•keepamovin•44m ago•0 comments

Agents.md as a Dark Signal

https://joshmock.com/post/2026-agents-md-as-a-dark-signal/
2•birdculture•46m ago•0 comments

System time, clocks, and their syncing in macOS

https://eclecticlight.co/2025/05/21/system-time-clocks-and-their-syncing-in-macos/
1•fanf2•47m ago•0 comments

McCLIM and 7GUIs – Part 1: The Counter

https://turtleware.eu/posts/McCLIM-and-7GUIs---Part-1-The-Counter.html
2•ramenbytes•50m ago•0 comments

So whats the next word, then? Almost-no-math intro to transformer models

https://matthias-kainer.de/blog/posts/so-whats-the-next-word-then-/
1•oesimania•51m ago•0 comments

Ed Zitron: The Hater's Guide to Microsoft

https://bsky.app/profile/edzitron.com/post/3me7ibeym2c2n
2•vintagedave•54m ago•1 comments

UK infants ill after drinking contaminated baby formula of Nestle and Danone

https://www.bbc.com/news/articles/c931rxnwn3lo
1•__natty__•55m ago•0 comments

Show HN: Android-based audio player for seniors – Homer Audio Player

https://homeraudioplayer.app
3•cinusek•55m ago•2 comments

Starter Template for Ory Kratos

https://github.com/Samuelk0nrad/docker-ory
1•samuel_0xK•57m ago•0 comments

LLMs are powerful, but enterprises are deterministic by nature

2•prateekdalal•1h ago•0 comments

Make your iPad 3 a touchscreen for your computer

https://github.com/lemonjesus/ipad-touch-screen
2•0y•1h ago•1 comments
Open in hackernews

Nvidia-smi hangs indefinitely after ~66 days

https://github.com/NVIDIA/open-gpu-kernel-modules/issues/971
200•tosh•1w ago

Comments

grayhatter•1w ago
a pet peeve of mine, (along with people brigading on issues/threads e.g. posting them to unrelated news sites... op....) is woefully incorrect language.

> at day 66 all our jobs started randomly failing

if there's a definable pattern, you can call it unpredictabily, but you can't call it randomly.

paulddraper•1w ago
Unexpectedly is probably what they meant
JohnLeitch•1w ago
Seems quite predictable given the others in the bug report encountering the same.
toast0•1w ago
IMHO, what they said means that on day 65 all jobs work, on day 66, jobs work or don't, seemingly at random.

But what they seem to be indicating is that all jobs fail on day 66. There's no randomness in evidence.

stevenhuang•1w ago
It's from the perspective of not knowing anything about the issue. It would look like jobs failing randomly one day when everything was fine the day before. Not hard to understand.
Joker_vD•1w ago
They've meant something like "arbitrary", in its "without any good/justifiable reason" sense. The word "random" is also used in this sense, especially when talking about human-made decisions.
wincy•1w ago
Crazy, so if I understand correctly, something with B200s and nvlink is causing issues where after 66 days and 12 hours of uptime, nvidia-smi and other jobs start failing, timing out, then once you restart the cluster it starts working again.

They suspect jobs will work if you only use 1 B200, but one person power cycled so wasn’t able to test it. Hopefully they won’t have to wait another 66 days for further troubleshooting.

layla5alive•1w ago
Some 32-bit counter somewhere used when in NVLINK overflows?
mook•1w ago
Isn't 32bit counter 49 days? Assuming that one was counting milliseconds, at least.

Only remember that because that's the limit for Windows 95…

repiret•1w ago
100ns intervals. My favorite part of that story is how long after Windows 95 was released before anybody discovered the bug.
justsomehnguy•1w ago
That's because people actually powered off their computer after work/leisure sessions. Someone on an unlimited night dial-up could had discovered it well before "anybody" but it's not like there was a built-in function to actually send a crash report to Redmond.

https://i.sstatic.net/p9hUgGfg.png

themafia•1w ago
66 days + 12 hours are 5,745,600,000,000,000 ns. The log2 of this is 52.351...

Javascript and some other languages only have integer precision up to 52 bits then they switch to floating point.

Curious.

loeg•1w ago
It's 32 bits of milliseconds, right? Hm, no, that would overflow much sooner (49.7 days).
oasisaimlessly•1w ago
It's a uint32_t of 750 Hz "jiffies", which does overflow at ~66 days.
userbinator•1w ago
While that seems like a convincing explanation, 750Hz is a rather odd value to use for a timer, and more importantly the overflow would be at 66d6h43m43s instead of the reported ~66d12h.
fc417fc802•1w ago
66 days 12 hours would put it at 747.5 Hz. A different report had 66 days 10 hours 16 minutes which works out to 748 Hz.

Maybe the clock was just feeling a little sluggish? /s

i_am_proteus•1w ago
It is indeed the explanation: https://github.com/NVIDIA/open-gpu-kernel-modules/pull/1014
loeg•1w ago
Wild.
loegta3•1w ago
Bingo! Someone decided to store timestamps in float64 which has 52 bit mantissa, and the time functions break when losing precision.
blackoil•1w ago
*China specific code leaked into mainline.
zeehio•1w ago
66 days 14 hours and 24 minutes (66.6 days) would have been a far more diabolical hang...
nulone•1w ago
NVLink postRxDetLinkMask errors show up right before the hang. Has anyone captured a bug report or stack trace while nvidia-smi is stuck to see what it's blocking on?
yoshicoder•1w ago
I wonder if the process to debugging this is just to search for what power of 2 times a time unit equals ~66 days
userbinator•1w ago
I think it's an overflow of a scaled counter.

Also, who else immediately noticed the AI-generated comment?

reneberlin•1w ago
I mentioned it on the top, that i engaged AI, because 1. i am lazy, and 2nd i didn't want to recall wrong.

I was feeling slightly bad about it, and you make me feel miserable now. I think i won't do it again - feels wrong.

jorl17•1w ago
This is only very tangentially related, but I got flashbacks to a time where we had dozens of edge/IoT raspberry pi devices with completely unupgradeable kernels with a bug that would make the whole USB stack shut down after "roughly a week" (7-9 days) of uptime. Once it got shut down, the only way to fix it was to do a full restart, and, at the time, we couldn't really be restarting those devices (not even at night).

This means that every single device would seemingly randomly completely break: touchscreen, keyboard, modems, you name it. Everything broke. And since the modem was part of it, we would lose access to the device — very hard to solve because maintenance teams were sometimes hours (& flights!) away.

It seemed to happen at random, and it was very hard to trace it down because we were also gearing up for an absolutely massive (hundreds of devices, and then a couple of months later, thousands) launch, and had pretty much every conceivable issue thrown at us, from faulty USB hubs, broken modems (which would also kill the USB hub if they pulled too much power), and I'm sure I've forgotten a bunch of other issues.

Plus, since the problem took a week to manifest, we couldn't really iterate on fixes quickly - after deploying a "potential fix", we'd have to wait a whole week to actually see if it worked. I can vividly remember the joy I had when I managed to get the issue to consistently happen only in the span of 2 hours instead of a week. I had no idea _why_, but at least I could now get serviceable feedback loops.

Eventually, after trying to mess with every variable we could, and isolating this specific issue from the other ones, we somehow figured out that the issue was indeed a bug in the kernel, or at least in one of its drivers: https://github.com/raspberrypi/linux/issues/5088 . We had many serial ports and a pattern of opening and closing them which triggered the issue. Upgrading the kernel was impossible due to a specific vendor lock-in, and we had to fix live devices and ship hundreds of them in less than a month.

In the end, we managed to build several layers on top of this unpatchable ever-growing USB-incapacitating bug: (i) we changed our serial port access patterns to significantly reduce the frequency of crashes; (ii) we adjusted boot parameters to make it much harder to trigger (aka "throw more memory at the memory leak"); (iii) we built a system that proactively detected the issue and triggered a USB reset in a very controlled fashion (this would sometimes kill the network of the device for a while, but we had no choice!); (iv) if, for some reason, all else failed, a watchdog would still reboot the system (but we really _really_ _reaaaally_ didn't want this to happen).

In a way, even though these issues suck, it's when we are faced with them that we really grow. We need to grab our whole troubleshooting arsenal, do things that would otherwise feel "wrong" or "inelegant", and push through the issues. Just thinking back to that period, I'm engulfed by a mix of gratitude for how much I learned, and an uneasy sense of dread (what if next time I won't be able to figure it out)?

nomel•1w ago
Even National Instruments had this type of bug in their nivisa driver, that powers a good portion of lab and test equipment of the world. Every 31 days our test equipment would stop working, which happens to be the overflow of one of the windows timers. was also one of the fasted bug fix updates I ever saw, after reporting it!
nottorp•1w ago
A week? I've had some Pis lose usb in 1-2 days. Fortunately we could afford to make them self restart every couple hours.
locao•1w ago
I also had the same experience, but I could only make them restart during the night. So I wrote a monitor to check if any of the Pis lost USB before restarting.

When our business grew, even restarting every night, we would get one or two lost USB warnings every day. One day I didn't receive any warnings. I was really happy, I had fix the issue! Three days later a client calls screaming the service is not working for two whole days and we did nothing. After getting every Pi restarted, I went to check the monitor. Shut down. I asked my business partner about it. "The alarms made me anxious, so I decided to shut down the monitor".

Obviously I sold my shares and never looked back.

nottorp•1w ago
Ah well. Our project was Pis in a crappy mesh network so it lost data occasionally even if they stayed on, and it was not so important to have continous data anyway. We rebooted them every like 3 or 6 hours.
BoredomIsFun•1w ago
I've always been sceptical of the modern tendency of throwing powerful hardware at every embedded projects. In most cases good old atmel AVR or even 8051 would suffice.
jorl17•1w ago
I think I used to have that view as well, and in a way still do, but this particular project proved otherwise.

The first version was built pretty much that way, with a tiny microcontroller and extremely optimized code. The problem then became that it was very hard to iterate quickly on it and prototype new features. Every new piece of hardware that was added (or just evaluated) would have to be carefully integrated and it really added to the mess. Maybe it would have been different if the code had been structured with more care from the get-go, who knows (I entered the project already in version 2).

For version 2, the micro-controller was thrown out, and raspberry-pi based solutions were brought in. Sure, it felt like carrying a shotgun to fire at a couple of flies, but having a linux machine with such a vast ecosystem was amazing. On top of that, it was much easier to hire people to work on the project because now they could get by with higher level languages like python and javascript. And it was much, much, much faster to develop on.

The usage of the raspberry pi was, in my view, one of the key details that allowed for what ultimately became an extremely successful product. It was much less energy-efficient, but it was very simple to develop and iterate on. In the span of months we experimented with many hardware addons, as product-market-fit was still being found out, and the plethora of online resources for everything else was a boon.

I'm pretty sure this was _the_ project that really made me realize that more often than not the right solution is the one that lets the right people make the right decisions. And for that particular team, this was, without a doubt, a remarkably successful decision. Most of the problems that typically come with it (such as bloat, and inefficiency) were eventually solved, something which would not have been possible by going slowly at first.

pajko•1w ago
Timestamps should NOT be compared like this. Exactly this is why time_before() or time_after() exist.

https://elixir.bootlin.com/linux/v6.15.7/source/include/linu...

Joker_vD•1w ago
Do I understand it correctly that the logic is that if timestamp B is above timestamp A, but the difference is more than half of the unsigned range, B is considered to happen before A?
rcxdude•1w ago
Yes. When the timestamps wrap it's fundamentally ambiguous, but this will be correct unless the timestamps are very far apart (and the failure mode is more benign: a really long time difference being considered shorter is better than all time differences being considered zero after the timestamp wraps).
AceJohnny2•1w ago
Offtopic...

    * Do this with "<0" and ">=0" to only test the sign of the result. A
    * good compiler would generate better code (and a really good compiler
    * wouldn't care). Gcc is currently neither.
It's funny the love-hate relationship the Linux kernel has with GCC. It's the only supported compiler[1], and yet...

[1] can Clang fully compile Linux yet? I haven't followed the updates in a while.

EvgeniyZh•1w ago
Yes it can [1].

https://docs.kernel.org/kbuild/llvm.html

rwmj•1w ago
To be fair this comment predates git history (before 2005) when GCC wasn't a very good compiler. The kernel developers at one point were sticking with a specific version of GCC because later versions would miscompile the kernel. Clang didn't exist then.

GCC is a different beast and far better nowadays.

foota•1w ago
Wow, someone in the github comments[1] noticed that one of the bug numbers assigned internally for the issue matches to the day the number of days the driver would stay up.

1: https://github.com/NVIDIA/open-gpu-kernel-modules/issues/971...

userbinator•1w ago
You mean number of seconds, but yes, I think everyone looking at this would be converting units to see if there was a particular boundary being met.
nurettin•1w ago
> we were hit with this on a 256 gpu b200 cluster -- at day 66 all our jobs started randomly failing

ouch

timzaman•1w ago
Known pretty longstanding issue w nvidia still unresolved afaik
wvenable•1w ago
A few years ago, at my company, we would get random TPM crashes every few months on all our machines. You'd be working and the TPM would just disappear and then any apps that rely on it for key retrieval would error out. Even worse, since the TPM chip is always running, neither a reboot nor a shutdown would fix it -- you literally had to pull the plug.

This went on for months. Then one day we had a power outage. Two months later, every single machine failed at the same time. I checked the logs and it was 49 days and few hours since that outage. It didn't take me too long to figure out what the underlying programming error inside the TPM was. At least we could then describe exactly what the problem was to our PC vendor.

spuz•1w ago
So what was the programming error in the TPM?
jcurtis•1w ago
49 days is a bit under 2^32 milliseconds... So unsigned int overflow?
swinglock•1w ago
Something breaking after 49.7 days is a classic. Someone counted milliseconds since start with a 32 bit unsigned int and some code assumed it couldn't wrap.
blell•1w ago
https://news.ycombinator.com/item?id=28340101
reneberlin•1w ago
In memory of the "49.7-day bug"

Back in the day it was Windows, that had a hard limit on how long it could run in one pass. I forgot when it began and ended, but happily AI helped out to investigate back in time.

The bug primarily affected the Windows 9x family of operating systems:

Windows 95 (all versions)

Windows 98 (original release)

Windows 98 Second Edition (SE)

While there were separate reports of similar 497-day overflows in Windows NT 4.0 and Windows 2000, the "classic" version of this bug that most people remember is the 49.7-day limit on Windows 95 and 98.

Why 49.7 days? The issue was a classic integer overflow. Windows used a 32-bit counter to track the number of milliseconds since the system started. This counter was used by the Virtual Device Driver (VMM) to manage system timers.

The maximum value for a 32-bit unsigned integer is: 2^32 - 1, which equals: 4,294,967,295 millisec.

If you convert those milliseconds into days: 4,294,967,295 / 1,000 = 4,294,967 seconds 4,294,967 / 60 / 60 / 24 ~ 49.71 days

When the counter hit that maximum value, it would "wrap around" to zero. Because many system services and drivers were waiting for the counter to increase to a certain target time, they would suddenly find themselves waiting for a number that had already passed or was now mathematically impossible to reach in their logic. This caused the "hang"—the mouse might still move, but the OS could no longer process tasks.

When did it start and end? Started: With the release of Windows 95 in August 1995.

Ended: Microsoft officially fixed the bug with a patch in 1999 (Knowledge Base article KB216641). Windows Me (released in 2000) was the first in that specific family to ship with the fix included, and the transition to the Windows NT architecture (Windows XP and later) eventually rendered the specific underlying cause obsolete for home users.

hulitu•1w ago
> Windows Me (released in 2000) was the first in that specific family to ship with the fix included,

With all this, Windows Me was the most unstable, crashing several times per day, although there were, on HN, reports that in some configurations was stable.

userbinator•1w ago
This counter was used by the Virtual Device Driver (VMM) to manage system timers.

VMM stands for Virtual Machine Manager. AI slop strikes again.

amelius•1w ago
66 days? This is obvious. It's the overflow of a 32-bit register counting imperial milliseconds.