frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Start all of your commands with a comma

https://rhodesmill.org/brandon/2009/commands-with-comma/
56•theblazehen•2d ago•11 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
637•klaussilveira•13h ago•188 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
935•xnx•18h ago•549 comments

What Is Ruliology?

https://writings.stephenwolfram.com/2026/01/what-is-ruliology/
35•helloplanets•4d ago•30 comments

How we made geo joins 400× faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
113•matheusalmeida•1d ago•28 comments

Jeffrey Snover: "Welcome to the Room"

https://www.jsnover.com/blog/2026/02/01/welcome-to-the-room/
13•kaonwarb•3d ago•12 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
45•videotopia•4d ago•1 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
222•isitcontent•13h ago•25 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
214•dmpetrov•13h ago•106 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
324•vecti•15h ago•142 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
374•ostacke•19h ago•94 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
478•todsacerdoti•21h ago•237 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
359•aktau•19h ago•181 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
278•eljojo•16h ago•166 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
407•lstoll•19h ago•273 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
17•jesperordrup•3h ago•10 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
85•quibono•4d ago•21 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
58•kmm•5d ago•4 comments

Delimited Continuations vs. Lwt for Threads

https://mirageos.org/blog/delimcc-vs-lwt
27•romes•4d ago•3 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
245•i5heu•16h ago•193 comments

Was Benoit Mandelbrot a hedgehog or a fox?

https://arxiv.org/abs/2602.01122
14•bikenaga•3d ago•2 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
54•gfortaine•11h ago•22 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
143•vmatsiiako•18h ago•65 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
1061•cdrnsf•22h ago•438 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
179•limoce•3d ago•96 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
284•surprisetalk•3d ago•38 comments

Why I Joined OpenAI

https://www.brendangregg.com/blog/2026-02-07/why-i-joined-openai.html
137•SerCe•9h ago•125 comments

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

https://github.com/phreda4/r3
70•phreda4•12h ago•14 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
28•gmays•8h ago•11 comments

FORTH? Really!?

https://rescrv.net/w/2026/02/06/associative
63•rescrv•21h ago•23 comments
Open in hackernews

Writing Speed-of-Light Flash Attention for 5090 in CUDA C++

https://gau-nernst.github.io/fa-5090/
159•dsr12•5mo ago

Comments

doctorpangloss•5mo ago
Hmm, but supposing the accelerated NVIDIA specific inference data types were available for Triton, then you would just use that? Why not contribute to Triton, they accept PRs? Like so what if you do free product ecosystem development for NVIDIA and giant corporations by contributing to Triton?
qeternity•5mo ago
Second line of the post:

> The main objective is to learn writing attention in CUDA C++, since many features are not available in Triton, such as MXFP8 / NVFP4 MMA for sm120.

doctorpangloss•5mo ago
Yes… I read it. If the feature is missing, why not contribute it instead?
almostgotcaught•5mo ago
How many PRs do you have landed in Triton that you can just blithely say "contribute it"?
saagarjha•5mo ago
I mean, you can look at the most recent commit and see that the infrastructure is being built out for this right now (of course OpenAI doesn't care about sm_120, though).
almostgotcaught•5mo ago
i don't know what this comment has to do with my point that OAI doesn't take commits from randoms, especially for infra code.
saagarjha•5mo ago
Yeah they do
doctorpangloss•5mo ago
By all means, the guy could have written the triton fixes he needs and NOT sent it up stream. It would still make more sense to do that! He’s obviously an expert, and I was sincerely wondering, why bother with the C++ stuff if he already knew the better way, and also has the chops to implement it?
almostgotcaught•5mo ago
There's an enormous difference between writing kernels and writing compiler infra.
steinvakt2•5mo ago
I had a 5090 some months ago but couldnt get flash attention to work. Does it now work natively? What about 5080?
sigmoid10•5mo ago
Pytorch now has native support for the Blackwell architecture:

https://pytorch.org/blog/pytorch-2-7/

SynasterBeiter•5mo ago
It does, but the performance is pretty bad, worse than Hopper.
zackangelo•5mo ago
Curious what issues you were having. The kernel should compile natively if you pass nvcc the correct arch flags, although it probably won't take advantage of any new hardware features.
saagarjha•5mo ago
High-performance GPU code typically uses nonportable features that are not supported across generations.
ProofHouse•5mo ago
Damn awesome. This going to take me 3 reads and a week to digest
neilmovva•5mo ago
I was surprised to see 5090's theoretical BF16 TFLOPs at just 209.5. That's not even 10% of the server Blackwell (B200 is 2250, and GB200 is 2500). B200 costs around $30-40k per GPU, so they are pretty close in performance per dollar.

Starting with 4090, NVIDIA limits the performance of tensor cores on gaming cards, specifically for ops that might be used in ML training. FP8 and FP16 matmuls run at full speed if accumulating in FP16 (I've never seen anyone use this), but only half speed when accumulating in FP32. This restriction is not present for lower precision matmuls like FP4, and is removed entirely on the workstation-class cards like RTX Pro 6000.

It doesn't seem worth it to use NVIDIA gaming cards as a "cheaper FLOPs" alternative anymore (e.g. diffusion models could have been cheaper to run on 3090 than A100). They are generous with memory bandwidth though, nearly 2TB/s on 5090 is amazing!

steinvakt2•5mo ago
Isn't 5090 FE (roughly 2500 USD in my country) pretty good FLOP value? 32 GB VRAM (and flash attention pushes it even faster compared to apple/mps relatively cheap "vram")
neilmovva•5mo ago
Not really:

5090: 210 TF / $2k == 105 TF/$k

B200: 2250 TF / $40k == 56 TF/$k

Getting only 2x the FLOPs per dollar probably isn't worth the hassle of having to rack 10x as many GPUs, while having no NVLink.

lossolo•5mo ago
One of the reasons they removed NVLink from consumer cards (they supported it before). There’s also an issue with power consumption (1xB200 vs 10x5090)
steinvakt2•5mo ago
Sure, but when spending 20x more, getting almost twice the compute per buck seems expected
gautamcgoel•5mo ago
Do you have a source for that B200 price?
mota7•5mo ago
Is there really that big a different in TFLOPS between the GB100 and GB202 chips? The GB100 has fewer SMs than the GB202, so I'm confused about where the 10x performance would be coming from?
godelski•5mo ago
You're asking a really good question but it's not a question with an easy answer.

There's a lot more to performance computing than FLOPs. FLOPs are you good high level easy to understand metric but it's a small part of the story when you're in the weeds.

To help make sense of this, look at CPU frequencies. I think most people on HN know that two CPU with the same frequency can have dramatically different outcomes on benchmarks, right? You might know how some of these come down to things like IPC (instructions per cycle) or the cache structures. There's even more but we know it's not so easy to measure, right?

On a GPU all that is true but there's only more complexity. Your GPU is more similar to a whole motherboard where your PCIe connection is a really really fast network connection. There's lots of faults to this analogy but this closer than just comparing TFLOPs.

Nvidia's moat has always been "CUDA". Quotes because even that is a messier term than most think (Cutlass, CuBLAS, cuDNN, CuTe, etc). The new cards are just capable of things the older ones aren't. Mix between hardware and software.

I know this isn't a great answer but there is none. You'll probably get some responses and many of them will have parts of the story but it's hard to paint a real good picture in a comment. There's no answer that is both good and short.

saagarjha•5mo ago
No, GPUs are a lot simpler. You can mostly just take the clock rate and scale it directly for the instruction being compared.
saagarjha•5mo ago
There's a 2x performance hit from the weird restriction on fp32 accumulation, plus the fact that 5090 has "fake" Blackwell (no tcgen05) which limits the size and throughput of matrix multiplication through the tensor cores.
laidoffamazon•5mo ago
Isn't the new trend to train in lower precision anyway?
storus•5mo ago
Only GPU-poors run Q-GaLore and similar tricks.
Twirrim•5mo ago
Even the large cloud AI services are focusing on this too, because it drives down the average "cost per query", or whatever you want to call it. For inference, arguably more even than training, the smaller and more efficient they can get it, the better their bottom line.
storus•5mo ago
For inference of course; the OP I replied to mentioned training though.
neilmovva•5mo ago
Today, training in "low precision" probably means computing FP8 x FP8 -> FP32. The FP32 accumulation is still important, but otherwise yes this works, especially if we're talking about MXFP8 as supported on Blackwell [0].

What's less proven is a recipe using MXFP4 x MXFP4 -> FP32 compute, e.g. [1], which needs more involved techniques to work. But if you get it to work stably, that pathway is running at full throughput on 5090.

[0]: https://arxiv.org/abs/2506.08027 [1]: https://arxiv.org/abs/2502.20586

laidoffamazon•5mo ago
Interesting. My assumption was one of the innovations of DeepSeek and the modern GPT models was performing low precision pretraining rather than just finetuning further. I didn't realize you still need accumulation at a higher precision anyway
Scene_Cast2•5mo ago
My issue with upgrading to the 5090 for workstation ML use is that it both has higher TDP than the 4090 and it can only be limited to 70% power (not 50% like the 4090).
saagarjha•5mo ago
> Due to improvements in newer hardware, you might need to use more tricks to reach Speed-of-Light on older GPUs e.g. pipeline shared memory to register memory data movements.

On the contrary, older GPUs are a lot easier to hit rooflines on. Newer GPUs run so fast that they keep adding new tricks to remove bottlenecks. Not to discount the author's work here but a 5090 is pretty bad on the FLOPs/memory bandwidth ratio so it's comparatively easier to get throttled by tensor cores there; on datacenter hardware your tensor cores are so fast that you'll hit limits that were glossed over here.

For example, using Ampere "mma" instructions won't cut it, because they compute a really small MMA and force your input to live in registers. You'll need TMA to get data into shared memory and wmma to do a matrix multiply out of them. At those speeds you will run into issues with dispatching instructions and computing addresses (and doing out-of-bounds calculation) fast enough that you will need to offload it to specialized hardware to keep up with the tensor cores.

a_t48•5mo ago
Definitely going to save this for later and come back to it after I get some more CUDA experience under my belt. It feels so nice right now making nice beautiful to use pipeline code w/ npp and some CUDA kernels here and there, the code is much faster than what it's replacing, but then I look at this guy getting down into the weeds of memory bank contention, prefetching, loop invariance, etc. Makes me feel like I'm playing with LEGO, I'm a little jealous.

The tip that Nsight can run on Mac over SSH is great, too. I've been capturing and viewing data over RDP, basically, will have to give it a shot next week.