frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Writing Speed-of-Light Flash Attention for 5090 in CUDA C++

https://gau-nernst.github.io/fa-5090/
92•dsr12•5h ago

Comments

doctorpangloss•2h ago
Hmm, but supposing the accelerated NVIDIA specific inference data types were available for Triton, then you would just use that? Why not contribute to Triton, they accept PRs? Like so what if you do free product ecosystem development for NVIDIA and giant corporations by contributing to Triton?
qeternity•2h ago
Second line of the post:

> The main objective is to learn writing attention in CUDA C++, since many features are not available in Triton, such as MXFP8 / NVFP4 MMA for sm120.

steinvakt2•2h ago
I had a 5090 some months ago but couldnt get flash attention to work. Does it now work natively? What about 5080?
sigmoid10•1h ago
Pytorch now has native support for the Blackwell architecture:

https://pytorch.org/blog/pytorch-2-7/

SynasterBeiter•4m ago
It does, but the performance is pretty bad, worse than Hopper.
zackangelo•45m ago
Curious what issues you were having. The kernel should compile natively if you pass nvcc the correct arch flags, although it probably won't take advantage of any new hardware features.
ProofHouse•2h ago
Damn awesome. This going to take me 3 reads and a week to digest
neilmovva•7m ago
I was surprised to see 5090's theoretical BF16 TFLOPs at just 209.5. That's not even 10% of the server Blackwell (B200 is 2250, and GB200 is 2500). B200 costs around $30-40k per GPU, so that's almost in line with their relative performance.

Starting with 4090, NVIDIA limits the performance of tensor cores on gaming cards, specifically for ops that might be used in training. FP8 and FP16 matmuls run at full speed if accumulating in FP16 (I've never seen anyone use this), but only half speed when accumulating in FP32. This restriction is not present for lower precision matmuls like FP4, or removed entirely on the workstation-class cards like RTX Pro 6000.

It doesn't seem worth it to use NVIDIA gaming cards as a "cheaper FLOPs" alternative anymore (e.g. diffusion models could have been cheaper to run on 4090 than H100). They are generous with memory bandwidth though, nearly 2TB/s is amazing!

RFC 9839 and Bad Unicode

https://www.tbray.org/ongoing/When/202x/2025/08/14/RFC9839
161•Bogdanp•5h ago•81 comments

Libre – An anonymous social experiment without likes, followers, or ads

https://libreantisocial.com
22•rododecba•1h ago•12 comments

Librebox: An open source, Roblox-compatible game engine

https://github.com/librebox-devs/librebox-demo
156•libreboxdevs•7h ago•35 comments

Manim: Animation engine for explanatory math videos

https://github.com/3b1b/manim
292•pykello•10h ago•56 comments

Writing Speed-of-Light Flash Attention for 5090 in CUDA C++

https://gau-nernst.github.io/fa-5090/
92•dsr12•5h ago•9 comments

I Made a Floppy Disk from Scratch

https://kottke.org/25/08/i-made-a-floppy-disk-from-scratch
127•bookofjoe•7h ago•53 comments

Rethinking the Linux cloud stack for confidential VMs

https://lwn.net/Articles/1030818/
84•Bogdanp•6h ago•26 comments

Developer's block

https://underlap.org/developers-block/
135•todsacerdoti•9h ago•72 comments

Bild AI (YC W25) Is Hiring Applied AI Founding Engineer

https://www.workatastartup.com/jobs/75647
1•rooppal•1h ago

450× Faster Joins with Index Condition Pushdown

https://readyset.io/blog/optimizing-straddled-joins-in-readyset-from-hash-joins-to-index-conditio...
52•marceloaltmann•4d ago•18 comments

WebR – R in the Browser

https://docs.r-wasm.org/webr/latest/
106•sieste•4d ago•22 comments

Determinants and causal effects of admission to selective private colleges [pdf]

https://www.nber.org/system/files/working_papers/w31492/w31492.pdf
4•EvgeniyZh•7h ago•1 comments

Lightning declines over shipping lanes following regulation of sulfur emissions

https://theconversation.com/the-world-regulated-sulfur-in-ship-fuels-and-the-lightning-stopped-24...
175•lentoutcry•4d ago•42 comments

Waitgroups: What they are, how to use them and what changed with Go 1.25

https://mfbmina.dev/en/posts/waitgroups/
46•mfbmina•3h ago•30 comments

David Klein's TWA Posters (2018)

https://flashbak.com/david-kleins-magnificent-twa-posters-404428/
73•NaOH•4d ago•6 comments

World Wide Lightning Location Network

https://wwlln.net/
79•perihelions•10h ago•31 comments

Converting an online game to work without any JavaScript

https://bejofo.com/blog/no-js-game-case-study
16•YannickR•4d ago•5 comments

Shader Academy: Learn computer graphics by solving challenges

https://shaderacademy.com/
219•pykello•3d ago•57 comments

My experience creating software with LLM coding agents – Part 2 (Tips)

https://efitz-thoughts.blogspot.com/2025/08/my-experience-creating-software-with_22.html
165•efitz•17h ago•77 comments

The first Media over QUIC CDN: Cloudflare

https://moq.dev/blog/first-cdn/
276•kixelated•1d ago•110 comments

The JWST Rocky Worlds DDT Program reveals GJ 3929B to likely be a bare rock

https://arxiv.org/abs/2508.12516
4•bikenaga•3h ago•0 comments

Show HN: JavaScript-free (X)HTML Includes

https://github.com/Evidlo/xsl-website
194•Evidlo•23h ago•104 comments

You can't grow cool-climate plants in hot climates

https://www.crimepaysbutbotanydoesnt.com/blog/why-you-cant-grow-cool-climate-plants-in-hot-climates
146•surprisetalk•3d ago•105 comments

Nitro: A tiny but flexible init system and process supervisor

https://git.vuxu.org/nitro/about/
222•todsacerdoti•23h ago•81 comments

The theory and practice of selling the Aga cooker (1935) [pdf]

https://comeadwithus.wordpress.com/wp-content/uploads/2012/08/the-theory-and-practice-of-selling-...
66•phpnode•3d ago•34 comments

Game math: precise control over numeric springing

https://allenchou.net/2015/04/game-math-precise-control-over-numeric-springing/
7•fanf2•2d ago•1 comments

The Fancy Rug Dilemma

https://epan.land/essays/2025-8_FancyRugDilemma
45•ericpan64•3d ago•30 comments

RFK Jr demanded a vaccine study be retracted – the journal said no

https://www.nature.com/articles/d41586-025-02682-9
27•rntn•1h ago•20 comments

Echidna Enters a New Era of Symbolic Execution

https://gustavo-grieco.github.io/blog/echidna-symexec/
16•galapago•3d ago•2 comments

I run a full Linux desktop in Docker just because I can

https://www.howtogeek.com/i-run-a-full-linux-desktop-in-docker-just-because-i-can/
163•redbell•4d ago•101 comments