frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

I rebuilt FlashAttention in Triton to understand the performance archaeology

https://aminediro.com/posts/flash_attn/
35•amindiro•3d ago

Comments

amindiro•3d ago
I’ve spent the last few weeks deconstructing FlashAttention. While the original paper is brilliant, I found that just reading it didn't give me a "gut feeling" for why certain engineering choices were made (the transition from v1 to v2).

I decided to rebuild it from scratch using Triton. This post is a chronicle of that journey—moving beyond the high-level algorithm and into the "performance archaeology" of the GPU:

- Profiling with Nsight Compute to find the real bottlenecks.

- Looking at the generated PTX and SASS code.

- Debugging shared memory bank conflicts and MIO bottlenecks.

- Iterating through the logic to see why tiling and online softmax are hardware-necessitated, not just mathematical tricks.

I’ve tried to keep it in the spirit of Simon Boehm’s matmul deep dive. Would love to hear from any GPU engineers on whether my interpretations of the SASS/bank conflict behavior match what you've seen in production.

npalli•50m ago
Seems very detailed and comprehensive. Did I miss it, but was there a performance comparison to the PyTorch version at the top?
raphaelty•47m ago
Very interesting, wondering if there are other heavily used algorithm which could benefit a lot from a "Flash" version but don't have one today

The Ultimate Windows Utility (2022)

https://christitus.com/windows-tool/
58•janandonly•1h ago•32 comments

Fabrice Bellard Releases MicroQuickJS

https://github.com/bellard/mquickjs/blob/main/README.md
1147•Aissen•19h ago•427 comments

Google 2025 recap: Research breakthroughs of the year

https://blog.google/technology/ai/2025-research-breakthroughs/
38•Anon84•3h ago•11 comments

Some Epstein file redactions are being undone with hacks

https://www.theguardian.com/us-news/2025/dec/23/epstein-unredacted-files-social-media
588•vinni2•16h ago•444 comments

X-ray: a Python library for finding bad redactions in PDF documents

https://github.com/freelawproject/x-ray
479•rendx•14h ago•85 comments

Avoid Mini-Frameworks

https://laike9m.com/blog/avoid-mini-frameworks,171/
6•laike9m•38m ago•2 comments

Unifi Travel Router

https://blog.ui.com/article/travel-in-style-unifi-style-unifi-travel-router
299•flurdy•12h ago•254 comments

I rebuilt FlashAttention in Triton to understand the performance archaeology

https://aminediro.com/posts/flash_attn/
35•amindiro•3d ago•3 comments

Map: Operator[] Should Be Nodiscard

https://quuxplusone.github.io/blog/2025/12/18/nodiscard-operator-bracket/
13•jandeboevrie•4d ago•0 comments

Autonomously navigating the real world: lessons from the PG&E outage

https://waymo.com/blog/2025/12/autonomously-navigating-the-real-world
99•scoofy•10h ago•54 comments

Texas app store age verification law blocked by federal judge

https://www.macrumors.com/2025/12/23/texas-app-store-law-blocked/
243•danso•14h ago•159 comments

Nabokov's guide to foreigners learning Russian

https://twitter.com/haravayin_hogh/status/2003299405907247502
125•flaxxen•11h ago•201 comments

Show HN: Tonbo – an embedded database for serverless and edge runtimes

https://github.com/tonbo-io/tonbo
29•ethegwo•6d ago•8 comments

Don't Become the Machine

https://armeet.bearblog.dev/becoming-the-machine/
117•armeet•9h ago•56 comments

Show HN: Turn raw HTML into production-ready images for free

https://html2png.dev
91•alvinunreal•10h ago•44 comments

Lua 5.5

https://lua.org/versions.html#5.5
316•km•1d ago•104 comments

Permission Systems for Enterprise That Scale

https://eliocapella.com/blog/permission-systems-for-enterprise/
14•eliocs•2h ago•2 comments

Proving Bounds for the Randomized MaxCut Approximation Algorithm in Lean4

https://abhamra.com/blog/randomized-maxcut/
38•todsacerdoti•3d ago•1 comments

Scaling Go Testing with Contract and Scenario Mocks

https://funnelstory.ai/blog/engineering/scaling-go-testing-with-contract-and-scenario-mocks
5•preetamjinka•5d ago•0 comments

Perfect Software – Software for an Audience of One

https://outofdesk.netlify.app/blog/perfect-software
154•ggauravr•4d ago•62 comments

We replaced H.264 streaming with JPEG screenshots (and it worked better)

https://blog.helix.ml/p/we-mass-deployed-15-year-old-screen
454•quesobob•18h ago•266 comments

Custom Cross Compiler with Nix

https://www.hobson.space/posts/nixcross/
26•todsacerdoti•7h ago•1 comments

Open source USB to GPIB converter (for Test and Measurement instruments)

https://github.com/xyphro/UsbGpib
46•v15w•11h ago•20 comments

HTTP Caching, a Refresher

https://danburzo.ro/http-caching-refresher/
118•danburzo•17h ago•18 comments

Correspondence Between Don Knuth and Peter van Emde Boas on Priority Deques 1977 [pdf]

https://staff.fnwi.uva.nl/p.vanemdeboas/knuthnote.pdf
38•vismit2000•10h ago•2 comments

Fifty problems with standard web APIs in 2025

https://zerotrickpony.com/articles/browser-bugs/
140•dhruv3006•6d ago•53 comments

Learn Lisp/Fennel Programming Against Neovim

https://github.com/humorless/fennel-fp-neovim
59•veqq•6d ago•6 comments

Volvo Centum is Dalton Maag's new typeface for Volvo

https://www.wallpaper.com/design-interiors/corporate-design-branding/volvo-new-font-volvo-centum
98•ohjeez•18h ago•86 comments

Help My c64 caught on fire

https://c0de517e.com/026_c64fire.htm
105•ibobev•17h ago•33 comments

Is Northern Virginia still the least reliable AWS region?

https://statusgator.com/blog/aws-least-reliable-region-in-2025/
93•colinbartlett•13h ago•65 comments