I rebuilt FlashAttention in Triton to understand the performance archaeology

95•amindiro•1mo ago

Comments

amindiro•1mo ago

I’ve spent the last few weeks deconstructing FlashAttention. While the original paper is brilliant, I found that just reading it didn't give me a "gut feeling" for why certain engineering choices were made (the transition from v1 to v2).

I decided to rebuild it from scratch using Triton. This post is a chronicle of that journey—moving beyond the high-level algorithm and into the "performance archaeology" of the GPU:

- Profiling with Nsight Compute to find the real bottlenecks.

- Looking at the generated PTX and SASS code.

- Debugging shared memory bank conflicts and MIO bottlenecks.

- Iterating through the logic to see why tiling and online softmax are hardware-necessitated, not just mathematical tricks.

I’ve tried to keep it in the spirit of Simon Boehm’s matmul deep dive. Would love to hear from any GPU engineers on whether my interpretations of the SASS/bank conflict behavior match what you've seen in production.

liuliu•1mo ago

I hope you finish this one though. It starts strong (I particularly liked how you looked into ncu and shows what each recommendation means, this is very helpful for beginners), but ends with something not satisfying. You didn't explore tensor core (particularly, fp16 / tf32 / bf16), and swizzling (which is the right way to solve the K transpose issue, especially giving Triton itself provides a few ways to do this), and / or async loading (pipelining).

Do you have problem to access H100 or similar chips? Wondering if there anything can help to finish this write-up.

amindiro•1mo ago

Hi, thanks a lot for the feedback! I'm glad you enjoyed the profiling sections.

You've hit the nail on the head regarding the missing pieces. I actually hit a bit of a wall with my current hardware; using an RTX 2070 made it difficult to meaningfully explore the async loading (TMA) and pipelining optimizations that were used in FA3 and FA4. I also felt the write-up was already pushing the limits of a single post's length, so I decided to "ship it" as a first part.

I would love to dive into TMA for Part 2. If I can get my hands on an H100 (or even an A100), that's highly appreciatediated on my end! If you have any leads on hardware access, please let me know—I’d love to finish the story!

npalli•1mo ago

Seems very detailed and comprehensive. Did I miss it, but was there a performance comparison to the PyTorch version at the top?

amindiro•1mo ago

Hi thanks for feedback! That’s a good point I did compare to torch but at a high enough sequence length (~1024) torch version starts OOM because it has to materialize the S^2 in global mem. On small sequence length, torch does win solely on optimised cublas matmuls

raphaelty•1mo ago

Very interesting, wondering if there are other heavily used algorithm which could benefit a lot from a "Flash" version but don't have one today

rishabhaiover•1mo ago

I did an experiment on FlashAttention in Triton to measure the impact of caching tiles in the Shared Memory. Surprisingly, it had a non-monotonic relationship with prefetching these tiles and it was kernel dependent. Attention kernel benefits from prefetching caches while MLP W1 doesn't.

amindiro•1mo ago

Very interesting and Would love to see the experiments. Quick question: what do you mean about kernel dependent ?

rishabhaiover•1mo ago

Sorry for not being clear. We had two different CUDA functions, one was for Attention and one was for the MLP. Here's the kernel code: https://github.com/sankirthk/GPT2-Kernel-Fusion/blob/main/ke...

We saw different results of pipelining with the Attention kernel vs the MLP kernel (since MLP W1 has to project the attention results into a much higher dimension, the arithmetic intensity shifts towards compute bound characteristics)

amindiro•1mo ago

Agreed, this observation holds true for both decode and prefill. Thanks for sharing the code

sheepscreek•1mo ago

What’s with GPU engineers using such unreadable variable names (to anyone outside the immediate domain)?

It’s the equivalent of doing this for compound interest rate calculation:

# A = P * (1 + r/n)^(nt) P = 10000 r = 0.06 n = 12 t = 5 A = P (1 + r / n) * (n * t)

Compared to this:

principal = 10_000 annual_interest_rate = 0.06 compounds_per_year = 12 years = 5

future_value = principal * (1 + annual_interest_rate / compounds_per_year) * (compounds_per_year * years)

My question is partly rhetorical - I know the answer lies with the tight research and mathematical origins. But that makes it research code IMO, not what I would consider high quality software code.

tornikeo•1mo ago

I think it's a combination of multiple factors. I worked with GPU kernel codes before and the code that you write has a tendency of never being updated or modified. once it works it works perfectly and you do not change it. if you get new hardware you're going to fully rewrite it. so, typically readability is just not useful. also, you're never working with variables that make sense to humans. it's never something tangible. it's always tiles, offsets, indices. i do not think, at least when I was writing the code for GPUS to waste space visual space on better variable naming was worthwhile.

fny•1mo ago

I'm a former Ruby guy who ended up in stats/ML for a time. I think it's all about information density.

Let's use your example of `A = P (1 + r / n) * (n * t)` -- I can immediately see the shape of the function and how all the variables interrelated. If I'm comfortable in the domain, I also know what the variables mean. Finally, this maps perfectly to how the math is written.

If you look at everything in the post, all of the above apply. Every one in the domain has seen Q = query, K = key, V = value a billion times, and some variation of (B, N_h, T, D_h). Frankly, I've had enough exposure that after I see (B, N_h, T, D_h) once, I can parse (32, 8, 16, 16) without thinking.

I like you found this insane when I started studying stats, but overtime I realized there a lot to be gained once you've trained yourself to speak the language.

lostmsu•1mo ago

This brought up memory of Hungarian notation. I think now I will try to use it in my PyTorch code to solve the common problem I have with NN code: keeping track of tensor shapes and their meanings.

  B, T, E = x.size() # batch size, sequence length, embedding dimensionality
  
  q, k, v = self.qkv(x).split(self.embedding, dim=-1)
  q, k, v = map(lambda y: y.view(B, T, self.heads, E // self.heads).transpose(1, 2))

  attention = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
  ...

  B, T, E = bteX.size()

  iHeadSize = E // self.heads
  bteQ, bteK, bteV = self.qkv_E_3E(bteX).split(E, dim=-1)
  bhtiQ, bhtiK, bhtiV = map(lambda y: y.view(B, T, self.heads, iHeadSize).transpose(1, 2))

  bhttAttention = (bhtiQ @ bthiK.transpose(-2, -1)) * (1.0 / iHeadSize)

Looks uglier but might be easier to reason about.

pryelluw•1mo ago

Bad programmers. Researchers usually (though sometimes not) are bad at programming. Hence why I don’t do projects for academia.

ljlolel•1mo ago

PhD dropout here: When you’re implementing a math algorithm you can’t really self document. So you have the pdf of the paper and a clear formula, then best to link to that and just implement the formula exactly with same variables.

fancy_pantser•1mo ago

When OpenAI announced the Triton language, I was worried I'd be confused one day while reading something because of Nvidia's open-source Triton inference server. I made it quite a long time, but it finally happened today! I was so intrigued for the first few pages and then deeply confused.

hyperbovine•1mo ago

I still don't understand why certain performance aspects of the CUDA platform are so poorly documented. Why is successfully pushing the hw to its performance envelope considered a novel research result? Shouldn't I be able to look this stuff up on the Nvidia website?

amindiro•1mo ago

One reason is clearly the fast past at which nvidia is evolving the hardware. I would consider cuda a very well documented platform in general. What they lack is low level tutorials, but this is where posts like this one can be a good resource

Show HN: Seedance 2.0 AI video generator for creators and ecommerce

Wally: A fun, reliable voice assistant in the shape of a penguin

Rewriting Pycparser with the Help of an LLM

Lobsters Vibecoding Challenge

E-Commerce vs. Social Commerce

Avoiding Modern C++ – Anton Mikhailov [video]

Show HN: AegisMind–AI system with 12 brain regions modeled on human neuroscience

Zig – Package Management Workflow Enhancements

AI-powered text correction for macOS

AppSecMaster – Learn Application Security with hands on challenges

Fibonacci Number Certificates

AI Overviews are killing the web search, and there's nothing we can do about it

City skylines need an upgrade in the face of climate stress

1979: The Model World of Robert Symes [video]

Satellites Have a Lot of Room

1980s Farm Crisis

Show HN: FSID - Identifier for files and directories (like ISBN for Books)

Show HN: Holy Grail: Open-Source Autonomous Development Agent

Show HN: Minecraft Creeper meets 90s Tamagotchi

Show HN: Termiteam – Control center for multiple AI agent terminals

The only U.S. particle collider shuts down

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

Show HN: Remotion directory (videos and prompts)

Portable C Compiler

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

Software Engineering Transformation 2026

Microsoft purges Win11 printer drivers, devices on borrowed time

Lunch with the FT: Tarek Mansour

Old Mexico and her lost provinces (1883)

'AI' is a dick move, redux

Show HN: Seedance 2.0 AI video generator for creators and ecommerce

Wally: A fun, reliable voice assistant in the shape of a penguin

Rewriting Pycparser with the Help of an LLM

Lobsters Vibecoding Challenge

E-Commerce vs. Social Commerce

Avoiding Modern C++ – Anton Mikhailov [video]

Show HN: AegisMind–AI system with 12 brain regions modeled on human neuroscience

Zig – Package Management Workflow Enhancements

AI-powered text correction for macOS

AppSecMaster – Learn Application Security with hands on challenges

Fibonacci Number Certificates

AI Overviews are killing the web search, and there's nothing we can do about it

City skylines need an upgrade in the face of climate stress

1979: The Model World of Robert Symes [video]

Satellites Have a Lot of Room

1980s Farm Crisis

Show HN: FSID - Identifier for files and directories (like ISBN for Books)

Show HN: Holy Grail: Open-Source Autonomous Development Agent

Show HN: Minecraft Creeper meets 90s Tamagotchi

Show HN: Termiteam – Control center for multiple AI agent terminals

The only U.S. particle collider shuts down

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

Show HN: Remotion directory (videos and prompts)

Portable C Compiler

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

Software Engineering Transformation 2026

Microsoft purges Win11 printer drivers, devices on borrowed time

Lunch with the FT: Tarek Mansour

Old Mexico and her lost provinces (1883)

'AI' is a dick move, redux

I rebuilt FlashAttention in Triton to understand the performance archaeology

Comments