Found this while fighting with ROCm last week. Some team called Aule Technologies built a FlashAttention implementation for AMD hardware that actually runs.
Anyone who's tried attention kernels on MI200/MI300 knows the ecosystem is a graveyard of abandoned CUDA ports. This one works out of the box.
Repo: https://github.com/AuleTechnologies/Aule-Attention
53 stars, still pretty new. Looked at the code — it's clean HIP, not some janky translation layer. Benchmarks look legit though I haven't verified independently.
Not affiliated, just surprised something in the ROCm attention space isn't broken. Figured others here running AMD might find it useful.
xenn0010•16m ago