GPU Prefix Sums: A nearly complete collection

https://github.com/b0nes164/GPUPrefixSums

75•coffeeaddict1•12h ago

https://dl.acm.org/doi/10.1145/3694906.3743326

Comments

genpfault•11h ago

https://en.wikipedia.org/wiki/Prefix_sum#Applications

almostgotcaught•10h ago

this is missing the most important one (in today's world): extracting non-zero elements from a sparse vector/matrix

https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co...

merope14•9h ago

Not even close. The most important application (in today's world) is radix sort.

WJW•8h ago

What specific application do you have in mind that radix sort is more important than matrix multiplication?

otherjason•6h ago

I think they were trying to say “radix sort is a more important application of prefix sum than extraction of values from a sparse matrix/vector is.”

WJW•5h ago

I understand what GP meant, but extraction of values from a sparse matrix is an essential operation of multiplying two sparse matrices. Sparse matmult in turn is an absolutely fundamental operation in everything from weather forecasting to logistics planning to electric grid control to training LLMs. Radix sort on the other hand is very nice but (as far as I know) not nearly used as widely. Matrix multiplication is just super fundamental to the modern world.

I would love to be enlightened about some real-world applications of radix sort I may have missed though, since it's a cool algorithm. Hence my question above.

littlestymaar•3h ago

> to training LLMs

LLMs are made from dense matrices, aren't they?

WJW•3h ago

Not always, or rather not exclusively. For example, some types of distillation benefit from sparse-ifying the dense-ish matrices the original was made of [1]. There's also a lot of benefit to be had from sparsity in finetuning [2]. LLMs were merely one of the examples though, don't focus too much on them. The point was that sparse matmul makes up the bulk of scientific computations and a huge amount of industrial computations too. It's probably second only to the FFT in importance, so it would be wild if radix sort managed to eclipse it somehow.

[1] https://developer.nvidia.com/blog/mastering-llm-techniques-i...

[2] https://arxiv.org/html/2405.15525v1

almostgotcaught•2h ago

Almost all performant kernels employ structured sparsity

woadwarrior01•6h ago

Top K sampling comes to mind, although it's nowhere nearly as important as matmult.

almostgotcaught•6h ago

ranking models benefit from gpu impls of sort but yup they're not nearly as common/important as spmm/spmv

m-schuetz•6h ago

Is that relevant for 4x4 multiplications? Because at least for me, radix sort is way more important than multiplying matrices beyond 4x4. E.g. for Gaussian Splatting.

coffeeaddict1•10h ago

Related paper by the authors: https://dl.acm.org/doi/10.1145/3694906.3743326

dang•5h ago

We'll put that link in the top text too. Thanks!

m-schuetz•8h ago

That and https://github.com/b0nes164/GPUSorting have been a tremendous help for me, since CUB does not nicely work with the Cuda Driver Api. The author is doing amazing work.

luizfelberti•7h ago

This looks amazing, I've been shopping for an implementation of this I could play around with for a while now

They mention promising results on Apple Silicon GPUs and even cite the contributions from Vello, but I don't see a Metal implementation in there and the benchmark only shows results from an RTX 2080. Is it safe to assume that they're referring to the WGPU version when talking about M-series chips?

Ask HN: The government of my country blocked VPN access. What should I use?

Python: The Documentary

Fuck up my site – Turn any website into beautiful chaos

Some thoughts on LLMs and software development

My startup banking story (2023)

Uncertain<T>

Death by PowerPoint: the slide that killed seven people

Expert LSP the official language server implementation for Elixir

RSS Is Awesome

Building your own CLI coding agent with Pydantic-AI

TuneD is a system tuning service for Linux

Are OpenAI and Anthropic losing money on inference?

AI adoption linked to 13% decline in jobs for young U.S. workers: study

Launch HN: Dedalus Labs (YC S25) – Vercel for Agents

Rupert's Property

A forgotten medieval fruit with a vulgar name (2021)

Dependent types I › Universes, or types of types

Bad Craziness

You no longer need JavaScript: an overview of what makes modern CSS so awesome

Thrashing

Speed-coding for the 6502 – a simple example

Will AI Replace Human Thinking? The Case for Writing and Coding Manually

VLT observations of interstellar comet 3I/ATLAS II

Optimising for maintainability – Gleam in production at Strand

Show HN: SwiftAI – open-source library to easily build LLM features on iOS/macOS

Web Bot Auth

In Search of AI Psychosis

RFC 8594: The Sunset HTTP Header Field (2019)

I researched every attempt to stop fascism in history. The success rate is 0%

That boolean should probably be something else