Faster Argmin on Floats

https://algorithmiker.github.io/faster-float-argmin/

22•return_to_monke•4mo ago

Comments

why_only_15•4mo ago

This trick is very useful on Nvidia GPUs for calculating mins and maxes in some cases, e.g. atomic mins (better u32 support than f32) or warp-wide mins with `redux.sync` (only supports u32, not f32).

TheDudeMan•4mo ago

How fast if you write a for loop and keep track of the index and value of the smallest (possibly treating them as ints)?

nine_k•4mo ago

I hazard to guess that it would be the same, because the compiler would produce a loop out of .iter(), would expose the loop index via .enumerate(), and would keep track of that index in .min_by(). I suppose the lambda would be inlined, maybe even along with comparisons.

I wonder could that be made faster by using AVX instructions; they allow to find the minimum value among several u32 values, but not immediately its index.

anonymoushn•4mo ago

you can have some vector registers n_acc, ns, idx_acc, idxs, then you can do

  // (initialize ns and idxs by reading from the array
  //  and adding the apropriate constant to the old value of idxs.)
  n_acc = min(n_acc, ns);
  const is_new_min = eq(n_acc, ns);
  idx_acc = blend(idx_acc, idxs, is_new_min);

Edit: I wrote this with min, eq, blend but you can actually use cmpgt, min, blend to avoid having a dependency chain through all three instructions. I am just used to using min, eq, blend because of working on unsigned values that don't have cmpgt

you can consult the list of toys here: https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

shoo•4mo ago

Even without AVX it seems possible to do better than a naive C style for loop argmax by manually unrolling the loop a bit and maintaining multiple accumulators

e.g. using 4 accumulators instead of 1 accumulator in the naive for loop gives me around a 15%-20% speedup (Not using rust, extremely scalar terrible naive C code via g++ with -funroll-all-loops -march=native -O3)

if we're expressing argmax via the obvious C style naive for loop, or a functional reduce, with a single accumulator, we've forcing a chain dependency that isn't really part of the problem. but if we don't care which argmax-ing index we get (if there are multiple minimal elements in the array) then instead of evaluating the reductions in a single rigid chain bound by a single accumulator, we can break the chain and get our hardware to do more work in parallel, even if we're only single threaded.

anonymoushn is doing something much cleverer again using intrinsics but there's still that idea of "how do we break the dependency chain between different operations so the cpu can kick them off in parallel"

TinkersW•4mo ago

Yes this is fairly easy to write in AVX, and you can track the index also, honestly the code is cleaner and nicer to read than this mildly obfuscated rust.

imtringued•4mo ago

You're referring to nothing and nothing. What exactly are you talking about? It certainly can't be the trivial to understand one liners in the blog.

TheDudeMan•4mo ago

But how is that slower than sorting the list?!

teo_zero•4mo ago

I had expected something about algorithms, not Rust-specific implementations.

why_only_15•4mo ago

doing a u32 compare instead of an f32 compare is not rust-specific or indeed CPU-specific.

meisel•4mo ago

Another speed up method here would be using simd, although it would be interesting to see in the assembly if it was auto-vectorized already.

This reminds me of a trick to sort floats faster, even if they have negatives, nans, and inf: map each float to a sortable int version of itself where one can compare them as ints (the precise mapping depending on how you want to order stuff like Nan). The one time conversion is fast and will pay off for the lg(n) comparisons. Then after sorting, map them back.

LegalArgumentException: From Courtrooms to Clojure – Sen [video]

US moves to deport 5-year-old detained in Minnesota

If you lose your passport in Austria, head for McDonald's Golden Arches

Show HN: Mermaid Formatter – CLI and library to auto-format Mermaid diagrams

RFCs vs. READMEs: The Evolution of Protocols

Kanchipuram Saris and Thinking Machines

Chinese chemical supplier causes global baby formula recall

I've used AI to write 100% of my code for a year as an engineer

Looking for 4 Autistic Co-Founders for AI Startup (Equity-Based)

AI-native capabilities, a new API Catalog, and updated plans and pricing

What changed in tech from 2010 to 2020?

From Human Ergonomics to Agent Ergonomics

Advanced Inertial Reference Sphere

Toyota Developing a Console-Grade, Open-Source Game Engine with Flutter and Dart

Typing for Love or Money: The Hidden Labor Behind Modern Literary Masterpieces

Show HN: A longitudinal health record built from fragmented medical data

CoreWeave's $30B Bet on GPU Market Infrastructure

Creating and Hosting a Static Website on Cloudflare for Free

"The Stanford scam proves America is becoming a nation of grifters"

Elon Musk on Space GPUs, AI, Optimus, and His Manufacturing Method

X (Twitter) is back with a new X API Pay-Per-Use model

Zlob.h 100% POSIX and glibc compatible globbing lib that is faste and better

Show HN: Deterministic signal triangulation using a fixed .72% variance constant

Scientists Discover Levitating Time Crystals You Can Hold, Defy Newton’s 3rd Law

When Michelangelo Met Titian

Solving NYT Pips with DLX

Baldur's Gate to be turned into TV series – without the game's developers

Interview with 'Just use a VPS' bro (OpenClaw version) [video]

EchoJEPA: Latent Predictive Foundation Model for Echocardiography

Disablling Go Telemetry