frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Faster sorting with SIMD CUDA intrinsics (2024)

https://winwang.blog/posts/bitonic-sort/
92•winwang•8mo ago
Code at https://github.com/wiwa/blog-code/

Comments

ashvardanian•8mo ago
The article covers extremely important CUDA warp-level synchronization/exchange primitives, but it's not what is generally called SIMD in the CUDA land .

Most "CUDA SIMD" intrinsics are designed to process a 32-bit data pack containing 2x 16-bit or 4x 8-bit values (<https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/gro...>). That significantly shrinks their applicability in most domains outside of video and string processing. I've had pretty high hopes for DPX on Hopper (<https://developer.nvidia.com/blog/boosting-dynamic-programmi...>) instructions and started integrating them in StringZilla last year, but the gains aren't huge.

winwang•8mo ago
Oh wow, TIL, thanks. I usually call stuff like that SWAR, and every now-and-then I try to think of a way to (fruitfully) use it. The "SIMD" in this case was just an allusion to warp-wide functions looking like how one might use SIMD in CPU code, as opposed to typical SIMT CUDA.

Also, StringZilla looks amazing -- I just became your 1000th Github follower :)

ashvardanian•8mo ago
Thanks, appreciate the gesture :)

Traditional SWAR on GPUs is a fascinating topic. I've begun assembling a set of synthetic benchmarks to compare DP4A vs. DPX (<https://github.com/ashvardanian/less_slow.cpp/pull/35>), but it feels incomplete without SWAR. My working hypothesis is that 64-bit SWAR on properly aligned data could be very useful in GPGPU, though FMA/MIN/MAX operations in that PR might not be the clearest showcase of its strengths. Do you have a better example or use case in mind?

winwang•8mo ago
I don't -- unfortunately not too well-versed in this field! But I was a bit fascinated with SWAR after I randomly thought of how to prefix-sum with int multiplication, later finding out that it is indeed an old trick as I suspected (I'm definitely not on this thread btw): https://mastodon.social/@dougall/109913251096277108

As for 64-bit... well, I mostly avoid using high-end GPUs, but I was of the impression that i64 is just simulated. In fact, I was thinking of using the full warp as a "pipeline" to implement u32 division (mostly as a joke), almost like anti-SWAR. There was some old-ish paper detailing arithmetic latencies in GPUs and division was approximately more than 32x multiplication (...or I could be misremembering).

bobmcnamara•8mo ago
Parallel compares: https://graphics.stanford.edu/~seander/bithacks.html#ZeroInW...
DennisL123•8mo ago
Interesting stuff. Not sure if I read this right that it‘s 16 und 32 bit values of integers that get sorted. If yes, I‘d love to see if the GPU implementation can beat a competitive Radix sort implementation on a CPU.
winwang•8mo ago
It's 32 32-bit values which get sorted. I don't think a GPU sort would beat a CPU sort at this scale, even if you don't take kernel launch time into account. CPUs are simply too fast for (super-)small data, especially with AVX-512. But if we're talking about a larger amount of data, that would be a different story, i.e. as part of a normal gpu mergesort.
maeln•8mo ago
It is also useful if your data already lives on the GPU memory. For example, when you need to z-sort a bunch of particles in a 3d renderer particle system.
exDM69•8mo ago
A 32 way GPU sorting algorithm might be just what I need for sorting and deduplicating triangle id's in a visibility buffer renderer I am working on.

Thanks for sharing.

winwang•8mo ago
As someone who doesn't know very much about graphics (ironically), you're welcome and hope it helps!
fourseventy•8mo ago
What are the biggest use cases of GPU accelerated sorting?

East coast. Verizon outage in US

https://www.firstcoastnews.com/article/news/nation-world/verizon-outage-reported/507-ef3cb3d0-f59...
112•Scubabear68•1h ago•66 comments

So, You've Hit an Age Gate. What Now?

https://www.eff.org/deeplinks/2026/01/so-youve-hit-age-gate-what-now
186•hn_acker•2h ago•147 comments

Why some clothes shrink in the wash – and how to 'unshrink' them

https://www.swinburne.edu.au/news/2025/08/why-some-clothes-shrink-in-the-wash-and-how-to-unshrink...
309•OptionOfT•3d ago•178 comments

Ask HN: Could you share your personal website here?

134•susam•3h ago•519 comments

Why Every Country Should Set 16 as the Minimum Age for Social Media Accounts

https://www.afterbabel.com/p/why-every-country-should-set-16
20•paulpauper•17m ago•2 comments

You Can Just Buy Far-UVC

https://www.jefftk.com/p/you-can-just-buy-far-uvc
29•surprisetalk•4d ago•32 comments

The Unbearable Frustration of Figuring Out APIs

https://blog.ar-ms.me/thoughts/translation-cli/
51•ezekg•3h ago•28 comments

Meet ski map artist James Niehues, the 'Monet of the mountains'

https://adventure.com/ski-map-artist-james-niehues/
8•gyomu•3d ago•0 comments

Show HN: Harmony – AI notetaker for Discord

https://harmonynotetaker.ai/
3•SeanDorje•7m ago•0 comments

Show HN: HyTags – HTML as a Programming Language

https://hytags.org
43•lassejansen•1d ago•22 comments

Show HN: A 10KiB kernel for cloud apps

https://github.com/ReturnInfinity/BareMetal-Cloud
47•ianseyler•4h ago•6 comments

Starlink roam 50GB is now 100GB with unlimited slow speed after that

https://starlink.com/support/article/58c9c8b7-474e-246f-7e3c-06db3221d34d
167•bahmboo•4h ago•168 comments

Edge of Emulation: Game Boy Sewing Machines (2020)

https://shonumi.github.io/articles/art22.html
84•mosura•5h ago•6 comments

I built Vector. Now I'm answering the question your observability vendor won't

https://usetero.com/blog/the-question-your-observability-vendor-wont-answer
74•binarylogic•4h ago•38 comments

Show HN: A fast CLI and MCP server for managing Lambda cloud GPU instances

https://github.com/Strand-AI/lambda-cli
2•odedfalik•24m ago•2 comments

I’m leaving Redis for SolidQueue

https://www.simplethread.com/redis-solidqueue/
268•amalinovic•10h ago•108 comments

GitHub should charge everyone $1 more per month to fund open source

https://blog.greg.technology/2025/11/27/github-should-charge-1-dollar-more-per-month.html
103•evakhoury•3h ago•103 comments

How have prices changed in a year? NPR checked 114 items at Walmart

https://www.npr.org/2026/01/14/nx-s1-5638908/walmart-prices-inflation-affordability-shrinkflation
125•srameshc•3h ago•81 comments

Government drops plans for mandatory digital ID to work in UK

https://www.bbc.com/news/articles/c3385zrrx73o
130•FridayoLeary•4h ago•61 comments

Xoscript

https://xoscript.com/history.xo
38•gabordemooij•4h ago•33 comments

Lago (Open-Source Billing) is hiring across teams and geos

1•Rafsark•7h ago

Virginia Faulkner: Writer, Editor and Ghostwriter?

https://lithub.com/virginia-faulkner-writer-editor-and-ghostwriter/
12•samclemens•5d ago•1 comments

Find a pub that needs you

https://www.ismypubfucked.com/
130•thinkingemote•4h ago•82 comments

I Hate GitHub Actions with Passion

https://xlii.space/eng/i-hate-github-actions-with-passion/
312•xlii•9h ago•250 comments

A Brief Introduction to the Basics of Game Theory

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1968579
57•7777777phil•2d ago•8 comments

System Programming in Linux: A Hands-On Introduction "Demo" Programs

https://github.com/stewartweiss/intro-linux-sys-prog
85•teleforce•10h ago•5 comments

Show HN: Tiny FOSS Compass and Navigation App (<2MB)

https://github.com/CompassMB/MBCompass
109•nativeforks•9h ago•35 comments

There's a ridiculous amount of tech in a disposable vape

https://blog.jgc.org/2026/01/theres-ridiculous-amount-of-tech-in.html
688•abnercoimbre•2d ago•607 comments

Ford F-150 Lightning outsold the Cybertruck and was then canceled for poor sales

https://electrek.co/2026/01/13/ford-f150-lightning-outsold-tesla-cybertruck-canceled-not-selling-...
239•MBCook•2h ago•292 comments

Systematically generating tests that would have caught Anthropic's top‑K bug

https://theorem.dev/blog/anthropic-bug-test/
62•jasongross•3d ago•17 comments