frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

What an unprocessed photo looks like

https://maurycyz.com/misc/raw_photo/
113•zdw•1h ago•18 comments

Stepping down as Mockito maintainer after 10 years

https://github.com/mockito/mockito/issues/3777
166•saikatsg•3h ago•76 comments

Unity's Mono problem: Why your C# code runs slower than it should

https://marekfiser.com/blog/mono-vs-dot-net-in-unity/
50•iliketrains•2h ago•22 comments

62 years in the making: NYC's newest water tunnel nears the finish line

https://ny1.com/nyc/all-boroughs/news/2025/11/09/water--dep--tunnels-
20•eatonphil•45m ago•3 comments

PySDR: A Guide to SDR and DSP Using Python

https://pysdr.org/content/intro.html
70•kklisura•3h ago•4 comments

MongoBleed Explained Simply

https://bigdata.2minutestreaming.com/p/mongobleed-explained-simply
52•todsacerdoti•2h ago•12 comments

CEOs are hugely expensive. Why not automate them?

https://www.newstatesman.com/business/companies/2023/05/ceos-salaries-expensive-automate-robots
58•nis0s•34m ago•32 comments

Spherical Cow

https://lib.rs/crates/spherical-cow
6•Natfan•39m ago•1 comments

Researchers Discover Molecular Difference in Autistic Brains

https://medicine.yale.edu/news-article/molecular-difference-in-autistic-brains/
14•amichail•1h ago•0 comments

Growing up in “404 Not Found”: China's nuclear city in the Gobi Desert

https://substack.com/inbox/post/182743659
668•Vincent_Yan404•17h ago•290 comments

Time in C++: Inter-Clock Conversions, Epochs, and Durations

https://www.sandordargo.com/blog/2025/12/24/clocks-part-5-conversions
11•ibobev•2d ago•0 comments

Calendar

https://neatnik.net/calendar/?year=2026
939•twapi•18h ago•113 comments

Remembering Lou Gerstner

https://newsroom.ibm.com/2025-12-28-Remembering-Lou-Gerstner
56•thm•5h ago•26 comments

Building a macOS app to know when my Mac is thermal throttling

https://stanislas.blog/2025/12/macos-thermal-throttling-app/
213•angristan•11h ago•95 comments

Slaughtering Competition Problems with Quantifier Elimination

https://grossack.site/2021/12/22/qe-competition.html
4•todsacerdoti•40m ago•0 comments

Software engineers should be a little bit cynical

https://www.seangoedecke.com/a-little-bit-cynical/
90•zdw•2h ago•69 comments

Show HN: Pion SCTP with RACK is 70% faster with 30% less latency

https://pion.ly/blog/sctp-and-rack/
31•pch07•5h ago•4 comments

Doublespeak: In-Context Representation Hijacking

https://mentaleap.ai/doublespeak/
37•surprisetalk•6d ago•5 comments

Dolphin Progress Report: Release 2512

https://dolphin-emu.org/blog/2025/12/22/dolphin-progress-report-release-2512/
45•akyuu•1h ago•2 comments

John Malone and the Invention of Liquid-Based Engines

https://permalink.lanl.gov/object/tr?what=info:lanl-repo/lareport/LA-UR-93-1350-25
11•akshatjiwan•4d ago•1 comments

As AI gobbles up chips, prices for devices may rise

https://www.npr.org/2025/12/28/nx-s1-5656190/ai-chips-memory-prices-ram
19•geox•58m ago•8 comments

Replacing JavaScript with Just HTML

https://www.htmhell.dev/adventcalendar/2025/27/
675•soheilpro•22h ago•254 comments

Learn computer graphics from scratch and for free

https://www.scratchapixel.com
156•theusus•12h ago•19 comments

Show HN: Phantas – A browser-based binaural strobe engine (Web Audio API)

https://phantas.io
12•AphantaZach•3h ago•5 comments

Why I Disappeared – My week with minimal internet in a remote island chain

https://www.kenklippenstein.com/p/why-i-disappeared
10•eh_why_not•2h ago•0 comments

One year of keeping a tada list

https://www.ducktyped.org/p/one-year-of-keeping-a-tada-list
213•egonschiele•6d ago•60 comments

Langfuse (YC W23) Is Hiring in Berlin, Germany

https://langfuse.com/careers
1•clemo_ra•11h ago

Show HN: LoongArch Userspace Emulator

https://github.com/libriscv/libloong
14•fwsgonzo•4d ago•3 comments

Loss of moist broadleaf forest in Africa has turned a carbon sink into source

https://www.nature.com/articles/s41598-025-27462-3
57•PaulHoule•2h ago•16 comments

Designing Predictable LLM-Verifier Systems for Formal Method Guarantee

https://arxiv.org/abs/2512.02080
52•PaulHoule•8h ago•10 comments
Open in hackernews

Faster sorting with SIMD CUDA intrinsics (2024)

https://winwang.blog/posts/bitonic-sort/
92•winwang•7mo ago
Code at https://github.com/wiwa/blog-code/

Comments

ashvardanian•7mo ago
The article covers extremely important CUDA warp-level synchronization/exchange primitives, but it's not what is generally called SIMD in the CUDA land .

Most "CUDA SIMD" intrinsics are designed to process a 32-bit data pack containing 2x 16-bit or 4x 8-bit values (<https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/gro...>). That significantly shrinks their applicability in most domains outside of video and string processing. I've had pretty high hopes for DPX on Hopper (<https://developer.nvidia.com/blog/boosting-dynamic-programmi...>) instructions and started integrating them in StringZilla last year, but the gains aren't huge.

winwang•7mo ago
Oh wow, TIL, thanks. I usually call stuff like that SWAR, and every now-and-then I try to think of a way to (fruitfully) use it. The "SIMD" in this case was just an allusion to warp-wide functions looking like how one might use SIMD in CPU code, as opposed to typical SIMT CUDA.

Also, StringZilla looks amazing -- I just became your 1000th Github follower :)

ashvardanian•7mo ago
Thanks, appreciate the gesture :)

Traditional SWAR on GPUs is a fascinating topic. I've begun assembling a set of synthetic benchmarks to compare DP4A vs. DPX (<https://github.com/ashvardanian/less_slow.cpp/pull/35>), but it feels incomplete without SWAR. My working hypothesis is that 64-bit SWAR on properly aligned data could be very useful in GPGPU, though FMA/MIN/MAX operations in that PR might not be the clearest showcase of its strengths. Do you have a better example or use case in mind?

winwang•7mo ago
I don't -- unfortunately not too well-versed in this field! But I was a bit fascinated with SWAR after I randomly thought of how to prefix-sum with int multiplication, later finding out that it is indeed an old trick as I suspected (I'm definitely not on this thread btw): https://mastodon.social/@dougall/109913251096277108

As for 64-bit... well, I mostly avoid using high-end GPUs, but I was of the impression that i64 is just simulated. In fact, I was thinking of using the full warp as a "pipeline" to implement u32 division (mostly as a joke), almost like anti-SWAR. There was some old-ish paper detailing arithmetic latencies in GPUs and division was approximately more than 32x multiplication (...or I could be misremembering).

bobmcnamara•7mo ago
Parallel compares: https://graphics.stanford.edu/~seander/bithacks.html#ZeroInW...
DennisL123•7mo ago
Interesting stuff. Not sure if I read this right that it‘s 16 und 32 bit values of integers that get sorted. If yes, I‘d love to see if the GPU implementation can beat a competitive Radix sort implementation on a CPU.
winwang•7mo ago
It's 32 32-bit values which get sorted. I don't think a GPU sort would beat a CPU sort at this scale, even if you don't take kernel launch time into account. CPUs are simply too fast for (super-)small data, especially with AVX-512. But if we're talking about a larger amount of data, that would be a different story, i.e. as part of a normal gpu mergesort.
maeln•7mo ago
It is also useful if your data already lives on the GPU memory. For example, when you need to z-sort a bunch of particles in a 3d renderer particle system.
exDM69•7mo ago
A 32 way GPU sorting algorithm might be just what I need for sorting and deduplicating triangle id's in a visibility buffer renderer I am working on.

Thanks for sharing.

winwang•7mo ago
As someone who doesn't know very much about graphics (ironically), you're welcome and hope it helps!
fourseventy•7mo ago
What are the biggest use cases of GPU accelerated sorting?