frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Japanese game devs face font dilemma as license increases from $380 to $20k

https://www.gamesindustry.biz/japanese-devs-face-font-licensing-dilemma-as-leading-provider-incre...
127•zdw•2h ago•47 comments

Anthropic acquires Bun

https://bun.com/blog/bun-joins-anthropic
1653•ryanvogel•12h ago•786 comments

AI Is Breaking the Moral Foundation of Modern Society

https://eyeofthesquid.com/ai-is-breaking-the-moral-foundation-of-modern-society-a145d471694f
28•TinyBig•43m ago•7 comments

IBM CEO says there is 'no way' spending on AI data centers will pay off

https://www.businessinsider.com/ibm-ceo-big-tech-ai-capex-data-center-spending-2025-12
406•nabla9•12h ago•491 comments

AI Agents Break Rules Under Everyday Pressure

https://spectrum.ieee.org/ai-agents-safety
60•pseudolus•5d ago•10 comments

Paged Out

https://pagedout.institute
339•varjag•10h ago•33 comments

Understanding ECDSA

https://avidthinker.github.io/2025/11/28/understanding-ecdsa/
32•avidthinker•2h ago•3 comments

OpenAI declares 'code red' as Google catches up in AI race

https://www.theverge.com/news/836212/openai-code-red-chatgpt
581•goplayoutside•15h ago•649 comments

Sending DMARC reports is somewhat hazardous

https://utcc.utoronto.ca/~cks/space/blog/spam/DMARCSendingReportsProblems
17•zdw•1h ago•2 comments

I designed and printed a custom nose guard to help my dog with DLE

https://snoutcover.com/billie-story
475•ragswag•2d ago•57 comments

Interview with RollerCoaster Tycoon's Creator, Chris Sawyer (2024)

https://medium.com/atari-club/interview-with-rollercoaster-tycoons-creator-chris-sawyer-684a0efb0f13
17•areoform•2h ago•1 comments

Counter Galois Onion: Improved encryption for Tor circuit traffic

https://blog.torproject.org/introducing-cgo/
56•wrayjustin•1w ago•5 comments

All Sources of DirectX 12 Documentation

https://asawicki.info/news_1794_all_sources_of_directx_12_documentation
14•ibobev•1w ago•5 comments

Quad9 DOH HTTP/1.1 Retirement, December 15, 2025

https://quad9.net/news/blog/doh-http-1-1-retirement/
8•pickledoyster•46m ago•0 comments

Super fast aggregations in PostgreSQL 19

https://www.cybertec-postgresql.com/en/super-fast-aggregations-in-postgresql-19/
12•jnord•1w ago•2 comments

Amazon launches Trainium3

https://techcrunch.com/2025/12/02/amazon-releases-an-impressive-new-ai-chip-and-teases-a-nvidia-f...
162•thnaks•11h ago•64 comments

Learning music with Strudel

https://terryds.notion.site/Learning-Music-with-Strudel-2ac98431b24180deb890cc7de667ea92
445•terryds•1w ago•108 comments

Qwen3-VL can scan two-hour videos and pinpoint nearly every detail

https://the-decoder.com/qwen3-vl-can-scan-two-hour-videos-and-pinpoint-nearly-every-detail/
163•thm•2d ago•51 comments

Zig's new plan for asynchronous programs

https://lwn.net/SubscriberLink/1046084/4c048ee008e1c70e/
259•messe•16h ago•198 comments

All about automotive lidar

https://mainstreetautonomy.com/blog/2025-08-29-all-about-automotive-lidar/
137•dllu•1d ago•61 comments

School cell phone bans and student achievement

https://www.nber.org/digest/202512/school-cell-phone-bans-and-student-achievement
127•harias•12h ago•125 comments

Load ZX Spectrum – first Museum dedicated to our first personal computer

https://loadzx.com/en/
33•elvis70•6d ago•5 comments

Free static site generator for small restaurants and cafes

https://lite.localcafe.org/
113•fullstacking•10h ago•72 comments

Kohler Can Access Pictures from "End-to-End Encrypted" Toilet Camera

https://varlogsimon.leaflet.pub/3m6zrw6k2bs2p?interactionDrawer=quotes
130•TimDotC•4h ago•125 comments

100k TPS over a billion rows: the unreasonable effectiveness of SQLite

https://andersmurphy.com/2025/12/02/100000-tps-over-a-billion-rows-the-unreasonable-effectiveness...
331•speckx•12h ago•116 comments

Delty (YC X25) Is Hiring

https://www.ycombinator.com/companies/delty/jobs/aPWMaiq-full-stack-software-engineer
1•lalitkundu•9h ago

DOOM could have had PC Speaker Music

https://lenowo.org/viewtopic.php?t=45
68•minki_the_avali•7h ago•47 comments

Python Data Science Handbook

https://jakevdp.github.io/PythonDataScienceHandbook/
255•cl3misch•18h ago•45 comments

YesNotice

https://infinitedigits.co/docs/software/yesnotice/
166•surprisetalk•1w ago•57 comments

Practical Intro to Operational Transformation

https://archive.casouri.cc/note/2025/practical-intro-ot/
30•casouri•6d ago•3 comments
Open in hackernews

Faster sorting with SIMD CUDA intrinsics (2024)

https://winwang.blog/posts/bitonic-sort/
92•winwang•7mo ago
Code at https://github.com/wiwa/blog-code/

Comments

ashvardanian•7mo ago
The article covers extremely important CUDA warp-level synchronization/exchange primitives, but it's not what is generally called SIMD in the CUDA land .

Most "CUDA SIMD" intrinsics are designed to process a 32-bit data pack containing 2x 16-bit or 4x 8-bit values (<https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/gro...>). That significantly shrinks their applicability in most domains outside of video and string processing. I've had pretty high hopes for DPX on Hopper (<https://developer.nvidia.com/blog/boosting-dynamic-programmi...>) instructions and started integrating them in StringZilla last year, but the gains aren't huge.

winwang•7mo ago
Oh wow, TIL, thanks. I usually call stuff like that SWAR, and every now-and-then I try to think of a way to (fruitfully) use it. The "SIMD" in this case was just an allusion to warp-wide functions looking like how one might use SIMD in CPU code, as opposed to typical SIMT CUDA.

Also, StringZilla looks amazing -- I just became your 1000th Github follower :)

ashvardanian•7mo ago
Thanks, appreciate the gesture :)

Traditional SWAR on GPUs is a fascinating topic. I've begun assembling a set of synthetic benchmarks to compare DP4A vs. DPX (<https://github.com/ashvardanian/less_slow.cpp/pull/35>), but it feels incomplete without SWAR. My working hypothesis is that 64-bit SWAR on properly aligned data could be very useful in GPGPU, though FMA/MIN/MAX operations in that PR might not be the clearest showcase of its strengths. Do you have a better example or use case in mind?

winwang•7mo ago
I don't -- unfortunately not too well-versed in this field! But I was a bit fascinated with SWAR after I randomly thought of how to prefix-sum with int multiplication, later finding out that it is indeed an old trick as I suspected (I'm definitely not on this thread btw): https://mastodon.social/@dougall/109913251096277108

As for 64-bit... well, I mostly avoid using high-end GPUs, but I was of the impression that i64 is just simulated. In fact, I was thinking of using the full warp as a "pipeline" to implement u32 division (mostly as a joke), almost like anti-SWAR. There was some old-ish paper detailing arithmetic latencies in GPUs and division was approximately more than 32x multiplication (...or I could be misremembering).

bobmcnamara•6mo ago
Parallel compares: https://graphics.stanford.edu/~seander/bithacks.html#ZeroInW...
DennisL123•7mo ago
Interesting stuff. Not sure if I read this right that it‘s 16 und 32 bit values of integers that get sorted. If yes, I‘d love to see if the GPU implementation can beat a competitive Radix sort implementation on a CPU.
winwang•7mo ago
It's 32 32-bit values which get sorted. I don't think a GPU sort would beat a CPU sort at this scale, even if you don't take kernel launch time into account. CPUs are simply too fast for (super-)small data, especially with AVX-512. But if we're talking about a larger amount of data, that would be a different story, i.e. as part of a normal gpu mergesort.
maeln•7mo ago
It is also useful if your data already lives on the GPU memory. For example, when you need to z-sort a bunch of particles in a 3d renderer particle system.
exDM69•7mo ago
A 32 way GPU sorting algorithm might be just what I need for sorting and deduplicating triangle id's in a visibility buffer renderer I am working on.

Thanks for sharing.

winwang•7mo ago
As someone who doesn't know very much about graphics (ironically), you're welcome and hope it helps!
fourseventy•7mo ago
What are the biggest use cases of GPU accelerated sorting?