frontpage.

I wrote a single-header CUDA radix sort library that's 1-6% faster than CUB for 32-bit keys and up to 38% faster for small types.

Why bother? CUB's radix sort is already memory-bound, which is theoretically optimal. But CUB's codebase prioritizes backward compatibility and generality across GPU generations, leaving room for targeted optimizations on modern hardware.

Benchmarks (RTX 5090, 67M elements): int32_t keys: 1.53ms vs 1.62ms (+6%) int8_t keys: 0.19ms vs 0.25ms (+32%) int32_t pairs: 3.07ms vs 3.10ms (+1%)

What's different: Implements OneSweep with decoupled lookback PDL (Programmatic Dependent Launch) overlaps histogram with first sort pass on Hopper+ Fine-grained PTX cache hints to avoid polluting L2 CUDA Graph caching drops launch overhead from ~100μs to ~5μs Auto-tuned tile sizes for 48 type combinations

The other goal is education. CUB is harder to read. I have tried to make every optimization decision explicit in the comments — not just what the code does, but why the GPU hardware forces that choice.

GitHub: https://github.com/IlyaGrebnov/libcusort

Happy to answer questions about GPU radix sort internals or the specific optimizations.

Octrafic – open-source AI-assisted API testing from the CLI

US Accuses China of Secret Nuclear Testing

Peacock. A New Programming Language

A postcard arrived: 'If you're reading this I'm dead, and I really liked you'

What to know about the software selloff

Show HN: Syntux – generative UI for websites, not agents

Microsoft appointed a quality czar. He has no direct reports and no budget

AI overlay that reads anything on your screen (invisible to screen capture)

Show HN: Seafloor, be up and running with OpenClaw in 20 seconds

Tesla turbine-inspired structure generates electricity using compressed air

State Department deleting 17 years of tweets (2009-2025); preservation needed

Learning to code, or building side projects with AI help, this one's for you

Effulgence RPG Engine [video]

Five disciplines discovered the same math independently – none of them knew

We Scanned an AI Assistant for Security Issues: 12,465 Vulnerabilities

Amazon no longer defend cloud customers against video patent infringement claims

Show HN: Medinilla – an OCPP compliant .NET back end (partially done)

How Does AI Distribute the Pie? Large Language Models and the Ultimatum Game

Resistance Infrastructure

Fire-juggling unicyclist caught performing on crossing

Restoring a lost 1981 Unix roguelike (protoHack) and preserving Hack 1.0.3

GPS and Time Dilation – Special and General Relativity

Show HN: Witnessd – Prove human authorship via hardware-bound jitter seals

Show HN: I built a clawdbot that texts like your crush

Scientists reverse Alzheimer's in mice and restore memory (2025)

Compiling Prolog to Forth [pdf]

Show HN: Cymatica – an experimental, meditative audiovisual app

GitBlack: Tracing America's Foundation

Horizon-LM: A RAM-Centric Architecture for LLM Training

We just ordered shawarma and fries from Cursor [video]