frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Never Give Them Your Face

https://nevergivethemyourface.com/
548•audiodude•3h ago•286 comments

Pledging Another $400k to the Zig Software Foundation

https://mitchellh.com/writing/zig-donation-2026
475•tosh•3h ago•148 comments

Claude Code's "extended thinking" is a summary- not authentic thinking

https://patrickmccanna.net/the-text-in-claude-codes-extended-thinking-output-is-not-authentic/
138•0o_MrPatrick_o0•2h ago•107 comments

Moebius: 0.2B image inpainting model with 10B-level performance

https://hustvl.github.io/Moebius/
85•DSemba•3h ago•15 comments

Deno Desktop

https://docs.deno.com/runtime/desktop/
830•GeneralMaximus•11h ago•323 comments

Charge Robotics (YC S21) Is Hiring Software and Hardware Engineers

https://jobs.ashbyhq.com/charge-robotics
1•justicz•30s ago

Codex logging bug may write TBs to local SSDs

https://github.com/openai/codex/issues/28224
336•vantareed•9h ago•182 comments

Nintendo Wii U games running from a 1980's Bernoulli disk [video]

https://www.youtube.com/watch?v=8GZDOpV2OXk
23•zdw•20h ago•1 comments

GLM 5.2 vs. Opus

https://techstackups.com/comparisons/glm-5.2-vs-opus/
366•ritzaco•9h ago•259 comments

Die analysis of the 8087 math coprocessor's fast bit shifter (2020)

https://www.righto.com/2020/05/die-analysis-of-8087-math-coprocessors.html
35•Jimmc414•3h ago•8 comments

Help I accidentally a wigglegram

https://lmao.center/blog/wiggle-accidents/
418•gregsadetsky•2d ago•96 comments

National Science Foundation slashes basic science to fund new tech initiative

https://www.science.org/content/article/exclusive-nsf-slashes-research-programs-support-new-tech-...
6•strangeloops85•35m ago•4 comments

Did my old job only exist because of fraud?

https://david.newgas.net/did-my-old-job-only-exist-because-of-fraud/
757•advisedwang•19h ago•341 comments

Granularity comes at a cost – Game Theory

https://www.sidhantbansal.com/2026/Granularity-comes-at-a-cost/
28•sidhantbansal•2d ago•4 comments

DHL Set to Transport Goods on New Wind-Powered Cargo Ships

https://www.wsj.com/pro/sustainable-business/dhl-set-to-transport-goods-on-new-wind-powered-cargo...
65•julienchastang•2h ago•24 comments

Apertus – Open Foundation Model for Sovereign AI

https://apertvs.ai/
496•T-A•19h ago•167 comments

Munich 1991: The Roots of the Current AI Boom

https://people.idsia.ch/~juergen/ai-boom-roots-munich-1991.html
175•tosh•3d ago•77 comments

Maria Isabel Sánchez Vegara on Her 100th "Little People, Big Dreams" Book

https://www.amightygirl.com/blog?p=36753
29•zeristor•2d ago•4 comments

Why Drawing Tablet Brands Won't Collaborate on Linux Floss Drivers

https://www.davidrevoy.com/article1154/why-drawing-tablet-brands-wont-collaborate-on-linux-floss-...
148•Tomte•4h ago•64 comments

There is minimal downside to switching to open models

https://www.marble.onl/posts/cancel_claude.html
342•amarble•20h ago•283 comments

Nvidia Halos

https://www.nvidia.com/en-us/ai-trust-center/halos/autonomous-vehicles/
66•ilreb•3h ago•38 comments

Chevron signs 20-year power agreement with Microsoft for West Texas data center

https://www.chevron.com/newsroom/2026/q2/chevron-signs-20-year-power-agreement-with-microsoft-for...
53•cdrnsf•3h ago•48 comments

Show HN: I rebuilt the only parts of my IDE I use, in Rust, over a weekend

https://github.com/kyle-ssg/kyde
21•kyle-ssg•4h ago•8 comments

Manticore Search 27.1.5: Auth, sharding, conversational and faster vector search

https://manticoresearch.com/blog/manticore-search-27-1-5/
30•snikolaev•6h ago•2 comments

My 1992 view of the problems of computer programming in 1992

https://blog.plover.com/prog/fortran-i.html
85•speckx•3d ago•37 comments

Sakana Fugu

https://sakana.ai/fugu/
190•Finbarr•14h ago•106 comments

Memory Safe Inline Assembly

https://fil-c.org/inlineasm
161•pizlonator•2d ago•39 comments

Everything is logarithms

https://alexkritchevsky.com/2026/05/25/everything-is-logarithms.html
282•E-Reverance•19h ago•60 comments

Good results fine tuning a local LLM like Qwen 3:0.6B to categorize questions

https://www.teachmecoolstuff.com/viewarticle/fine-tuning-a-local-llm-to-categorize-questions
197•dev-experiments•18h ago•41 comments

UUID: NewV7() always generates a UUID with 7000 on browsers (Golang)

https://github.com/golang/go/issues/80084
29•mfrw•8h ago•3 comments
Open in hackernews

Faster sorting with SIMD CUDA intrinsics (2024)

https://winwang.blog/posts/bitonic-sort/
92•winwang•1y ago
Code at https://github.com/wiwa/blog-code/

Comments

ashvardanian•1y ago
The article covers extremely important CUDA warp-level synchronization/exchange primitives, but it's not what is generally called SIMD in the CUDA land .

Most "CUDA SIMD" intrinsics are designed to process a 32-bit data pack containing 2x 16-bit or 4x 8-bit values (<https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/gro...>). That significantly shrinks their applicability in most domains outside of video and string processing. I've had pretty high hopes for DPX on Hopper (<https://developer.nvidia.com/blog/boosting-dynamic-programmi...>) instructions and started integrating them in StringZilla last year, but the gains aren't huge.

winwang•1y ago
Oh wow, TIL, thanks. I usually call stuff like that SWAR, and every now-and-then I try to think of a way to (fruitfully) use it. The "SIMD" in this case was just an allusion to warp-wide functions looking like how one might use SIMD in CPU code, as opposed to typical SIMT CUDA.

Also, StringZilla looks amazing -- I just became your 1000th Github follower :)

ashvardanian•1y ago
Thanks, appreciate the gesture :)

Traditional SWAR on GPUs is a fascinating topic. I've begun assembling a set of synthetic benchmarks to compare DP4A vs. DPX (<https://github.com/ashvardanian/less_slow.cpp/pull/35>), but it feels incomplete without SWAR. My working hypothesis is that 64-bit SWAR on properly aligned data could be very useful in GPGPU, though FMA/MIN/MAX operations in that PR might not be the clearest showcase of its strengths. Do you have a better example or use case in mind?

winwang•1y ago
I don't -- unfortunately not too well-versed in this field! But I was a bit fascinated with SWAR after I randomly thought of how to prefix-sum with int multiplication, later finding out that it is indeed an old trick as I suspected (I'm definitely not on this thread btw): https://mastodon.social/@dougall/109913251096277108

As for 64-bit... well, I mostly avoid using high-end GPUs, but I was of the impression that i64 is just simulated. In fact, I was thinking of using the full warp as a "pipeline" to implement u32 division (mostly as a joke), almost like anti-SWAR. There was some old-ish paper detailing arithmetic latencies in GPUs and division was approximately more than 32x multiplication (...or I could be misremembering).

bobmcnamara•1y ago
Parallel compares: https://graphics.stanford.edu/~seander/bithacks.html#ZeroInW...
DennisL123•1y ago
Interesting stuff. Not sure if I read this right that it‘s 16 und 32 bit values of integers that get sorted. If yes, I‘d love to see if the GPU implementation can beat a competitive Radix sort implementation on a CPU.
winwang•1y ago
It's 32 32-bit values which get sorted. I don't think a GPU sort would beat a CPU sort at this scale, even if you don't take kernel launch time into account. CPUs are simply too fast for (super-)small data, especially with AVX-512. But if we're talking about a larger amount of data, that would be a different story, i.e. as part of a normal gpu mergesort.
maeln•1y ago
It is also useful if your data already lives on the GPU memory. For example, when you need to z-sort a bunch of particles in a 3d renderer particle system.
exDM69•1y ago
A 32 way GPU sorting algorithm might be just what I need for sorting and deduplicating triangle id's in a visibility buffer renderer I am working on.

Thanks for sharing.

winwang•1y ago
As someone who doesn't know very much about graphics (ironically), you're welcome and hope it helps!
fourseventy•1y ago
What are the biggest use cases of GPU accelerated sorting?