frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Unweaving warp specialization on modern tensor core GPUs

https://rohany.github.io/blog/warp-specialization/
20•rohany•2h ago

Comments

liuliu•1h ago
My understanding is that you cannot talk about warp specialization without talking about the alternative: multi-stage pipelining. And the final example code given is multi-stage pipeline with double buffers.

And here is my understanding where it differs:

1. multi-stage pipeline requires careful hand-tuning, even at PTX level to make sure your async wait is weaved properly to maximize overlap.

2. since these register files now is huge, multi-stage pipeline is difficult to write at intrinsics level to make efficient use of these huge register files.

3. Warp specialization delegated most of these scheduling dynamically, hence it is better adapted to hardware (and have more information to make scheduling decisions at runtime). Although this is a bit moot because we write different code for different hardware anyway.

Anything more I am missing?

rohany•58m ago
Author here! I think that warp specialization is inherently related to multi-stage pipelining, they aren't really alternatives of each other. Warp specialization is a way to realize a multi-stage pipeline in the face of hazards that may cause the pipeline to spill out of the register file or not let parts of the pipeline run concurrently as desired.

The fact that we tend to need different warp specialization strategies for different hardware is a consequence of the capabilities of that hardware (i.e. different asynchronous instruction types), and contributes to the complexity of targeting that new hardware.

majke•1h ago
I always assumed that when one warp waits for results from a long latency instruction, another warp, potentially from another block can be scheduled in.

I guess this post assumes the need to use all the gpu resources from within a single block.

rohany•57m ago
> I always assumed that when one warp waits for results from a long latency instruction, another warp, potentially from another block can be scheduled in.

Yes, that is correct. However, most MMA-style kernels that utilize the Tensor Core usually need enough resources per block that only 1 block fits on each SM.

Qwen3-Omni: Native Omni AI model for text, image and video

https://github.com/QwenLM/Qwen3-Omni
223•meetpateltech•4h ago•61 comments

Show HN: Perfect your presentation with a panel of AI reviewers

https://review.thorntale.com/
6•ellenfkh•22m ago•0 comments

Choose Your Own Adventure

https://www.filfre.net/2025/09/choose-your-own-adventure/
96•naves•4h ago•52 comments

Cap'n Web: a new RPC system for browsers and web servers

https://blog.cloudflare.com/capnweb-javascript-rpc-library/
305•jgrahamc•9h ago•154 comments

Why haven't local-first apps become popular?

https://marcobambini.substack.com/p/why-local-first-apps-havent-become
232•marcobambini•9h ago•254 comments

Linux: Make the Kernel Cute

https://github.com/torvalds/linux/pull/1290
6•sergiotapia•18m ago•3 comments

OpenAI and Nvidia announce partnership to deploy 10GW of Nvidia systems

https://openai.com/index/openai-nvidia-systems-partnership/
351•meetpateltech•6h ago•472 comments

I'm spoiled by Apple Silicon but still love Framework

https://simonhartcher.com/posts/2025-09-22-why-im-spoiled-by-apple-silicon-but-still-love-framework/
126•deevus•9h ago•183 comments

Diffusion Beats Autoregressive in Data-Constrained Settings

https://blog.ml.cmu.edu/2025/09/22/diffusion-beats-autoregressive-in-data-constrained-settings/
40•djoldman•4h ago•8 comments

Jailhouse confessions of a teen hacker

https://www.bloomberg.com/news/features/2025-09-19/multimillion-dollar-hacking-spree-scattered-sp...
42•wslh•3d ago•7 comments

Paper2Agent: Stanford Reimagining Research Papers as Interactive AI Agents

https://arxiv.org/abs/2509.06917
3•Gaishan•33m ago•2 comments

Is a movie prop the ultimate laptop bag?

https://blog.jgc.org/2025/09/is-movie-prop-ultimate-laptop-bag.html
120•jgrahamc•10h ago•129 comments

A board member's perspective of the RubyGems controversy

https://apiguy.substack.com/p/a-board-members-perspective-of-the
55•Qwuke•1d ago•77 comments

Testing is better than data structures and algorithms

https://nedbatchelder.com/blog/202509/testing_is_better_than_dsa.html
70•rsyring•6h ago•56 comments

SWE-Bench Pro

https://github.com/scaleapi/SWE-bench_Pro-os
80•tosh•6h ago•18 comments

Transforming recursion into iteration for LLVM loop optimizations

https://dspace.mit.edu/handle/1721.1/162684
21•matt_d•1d ago•2 comments

Mentra (YC W25) is hiring to build smart glasses

1•caydenpiercehax•5h ago

Categorical Foundations for Cute Layouts

https://research.colfax-intl.com/categorical-foundations-for-cute-layouts/
24•charles_irl•17h ago•4 comments

What happens when coding agents stop feeling like dialup?

https://martinalderson.com/posts/what-happens-when-coding-agents-stop-feeling-like-dialup/
76•martinald•1d ago•69 comments

Cloudflare is sponsoring Ladybird and Omarchy

https://blog.cloudflare.com/supporting-the-future-of-the-open-web/
564•jgrahamc•9h ago•358 comments

Easy Forth (2015)

https://skilldrick.github.io/easyforth/
169•pkilgore•10h ago•94 comments

AI-generated “workslop” is destroying productivity?

https://hbr.org/2025/09/ai-generated-workslop-is-destroying-productivity
162•McScrooge•4h ago•92 comments

Beyond the Front Page: A Personal Guide to Hacker News

https://hsu.cy/2025/09/how-to-read-hn/
187•firexcy•12h ago•79 comments

Show HN: Python Audio Transcription: Convert Speech to Text Locally

https://www.pavlinbg.com/posts/python-speech-to-text-guide
20•Pavlinbg•4h ago•14 comments

PlanetScale for Postgres is now GA

https://planetscale.com/blog/planetscale-for-postgres-is-generally-available
240•munns•7h ago•137 comments

SGI demos from long ago in the browser via WASM

https://github.com/sgi-demos
226•yankcrime•14h ago•59 comments

Unweaving warp specialization on modern tensor core GPUs

https://rohany.github.io/blog/warp-specialization/
20•rohany•2h ago•4 comments

CompileBench: Can AI Compile 22-year-old Code?

https://quesma.com/blog/introducing-compilebench/
113•jakozaur•9h ago•46 comments

What is algebraic about algebraic effects?

https://interjectedfuture.com/what-is-algebraic-about-algebraic-effects/
70•iamwil•8h ago•31 comments

The Beginner's Textbook for Fully Homomorphic Encryption

https://arxiv.org/abs/2503.05136
151•Qision•1d ago•28 comments