frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

Show HN: cuTile Rust: Safe, data-race-free GPU kernels in Rust

https://github.com/nvlabs/cutile-rs
17•melihelibol•3h ago

Comments

melihelibol•3h ago
Hello,

I built cuTile Rust and just posted the paper preprint. Happy to answer questions.

TL;DR: Rust gives you fearless concurrency on the CPU, but GPU kernel programming still requires unsafe code. cuTile Rust carries Rust's ownership model across the launch boundary and maintains it via a safe tile-based programming model that compiles to Tile IR. The host-side GPU work you write composes into synchronous launches, async pipelines, or a CUDA graph you capture once and replay.

It works by letting you partition mutable output tensors into disjoint pieces on the host: Each tile program gets an exclusive `&mut` view of its piece and the inputs as shared `&` reads. Kernels are written with single-threaded semantics, and the compiler maps that to thread blocks and manages shared memory. Because the pieces are provably disjoint and ordering is threaded through mutable references, you get compile-time data-race freedom when using the safe surface API. As with any other Rust tool, safety remains extensible: Any functionality not yet exposed safely can be made available by writing your own safe abstractions over unsafe code where you supply the invariants yourself. The lower-level Tile IR operation surface (as of CUDA 13.3) is exposed through unsafe intrinsics.

On the B200, an optimized safe GEMM kernel is competitive with cuBLAS: It's within 0.3% of a hand-written low-level (Tile IR) variant at ~92% of the GPU's dense f16 peak. Element-wise is ~7 TB/s. So on these kernels the safety is effectively free. We also worked with Hugging Face to evaluate Grout, a Qwen3 inference engine built on cuTile Rust. In batch-1 Qwen3 decode, Grout reaches 171 tokens/s for Qwen3-4B on an NVIDIA GeForce RTX 5090 and 82 tokens/s for Qwen3-32B on a B200, showing strong performance on memory-bound inference tasks, consistent with our HBM roofline analysis. Benchmarks (harnesses + CSVs), hardware, and clock settings can be found in the repo. The full methodology is available in the paper: https://arxiv.org/abs/2606.15991

This is an early-stage research release: the tensor API is young (a few patterns still need raw pointers), GEMM trails cuBLAS at some sizes, and the tile model gives up SIMT-level control (explicit warp primitives, manual shared memory) in exchange for the semantics that make safety checkable. The recent 0.2.0 release added low-precision support (FP4 packing and block-scaled MMA on CUDA 13.3). These only just landed, so we haven't benchmarked them yet, but we expect them to perform well. Also, while Tile IR is not portable across GPU vendors, it is portable across NVIDIA GPU architectures: What you write in cutile-rs will work on sm_80+ (Ampere and up) with CUDA 13.3, but hardware-specific features such as native FP4 require architectures that support them.

The latest release is on crates.io. After setting up CUDA 13, you can `cargo add cutile` to pull it into your own project, or clone the repo and `cargo run -p cutile-examples --example hello_world` to try it out. Most of what's here has co-evolved with community feedback and contributions since we made the repo public. We read everything that comes in, so anything you raise will shape the direction of this project. If you have a cool feature idea, open an issue or a PR and let's discuss.

lmeyerov•11m ago
Any thoughts on layering on-GPU work stealing or cudf on top?

For gfql (graph query language mapping down to cudf calls), we're trying to jettison the hot loop of python->cpu->gpu, so been loosely watching cuTile evolve!

binarybana•2h ago
I'm excited to see what cuTile-rs unlocks. Like the direction of HuggingFace's grout https://github.com/huggingface/grout project for local LLM inference:

- state of the art performance

- codebase that fits in a context window (including kernel definitions!)

- single binary deployment

Similar to antirez's ds4.c, but in Rust and with cuTile making kernels both easier to author and higher performance.

the__alchemist•1h ago
How does this compare to nvidia's CUDA-oxide? The latter is similar in syntax to CUDARC on on host side, but replace's the normal-cuda-kernel (in c++ish) on device side with rust.

I ask, because I use CUDA in rust (kernels via cudarc; ML with burn and candle, and cuFFT with FFI), so I am trying to figure out how I would fit this into a workflow.

melihelibol•1h ago
You got it right: This project exposes our tile programming model, whereas cuda-oxide exposes our lower level CUDA-like programming model. Our tile programming model is higher-level: It compiles to what looks like the CUDA-like programming model.

If you're using burn and candle, and you're writing custom kernels, you can probably write most of your kernels in cutile-rs and let the Tile IR compiler optimize your kernel.

That said, if you're used to writing CUDA, then there is a bit of a learning curve. We have tutorials available that walk you through how it works here: https://nvlabs.github.io/cutile-rs/0.2.0/index.html

Familiarity with numpy helps substantially (it's supposed to have a numpy-like feel), but if you're coming from CUDA and want to leverage the safety features this project provides, then you should jump straight to the "useful mental models" page, which touches on how this compares to CUDA: https://nvlabs.github.io/cutile-rs/main/guide/useful-mental-...

GrapheneOS has been ported to Android 17

https://discuss.grapheneos.org/d/36469-grapheneos-has-been-ported-to-android-17-and-official-rele...
270•Cider9986•3h ago•110 comments

Running local models is good now

https://vickiboykis.com/2026/06/15/running-local-models-is-good-now/
946•jfb•9h ago•409 comments

SpaceX to buy Cursor for $60B

https://www.reuters.com/legal/transactional/spacex-buy-anysphere-60-billion-2026-06-16/
819•itsmarcelg•13h ago•1257 comments

Humiliating IIS servers for fun and jail time

https://mll.sh/humiliating-iis-servers-for-fun-and-jail-time/
17•denysvitali•59m ago•0 comments

Calvin and Hobbes and the price of integrity

https://therepublicofletters.substack.com/p/calvin-and-hobbes-and-the-price-of
236•pseudolus•8h ago•105 comments

TIL: You can make HTTP requests without curl using Bash /dev/TCP

https://mareksuppa.com/til/bash-dev-tcp-http-without-curl/
225•mrshu•7h ago•124 comments

GPT‑NL: a sovereign language model for the Netherlands

https://www.tno.nl/en/digital/artificial-intelligence/gpt-nl/
123•root-parent•5h ago•130 comments

Mechanical Watch (2022)

https://ciechanow.ski/mechanical-watch/
609•razin•12h ago•113 comments

Stop Using JWTs

https://gist.github.com/samsch/0d1f3d3b4745d778f78b230cf6061452
206•dzonga•7h ago•127 comments

Has AI already killed self-help nonfiction books?

https://tim.blog/2026/06/12/has-ai-already-killed-nonfiction/
124•imakwana•6h ago•126 comments

Wolfram Language and Mathematica Version 15, AI Assistant, Symbolic Music, More

https://writings.stephenwolfram.com/2026/06/launching-version-15-of-wolfram-language-mathematica-...
4•alok-g•37m ago•1 comments

But yak shaving is fun (2019)

https://parksb.github.io/en/article/32.html
195•parksb•9h ago•53 comments

The UK's Teen Social Media Ban Is Political Theater, Not Child Safety Policy

https://www.techdirt.com/2026/06/16/the-uks-teen-social-media-ban-is-political-theater-not-child-...
57•hn_acker•1h ago•41 comments

10Gb/s Ethernet: switching to a Broadcom SFP+ module

https://www.gilesthomas.com/2026/06/10g-ethernet-switching-to-broadcom-sfp-plus
85•gpjt•6h ago•64 comments

A brief tour of the PDP-11, the most influential minicomputer of all time (2022)

https://arstechnica.com/gadgets/2022/03/a-brief-tour-of-the-pdp-11-the-most-influential-minicompu...
13•jensgk•1d ago•0 comments

NLnet announces funding for 67 more open-source projects

https://nlnet.nl/news/2026/20260616-67-new-projects.html
23•laurenth•40m ago•7 comments

Apple is about to make Hide My Email useless

https://arseniyshestakov.com/2026/06/16/apple-is-about-to-make-hide-my-email-useless/
349•SXX•5h ago•215 comments

Correlated randomness in Slay the Spire 2

https://tck.mn/blog/correlated-randomness-sts2/
271•rdmuser•14h ago•85 comments

W.H. Auden and James Schuyler in life and literature

https://hedgehogreview.com/web-features/thr/posts/companions-on-parnassus
11•Caiero•3d ago•0 comments

Frood, an Alpine Initramfs NAS (2024)

https://words.filippo.io/frood/
23•ethanpil•3h ago•8 comments

Apple's weird anti-nausea dots cured my car sickness

https://www.theverge.com/tech/942854/apple-vehicle-motion-cues-review-really-work
514•neilfrndes•7h ago•170 comments

A Nipkow Disk Mechanical TV Simulator

https://analogtv.net/mechanical-lab
9•ambanmba•2d ago•2 comments

Show HN: cuTile Rust: Safe, data-race-free GPU kernels in Rust

https://github.com/nvlabs/cutile-rs
19•melihelibol•3h ago•5 comments

Is Meta destroying its engineering organization?

https://newsletter.pragmaticengineer.com/p/why-is-meta-destroying-its-engineering
354•throwarayes•7h ago•326 comments

Making ast.walk 220x Faster

https://reflex.dev/blog/why-ast-walk-when-you-can-ast-sprint/
80•palashawas•7h ago•13 comments

Qwen-Robot Suite: A Foundation Model Suite for Physical World Intelligence

https://qwen.ai/blog?id=qwen-robotsuite
112•ilreb•10h ago•17 comments

Formal Methods and the Future of Programming

https://blog.janestreet.com/formal-methods-at-jane-street-index/
68•nextos•5d ago•2 comments

SubQ 1.1 Small

https://subq.ai/subq-1-1-small-technical-report
102•EDM115•9h ago•45 comments

An interview with an Apple emoji designer

https://shadycharacters.co.uk/2026/06/ollie-wagner/
98•nate•3d ago•50 comments

Show HN: Sabela – A Reactive Notebook for Haskell

https://sabela.datahaskell.com/
28•mchav•2d ago•1 comments