Show HN: cuTile Rust: Safe, data-race-free GPU kernels in Rust

17•melihelibol•3h ago

Comments

melihelibol•3h ago

Hello,

I built cuTile Rust and just posted the paper preprint. Happy to answer questions.

TL;DR: Rust gives you fearless concurrency on the CPU, but GPU kernel programming still requires unsafe code. cuTile Rust carries Rust's ownership model across the launch boundary and maintains it via a safe tile-based programming model that compiles to Tile IR. The host-side GPU work you write composes into synchronous launches, async pipelines, or a CUDA graph you capture once and replay.

It works by letting you partition mutable output tensors into disjoint pieces on the host: Each tile program gets an exclusive `&mut` view of its piece and the inputs as shared `&` reads. Kernels are written with single-threaded semantics, and the compiler maps that to thread blocks and manages shared memory. Because the pieces are provably disjoint and ordering is threaded through mutable references, you get compile-time data-race freedom when using the safe surface API. As with any other Rust tool, safety remains extensible: Any functionality not yet exposed safely can be made available by writing your own safe abstractions over unsafe code where you supply the invariants yourself. The lower-level Tile IR operation surface (as of CUDA 13.3) is exposed through unsafe intrinsics.

On the B200, an optimized safe GEMM kernel is competitive with cuBLAS: It's within 0.3% of a hand-written low-level (Tile IR) variant at ~92% of the GPU's dense f16 peak. Element-wise is ~7 TB/s. So on these kernels the safety is effectively free. We also worked with Hugging Face to evaluate Grout, a Qwen3 inference engine built on cuTile Rust. In batch-1 Qwen3 decode, Grout reaches 171 tokens/s for Qwen3-4B on an NVIDIA GeForce RTX 5090 and 82 tokens/s for Qwen3-32B on a B200, showing strong performance on memory-bound inference tasks, consistent with our HBM roofline analysis. Benchmarks (harnesses + CSVs), hardware, and clock settings can be found in the repo. The full methodology is available in the paper: https://arxiv.org/abs/2606.15991

This is an early-stage research release: the tensor API is young (a few patterns still need raw pointers), GEMM trails cuBLAS at some sizes, and the tile model gives up SIMT-level control (explicit warp primitives, manual shared memory) in exchange for the semantics that make safety checkable. The recent 0.2.0 release added low-precision support (FP4 packing and block-scaled MMA on CUDA 13.3). These only just landed, so we haven't benchmarked them yet, but we expect them to perform well. Also, while Tile IR is not portable across GPU vendors, it is portable across NVIDIA GPU architectures: What you write in cutile-rs will work on sm_80+ (Ampere and up) with CUDA 13.3, but hardware-specific features such as native FP4 require architectures that support them.

The latest release is on crates.io. After setting up CUDA 13, you can `cargo add cutile` to pull it into your own project, or clone the repo and `cargo run -p cutile-examples --example hello_world` to try it out. Most of what's here has co-evolved with community feedback and contributions since we made the repo public. We read everything that comes in, so anything you raise will shape the direction of this project. If you have a cool feature idea, open an issue or a PR and let's discuss.

lmeyerov•11m ago

Any thoughts on layering on-GPU work stealing or cudf on top?

For gfql (graph query language mapping down to cudf calls), we're trying to jettison the hot loop of python->cpu->gpu, so been loosely watching cuTile evolve!

binarybana•2h ago

I'm excited to see what cuTile-rs unlocks. Like the direction of HuggingFace's grout https://github.com/huggingface/grout project for local LLM inference:

- state of the art performance

- codebase that fits in a context window (including kernel definitions!)

- single binary deployment

Similar to antirez's ds4.c, but in Rust and with cuTile making kernels both easier to author and higher performance.

the__alchemist•1h ago

How does this compare to nvidia's CUDA-oxide? The latter is similar in syntax to CUDARC on on host side, but replace's the normal-cuda-kernel (in c++ish) on device side with rust.

I ask, because I use CUDA in rust (kernels via cudarc; ML with burn and candle, and cuFFT with FFI), so I am trying to figure out how I would fit this into a workflow.

melihelibol•1h ago

You got it right: This project exposes our tile programming model, whereas cuda-oxide exposes our lower level CUDA-like programming model. Our tile programming model is higher-level: It compiles to what looks like the CUDA-like programming model.

If you're using burn and candle, and you're writing custom kernels, you can probably write most of your kernels in cutile-rs and let the Tile IR compiler optimize your kernel.

That said, if you're used to writing CUDA, then there is a bit of a learning curve. We have tutorials available that walk you through how it works here: https://nvlabs.github.io/cutile-rs/0.2.0/index.html

Familiarity with numpy helps substantially (it's supposed to have a numpy-like feel), but if you're coming from CUDA and want to leverage the safety features this project provides, then you should jump straight to the "useful mental models" page, which touches on how this compares to CUDA: https://nvlabs.github.io/cutile-rs/main/guide/useful-mental-...

GrapheneOS has been ported to Android 17

Running local models is good now

SpaceX to buy Cursor for $60B

Humiliating IIS servers for fun and jail time

Calvin and Hobbes and the price of integrity

TIL: You can make HTTP requests without curl using Bash /dev/TCP

GPT‑NL: a sovereign language model for the Netherlands

Mechanical Watch (2022)

Stop Using JWTs

Has AI already killed self-help nonfiction books?

Wolfram Language and Mathematica Version 15, AI Assistant, Symbolic Music, More

But yak shaving is fun (2019)

The UK's Teen Social Media Ban Is Political Theater, Not Child Safety Policy

10Gb/s Ethernet: switching to a Broadcom SFP+ module

A brief tour of the PDP-11, the most influential minicomputer of all time (2022)

NLnet announces funding for 67 more open-source projects

Apple is about to make Hide My Email useless

Correlated randomness in Slay the Spire 2

W.H. Auden and James Schuyler in life and literature

Frood, an Alpine Initramfs NAS (2024)

Apple's weird anti-nausea dots cured my car sickness

A Nipkow Disk Mechanical TV Simulator

Show HN: cuTile Rust: Safe, data-race-free GPU kernels in Rust

Is Meta destroying its engineering organization?

Making ast.walk 220x Faster

Qwen-Robot Suite: A Foundation Model Suite for Physical World Intelligence

Formal Methods and the Future of Programming

SubQ 1.1 Small

An interview with an Apple emoji designer

Show HN: Sabela – A Reactive Notebook for Haskell

Show HN: cuTile Rust: Safe, data-race-free GPU kernels in Rust

Comments

GrapheneOS has been ported to Android 17

Running local models is good now

SpaceX to buy Cursor for $60B

Humiliating IIS servers for fun and jail time

Calvin and Hobbes and the price of integrity

TIL: You can make HTTP requests without curl using Bash /dev/TCP

GPT‑NL: a sovereign language model for the Netherlands

Mechanical Watch (2022)

Stop Using JWTs

Has AI already killed self-help nonfiction books?

Wolfram Language and Mathematica Version 15, AI Assistant, Symbolic Music, More

But yak shaving is fun (2019)

The UK's Teen Social Media Ban Is Political Theater, Not Child Safety Policy

10Gb/s Ethernet: switching to a Broadcom SFP+ module

A brief tour of the PDP-11, the most influential minicomputer of all time (2022)

NLnet announces funding for 67 more open-source projects

Apple is about to make Hide My Email useless

Correlated randomness in Slay the Spire 2

W.H. Auden and James Schuyler in life and literature

Frood, an Alpine Initramfs NAS (2024)

Apple's weird anti-nausea dots cured my car sickness

A Nipkow Disk Mechanical TV Simulator

Show HN: cuTile Rust: Safe, data-race-free GPU kernels in Rust

Is Meta destroying its engineering organization?

Making ast.walk 220x Faster

Qwen-Robot Suite: A Foundation Model Suite for Physical World Intelligence

Formal Methods and the Future of Programming

SubQ 1.1 Small

An interview with an Apple emoji designer

Show HN: Sabela – A Reactive Notebook for Haskell