Show HN: cuTile Rust: Safe, data-race-free GPU kernels in Rust

8•melihelibol•2h ago

Comments

melihelibol•2h ago

Hello,

I built cuTile Rust and just posted the paper preprint. Happy to answer questions.

TL;DR: Rust gives you fearless concurrency on the CPU, but GPU kernel programming still requires unsafe code. cuTile Rust carries Rust's ownership model across the launch boundary and maintains it via a safe tile-based programming model that compiles to Tile IR. The host-side GPU work you write composes into synchronous launches, async pipelines, or a CUDA graph you capture once and replay.

It works by letting you partition mutable output tensors into disjoint pieces on the host: Each tile program gets an exclusive `&mut` view of its piece and the inputs as shared `&` reads. Kernels are written with single-threaded semantics, and the compiler maps that to thread blocks and manages shared memory. Because the pieces are provably disjoint and ordering is threaded through mutable references, you get compile-time data-race freedom when using the safe surface API. As with any other Rust tool, safety remains extensible: Any functionality not yet exposed safely can be made available by writing your own safe abstractions over unsafe code where you supply the invariants yourself. The lower-level Tile IR operation surface (as of CUDA 13.3) is exposed through unsafe intrinsics.

On the B200, an optimized safe GEMM kernel is competitive with cuBLAS: It's within 0.3% of a hand-written low-level (Tile IR) variant at ~92% of the GPU's dense f16 peak. Element-wise is ~7 TB/s. So on these kernels the safety is effectively free. We also worked with Hugging Face to evaluate Grout, a Qwen3 inference engine built on cuTile Rust. In batch-1 Qwen3 decode, Grout reaches 171 tokens/s for Qwen3-4B on an NVIDIA GeForce RTX 5090 and 82 tokens/s for Qwen3-32B on a B200, showing strong performance on memory-bound inference tasks, consistent with our HBM roofline analysis. Benchmarks (harnesses + CSVs), hardware, and clock settings can be found in the repo. The full methodology is available in the paper: https://arxiv.org/abs/2606.15991

This is an early-stage research release: the tensor API is young (a few patterns still need raw pointers), GEMM trails cuBLAS at some sizes, and the tile model gives up SIMT-level control (explicit warp primitives, manual shared memory) in exchange for the semantics that make safety checkable. The recent 0.2.0 release added low-precision support (FP4 packing and block-scaled MMA on CUDA 13.3). These only just landed, so we haven't benchmarked them yet, but we expect them to perform well. Also, while Tile IR is not portable across GPU vendors, it is portable across NVIDIA GPU architectures: What you write in cutile-rs will work on sm_80+ (Ampere and up) with CUDA 13.3, but hardware-specific features such as native FP4 require architectures that support them.

The latest release is on crates.io. After setting up CUDA 13, you can `cargo add cutile` to pull it into your own project, or clone the repo and `cargo run -p cutile-examples --example hello_world` to try it out. Most of what's here has co-evolved with community feedback and contributions since we made the repo public. We read everything that comes in, so anything you raise will shape the direction of this project. If you have a cool feature idea, open an issue or a PR and let's discuss.

binarybana•38m ago

I'm excited to see what cuTile-rs unlocks. Like the direction of HuggingFace's grout https://github.com/huggingface/grout project for local LLM inference:

- state of the art performance

- codebase that fits in a context window (including kernel definitions!)

- single binary deployment

Similar to antirez's ds4.c, but in Rust and with cuTile making kernels both easier to author and higher performance.

the__alchemist•13m ago

How does this compare to nvidia's CUDA-oxide? The latter is similar in syntax to CUDARC on on host side, but replace's the normal-cuda-kernel (in c++ish) on device side with rust.

I ask, because I use CUDA in rust (kernels via cudarc; ML with burn and candle, and cuFFT with FFI), so I am trying to figure out how I would fit this into a workflow.

melihelibol•2m ago

You got it right: This project exposes our tile programming model, whereas cuda-oxide exposes our lower level CUDA-like programming model. Our tile programming model is higher-level: It compiles to what looks like the CUDA-like programming model.

If you're using burn and candle, and you're writing custom kernels, you can probably write most of your kernels in cutile-rs and let the Tile IR compiler optimize your kernel.

That said, if you're used to writing CUDA, then there is a bit of a learning curve. We have tutorials available that walk you through how it works here: https://nvlabs.github.io/cutile-rs/0.2.0/index.html

Familiarity with numpy helps substantially (it's supposed to have a numpy-like feel), but if you're coming from CUDA and want to leverage the safety features this project provides, then you should jump straight to the "useful mental models" page, which touches on how this compares to CUDA: https://nvlabs.github.io/cutile-rs/main/guide/useful-mental-...

Show HN: cuTile Rust: Safe, data-race-free GPU kernels in Rust

Show HN: VoiceDraw – Talk system design out loud, the diagrams draw themselves

Show HN: Sabela – A Reactive Notebook for Haskell

Show HN: Garden of Flowers – an archive of pictorial typography before ASCII art

Show HN: Pen and paper resource development game with an emergent world

Show HN: Azure DevOps TUI Management Style

Show HN: I'm 15, built an AI that watches your screen and acts before you ask

Show HN: Ctx, save tokens by loading only the relevant tools

Show HN: Fata – Spaced repetition to fight skill rot from AI coding

Show HN: Dev-friendly native OTel: only OSS stateful, on-the-wire Observability

Show HN: Veterinarian turned founder, AI lawn diagnosis

Show HN: machine0 – Persistent NixOS VMs You Control from the CLI

Show HN: Microlearning apps with a TikTok-style feed to beat doomscrolling

Show HN: SharkClean MCP

Show HN: Memento – Self-hosted agentic search and LLM wiki over your email

Show HN: Kage – Shadow any website to a single binary for offline viewing

Show HN: Claireon – MCP Server for Unreal Editor

Show HN: Trace – Offline Mac meeting transcripts you can flag mid-call

Show HN: git-lrc – Free, Micro AI Code Reviews That Run on Git Commit

Show HN: The Dictionary Game (Fictionary/Balderdash) as a Daily Puzzle

Show HN: Exploiting Slack's video embeds to achieve E2EE communication

Show HN: Zero Browser

Show HN: Discover Wikipedia articles popular on Hacker News

Show HN: Morning Stack finds real job openings, tweaks resume and cover letter

Show HN: Pair your iPhone to your own Ollama over Tailscale with a QR scan

Show HN: AppointMe – open-source .NET SaaS template (modular monolith, DDD)

Show HN: Write SaaS apps where users control where their data is stored

Show HN: Brainfuck but with Turtle Graphics

Show HN: CoreMCP – MCP Server for On-Prem DBs

Show HN: I am building a map of people who lived in the Roman Empire

Show HN: cuTile Rust: Safe, data-race-free GPU kernels in Rust

Comments

Show HN: cuTile Rust: Safe, data-race-free GPU kernels in Rust

Show HN: VoiceDraw – Talk system design out loud, the diagrams draw themselves

Show HN: Sabela – A Reactive Notebook for Haskell

Show HN: Garden of Flowers – an archive of pictorial typography before ASCII art

Show HN: Pen and paper resource development game with an emergent world

Show HN: Azure DevOps TUI Management Style

Show HN: I'm 15, built an AI that watches your screen and acts before you ask

Show HN: Ctx, save tokens by loading only the relevant tools

Show HN: Fata – Spaced repetition to fight skill rot from AI coding

Show HN: Dev-friendly native OTel: only OSS stateful, on-the-wire Observability

Show HN: Veterinarian turned founder, AI lawn diagnosis

Show HN: machine0 – Persistent NixOS VMs You Control from the CLI

Show HN: Microlearning apps with a TikTok-style feed to beat doomscrolling

Show HN: SharkClean MCP

Show HN: Memento – Self-hosted agentic search and LLM wiki over your email

Show HN: Kage – Shadow any website to a single binary for offline viewing

Show HN: Claireon – MCP Server for Unreal Editor

Show HN: Trace – Offline Mac meeting transcripts you can flag mid-call

Show HN: git-lrc – Free, Micro AI Code Reviews That Run on Git Commit

Show HN: The Dictionary Game (Fictionary/Balderdash) as a Daily Puzzle

Show HN: Exploiting Slack's video embeds to achieve E2EE communication

Show HN: Zero Browser

Show HN: Discover Wikipedia articles popular on Hacker News

Show HN: Morning Stack finds real job openings, tweaks resume and cover letter

Show HN: Pair your iPhone to your own Ollama over Tailscale with a QR scan

Show HN: AppointMe – open-source .NET SaaS template (modular monolith, DDD)

Show HN: Write SaaS apps where users control where their data is stored

Show HN: Brainfuck but with Turtle Graphics

Show HN: CoreMCP – MCP Server for On-Prem DBs

Show HN: I am building a map of people who lived in the Roman Empire