frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

Show HN: cuTile Rust: Safe, data-race-free GPU kernels in Rust

https://github.com/nvlabs/cutile-rs
2•melihelibol•1h ago

Comments

melihelibol•1h ago
Hello,

I built cuTile Rust and just posted the paper preprint. Happy to answer questions.

TL;DR: Rust gives you fearless concurrency on the CPU, but GPU kernel programming still requires unsafe code. cuTile Rust carries Rust's ownership model across the launch boundary and maintains it via a safe tile-based programming model that compiles to Tile IR. The host-side GPU work you write composes into synchronous launches, async pipelines, or a CUDA graph you capture once and replay.

It works by letting you partition mutable output tensors into disjoint pieces on the host: Each tile program gets an exclusive `&mut` view of its piece and the inputs as shared `&` reads. Kernels are written with single-threaded semantics, and the compiler maps that to thread blocks and manages shared memory. Because the pieces are provably disjoint and ordering is threaded through mutable references, you get compile-time data-race freedom when using the safe surface API. As with any other Rust tool, safety remains extensible: Any functionality not yet exposed safely can be made available by writing your own safe abstractions over unsafe code where you supply the invariants yourself. The lower-level Tile IR operation surface (as of CUDA 13.3) is exposed through unsafe intrinsics.

On the B200, an optimized safe GEMM kernel is competitive with cuBLAS: It's within 0.3% of a hand-written low-level (Tile IR) variant at ~92% of the GPU's dense f16 peak. Element-wise is ~7 TB/s. So on these kernels the safety is effectively free. We also worked with Hugging Face to evaluate Grout, a Qwen3 inference engine built on cuTile Rust. In batch-1 Qwen3 decode, Grout reaches 171 tokens/s for Qwen3-4B on an NVIDIA GeForce RTX 5090 and 82 tokens/s for Qwen3-32B on a B200, showing strong performance on memory-bound inference tasks, consistent with our HBM roofline analysis. Benchmarks (harnesses + CSVs), hardware, and clock settings can be found in the repo. The full methodology is available in the paper: https://arxiv.org/abs/2606.15991

This is an early-stage research release: the tensor API is young (a few patterns still need raw pointers), GEMM trails cuBLAS at some sizes, and the tile model gives up SIMT-level control (explicit warp primitives, manual shared memory) in exchange for the semantics that make safety checkable. The recent 0.2.0 release added low-precision support (FP4 packing and block-scaled MMA on CUDA 13.3). These only just landed, so we haven't benchmarked them yet, but we expect them to perform well. Also, while Tile IR is not portable across GPU vendors, it is portable across NVIDIA GPU architectures: What you write in cutile-rs will work on sm_80+ (Ampere and up) with CUDA 13.3, but hardware-specific features such as native FP4 require architectures that support them.

The latest release is on crates.io. After setting up CUDA 13, you can `cargo add cutile` to pull it into your own project, or clone the repo and `cargo run -p cutile-examples --example hello_world` to try it out. Most of what's here has co-evolved with community feedback and contributions since we made the repo public. We read everything that comes in, so anything you raise will shape the direction of this project. If you have a cool feature idea, open an issue or a PR and let's discuss.

Data Compression Explained

https://mattmahoney.net/dc/dce.html
1•mtdewcmu•1m ago•0 comments

I Built Our Data Lake on DynamoDB Streams Instead of Kafka

https://medium.com/@yalovoy/i-built-our-entire-data-lake-on-dynamodb-streams-instead-of-kafka-218...
1•zero-ground-445•1m ago•0 comments

The Rain Spell

https://notas.grod.es/the-rain-spell
1•grodes•4m ago•0 comments

OpenAI's leaked financials reveal soaring losses as it prepares to go public

https://groups.google.com/a/netflix.com/g/ios-ui-kickoffs/c/772e4-hycBE
1•andsoitis•7m ago•2 comments

DOJ assists Musk's xAI in NAACP air pollution suit, asks court to toss case

https://www.cnbc.com/2026/06/16/usdepartment-of-justice-calls-for-dismissal-of-naacp-xai-lawsuit-...
1•ChrisArchitect•8m ago•0 comments

Workday isn't 75% of the ATS market. I checked 337 companies

https://withresumeai.com/blog/workday-ats-market-share-myth-2026
1•kzahiri•8m ago•0 comments

Drone Physics

https://iahmed.me/post/drone-physics/
3•hazrmard•12m ago•0 comments

Show HN: Steam City – Your Steam game library as a 3D city

https://thesteamcity.com
1•m1rsh0•13m ago•0 comments

Welcoming Our Newest Core Team Members

https://ziglang.org/news/welcoming-new-team-members/
1•yurivish•13m ago•0 comments

Swift-OS – operating system written in Embedded Swift for aarch64

https://github.com/asaptf/swift-os/tree/main
1•de_aztec•15m ago•0 comments

Eno – General Purpose Humanoid Robot

https://www.genesis.ai/
1•cheeko1234•16m ago•0 comments

UK Government Goes All-In on Digital Surveillance, Censorship and Control

https://www.nakedcapitalism.com/2026/06/the-uks-most-unpopular-government-on-record-just-went-all...
4•boticello•18m ago•1 comments

Software Is Not a Single-Player Game

https://www.davidpoll.com/2026/06/software-is-not-a-single-player-game/
2•depoll•18m ago•0 comments

How to Demolish a Bridge [video]

https://www.youtube.com/watch?v=7oi4yMr8Rjk
1•mhb•23m ago•0 comments

Ten Years of Just (and Lists)

1•rodarmor•24m ago•0 comments

Seattle Underground

https://en.wikipedia.org/wiki/Seattle_Underground
1•axelfontaine•26m ago•0 comments

AMD Pulls Memory Encryption from Ryzen CPUs

https://www.technology.org/2026/06/16/amd-strips-memory-encryption-consumer-ryzen-cpus/
4•KAMSPioneer•27m ago•0 comments

Update: Acabei de abrir O SDK no GitHub → github.com/mathhMadureira/orka

1•matteusmadu•29m ago•0 comments

The Battle with Anthropic Is the Start of a New Kind of Conflict

https://www.nytimes.com/2026/06/16/opinion/anthropic-fable-ai-trump-administration.html
1•furcyd•31m ago•0 comments

I built a fail-closed execution gate for AI agents

https://kronyqldemo.netlify.app/demo/ase-proof
1•Auditome•32m ago•0 comments

Total Iran Economic Damage Estimate

https://www.fdd.org/analysis/2026/04/23/total-iran-economic-damage-estimate/
4•littlexsparkee•34m ago•0 comments

Qode – The first AI agent that can generate 50k line codebases in one prompt

https://github.com/akshaylakkur/Q
1•akshayl284•35m ago•1 comments

Silicon Motion exec: Retail SSD market has almost disappeared

https://www.tomshardware.com/pc-components/ssds/the-retail-ssd-market-has-almost-disappeared-says...
7•Lihh27•40m ago•0 comments

Guardrails for Reuse in AI from ClimateSOS Foundational Charter

1•safiume•40m ago•0 comments

An Interview with Eugene Jarvis of Robotron, Defender and Stargate Fame

https://www.gamedeveloper.com/production/eugeneology-an-interview-with-eugene-jarvis
1•evo_9•43m ago•0 comments

VibeThinker-3B achieves 80.2 on LCBv6

https://twitter.com/WeiboLLM/status/2066870851841274249
4•moondistance•44m ago•1 comments

Mathup

https://mathup.xyz/
1•runarberg•44m ago•0 comments

GLM-5.2: Frontier Intelligence, Open Weights

https://twitter.com/Zai_org/status/2066938937344495629
6•zixuanlimit•46m ago•1 comments

Hacking group claims major hack of Novo Nordisk and attempted $25M extortion

https://www.reuters.com/legal/government/hacking-group-claims-major-hack-novo-nordisk-attempted-2...
4•nnmg•46m ago•0 comments

Anthropic "pauses" token-based billing for its Claude Agent SDK

https://arstechnica.com/ai/2026/06/anthropic-pauses-token-based-billing-for-its-claude-agent-sdk/
3•cdrnsf•46m ago•0 comments