Show HN: RunMat – runtime with auto CPU/GPU routing for dense math

21•nallana•2mo ago

Hi, I’m Nabeel. In August I released RunMat as an open-source runtime for MATLAB code that was already much faster than GNU Octave on the workloads I tried. https://news.ycombinator.com/item?id=44972919

Since then, I’ve taken it further with RunMat Accelerate: the runtime now automatically fuses operations and routes work between CPU and GPU. You write MATLAB-style code, and RunMat runs your computation across CPUs and GPUs for speed. No CUDA, no kernel code.

Under the hood, it builds a graph of your array math, fuses long chains into a few kernels, keeps data on the GPU when that helps, and falls back to CPU JIT / BLAS for small cases.

On an Apple M2 Max (32 GB), here are some current benchmarks (median of several runs):

* 5M-path Monte Carlo * RunMat ≈ 0.61 s * PyTorch ≈ 1.70 s * NumPy ≈ 79.9 s → ~2.8× faster than PyTorch and ~130× faster than NumPy on this test.

* 64 × 4K image preprocessing pipeline (mean/std, normalize, gain/bias, gamma, MSE) * RunMat ≈ 0.68 s * PyTorch ≈ 1.20 s * NumPy ≈ 7.0 s → ~1.8× faster than PyTorch and ~10× faster than NumPy.

* 1B-point elementwise chain (sin / exp / cos / tanh mix) * RunMat ≈ 0.14 s * PyTorch ≈ 20.8 s * NumPy ≈ 11.9 s → ~140× faster than PyTorch and ~80× faster than NumPy.

If you want more detail on how the fusion and CPU/GPU routing work, I wrote up a longer post here: https://runmat.org/blog/runmat-accel-intro-blog

You can run the same benchmarks yourself from the GitHub repo in the main HN link. Feedback, bug reports, and “here’s where it breaks or is slow” examples are very welcome.

Comments

constantcrying•2mo ago

Writing a (somewhat?) Matlab compatible interpreter and runtime, which targets GPU and CPU simultaneously, is certainly impressive.

But, who is this for? Matlab users? Python users? Julia users? Do you have an aim with this project or is it just for fun?

salvesefu•2mo ago

From the Website: "If you write math in MATLAB and hit performance walls on CPU, RunMat is built for you."

nallana•2mo ago

Thanks!! It was originally for Octave users whose scripts were running painfully slow.

The goal was to keep the MATLAB frontend capture syntax, but run it fast.

When we dug into why people were still using Octave, it was because it let them focus on their math, and was easier for them to read - was especially important for people that aren’t programmers; eg scientists and engineers.

I suppose this is also why we write in higher level languages than assembly.

The goal of this project is now: let’s make the fastest runtime in the world to run math.

Turned out, the MATLAB syntax offers a large amount of compiler time hinting in (it is meant for math intent capture after all).

We’ve found as we built this that if we take a domain specific approach (eg we’re going to make every optimization for what’s best for people wanting to focus on the math part), we can outperform general purpose languages like Python by a large mile on the math part.

For example, internals like keeping tensor shapes + broadcasting intent within the AST, and having the computation graph available for profitable GPU/CPU threshold detection isn’t something that makes practical sense to build into a general purpose runtime like Python, but —

It lets RunMat speed up elementwise math orders of magnitude (eg 1B points going through 5-6 element wise ops like sin/cos/+/- etc are 80x faster on my MBP vs Python/PyTorch).

So Tl;dr — started as for Octave users. Goal is to build the fastest runtime for math for those that are looking to use computers to do math.

Obligatory disclosure because we’re engineers: you can still get faster by writing your own CUDA / GPU code. We’re betting 99% of the people that are trying to run math using computers don’t want to do that (ML community notwithstanding).

ardata•2mo ago

I've built trading bots that run monte carlo sims on historical data... numpy works but gets slow on large backtests, and pytorch feels like overkill when I just want fast array math without managing GPU memory. If this can drop in and handle the heavy lifting automatically i could see use for it

We Mourn Our Craft

I Write Games in C (yes, C)

Hoot: Scheme on WebAssembly

SectorC: A C Compiler in 512 bytes

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

The AI boom is causing shortages everywhere else

Al Lowe on model trains, funny deaths and working with Disney

The Waymo World Model

Reinforcement Learning from Human Feedback

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

History and Timeline of the Proco Rat Pedal (2021)

Selection Rather Than Prediction

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Hackers (1995) Animated Experience

Making geo joins faster with H3 indexes

Sheldon Brown's Bicycle Technical Info