So I built Axiom to make that rewrite mechanical. The API mirrors NumPy/PyTorch as closely as I could — same method names, broadcasting rules, operator overloading, dynamic shapes, runtime dtypes. Code that looks like this in PyTorch:
scores = Q.matmul(K.transpose(-2, -1)) / math.sqrt(64)
output = scores.softmax(-1).matmul(V)
looks like this in Axiom: auto scores = Q.matmul(K.transpose(-2, -1)) / std::sqrt(64.0f);
auto output = scores.softmax(-1).matmul(V);
No mental translation. No debugging subtle API differences.What's in the box (28k LOC):
- 100+ operations: arithmetic, reductions, activations (relu, gelu, silu, softmax), pooling, FFT, full LAPACK linear algebra (SVD, QR, Cholesky, eigendecomposition, solvers) - Metal GPU via MPSGraph — all ops run on GPU, not just matmul. Compiled graphs are cached by (shape, dtype) to avoid recompilation - Seamless CPU ↔ GPU: `auto g = tensor.gpu();` — unified memory on Apple Silicon avoids copies entirely - Built-in einops: `tensor.rearrange("b h w c -> b c h w")` - Highway SIMD across architectures (NEON, AVX2, AVX-512, SSE, WASM, RISC-V) - Runtime dtypes via variant (readable errors, not template explosions) - Row-major default, column-major supported via as_f_contiguous() - Works on macOS, Linux, Windows, and WebAssembly
Performance on M4 Pro (vs Eigen with OpenBLAS, PyTorch, NumPy):
- Matmul 2048×2048: 3,196 GFLOPS (Eigen 2,911 / PyTorch 2,433) - ReLU 4096×4096: 123 GB/s (Eigen 117 / PyTorch 70) - FFT2 2048×2048: 14.9ms (PyTorch 27.6ms / NumPy 63.5ms)
To try it:
git clone https://github.com/frikallo/axiom.git
cd axiom && make release
Or add to your CMake project via FetchContent. Example files in examples/.Happy to answer questions about the internals or take feedback on the API.
DenisDolya•6m ago