Show HN: AxonML – A PyTorch-equivalent ML framework written in Rust

4•AutomataNexus•1h ago

Comments

AutomataNexus•1h ago

Hi HN. I've been building AxonML for a bit now, testing often, and it's at v0.3.3 now -- 22 crates, 336 Rust source files, 1,076+ passing tests. It's a from-scratch ML framework in pure Rust aiming for PyTorch parity, dual licensed MIT/Apache-2.0.

I'm sharing it because I think the "Rust for ML" space is still underexplored relative to its potential, and I wanted to show what one person building full-time can produce.

### What's built

The full stack, bottom to top:

*Core compute:* N-dimensional tensors with broadcasting (NumPy rules), arbitrary shapes, views, slicing. Reverse-mode automatic differentiation with a tape-based computational graph. GPU backends for CUDA (GPU-resident tensors, cuBLAS GEMM, 20+ element-wise kernels with automatic dispatch), Vulkan, Metal, and WebGPU.

*Neural networks:* Linear, Conv1d/2d, MaxPool, AvgPool, AdaptiveAvgPool, BatchNorm1d/2d, LayerNorm, GroupNorm, InstanceNorm2d, Dropout, RNN/LSTM/GRU (with cell variants), MultiHeadAttention, CrossAttention, full Transformer encoder/decoder, Seq2SeqTransformer, Embedding. Loss functions: MSE, CrossEntropy, BCE, BCEWithLogits, L1, SmoothL1, NLL. Initialization: Xavier, Kaiming, Orthogonal.

*Optimizers:* SGD (with momentum/Nesterov), Adam, AdamW, RMSprop, Adagrad, LBFGS, LAMB. GradScaler for mixed precision. LR schedulers: Step, Cosine, OneCycle, Warmup, ReduceLROnPlateau, MultiStep, Exponential.

*Distributed training:* DDP, Fully Sharded Data Parallel (ZeRO-2/ZeRO-3), Pipeline Parallelism with microbatching, Tensor Parallelism.

*LLM architectures:* BERT (encoder, sequence classification, masked LM), GPT-2 (decoder, LM head), LLaMA (RMSNorm, RotaryEmbedding, GroupedQueryAttention), Mistral, Phi. Text generation with top-k, top-p, temperature sampling. Pretrained model hub configs.

*Ecosystem tooling:* ONNX import/export (40+ operators, opset 17), model quantization (INT4/INT5/INT8/F16, block-based with calibration, ~8x size reduction at Q4), kernel fusion (automatic pattern detection, FusedLinear, up to 2x on memory-bound ops), JIT compilation (graph optimization, Cranelift foundation), profiling (timeline with Chrome trace export, bottleneck analyzer).

*Vision/Audio/NLP:* ResNet, VGG, ViT architectures, image transforms, MFCC/spectrogram, BPE tokenizer, vocabulary management.

*Full application stack:* CLI with 50+ commands, terminal UI (ratatui-based dashboard), web dashboard (Leptos/WASM with WebSocket), Axum REST API server with JWT auth, MFA (TOTP + WebAuthn), model registry, inference endpoint deployment, in-browser terminal via WebSocket PTY, Prometheus metrics, Weights & Biases integration, Kaggle integration.

I estimate PyTorch parity at roughly 92-95% for the core training loop and standard layer types.

### Production deployment -- this is the part I'm most proud of

AxonML is running live production inference right now. 12 HVAC predictive maintenance models (LSTM autoencoders for anomaly detection + GRU failure predictors) are deployed across 6 Raspberry Pi edge controllers, monitoring commercial building equipment across 5 facilities. Each model is cross-compiled to `armv7-unknown-linux-musleabihf` (static musl), runs as a PM2-managed daemon at ~2-3 MB RSS, and exposes predictions via REST API at 1 Hz.

Beyond those initial 6 controllers, I've built out models for 35 HVAC areas across 7 facilities (FCOG, Warren, Huntington, Akron, Hopebridge, NE Realty, and a unified NexusBMS system with 22 trained models covering air handlers, boilers, chillers, VAVs, fan coils, make-up air units, DOAS units, pumps, and steam systems). 69 `.axonml` model files total.

The deployment pipeline: AxonML training on CPU --> `.axonml` serialized weights --> cross-compiled ARM inference binary (pure tensor ops, no autograd overhead) --> PM2 process management on the Pi --> HTTP endpoints for integration with the building management system.

This is the use case that drove most of the framework's development. The models needed to be small, fast, and run on constrained hardware without Python.

### Kaggle competition usage

I'm also using AxonML for the Deep Past Initiative Kaggle competition -- machine translation from Akkadian cuneiform to English. Full seq2seq Transformer (encoder-decoder with multi-head attention, sinusoidal positional encoding, BPE tokenization) trained on ~1,561 parallel sentence pairs. It compiles and trains end-to-end through AxonML. Evaluated on BLEU + chrF++.

### Honest limitations

- *Ecosystem maturity.* PyTorch has thousands of contributors, Hugging Face, torchvision's pretrained zoo, a decade of Stack Overflow answers. AxonML has one developer and a growing but small set of pretrained weights. If you need a specific pretrained model, you'll probably need to convert it yourself via ONNX - *GPU kernel coverage.* CUDA support works -- cuBLAS GEMM, 20+ element-wise kernels, GPU-resident tensors -- but the coverage is nowhere near cuDNN-backed PyTorch. Some operations will fall back to CPU. Vulkan/Metal/WebGPU backends are implemented but less battle-tested than CUDA - *Python interop doesn't exist.* If your workflow depends on pandas, scikit-learn preprocessing, or Jupyter notebooks, you'll need to handle data prep separately. This is a Rust-native framework

### Why Rust for ML?

Three reasons from practical experience:

1. *Single-binary deployment.* `cargo build --release --target armv7-unknown-linux-musleabihf` gives you a statically-linked inference binary. No Python runtime, no pip, no conda, no Docker. Copy it to a Raspberry Pi and it runs. This is why my HVAC models actually work in production 2. *Compile-time safety.* Dimension mismatches, type errors, and lifetime issues are caught before you start a training run, not 3 hours into one 3. *Memory predictability.* No GC pauses, no reference counting overhead on the hot path, deterministic memory layout. On a Raspberry Pi with 1 GB RAM running at 2-3 MB RSS, this matters

GitHub: https://github.com/AutomataNexus/AxonML

Happy to answer questions about the architecture, the borrow-checker-vs-autograd challenges, the edge deployment pipeline, or the Kaggle experience.

jacobn•1h ago

Cool! How do you actually implement “Reverse-mode automatic differentiation with a tape-based computational graph” in rust?

AI What Do: A framework for thinking about AI power and human agency

Daily Tetonor- the Daily Math Logic Puzzle

How Awesome? annotates GitHub awesome lists with repo stats, stars, last commit

Show HN: Integrate governance before your AI stack executes – COMMAND console

Music Programming Studio: Live-Coding and AI Prompting

Ubuntu 26.04 ends a 40-year old sudo tradition

Napkin Math Flashcards

Fast Autoscheduling for Sparse ML Frameworks

Sam Altman AMA about DoD deal

TENSURE: Fuzzing Sparse Tensor Compilers (Registered Report)

OpenAI has released Dow contract language, and it's as Anthropic claimed

A Day in the Life of an Enshittificator [video]

Claude making me more productive every day usecases

DeepExplain: Interactive Guide to Dirac Notation and Quantum Mechanics

Show HN: A live playground for Beautiful Mermaid

Show HN: Atom – open-source AI agent with "visual" episodic memory

A Reinforcement Learning Environment for Automatic Code Optimization in MLIR

"Half the dads at this 7am swim practice have Codex or Claude Code fired up."

Show HN: Tool to surface past architectural decisions directly on Pull Requests

Show HN: Pare – would you let strangers set you up?

Optimal: Cost effective infra with agentic inbox

Show HN: A no-BS guide for mkt directors who just landed in a broken company

Mock Wallet – Test Web3 Apps with Playwright, Humans, and AI Agents

The Mystery of Skype

One-Line C64 Basic Dice Program: Default RAM Initialization [video]

Why Apple's move to video could endanger podcasting's greatest power

Working with file extensions in bash scripts

Show HN: GitPop – open-source AI Git context menu for Windows (OS X coming soon)

Man's Search for Meaning (1963)

Moved to SF alone. No network, no YC. One year in