Our goal is to fully understand the ML stack end-to-end. To force ourselves to use it for something real, we also trained a 12M-parameter transformer with it.
We wrote custom CUDA kernels for operations such as flash attention, AdamW, layer norm, and GELU. We also added a WebGPU path so it can still run on machines without NVIDIA GPUs, though that comes with tradeoffs.
If you work on ML systems, compilers, kernels, training infrastructure, or anything adjacent, we’d genuinely love to be grilled on it.
In particular, we’d love people to point out the naive parts, the places our abstractions break down, the optimizations that don’t actually matter, and the first things that would need to change if this had to scale beyond a learning project.
Don’t hold back. Thanks!
Demo: https://youtu.be/JB3YpV3gpnI Repo: https://github.com/mni-ml/framework Blog: https://mni-ml.github.io/