I'm not an ML engineer. I used Claude Code (Opus 4.6) to get LoRA fine-tuning gradients running on Apple's Neural Engine — the dedicated ML chip in every Apple Silicon Mac that has no public training API.
192 gradient dispatches, zero GPU fallbacks, converging loss, all at ~2.8W.
Three discoveries found through iteration on real hardware: (1) ANE's matmul op compiles but never executes — everything must be rewritten as 1x1 convolutions, (2) spatial dimensions must be multiples of 16, (3) the ANE compiler leaks handles and silently fails after ~119 compiles.
Built on maderix's ANE reverse engineering work. The repo includes the full MIL kernel generator, subprocess isolation for the compile limit, and integration with MLX for hybrid GPU+ANE training.
jmanhype•10h ago
192 gradient dispatches, zero GPU fallbacks, converging loss, all at ~2.8W.
Three discoveries found through iteration on real hardware: (1) ANE's matmul op compiles but never executes — everything must be rewritten as 1x1 convolutions, (2) spatial dimensions must be multiples of 16, (3) the ANE compiler leaks handles and silently fails after ~119 compiles.
Built on maderix's ANE reverse engineering work. The repo includes the full MIL kernel generator, subprocess isolation for the compile limit, and integration with MLX for hybrid GPU+ANE training.