Hi HN, I built NeuralForge because I wanted to fine-tune small LLMs on my MacBook without renting cloud GPUs or setting up CUDA.
It uses Apple's Neural Engine directly (not Metal, not CPU) to hit ~1.2 TFLOPS on a consumer Mac. The app wraps a C/Obj-C training engine in a SwiftUI GUI with live loss curves, LoRA support, and one-click export to GGUF/CoreML.
What actually works today:
- Real training on ANE (110M parameter llama2.c models)
- LoRA fine-tuning (rank 4-64 on attention weights)
- Cosine LR schedule with warmup
- Checkpoint save/resume (survives crashes)
- Text generation at 66ms/token with top-p sampling
- Export to GGUF, llama2c, CoreML formats
- 648 automated tests (unit + integration on real hardware)
Current limitations:
- Only llama2.c format models (110M tested, larger planned)
- macOS 14+ on Apple Silicon only
- Some UI features are still stubs (being honest)
Setup: git clone + bash setup.sh (one command)
The hardest part was ANE itself. There's basically zero documentation on using it for training — it's designed for inference only. I had to reverse-engineer the MIL compiler, figure out the 119-kernel compilation limit per process, and build an exec() restart mechanism that transparently re-launches the training process to get fresh kernel budget.
khaeldur•1h ago
It uses Apple's Neural Engine directly (not Metal, not CPU) to hit ~1.2 TFLOPS on a consumer Mac. The app wraps a C/Obj-C training engine in a SwiftUI GUI with live loss curves, LoRA support, and one-click export to GGUF/CoreML.
What actually works today: - Real training on ANE (110M parameter llama2.c models) - LoRA fine-tuning (rank 4-64 on attention weights) - Cosine LR schedule with warmup - Checkpoint save/resume (survives crashes) - Text generation at 66ms/token with top-p sampling - Export to GGUF, llama2c, CoreML formats - 648 automated tests (unit + integration on real hardware)
Current limitations: - Only llama2.c format models (110M tested, larger planned) - macOS 14+ on Apple Silicon only - Some UI features are still stubs (being honest)
Setup: git clone + bash setup.sh (one command)
The hardest part was ANE itself. There's basically zero documentation on using it for training — it's designed for inference only. I had to reverse-engineer the MIL compiler, figure out the 119-kernel compilation limit per process, and build an exec() restart mechanism that transparently re-launches the training process to get fresh kernel budget.
MIT licensed: https://github.com/Khaeldur/NeuralForge
Happy to answer questions about ANE internals.ye