Hi HN,
I built OpenGraviton, an open-source AI inference engine that pushes the limits of running extremely large LLMs on consumer hardware. By combining 1.58-bit ternary quantization, dynamic sparsity with Top-K pruning and MoE routing, and mmap-based layer streaming, OpenGraviton can run models far larger than your system RAM—even on a Mac Mini.
Early benchmarks:
TinyLlama-1.1B drops from ~2GB (FP16) to ~0.24GB with ternary quantization.
At 140B scale, models that normally require ~280GB fit within ~35GB packed.
Optimized for Apple Silicon with Metal + C++ tensor unpacking, plus speculative decoding for faster generation.
Check benchmarks, architecture, and details here:
https://opengraviton.github.io
GitHub:
https://github.com/opengraviton
This project isn’t just about squeezing massive models onto tiny hardware—it’s about democratizing access to giant LLMs without cloud costs. Feedback, forks, and ideas are very welcome!