I built OpenGraviton, an open-source AI inference engine designed to push the limits of running extremely large models on consumer hardware.
The system combines several techniques to drastically reduce memory and compute requirements:
• 1.58-bit ternary quantization ({-1, 0, +1}) for ~10x compression • dynamic sparsity with Top-K pruning and MoE routing • mmap-based layer streaming to load weights directly from NVMe SSDs • speculative decoding to improve generation throughput
These allow models far larger than system RAM to run locally.
In early benchmarks, OpenGraviton reduced TinyLlama-1.1B from ~2.05GB (FP16) to ~0.24GB using ternary quantization. Synthetic stress tests at the 140B scale show that models which would normally require ~280GB FP16 can fit within ~35GB when packed with the ternary format.
The project is optimized for Apple Silicon and currently uses custom Metal + C++ tensor unpacking.
Benchmarks, architecture, and details: https://opengraviton.github.io
GitHub: https://github.com/opengraviton
fatihturker•1h ago
The architecture page explains how ternary quantization, dynamic sparsity, and mmap layer streaming work together to push models far beyond normal RAM limits.
Happy to answer questions about the implementation or benchmarks.