I’ve been exploring how far large language models can be pushed on machines with limited memory.
I built an experimental runtime and architecture approach focused on making extremely large models more feasible on systems with around 32GB of RAM.
The core idea is combining several efficiency techniques:
ternary weight representation {-1, 0, +1} (~1.58 bits per weight), sparse execution that skips zero weights, memory-mapped layer streaming from NVMe storage, and lightweight tensor unpacking optimized for Apple Silicon.
Instead of keeping the entire model in RAM, weights can be streamed from fast SSD storage and unpacked during execution. This shifts the bottleneck from memory capacity toward storage bandwidth and compute efficiency.
Early experiments show significant compression compared to FP16 weights (for example TinyLlama-1.1B shrinking from ~2.05GB to ~0.24GB with ternary packing).
The project is still experimental, but the goal is to explore whether extreme compression + sparsity + SSD streaming can make much larger models practical on consumer machines.
Paper: https://opengraviton.github.io/paper.html
Runtime: https://github.com/opengraviton/graviton-native
I’d really appreciate feedback from people working on inference engines, quantization, or efficient model architectures.
fatihturker•2h ago
If models become heavily compressed and streamed from SSD, where do people think the real bottleneck moves to — storage bandwidth, memory bandwidth, or kernel efficiency?