I built this to challenge the assumption that we are strictly bound by the "Memory Wall." My hypothesis was that modern consumer silicon (like Apple M-series) has enough spare compute to decompress weights procedurally faster than it can read them from RAM.
The Architecture: It uses a "Predator-Prey" method:
Predators: We identify and preserve the high-magnitude outliers (the "Alpha" weights) in FP16.
Prey: The remaining weights are compressed into ternary masks {-1, 0, 1} and a block-wise scalar.
Reconstruction: A bitwise kernel reconstructs the layer in L2 cache during the forward pass.
Results: I validated this on Llama-3-8B (Layer 20, SwiGLU).
Compression: ~3.0 bits per weight (effective).
Fidelity: 0.915 Cosine Similarity (weights) / 0.912 (outputs).
Size: Brings an 8B model down to ~3GB.
The repo is currently a "Proof of Engine" in Python/MLX. The math works, but to realize the theoretical speed gains (125 t/s), I am working on porting the decompression kernels to Metal/CUDA.
Happy to answer questions about the compression logic or the "Predator" selection algorithm!
saznlamsal•1h ago
I built this to challenge the assumption that we are strictly bound by the "Memory Wall." My hypothesis was that modern consumer silicon (like Apple M-series) has enough spare compute to decompress weights procedurally faster than it can read them from RAM.
The Architecture: It uses a "Predator-Prey" method:
Predators: We identify and preserve the high-magnitude outliers (the "Alpha" weights) in FP16.
Prey: The remaining weights are compressed into ternary masks {-1, 0, 1} and a block-wise scalar.
Reconstruction: A bitwise kernel reconstructs the layer in L2 cache during the forward pass.
Results: I validated this on Llama-3-8B (Layer 20, SwiGLU).
Compression: ~3.0 bits per weight (effective).
Fidelity: 0.915 Cosine Similarity (weights) / 0.912 (outputs).
Size: Brings an 8B model down to ~3GB.
The repo is currently a "Proof of Engine" in Python/MLX. The math works, but to realize the theoretical speed gains (125 t/s), I am working on porting the decompression kernels to Metal/CUDA.
Happy to answer questions about the compression logic or the "Predator" selection algorithm!