I wrote a dependency-free C kernel (sparse-ternary-fma) using 2-bit encoding and AVX-512 instructions.
Benchmarks on Intel Xeon (N=4096):
Throughput (Dense): 2.38x faster (8.21 GFLOPS vs 3.45 AVX2)
Throughput (Sparse 80% zeros): 26.12x faster (23.25 GFLOPS vs 0.89 Scalar)
Memory: 4x denser (2-bit vs 8-bit standard)
This approach packs 4 trits per byte and leverages sparsity-aware FMA to skip zero-valued weights, which is critical for 1.58-bit quantization efficiency.
PR is pending on the Microsoft BitNet repo. Code is open source here:https://github.com/microsoft/BitNet/pull/365