I built this because I was frustrated with constant OOM crashes trying to run Qwen-2.5-7B and Llama-3 models on my local GTX 1050 (4GB).
I realized that standard GGUF quantization tools often add unnecessary padding to tensors, pushing VRAM usage just over the 4GB limit due to fragmentation. QKV Core solves this via what I call "Surgical Alignment"—it analyzes layer entropy to switch between dictionary coding and raw storage, and then trims padding bytes to strictly adhere to block boundaries (e.g., 110-byte alignment for Q3_K).
This approach shaved off enough overhead to fit 7B models comfortably into 4GB VRAM and resulted in a ~34% improvement in I/O load times thanks to the aligned memory blocks and Numba-accelerated kernels.
The project is open source (MIT). I'd love to hear your feedback on the quantization logic or answer any questions about the implementation!