"GGUF quantization" is the most popular tech stack for quantizing Llama-like models for CPU. But the documentation is very sparse, and the maintainers made it clear that writing a paper is not their priority. So I spent like a week reading through the code and understanding the various concepts (K-quants, I-quants, importance matrix, etc) and put together this (unofficial) repo with explainers.
It was mostly written by hand, without standard AI slop. I used AI mostly just to interrogate Claude Code on the llama.cpp codebase to help me understand it.
It's possible that I made mistakes or missed things here and there. If you have in-depth knowledge, I'd love your contributions!