We quantized OpenAI’s Whisper model to 1.58 bits using Quantization-Aware Training (QAT) to run speech recognition on resource-constrained embedded CPUs. Post-Training Quantization(PTQ) was unsuccessful under 4 bits, so we conducted QAT with a replicated dataset. To make inference feasible, we also implemented custom low-bit kernels optimized for edge deployment. This post walks through the technical challenges and how we addressed them for extreme quantization in real-world use.
coolhanhim•6h ago