I have an old project that relies on AWS transcription and I'd love to migrate it to something local.
- Converts to 16kHz WAV
- Transcribes using native ggerganov whisper
- Calls out to a local LLM to clean the text
- Prints out the final cleaned up transcription
I found that accuracy/success increased significantly when I added the LLM post-processor even with modestly sized 12-14b models.
I've been using it with great success to convert very old dictated memos from over a decade ago despite a lot of background noise (wind, traffic, etc).
[1] https://gist.github.com/scpedicini/455409fe7656d3cca8959c123...
I'm sure there are use cases where using Whisper directly is better, but it's a great addition to an already versatile tool.
For preprocessing, I found it best to convert files to a 16kHz WAV format for optimal processing. I also add low-pass and high-pass filters to remove non-speech sounds. To avoid hallucinations, I run Silero VAD on the entire audio file to find timestamps where there's a speaker. A side note on this: Silero requires careful tuning to prevent audio segments from being chopped up and clipped. I also use a post-processing step to merge adjacent VAD chunks, which helps ensure cohesive Whisper recordings.
For the Whisper task, I run Whisper in small audio chunks that correspond to the VAD timestamps. Otherwise, it will hallucinate during silences and regurgitate the passed-in prompt. If you're on a Mac, use the whisper-mlx models from Hugging Face to speed up transcription. I ran a performance benchmark, and it made a 22x difference to use a model designed for the Apple Neural Engine.
For post-processing, I've found that running the generated SRT files through ChatGPT to identify and remove hallucination chunks has a better yield.
drewbuschhorn•1h ago
Pavlinbg•1h ago
nvdnadj92•24m ago
- https://huggingface.co/pyannote/speaker-diarization-3.1 - https://github.com/narcotic-sh/senko
I personally love senko since it can run in seconds, whereas py-annote took hours, but there is a 10% WER (word error rate) that is tough to get around.