This post is mostly about how surprisingly hard it is to cut filler words out of speech cleanly. Apparently, stripping ums isn't a find and replace type thing, because Whisper's timestamps are off by up to a few hundred ms and cutting on them chops syllables or leaves stutters. So, I built a tool, erm, that starts from Whisper's guess, finds where each word actually starts and stops in the audio, and snaps the cuts to silence so there's no click, with ffmpeg doing the splicing.
dougcalobrisi•1h ago
https://github.com/dougcalobrisi/erm