So I built Whisnap. Hold a hotkey, talk, release - text just appears where your cursor is. Local Whisper with Metal on Apple Silicon, nothing leaves your machine if you don't want it.
I built fallbacks on top of fallbacks. If a model can't process your audio, it tries a different one. You can always retranscribe a recording. Even the cloud mode (optional) has its own fallback chain: WebSocket streaming to batch upload to local Whisper. Something always works.
One thing I spent a bunch of time on: a post-processing pipeline for Whisper's hallucination problem. Anyone who's worked with Whisper knows it hallucinates "Thanks for watching, don't forget to like and subscribe" from silent audio, or loops the same phrase endlessly. The filter handles bracketed artifacts, known hallucination phrases, word repetition, sentence loops, and cross-text deduplication. Not perfect, but catches most of it.
The same binary also works as a CLI, "whisnap recording.wav" just works. I run an AI agent (OpenClaw) on the same Mac and instead of paying for ElevenLabs or other cloud transcription APIs, it just calls Whisnap's CLI and gets clean text back. Same models, no extra setup.
Stack: Tauri v2, whisper-rs, RNNoise for denoising, SIMD audio mixing, rubato resampling.
It's free, Mac only for now. Would love to know if the hallucination filter holds up for anyone else's use cases. https://whisnap.com/