What is this? A local dictation app for macOS. It’s a free alternative to Wispr Flow, SuperWhisper, or MacWhisper. Since it runs entirely on YOUR device we made it free. There’s no servers to maintain so we couldn’t find anything to charge you for. We were playing with Apple Silicon and it turned into something usable, so we’re releasing it.
If you've written off on-device transcription before, it’s worth another look. Apple Silicon + MLX is seriously fast. We've been using it daily for the past few weeks. It's replaced our previous setups.
The numbers that surprised us: - <500ms results if you disable LLM post-processing (from settings) or use our fine-tuned 1B model (more on this below). This feels instant. You stop talking and the text is THERE. - With LLM Cleanup, p50 latency for a sentence is ~800ms (transcription + LLM post-processing combined). In practice, it feels quick! - Tested on M1, M2, and M4!
Technical Details: - Models: Parakeet 0.6B (transcription) + Llama 3B (cleanup), both running via MLX - Cleanup model has 8 tasks: remove filler words (ums and uhs) and stutters/repeats, convert numbers, special characters, acronyms (A P I → API), emails (hi at example dot com → hi@example.com), currency (two ninety nine → $2.99), and time (three oh two → 3:02). We’d like to add more, but each task increases latency (more on this below) so we settled here for now. - Cleanup model uses a simple few-shot algorithm to pull in relevant examples before processing your input. Current implementation sets N=5.
Challenges: - Cleanup Hallucinations: Out of the box, small LLMs (3B, 1B) still make mistakes. They can hallucinate long, unrelated responses and occasionally repeat back a few‑shot example. We had to add scaffolding to fall back to the raw audio transcripts when such cases are detected. So some “ums” and “ahs” still make it through. - Cleanup Latency: We can get better cleanup results by providing longer instructions or more few-shot examples (n=20 is better than n=5). But every input token hurts latency. If we go up to N=20 for example, LLM latency goes to 1.5-3s. We decided the delays weren't worth it for marginally better results.
Experimental: - Corrections: Since local models aren't perfect, we’ve added a feedback loop. When your transcript isn’t right, there’s a simple interface to correct it. Each correction becomes a fine-tuning example (stored locally on your machine, of course). We’re working on a one-click "Optimize" flow that will use DSPy locally to adjust the LLM cleanup prompt and fine-tune the transcription model and LLM on your examples. We want to see if personalization can close the accuracy gap. We’re still experimenting, but early results are promising! - Fine-tuned 1B model: per the above, we’ve a fine-tuned a cleanup model on our own labeled data. There’s a toggle to try this in settings. It’s blazing fast, under 500 ms. Because it’s fine‑tuned to the use case, it doesn’t require a long system prompt (which consumes input tokens and slows things down). If you try it, let us know what you think. We are curious to hear how well our model generalizes to other setups.
*Product details* - Universal hotkey (CapsLock default) - Works in any text field via simulated paste events. - Access point from the menu bar & right edge of your screen (latter can be disabled in settings) - It pairs well with our other tool, QuickEdit, if you want to polish dictated text further. - If wasn’t clear, yes, it’s Mac only. Linux folks, please roast us in the comments.
mkw5053•1h ago
My main gripe with Wispr Flow is that it's slow and does the entire transcription in one pass after you finish speaking. Does this stream and transcribe as you talk?
I really want to see the transcription in progress while I'm speaking.
telenardo•33m ago
The issues I see are: - Transcription models use beam search to choose the most likely words at each step, taking into account the surrounding words. The accuracy will drop a lot if you pick each top word individually as it’s spoken. The surrounding context matters a lot. - To that point, transcription models do get things wrong (i.e. "best" instead of "test"). The LLM post-processing can help here, by taking in the top-N hypotheses from the transcription mode and determining which makes the most sense (i.e. "run the tests", not "run the bests"), adding another layer of semantic understanding. Again, the surrounding context really matters here.
Do you need each word to stream individually? Or would it be sufficient for short phrases to stream?
The MLX inference is so fast that you could accomplish something like the latter by releasing and re-pressing the shortcut every 5-10 words. It so fast it honestly feels like streaming. In practice, I tend to do something like this anyway, because I find it easier to review shorter transcripts!