Show HN: local speech-to-text is shockingly fast on Apple Silicon

https://www.getonit.ai/dictate

1•telenardo•1w ago

TL;DR: How far can you go with local ML on a Mac? We built a dictation app to find out. It turned out, pretty far! On a stock M-series Mac, end-to-end speech → text → LLM cleanup runs in under 1s on a typical sentence.

What is this? A local dictation app for macOS. It’s a free alternative to Wispr Flow, SuperWhisper, or MacWhisper. Since it runs entirely on YOUR device we made it free. There’s no servers to maintain so we couldn’t find anything to charge you for. We were playing with Apple Silicon and it turned into something usable, so we’re releasing it.

If you've written off on-device transcription before, it’s worth another look. Apple Silicon + MLX is seriously fast. We've been using it daily for the past few weeks. It's replaced our previous setups.

The numbers that surprised us: - <500ms results if you disable LLM post-processing (from settings) or use our fine-tuned 1B model (more on this below). This feels instant. You stop talking and the text is THERE. - With LLM Cleanup, p50 latency for a sentence is ~800ms (transcription + LLM post-processing combined). In practice, it feels quick! - Tested on M1, M2, and M4!

Technical Details: - Models: Parakeet 0.6B (transcription) + Llama 3B (cleanup), both running via MLX - Cleanup model has 8 tasks: remove filler words (ums and uhs) and stutters/repeats, convert numbers, special characters, acronyms (A P I → API), emails (hi at example dot com → hi@example.com), currency (two ninety nine → $2.99), and time (three oh two → 3:02). We’d like to add more, but each task increases latency (more on this below) so we settled here for now. - Cleanup model uses a simple few-shot algorithm to pull in relevant examples before processing your input. Current implementation sets N=5.

Challenges: - Cleanup Hallucinations: Out of the box, small LLMs (3B, 1B) still make mistakes. They can hallucinate long, unrelated responses and occasionally repeat back a few‑shot example. We had to add scaffolding to fall back to the raw audio transcripts when such cases are detected. So some “ums” and “ahs” still make it through. - Cleanup Latency: We can get better cleanup results by providing longer instructions or more few-shot examples (n=20 is better than n=5). But every input token hurts latency. If we go up to N=20 for example, LLM latency goes to 1.5-3s. We decided the delays weren't worth it for marginally better results.

Experimental: - Corrections: Since local models aren't perfect, we’ve added a feedback loop. When your transcript isn’t right, there’s a simple interface to correct it. Each correction becomes a fine-tuning example (stored locally on your machine, of course). We’re working on a one-click "Optimize" flow that will use DSPy locally to adjust the LLM cleanup prompt and fine-tune the transcription model and LLM on your examples. We want to see if personalization can close the accuracy gap. We’re still experimenting, but early results are promising! - Fine-tuned 1B model: per the above, we’ve a fine-tuned a cleanup model on our own labeled data. There’s a toggle to try this in settings. It’s blazing fast, under 500 ms. Because it’s fine‑tuned to the use case, it doesn’t require a long system prompt (which consumes input tokens and slows things down). If you try it, let us know what you think. We are curious to hear how well our model generalizes to other setups.

*Product details* - Universal hotkey (CapsLock default) - Works in any text field via simulated paste events. - Access point from the menu bar & right edge of your screen (latter can be disabled in settings) - It pairs well with our other tool, QuickEdit, if you want to polish dictated text further. - If wasn’t clear, yes, it’s Mac only. Linux folks, please roast us in the comments.

Comments

mkw5053•1w ago

I'm interested!

My main gripe with Wispr Flow is that it's slow and does the entire transcription in one pass after you finish speaking. Does this stream and transcribe as you talk?

I really want to see the transcription in progress while I'm speaking.

telenardo•1w ago

It's not set up for that, no, though it's theoretically possible!

The issues I see are: - Transcription models use beam search to choose the most likely words at each step, taking into account the surrounding words. The accuracy will drop a lot if you pick each top word individually as it’s spoken. The surrounding context matters a lot. - To that point, transcription models do get things wrong (i.e. "best" instead of "test"). The LLM post-processing can help here, by taking in the top-N hypotheses from the transcription mode and determining which makes the most sense (i.e. "run the tests", not "run the bests"), adding another layer of semantic understanding. Again, the surrounding context really matters here.

Do you need each word to stream individually? Or would it be sufficient for short phrases to stream?

The MLX inference is so fast that you could accomplish something like the latter by releasing and re-pressing the shortcut every 5-10 words. It so fast it honestly feels like streaming. In practice, I tend to do something like this anyway, because I find it easier to review shorter transcripts!

mkw5053•5d ago

I'd be happy with short phrases and for the text to change as I continue to speak.

Show HN: CryptoClaw – open-source AI agent with built-in wallet and DeFi skills

ShowHN: Make OpenClaw Respond in Scarlett Johansson’s AI Voice from the Film Her

CReact Version 0.3.0 Released

Show HN: CReact – AI Powered AWS Website Generator

The rocky 1960s origins of online dating (2025)

Show HN: Agent-fetch – Sandboxed HTTP client with SSRF protection for AI agents

Why there is no official statement from Substack about the data leak

Effects of Zepbound on Stool Quality

Show HN: Seedance 2.0 – The Most Powerful AI Video Generator

Ask HN: Do we need "metadata in source code" syntax that LLMs will never delete?

Pentagon cutting ties w/ "woke" Harvard, ending military training & fellowships

Can Quantum-Mechanical Description of Physical Reality Be Considered Complete? [pdf]

Kessler Syndrome Has Started [video]

Complex Heterodynes Explained

EVs Are a Failed Experiment

MemAlign: Building Better LLM Judges from Human Feedback with Scalable Memory

CCC (Claude's C Compiler) on Compiler Explorer

Homeland Security Spying on Reddit Users

Actors with Tokio (2021)

Can graph neural networks for biology realistically run on edge devices?

Deeper into the shareing of one air conditioner for 2 rooms

Weatherman introduces fruit-based authentication system to combat deep fakes

Why Embedded Models Must Hallucinate: A Boundary Theory (RCC)

A Curated List of ML System Design Case Studies

Pony Alpha: New free 200K context model for coding, reasoning and roleplay

Show HN: Tunbot – Discord bot for temporary Cloudflare tunnels behind CGNAT

Open Problems in Mechanistic Interpretability

Bye Bye Humanity: The Potential AMOC Collapse

Dexter: Claude-Code-Style Agent for Financial Statements and Valuation

Digital Iris [video]