Show HN: Localvoxtral – Local real-time dictation on macOS with streaming STT

https://github.com/T0mSIlver/localvoxtral

1•T0mSIlver•2h ago

I built a native macOS menu bar app for real-time dictation that can run fully on-device.

Most dictation tools, even local ones, use Whisper or similar offline models: you record, then wait for the transcript. Localvoxtral uses Mistral's Voxtral Realtime, one of the first open-source speech models with a natively streaming architecture. Words appear as you speak, not after you stop. It feels closer to someone typing along as you talk.

Press a shortcut, speak, and text gets typed directly into whatever app you're in. No cloud, no subscription, no data leaving your machine.

Two backend options:

voxmlx on Apple Silicon: I forked voxmlx to add a WebSocket server and memory optimizations. Runs a 4-bit quantized model on an M1 Pro. Audio and inference stay fully on-device. vLLM on NVIDIA GPU: tested on an RTX 3090, noticeably faster.

The app is native Swift (~97%), lives in the menu bar, and stays out of your way. Configurable shortcut, mic selection, auto-paste. GitHub: https://github.com/T0mSIlver/localvoxtral

Pre-built DMG available in Releases

Comments

T0mSIlver•2h ago

Some technical context and where this is headed.

Why streaming matters for dictation. Whisper and most open-source STT models use bidirectional attention, meaning they need the full audio clip before they can transcribe anything. You get your text after you stop talking, usually with a noticeable delay. Voxtral Realtime takes a different approach: it has a causal audio encoder that processes audio left-to-right as it arrives. At 480ms delay it matches offline models on accuracy (FLEURS benchmark), but you see text appearing while you're still mid-sentence. For dictation this changes a lot. You can catch mistakes in real time, and the feedback loop feels natural instead of disconnected.

The app connects to backends via the OpenAI Realtime API WebSocket protocol. It captures audio from your mic, streams it over the WebSocket, and receives partial transcripts that get inserted into your active text field live. Any OpenAI Realtime-compatible server works.

The voxmlx fork. The original voxmlx by Awni Hannun does local Voxtral inference on Apple Silicon via MLX, but it was CLI-only. I added a WebSocket server that speaks the OpenAI Realtime protocol so localvoxtral (or any compatible client) can connect to it. I also added memory management to avoid OOM on longer sessions. Fork is here: https://github.com/T0mSIlver/voxmlx. I'd like to get the server piece upstreamed eventually.

Latency. On M1 Pro with a 4-bit quantized model, first words appear within roughly 200 to 400ms. On RTX 3090 via vLLM it's faster. Both feel responsive enough for natural dictation. What's next. Right now you have to start the server yourself before using the app. I want to add app-managed local serving (start/stop/model download) so it's truly one-click. If anyone has experience bundling Python/MLX processes into macOS apps cleanly, I'd love to hear your approach.

Happy to answer questions.

Researchers build ultra-efficient optical sensors shrinking light to a chip

Builders Unscripted: Ep. 1 – Peter Steinberger, Creator of OpenClaw

Homeownership Is Out of Reach for Many Americans, Despite a Buyer's Market

Show HN: SQL Crack – Local-first SQL visualizer with column lineage

Nimble gets $75M to build web datasets for AI agents

Time to Move On – The Reason Relationships End

The Day Moltbook's Agents Started Doing SEO

Be Careful with LLM "Agents"

Nobody Wants to Use Your Software (and That's the Point)

The Agent Times: OpenHands hits 68K stars in the agent economy

Cardiorespiratory fitness is associated with lower anger and anxiety

Free Font: Times New Resistance

EU: ECR rapporteur Wiśniewska is fighting to EXTEND scanning of private messages

Show HN: If Discord, Reddit, X, IRC and 4chan had a baby

Replacing Anthropic's API with 2x 3090s. Claude Code on a local 80B Qwen model

Japan Pushes to Make Snowball Fighting an Olympic Event

Show HN: Digital Janitor – A 1-click Python script to auto-sort messy downloads

Tell HN: GitHub Actions is falling over again

Tethered – Runtime network egress control for Python

The New Panopticon: How AI Changes Accountability

Racket 9.1 Is Available

Bulgarian Teacher with 38 International Medalist Students

USRP X420 10MHz – 20 GHz SDR

Is AI Good for Democracy?

Show HN: Open-source LLM and dataset for sports forecasting (Pro Golf)

PersonaLive Expressive Portrait Image Animation for Live Streaming

People Are Worried About Blue Owl Liquidity

The Epstein Files Should Never Have Been Released

Show HN: Ghist – Task management that lives in your repo

Elektrobit and Mobileye partner on safety Linux for L4 autonomy