EdgeWhisper runs Voxtral Mini 4B Realtime (Mistral AI, Apache 2.0) locally on Apple Silicon via the MLX framework. Hold a key, speak, release — text appears at your cursor in whatever app has focus.
Architecture: - Native Swift (SwiftUI + AppKit). No Electron. - Voxtral 4B inference via MLX on the Neural Engine. ~3GB model, runs in ~2GB RAM on M1+. - Dual text injection: AXUIElement (preserves undo stack) with NSPasteboard+CGEvent fallback. - 6-stage post-processing pipeline: filler removal → dictionary → snippets → punctuation → capitalization → formatting. - Sliding window KV cache for unlimited streaming without latency degradation. - Configurable transcription delay (240ms–2.4s). Sweet spot at 480ms.
What it does well: - Works in 20+ terminals/IDEs (VS Code, Xcode, iTerm2, Warp, JetBrains). Most dictation tools break in terminals — we detect them and switch injection strategy. - Removes filler words automatically ("um", "uh", "like"). - 13 languages with auto-detection. - Personal dictionary + snippet expansion with variable support ({{date}}, {{clipboard}}). - Works fully offline after model download. No accounts, no telemetry, no analytics.
What it doesn't do (yet): - No file/meeting transcription (coming) - No translation (coming) - No Linux/Windows (macOS only, Apple Silicon required)
Pricing: Free tier (5 min/day, no account needed). Pro at $7.99/mo or $79.99/yr.
I'd love feedback on: 1. Would local LLM post-processing (e.g., Phi-4-mini via MLX) for grammar/tone be worth the extra ~1GB RAM? 2. For developers using voice→code workflows: what context would you want passed to your editor? 3. Anyone else building on Voxtral Realtime? Curious about your experience with the causal audio encoder.
Dumbledumb•1h ago
Not knocking your app, but asking before your app seems very focused on one model, while others allow the user to pick according to their needs.