Architecture: two-stage pipeline. Stage 1 is speech recognition via Whisper (whisper-rs, 7 model variants, DTW timestamps) or Qwen3-ASR. I quantized the Qwen3-ASR model myself and wrote the inference pipeline in pure Rust. It handles accented speech and dialects better than Whisper in my testing, likely because of broader training data. Silero VAD pre-filters audio before either engine runs.
Stage 2 is text polish via candle (HuggingFace's Rust ML framework). Available models: Phi 4 Mini (2.5 GB), Ministral 3B/14B, Qwen 3 4B/8B. All Q4_K_M GGUF. Metal on macOS, CUDA on Windows.
The polish step does context detection: reads the active app and URL (NSWorkspace + osascript on Mac, GetForegroundWindow on Windows) and selects a prompt accordingly. You can define custom rules keyed on app name, bundle ID, or URL regex.
Other things: - Meeting mode: background transcription to SQLite. Start before a call, stop when done. - Edit by Voice: select text, speak an instruction ("translate to English", "make this shorter"), LLM rewrites in place - Two local STT engines with 100+ languages, automatic code-switching - Optional BYOK cloud: STT via Groq/OpenAI/Deepgram/Azure, polish via OpenRouter/Groq/Gemini/SambaNova
I built this because the existing tools (Wispr Flow, SuperWhisper) are cloud-only for AI processing and subscription-based. I wanted local inference for both stages, custom prompt rules per app, and source code I could actually read.
Rust, GPLv3.
Website: https://sumivoice.com/en/?utm_source=hackernews&utm_medium=forum&utm_campaign=launch_2026q1&utm_content=show_hn
Source: https://github.com/alan890104/sumi