It's context-aware; it reads your screen, documents, and active app to understand what you're working on. You can ask about PDFs, reply to emails, create calendar events, use web search, all by voice.
It supports Gemma 4 and Qwen 3.5 for text generation, plus multiple STT backends (Parakeet, Whisper, Qwen3-ASR).
Examples:
- Gemma4 in action, https://www.youtube.com/watch?v=OgfI-3YjEVU
- query a pdf document, https://www.youtube.com/watch?v=ggaDhut7FnU
- reply to email, https://www.youtube.com/watch?v=QFnHXMBp1gA
- and the usual voice dictation (with optional polishing)
I currently use it a lot with Claude Code, Obsidian and Apple Notes, or just read papers.
Code: https://github.com/Saladino93/hitokudraft/tree/litert
Download of binary: https://hitoku.me/draft/ (free with code HITOKUHN2026)
I am looking for feedback. My goal is to do AI research with clients interfacing, and I thought this is a nice little experiment I could do to iterate/fail quickly.
P.S. (if anyone has tips about this)
Current Gemma4 implementation (with small models) has some problems:
- easy to hallucinate for long contexts, so had to reset it often. Tuned some parameters, but need to find a sweet spot.
- Gemma4 with LiteRT is currently fast compared to the MLX implementation of Qwen3.5 (like 3x faster on my machine when dealing with images). But it has the price of memory spikes. I believe this is because LiteRT's WebGPU backend can allocate significantly more GPU memory than the model weights alone (I got 38GB of memory taken, for the E4B~4GB model!). I guess we need to wait for Google for this.
- App size: because no official Swift package from Google yet, have to bundle some file (LiteRT dylibs) that adds ~98 MB to a previous MLX only version (total app goes from ~50 MB to ~150 MB)
If any of this bothers you: use Qwen 3.5 instead (pure MLX), or wait for the upstream fixes from Google :)
Otherwise, for the mid-term I plan to switch to a potentially slower, but safer, MLX version for Gemma4 (hopefully on the weekend).