I tried to benchmark Google’s new on-device dictation app (Eloquent) and basically couldn’t. It drops about half of my dictations.
Background: Google shipped a new fully‑local dictation app yesterday with proprietary new models, so I was excited to benchmark it against the leading open models (Qwen3‑ASR, NVIDIA Parakeet V3, etc).
I have a harness that drives a dictation app by playing an audio file through a virtual input device and captures the app’s pasted output, so I can compare different apps on the same clips. I also have ~1,500 manually corrected clips from my daily engineering work.
What happened: I couldn’t get a clean eval, because ~half of dictations come back missing a large number of words. A clip of with ~20+ words routinely returns just 5-10 words. I assumed my harness was broken, so I used the app manually, speaking slowly and clearly into the mic. Same thing: roughly half the time, I only get a small fraction of what I actually said.
When Eloquent did return a complete transcript (15 of 50 tests), its accuracy was actually competitive ~24% WER vs ~21% for Qwen3-ASR on the same clips. The problem isn't the recognition. It's that for most dictations, you don't get your words back at all!
My theory: The transcriber is a chat‑style AI model, and chat models sometimes reply about your audio instead of transcribing it.
To test this, I ran Gemma 3n (Google's open model from the same family) directly on the same clips bypassing the Eloquent app. On 11 / 44 attempts it responded something like “I’m sorry, I can’t transcribe this,” instead of producing a transcript. Gemma had the same ~60 % word error rate as Eloquent. My guess is that Eloquent’s model has the same issue, the app just hides it.
Has anyone been able to get good results with this app? Or are others seeing this issue?
Disclosure: I build a competitive local dictation app, so not a neutral party!
telenardo•1h ago
Background: Google shipped a new fully‑local dictation app yesterday with proprietary new models, so I was excited to benchmark it against the leading open models (Qwen3‑ASR, NVIDIA Parakeet V3, etc).
I have a harness that drives a dictation app by playing an audio file through a virtual input device and captures the app’s pasted output, so I can compare different apps on the same clips. I also have ~1,500 manually corrected clips from my daily engineering work.
What happened: I couldn’t get a clean eval, because ~half of dictations come back missing a large number of words. A clip of with ~20+ words routinely returns just 5-10 words. I assumed my harness was broken, so I used the app manually, speaking slowly and clearly into the mic. Same thing: roughly half the time, I only get a small fraction of what I actually said.
When Eloquent did return a complete transcript (15 of 50 tests), its accuracy was actually competitive ~24% WER vs ~21% for Qwen3-ASR on the same clips. The problem isn't the recognition. It's that for most dictations, you don't get your words back at all!
My theory: The transcriber is a chat‑style AI model, and chat models sometimes reply about your audio instead of transcribing it.
To test this, I ran Gemma 3n (Google's open model from the same family) directly on the same clips bypassing the Eloquent app. On 11 / 44 attempts it responded something like “I’m sorry, I can’t transcribe this,” instead of producing a transcript. Gemma had the same ~60 % word error rate as Eloquent. My guess is that Eloquent’s model has the same issue, the app just hides it.
Has anyone been able to get good results with this app? Or are others seeing this issue?
Disclosure: I build a competitive local dictation app, so not a neutral party!