I was just wishing they would make a new flash-lite model, these things are so fast. Unfortunately 2.5-flash and therefore 2.5-flash-lite failed some of my agentic workflows.
If 3.1-flash-lite can do the job, this solves basically all latency issues for agentic workflows.
I publish my benchmarks here in case anyone is interested:
https://upmaru.com/llm-tests/simple-tama-agentic-workflow-q1...
P.S: The pricing bump is quiet significant, but still stomachable if it performs well. It is significant though.
We priced an enterprise contract using Flash 1.5 pricing last summer, and today that contract would be unit economic negative if we used Flash 3. Flash 2.5 and now Flash 3.1 Lite barely breaks even.
I predict open-source models and fine-tuning are going to make a real comeback this year for economic reasons.
Interesting. Flash 1.5 was already a year old at that point.
We benchmarked it for real-life voice-to-text use cases:
<10s 10-30s 30s-1m 1-2m 2-3m
Flash 2548 2732 3177 4583 5961
Flash Lite 1390 1468 1772 2362 3499
Faster by 1.83x 1.86x 1.79x 1.94x 1.70x
(latency in ms, median over 5 runs per sample, non-streaming)
Key takeaways:- 1.8x faster than Gemini 3 Flash on average - 1-2 sec transcription time for short recordings - ~$0.50/mo for heavy users (10h+ transcription) - Best-in-class WER and formatting instruction following - Multilingual: one model, 100+ languages
Gemini is slowly making $15/month voice apps obsolete.
That much is easy but what if you could also speak to and interrupt the main voice model and keep giving it instructions? Like speaking to customer support but instead of putting you on hold you can ask them several questions and get some live updates
Actually, I'm experimenting with this kind of stuff and trying to find a nice UX to make Ottex a voice command center - to trigger AI agents like Claude, open code to work on something, execute simple commands, etc.
I haven't found official benchmarks yet, but you can find Gemini 3 Flash word error rate benchmarks here: https://artificialanalysis.ai/speech-to-text/models/gemini — they are close to SOTA.
I speak daily in both English and Russian and have been using Gemini 3 Flash as my main transcription model for a few months. I haven't seen any model that provides better overall quality in terms of understanding, custom dictionary support, instruction following, and formatting. It's the best STT model in my experience. Gemini 3 Flash has somewhat uncomfortable latency though, and Flash Lite is much better in this regard.
This will likely bring the cost below 2.5 flash-lite for many tasks (depends on the ratio of input to output tokens).
That said, AA also reports that 3.1 FL was 20% more expensive to run for their complete Intelligence index benchmark.
The overall point is that cost is extremely task-dependent, and it doesn’t work to just measure token cost because reasoning can burn so many tokens, reasoning token usage varies by both task and model, and similarly the input/output ratios vary by task.
sh4jid•1h ago
simianwords•1h ago