Both models have improved intelligence on Artificial Analysis index with lower end-to-end response time. Also 24% to 50% improved output token efficiency (resulting in lower cost).
Gemini 2.5 Flash-Lite improvements include better instruction following, reduced verbosity, stronger multimodal & translation capabilities. Gemini 2.5 Flash improvements include better agentic tool use and more token-efficient reasoning.
Model strings: gemini-2.5-flash-lite-preview-09-2025 and gemini-2.5-flash-preview-09-2025
Now how long can Google keep this going and cannibalizing how they make money is another question...
This involves having it identify all potential keywords and distinct entities, determine their approximate gender (important for languages with ambiguous gender pronouns), and then perform a line-by-line analysis of each chapter. For each line, it identifies the speaking entity, determines whose POV the line represents, and identifies the subject entity. While I didn't need or expect perfection, Gemini Flash 2.5 was the only model I tested that could not only follow all these instructions, but follow them well. The cheap price was a bonus.
I was thoroughly impressed, it's now my go-to for any JSON-formatted analysis reports.
The first chart implies the gains are minimal for nonthinking models.
- "temperature" - intentional random sampling from the most likely next tokens to improve "creativity" and help avoid repetition
- quantization - running models with lower numeric precision (saves on both memory and compute, without impacting accuracy too much)
- differences in/existence of a system prompt, especially when using something end-user-oriented like Qwen Chat
- not-quite-deterministic GPU acceleration
Benchmarks are usually run at temperature zero (always take the most likely next token), with the full-precision weights, and no additions to the benchmark prompt except necessary formatting and stuff like end-of-turn tokens. They also usually are multiple-choice or otherwise expect very short responses, which leaves less room for run-to-run variance.
Of course a benchmark still can't tell you everything - real-world performance can be very different.
From OpenRouter last week:
* xAI: Grok Code Fast 1: 1.15T
* Anthropic: Claude Sonnet 4: 586B
* Google: Gemini 2.5 Flash: 325B
* Sonoma Sky Alpha: 227B
* Google: Gemini 2.0 Flash: 187B
* DeepSeek: DeepSeek V3.1 (free): 180B
* xAI: Grok 4 Fast (free): 158B
* OpenAI: GPT-4.1 Mini: 157B
* DeepSeek: DeepSeek V3 0324: 142B
People are lazy at pointing to the latest name.
I would rather use a model that is good than a model that is free, but different people have different priorities.
Y'know with all these latest models, the lines are kinda blurry actually. The definition of "good" is being foggy.
So it might as well be free as the definition of money is clear as crystal.
I also used it for some time to test on something really really niche like building telegram bot in cloudflare workers and grok-4-fast was kinda decent on that for the most part actually. So that's nice.
Also cheap enough to not really matter.
A bad model with good automated tooling and prompts will beat a good model without them, and if your goal is to build good tooling and prompts you need a tighter iteration loop.
Both apps have offered usage for free for a limited time:
https://blog.kilocode.ai/p/grok-code-fast-get-this-frontier-...
For all I know there are a couple of enormous whales on there who, should they decide to switch from one model to another, will instantly impact those overall ratings.
I'd love to have a bit more transparency about volume so I can tell if that's what is happening or not.
A "weekly active API Keys" faceted by models/app would be a useful data point to measure real-world popularity though.
gemini-2.5-flash-preview-09-2025 - what are they thinking?
I thought about joking that they had AI name it for them, but when I asked Gemini, it said that this name was confusing, redundant, and leads to unnecessarily high cognitive load.
Maybe Googlers should learn from their own models.
Something that distinguishes between a completely new pre-training process/architecture, and standard RLHF cycles/optimizations.
Flash is super fast, gets straight to the point.
Pro takes ages to even respond, then starts yapping endlessly, usually confuses itself in the process and ends up with a wrong answer.
On the other hand, I do prefer using Claude 4 Sonnet on very open-ended agentic programming tasks because it seems to have a better integration with VSCode Copilot. Gemini 2.5 Pro bugs out much more often where Claude works fine almost every time.
It's a delicate balance, because these Gemini models sometimes feel downright lobotomized compared to claude or gpt-5.
It's bad at agentic stuff, especially coding. Incomparably so compared to Claude and now GPT-5. But if it's just about asking it random stuff, and especially going on for very long in the same conversation - which non-tech users have a tendency to do - Gemini wins. It's still the best at long context, noticing things said long ago.
Earlier this week I was doing some debugging. For debugging especially I like to run sonnet/gpt5/2.5-pro in parallel with the same prompt/convo. Gemini was the only one that, 4 or so messages in, pointed out something very relevant in the middle of the logs in the very first message. GPT and Sonnet both failed to notice, leading them to give wrong sample code. I would've wasted more time if I hadn't used Gemini.
It's also still the best at a good number of low-resource languages. It doesn't glaze too much (Sonnet, ChatGPT) without being overly stubborn (raw GPT-5 API). It's by far the best at OCR and image recognition, which a lot of average users use quite a bit.
Google's ridiculously bad at marketing and AI UX, but they'll get there. They're already much more than just a "bang for the buck" player.
FWIW I use all 3 above mentioned on a daily basis for a wide variety of tasks, often side-by-side in parallel to compare performance.
My understanding is Gemini is not far behind on "intelligence", certainly not in a way that leaves obvious doubt over where they will be over the next iteration/model cycles, where I would expect them to at least continue closing the gap. I'd be curious if you have some benchmarks to share that suggest otherwise.
Meanwhile, afaik something Google has done, and perhaps relates back to your point re "latency/TPS/cost dimensions" that other providers aren't doing as much is integrating their model into interesting products beyond chat, at a pace that seems surprising given how much criticism they had been taking for being "slow" to react to the LLM trend.
Besides the Google Workspace surface and Google search, which now seem obvious - there are other interesting places where Gemini will surface - https://jules.google/ for one, to say nothing of their experiments/betas in the creative space - https://labs.google/flow/about
Another I noticed today: https://www.google.com/finance/beta
I would have thought putting Gemini on a finance dashboard like this would be inviting all sorts of regulatory (and other) scrutiny... and wouldn't be in keeping with a "slow" incumbent. But given the current climate, it seems Google is plowing ahead just as much as anyone else - with a lot more resources and surface to bring to bear. Imagine Gemini integration on Youtube. At this point it just seems like counting down the days...
can I get the sources of your rumour please? (Yes I know that I can search it but I would honestly prefer it if you could share it, thanks in advance!)
To be honest, I hadn't heard that elsewhere, but I haven't been following it massively this week.
I AM LAUGHING SO HARD RIGHT NOWWWWW
LMAOOOO
I wish to upvote this twice lol
Same way that openai updated their 4-o models and the like, which didn't turn out so well when it started glazing everyone and they had to revert it (maybe that was just chat and not api)
Anthropic kind of did the same thing [1] except it back-fired recently with the cries of "nerfing".
We buy these tokens, which are very hard to do in limited tiers, they expire after only a year, and we don't even know how often the responses are changing in the background. Even a 1% improvement or reduction I would want disclosed.
Really scary foundation AI companies are building on IMO. Transparency and access is important.
I've been running into it consistently, responses that just stop mid-sentence, not because of token limits or content filters, but what appears to be a bug in how the model signals completion. It's been documented on their GitHub and dev forums for months as a P2 issue.
The frustrating part is that when you compare a complete Gemini response to Claude or GPT-4, the quality is often quite good. But reliability matters more than peak performance. I'd rather work with a model that consistently delivers complete (if slightly less brilliant) responses than one that gives me half-thoughts I have to constantly prompt to continue.
It's a shame because Google clearly has the underlying tech. But until they fix these basic conversation flow issues, Gemini will keep feeling broken compared to the competition, regardless of how it performs on benchmarks.
https://github.com/googleapis/js-genai/issues/707
https://discuss.ai.google.dev/t/gemini-2-5-pro-incomplete-re...
1. Using the "Projects" thing (Folder organization) makes my browser tab (on Firefox) become unusably slow after a while. I'm basically forced to use the default chats organization, even though I would like to organize my chats in folders.
2. After editing a message that you already sent,you get to select between the different branches of the chat (1/2, and so on), which is cool, but when ChatGPT fails to generate a response in this "branched conversation" context, it will continue failing forever. When your conversation is a single thread and a ChatGPT message fails with an error, re trying usually works and the chat continues normally.
Typo in the first sentence? "... improving the efficiency." Gemini 2.5 Pro says this is perfectly good phrasing, whereas ChatGPT and Claude recognize that it's awkward or just incorrect. Hmm...
export LLM_GEMINI_KEY='...'
uvx --isolated --with llm-gemini llm -m gemini-flash-lite-latest 'An epic poem about frogs at war with ducks'
Release notes: https://github.com/simonw/llm-gemini/releases/tag/0.26Pelicans: https://github.com/simonw/llm-gemini/issues/104#issuecomment...
This industry desperately needs a Steve Jobs to bring some sanity to the marketing.
scosman•1h ago
Anthropic learned this lesson. Google, Deepseek, Kimi, OpenAI and others keep repeating it. This feels like Gemini_2.5_final_FINAL_FINAL_v2.
rsc•1h ago
Imustaskforhelp•49m ago
Could thereotically there could be something like a semver that can be autogenerated from that defined and regular version scheme that you shared?
Like, Honestly my idea of it is that I could use something like openrouter and then just change the semver without having to worry about these soooo many things as the schema that you shared y'know?
A website / tool which can create a semver from this defined scheme and vice versa can be really cool actually :>
qafy•46m ago
ComputerGuru•16m ago