Both models have improved intelligence on Artificial Analysis index with lower end-to-end response time. Also 24% to 50% improved output token efficiency (resulting in lower cost).
Gemini 2.5 Flash-Lite improvements include better instruction following, reduced verbosity, stronger multimodal & translation capabilities. Gemini 2.5 Flash improvements include better agentic tool use and more token-efficient reasoning.
Model strings: gemini-2.5-flash-lite-preview-09-2025 and gemini-2.5-flash-preview-09-2025
Now how long can Google keep this going and cannibalizing how they make money is another question...
This involves having it identify all potential keywords and distinct entities, determine their approximate gender (important for languages with ambiguous gender pronouns), and then perform a line-by-line analysis of each chapter. For each line, it identifies the speaking entity, determines whose POV the line represents, and identifies the subject entity. While I didn't need or expect perfection, Gemini Flash 2.5 was the only model I tested that could not only follow all these instructions, but follow them well. The cheap price was a bonus.
I was thoroughly impressed, it's now my go-to for any JSON-formatted analysis reports.
Disclaimer: I recently joined this team. But I like the product!
The first chart implies the gains are minimal for nonthinking models.
Which is a good thing in my book as the models now are way too verbose (and I suspect one of the reasons is the billing by tokens).
- "temperature" - intentional random sampling from the most likely next tokens to improve "creativity" and help avoid repetition
- quantization - running models with lower numeric precision (saves on both memory and compute, without impacting accuracy too much)
- differences in/existence of a system prompt, especially when using something end-user-oriented like Qwen Chat
- not-quite-deterministic GPU acceleration
Benchmarks are usually run at temperature zero (always take the most likely next token), with the full-precision weights, and no additions to the benchmark prompt except necessary formatting and stuff like end-of-turn tokens. They also usually are multiple-choice or otherwise expect very short responses, which leaves less room for run-to-run variance.
Of course a benchmark still can't tell you everything - real-world performance can be very different.
Though I imagine this should be a smaller effect than different quantization levels say.
[1]: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
From OpenRouter last week:
* xAI: Grok Code Fast 1: 1.15T
* Anthropic: Claude Sonnet 4: 586B
* Google: Gemini 2.5 Flash: 325B
* Sonoma Sky Alpha: 227B
* Google: Gemini 2.0 Flash: 187B
* DeepSeek: DeepSeek V3.1 (free): 180B
* xAI: Grok 4 Fast (free): 158B
* OpenAI: GPT-4.1 Mini: 157B
* DeepSeek: DeepSeek V3 0324: 142B
People are lazy at pointing to the latest name.
I would rather use a model that is good than a model that is free, but different people have different priorities.
Y'know with all these latest models, the lines are kinda blurry actually. The definition of "good" is being foggy.
So it might as well be free as the definition of money is clear as crystal.
I also used it for some time to test on something really really niche like building telegram bot in cloudflare workers and grok-4-fast was kinda decent on that for the most part actually. So that's nice.
Also cheap enough to not really matter.
A bad model with good automated tooling and prompts will beat a good model without them, and if your goal is to build good tooling and prompts you need a tighter iteration loop.
Both apps have offered usage for free for a limited time:
https://blog.kilocode.ai/p/grok-code-fast-get-this-frontier-...
If xAI in particular is in the mood to light cash on fire promoting their new model, you'll see it everywhere during the promo period, so not surprised that heavily boosts xAI stats. The mystery codename models of the week are a bit easier to miss.
It might not be OK for that kind of usecase, or might breach ToS.
But it's still great. Even my premium Perplexity account doesn't give me free API access.
For all I know there are a couple of enormous whales on there who, should they decide to switch from one model to another, will instantly impact those overall ratings.
I'd love to have a bit more transparency about volume so I can tell if that's what is happening or not.
A "weekly active API Keys" faceted by models/app would be a useful data point to measure real-world popularity though.
2.5 is probably the best balance for tools like Aider.
gemini-2.5-flash-preview-09-2025 - what are they thinking?
I thought about joking that they had AI name it for them, but when I asked Gemini, it said that this name was confusing, redundant, and leads to unnecessarily high cognitive load.
Maybe Googlers should learn from their own models.
Something that distinguishes between a completely new pre-training process/architecture, and standard RLHF cycles/optimizations.
Flash is super fast, gets straight to the point.
Pro takes ages to even respond, then starts yapping endlessly, usually confuses itself in the process and ends up with a wrong answer.
On the other hand, I do prefer using Claude 4 Sonnet on very open-ended agentic programming tasks because it seems to have a better integration with VSCode Copilot. Gemini 2.5 Pro bugs out much more often where Claude works fine almost every time.
Also 2.5 Pro is often incapable of searching and will hallucinate instead. I don't know why. It will claim it searched and then return some made up results instead. 2.5 Flash is much more consistently capable of searching
It's a delicate balance, because these Gemini models sometimes feel downright lobotomized compared to claude or gpt-5.
It's bad at agentic stuff, especially coding. Incomparably so compared to Claude and now GPT-5. But if it's just about asking it random stuff, and especially going on for very long in the same conversation - which non-tech users have a tendency to do - Gemini wins. It's still the best at long context, noticing things said long ago.
Earlier this week I was doing some debugging. For debugging especially I like to run sonnet/gpt5/2.5-pro in parallel with the same prompt/convo. Gemini was the only one that, 4 or so messages in, pointed out something very relevant in the middle of the logs in the very first message. GPT and Sonnet both failed to notice, leading them to give wrong sample code. I would've wasted more time if I hadn't used Gemini.
It's also still the best at a good number of low-resource languages. It doesn't glaze too much (Sonnet, ChatGPT) without being overly stubborn (raw GPT-5 API). It's by far the best at OCR and image recognition, which a lot of average users use quite a bit.
Google's ridiculously bad at marketing and AI UX, but they'll get there. They're already much more than just a "bang for the buck" player.
FWIW I use all 3 above mentioned on a daily basis for a wide variety of tasks, often side-by-side in parallel to compare performance.
===============================
Got it — *compliment on the info you've shared*, *informal summary of task*. *Another compliment*, but *downside of question*.
----------
(relevant emoji) Bla bla bla
1. Aspect 1
2. Aspect 2
----------
*Actual answer*
-----------
(checkmark emoji) *Reassuring you about its answer because:*
* Summary point 1
* Summary point 2
* Summary point 3
Would you like me to *verb* a ready-made *noun* that will *something that's helpful to you 40% of the time*?
===============================
It's gotta reduce the quality of the answers.People have said it destroys the intelligence mid convo
Same as social media converging to rage bait. The user base LIKES it subconsciously. Nobody at the companies explicitly added that to content recommendation model training. I know, for the latter, as I was there.
Just on the video link alone Gemini is making money on the free tier by pointing the hapless user at an ad while the other LLMs make zilch off the free tier.
Additionally, despite having "grounding with google search" it tends to default to old knowledge. I usually have to inform it that it's presently 2025. Even after searching and confirming, it'll respond with something along the lines of "in this hypothetical timeline" as if I just gaslit it.
Consider this conversation I just had with all Claude, Gemini, GPT-5.
<ask them to consider DDR6 vs M3 Ultra memory bandwidth>
-- follow up --
User: "Would this enable CPU inference or not? I'm trying to understand if something like a high-end Intel chip or a Ryzen with built in GPU units could theoretically leverage this memory bandwidth to perform CPU inference. Think carefully about how this might operate in reality."
<Intro for all 3 models below - no custom instructions>
GPT-5: "Short answer: more memory bandwidth absolutely helps CPU inference, but it does not magically make a central processing unit (CPU) “good at” large-model inference on its own."
Claude: "This is a fascinating question that gets to the heart of memory bandwidth limitations in AI inference. "
Gemini 2.5 Pro: "Of course. This is a fantastic and highly relevant question that gets to the heart of future PC architecture."
My understanding is Gemini is not far behind on "intelligence", certainly not in a way that leaves obvious doubt over where they will be over the next iteration/model cycles, where I would expect them to at least continue closing the gap. I'd be curious if you have some benchmarks to share that suggest otherwise.
Meanwhile, afaik something Google has done, and perhaps relates back to your point re "latency/TPS/cost dimensions" that other providers aren't doing as much is integrating their model into interesting products beyond chat, at a pace that seems surprising given how much criticism they had been taking for being "slow" to react to the LLM trend.
Besides the Google Workspace surface and Google search, which now seem obvious - there are other interesting places where Gemini will surface - https://jules.google/ for one, to say nothing of their experiments/betas in the creative space - https://labs.google/flow/about
Another I noticed today: https://www.google.com/finance/beta
I would have thought putting Gemini on a finance dashboard like this would be inviting all sorts of regulatory (and other) scrutiny... and wouldn't be in keeping with a "slow" incumbent. But given the current climate, it seems Google is plowing ahead just as much as anyone else - with a lot more resources and surface to bring to bear. Imagine Gemini integration on Youtube. At this point it just seems like counting down the days...
2025-09-26T14:32:10Z
2025-09-26T14:32:10Z200s
2025-09-26T14:32:10Z200s600s
2025-09-26T14:32:10Z200s600s300s
It then proceeded to talk about how efficient this approach was for thousands of numbers.Gemini is by far the dumbest LLM I've used
It gave me a 160 line parse function.
After gaping for a short while, I implemented it in a 5 line function and a lookup table.
These vibe codes who are proud that they generated thousands of lines of code makes me wonder if they are ever reading what they generate with a critical eye.
text = re.sub(r'(\*|_)(.+?)\1', replace_italic, text, flags=re.DOTALL)
The `replace_italic` is a one line callback function surrounding the re's match with the ANSI codes.Knowing what technique is "best" and telling the LLM to use it produces better results (on average) than giving the LLM freedom to choose. For some problems, the specification of the prompt needed to get good output becomes more work than just thinking and writing for myself.
For very complex things, I myself can not put the design into English in my own head but can "see" the correct answer as code concepts. I don't know if this is universal for all developers. If it is, it shows a limit of LLM's usefulness.
The viber coders (who I referred to in my comment) aren't giving implementation tips.
What did it give you before you put an implementation tip into your prompt?
=======
FWIW, if you're at all interested, here's my implementation:
def markdown_ansi_code_subst(mdstr: str, src_pattern: str, replacement_start: str, replacement_end: str) -> str:
while src_pattern in mdstr:
mdstr = mdstr.replace(src_pattern, replacement_start, 1)
mdstr = mdstr.replace(src_pattern, replacement_end, 1)
return mdstr
The caller supplies the pattern (`*` for italic, `**` for bold, etc) and a start/end replacement. As you can imagine, I store all of that in a static lookup table.I feel this is more readable than regexes.*
> Give me a Python function that takes a string holding text in Markdown markup syntax and that uses regular expressions to replace any Markdown markup codes for bold, italics and underline with their ANSI equivalent.
BTW, your solution will produce bad output. Markdown's "bold" etc markup comes in pairs of markers and your simple replacement will match singlets.
In a separate chat, I asked it to suggest some ways to do it first, and it provided three alternatives, but suggested I started with regex.
FWIW I used Open WebUI, which uses the API (via OpenRouter). I've seen people here mentioning that the API produces much better results than the chat app, for whatever reason.
For reference, here's prompt and the zero-shot result, minus the rigorous comments it added:
I need a Python function that will take in markdown in a string and return a string with ansi codes for bold, italics and underline. Can you write me such a function?
import re
def markdown_to_ansi(markdown_string: str) -> str:
ansi_string = markdown_string
ANSI_BOLD = "\033[1m"
ANSI_ITALIC = "\033[3m"
ANSI_UNDERLINE = "\033[4m"
ANSI_RESET = "\033[0m"
ansi_string = re.sub(r'\*\*(.*?)\*\*', f'{ANSI_BOLD}\\1{ANSI_RESET}', ansi_string)
ansi_string = re.sub(r'__(.*?)__', f'{ANSI_BOLD}\\1{ANSI_RESET}', ansi_string)
ansi_string = re.sub(r'\*(.*?)\*', f'{ANSI_ITALIC}\\1{ANSI_RESET}', ansi_string)
ansi_string = re.sub(r'\_(.*?)\_', f'{ANSI_ITALIC}\\1{ANSI_RESET}', ansi_string)
ansi_string = re.sub(r'\+\+(.*?)\+\+', f'{ANSI_UNDERLINE}\\1{ANSI_RESET}', ansi_string)
# A more robust solution for nesting would require a parsing library or a more complex state machine.
# However, for simple cases, applying in order can layer them correctly.
return ansi_stringYes, this is a very effective tactic, in my experience! Especially when I am asking for a solution where I am not confident I know what is "best". Having a "pre chat" to settle "what to do" and then "how to do it" before finally telling the LLM to "do it" is often worth the extra time for getting it to provide a solution for complex problems.
I'm using Gemini (2.5-pro) less and less these days. I used to be really impressived with its deep research capabilities and ability to cite sources reliably.
The last few weeks, it's increasingly argumentative and incapable of recognizing hallucinations around sourcing. I'm tired of arguing with it on basics like RFCs and sources it fabricates, won't validate, and refuses to budge on.
Example prompt I was arguing with it on last night:
> within a github actions workflow, is it possible to get access to the entire secrets map, or enumerate keys in this object?
As recent supply-chain attacks have shown, exfiltrating all the secrets from a Github workflow is as simple as `${{ toJSON(secrets) }}` or `echo ${{ toJSON(secrets) }} | base64` at worse. [1]
Give this prompt a shot! Gemini won't do anything except be obstinately ignorant. With me, it provided a test case workflow, and refused to believe the results. When challenged, expect it to cite unrelated community posts. Chatgpt had no problem with it.
[1] https://github.com/orgs/community/discussions/174045 https://github.com/orgs/community/discussions/47165
With this example, several attempts resulted in the same thing: Gemini expressing a strong belief that Github has a security capability which is really doesn't have.
If someone is able to get Gemini to give an accurate answer to this with a similar question, I'd be very curious to hear what it is.
Gemini is notoriously bad at multi-turn instruction following, so this holds strongly for it. Less so for Claude Opus 4 or GPT-5.
today: "before you marry someone, put the person in front of a slow AI model"
;-)
can I get the sources of your rumour please? (Yes I know that I can search it but I would honestly prefer it if you could share it, thanks in advance!)
To be honest, I hadn't heard that elsewhere, but I haven't been following it massively this week.
I AM LAUGHING SO HARD RIGHT NOWWWWW
LMAOOOO
I wish to upvote this twice lol
Same way that openai updated their 4-o models and the like, which didn't turn out so well when it started glazing everyone and they had to revert it (maybe that was just chat and not api)
Anthropic kind of did the same thing [1] except it back-fired recently with the cries of "nerfing".
We buy these tokens, which are very hard to do in limited tiers, they expire after only a year, and we don't even know how often the responses are changing in the background. Even a 1% improvement or reduction I would want disclosed.
Really scary foundation AI companies are building on IMO. Transparency and access is important.
They just don't want to be pinned down because the shifting sands are useful for the time when the LLM starts to get injected with ads or paid influence.
I've been running into it consistently, responses that just stop mid-sentence, not because of token limits or content filters, but what appears to be a bug in how the model signals completion. It's been documented on their GitHub and dev forums for months as a P2 issue.
The frustrating part is that when you compare a complete Gemini response to Claude or GPT-4, the quality is often quite good. But reliability matters more than peak performance. I'd rather work with a model that consistently delivers complete (if slightly less brilliant) responses than one that gives me half-thoughts I have to constantly prompt to continue.
It's a shame because Google clearly has the underlying tech. But until they fix these basic conversation flow issues, Gemini will keep feeling broken compared to the competition, regardless of how it performs on benchmarks.
https://github.com/googleapis/js-genai/issues/707
https://discuss.ai.google.dev/t/gemini-2-5-pro-incomplete-re...
1. Using the "Projects" thing (Folder organization) makes my browser tab (on Firefox) become unusably slow after a while. I'm basically forced to use the default chats organization, even though I would like to organize my chats in folders.
2. After editing a message that you already sent,you get to select between the different branches of the chat (1/2, and so on), which is cool, but when ChatGPT fails to generate a response in this "branched conversation" context, it will continue failing forever. When your conversation is a single thread and a ChatGPT message fails with an error, re trying usually works and the chat continues normally.
On mobile (android) opening the keyboard scrolls the chat to the bottom! I sometimes want to type referring something from the middle of the LLMs last answer.
Of course it could all be placebo, but when you intuitively think about it, somewhere on the road the the hundreds of billions in datacenter capex, one would think that there will be periods where compute and demand are out of sync. It's also perfectly understandable why now would be a time to be seeing that.
It’s so annoying that you have this super capable model but you interact with it using an app that is complete ass
Ask ChatGPT to output markdown or PDF on iOS or Mac app and the web experience. The web is often better - the apps will return nothing.
(Disclosure: I'm the founder of Synthetic.new, a company that runs open-source LLMs for monthly subscriptions.)
If you want to use application/json as the specified output in the request, you can’t use tools
So if you need both, you either hope it gives you correct json when using tools (which many times it doesn’t). Or you have to do two requests, one for the tool calling, another for formatting
At least, even if annoying, this issue is pretty straightforward to get around
And wanting to programmatically work with the result + allow tool calls is super common.
I can ask gemini to give me the pdf's content as a json and it complies most of the time. But at times, there's an introductory line like "Here's your json:". Those introductory lines interfere with programmatically using the output. They're sometimes there, sometimes not.
If I could have structured output at the same time as tool use, I can reliably use what gemini spits out as it'll be in a json, no annoying intro lines.
It’s a bit of a hack but maybe that reliably works here?
The issue here is that Gemini has support for some internal tools (like search and web scraping), and when you ask the model to use those, you can’t also ask it to use application/json as the output (which you normally can when not using tools)
Not a huge issue, just annoying
I’ve seen that behavior when LLMs of any make or model aren’t given enough time or allowed enough tokens.
Gemini 2.5 Pro is _amazing_ for software architecture, but I just get tired of poking it along. Sonnet does well enough.
Typo in the first sentence? "... improving the efficiency." Gemini 2.5 Pro says this is perfectly good phrasing, whereas ChatGPT and Claude recognize that it's awkward or just incorrect. Hmm...
“deliver better quality while also improving the efficiency.”
Reads fine to me. An editor would likely drop “the”.
export LLM_GEMINI_KEY='...'
uvx --isolated --with llm-gemini llm -m gemini-flash-lite-latest 'An epic poem about frogs at war with ducks'
Release notes: https://github.com/simonw/llm-gemini/releases/tag/0.26Pelicans: https://github.com/simonw/llm-gemini/issues/104#issuecomment...
But looking at these images, Google clearly hasn’t done that yet.
I don't think it would be worth it though, it would be pretty obvious you had cheated on my benchmark when it drew a perfect pelican riding a bicycle and then failed at a flamingo on a unicycle.
This industry desperately needs a Steve Jobs to bring some sanity to the marketing.
I actually even agree that the progress is plateauing, but your comment is a non-sequitur.
And I say this because, I added about 50 prompts in the settings to prevent video recommendations and to remove any links to videos. but I still get text saying "the linked video explains this more" even though there is no linked video.
This is not a bad way to monetise the free tier. Non of the other token providers found any way to monetise the free tier but Gemini is doing it on almost every prompt.
That's why Google names it like this, but I agree its dumb. Semver would be easier.
For example, the latest Gemini 2.5 Flash is known as "google/gemini-2.5-flash-preview-09-2025" [1].
[1]: https://openrouter.ai/google/gemini-2.5-flash-preview-09-202...
This is also the case with OpenAI and their models. Pretty standard I guess.
They don't change the versioning, because I guess they don't consider it to be "a new model trained from scratch".
In all seriousness though, their version system is awful.
That "example" is the name used in the article under discussion. There's no need to link to openrouter.ai to find the name.
semantic versioning works for most scenarios.
This is all solved for a long time now , llm vendors seems to have unlearnt versioning principles.
This is fairly typical - marketing and business wants different things to do with version number than what version number systems are good at .
This is the entire premise behind the cloud, the reason it was Amazon did it first, they had the largest workloads at the time before Web 2.0 and SaaS was a thing.
Only businesses with large first party apps succeeded in the cloud provider space, companies like HP, IBM all failed and their time to failure strongly correlated to their amount of first party apps they operated. i.e. These apps anyway needed to keep a lot of idle capacity for peak demand capacity they could now monetize and co-mingle in the cloud.
LLMs as a service is not any different from S3 launched 20 years ago.
---
[1] It isn't, at the scale they are operating these models it shouldn't matter at all, it is not individual GPUs or machines that make a difference in load handling at all. Only few users are going to explicitly pining a specific patch version for the rest they can serve either one that is available immediately or cheaply.
Gemini 2.5 Flash Preview $0.30 $2.50
Grok 4 Fast $0.20 $0.50
It is HORRENDOUS when compared to other models.
I hear a bunch of other people talking about how great Gemini is, but I've never seen it.
The responses are usually either incorrect, way too long, (essays when I wanted summaries) or just...not...good. I will ask the exact same question to both Gemini and ChatGPT (free) and GPT will give a great answer while the Gemini answer is trash.
Am I missing something?
ChatGPT is better at:
A) Interpreting what I'm asking it for me needing to provide additional explicit context.
B) Formatting answers in a way that are easily digestible.
I will also say whatever they use for the AI search summary is good enough for me like 50% of the time I google something, but those are generally the simpler 50% of queries.
My most recent trials output single commas as responses to basic questions or it simply refuses the task on ethical grounds such as generating a photo of a backpack wearing a hoodie for some reason (it claimed harmful stereotypes and instead generated an ape).
Refusing to do perfectly ethical tasks is probably the most consist problem I've had.
I think the "baked in" Gemini models are different, try using Gemini through the actual Gemini site.
I would like to try a small computer->human "upload" experiment, basic multilingual understanding without pronounciation knowledge would be very sad.
I intend to make a sort of computer reflexive game, I want to compare different upload strategies (with/without analog or classic error correcting codes, empirical spaced repetition constants, a ML predictor of which parameters I'm forgetting / losing resolution on.
Here's a summary of this discussion with the new version: https://extraakt.com/extraakts/the-great-llm-versioning-deba...
It kept finding those fatal flaws and starting to explain them to then slowly finish with "oh yes this works as intended".
Would like to know whether Flash exhibits these issues as well.
The way I have come to perceive AI is that it's mostly good at reassuring/reaffirming people's beliefs and ideas than an actual source of truth.
That would not be an issue if it was actually marketed as such, but seeing the "guided learning" function fail time and again makes me think we should be a lot more critical of what we're being told by tech enthusiasts/companies about AI.
at least for us, the bottleneck is the amount of retries/waiting needed to max out how many requests we can make in parallel.
[1] https://cloud.google.com/vertex-ai/generative-ai/docs/dynami...
However its hampered by max output tokens. Gemini is at 65 K while GPT 5 mini is at 128K. Both of them have similar costs as well so as such apart from the 1M context limit GPT 5 mini is better in every way.
scosman•4mo ago
Anthropic learned this lesson. Google, Deepseek, Kimi, OpenAI and others keep repeating it. This feels like Gemini_2.5_final_FINAL_FINAL_v2.
rsc•4mo ago
Imustaskforhelp•4mo ago
Could thereotically there could be something like a semver that can be autogenerated from that defined and regular version scheme that you shared?
Like, Honestly my idea of it is that I could use something like openrouter and then just change the semver without having to worry about these soooo many things as the schema that you shared y'know?
A website / tool which can create a semver from this defined scheme and vice versa can be really cool actually :>
CaptainOfCoit•4mo ago
With that in mind, what exactly would semver (or similar) represent for AI models? Setup the proper way, your pipelines should continue working regardless of the model, just that the accuracy or some other metric might change slightly. But there should never be any "breakages" like what semver is supposed to help flag.
scosman•4mo ago
This thread is more about the minor number: not incrementing it when making changes to the internals is painful for dependency tracking. These changes will also break apps (prompts are often tuned to the model).
qafy•4mo ago
ComputerGuru•4mo ago
esrauch•4mo ago
scosman•4mo ago
scosman•4mo ago