I find Flash and Flash Lite are more consistent than others as well as being really fast and cheap.
I could swap to other providers fairly easily, but don't intend to at this point. I don't operate at a large scale.
In general, when I need "cheap and fast" I choose Gemini.
I also believe Gemini is much better than ChatGPT in generating deep research reports. Google has an edge in web search and it shows. Gemini’s reports draw on a vast number of sources, thus tend to be more accurate. In general, I even prefer its writing style, and I like the possibility of exporting reports to Google Docs.
One thing that I don’t like about Gemini is its UI, which is miles behind the competition. Custom instructions, projects, temporary chats… these things either have no equivalent in Gemini or are underdeveloped.
What I like the most about translating with Gemini is that its default performance is already good enough, and it can be improved via the one million tokens of the context window. I load to the context my private databases of idiomatic translations, separated by language pairs and subject areas. After doing that, the need for manually reviewing Gemini translations is greatly diminished.
My favorite personal use of Gemini right now is basically as a book club. Of course it’s not as good as my real one but I often can’t them to read the books I want and Gemini is always ready when I want to explore themes. It’s often more profound than the book club too and seems a bit less likely to tunnel vision. Before LLMs I found exploring book themes pretty tedious, often I would have to wait a while to find someone who had read it but now I can get into it as soon as I’m done reading.
But the long context eval they used (MRCR) is limited. It's multi-needle, so that's a start, but its not evaluating long range dependency resolution nor topic modeling, which are the things you actually care about beyond raw retrieval for downstream tasks. Better than nothing, but not great for just throwing a pile of text at it and hoping for the best. Particularly for out-of-distribution token sequences.
I do give google some credit though, they didn't try to hide how poorly they did on that eval. But there's a reason you don't see them adding RULER, HELMET, or LongProc to this. The performance is abysmal after ~32k.
EDIT: I still love using 2.5 Pro for a ton of different tasks. I just tend to have all my custom agents compress the context aggressively for any long context or long horizon tasks.
Huh. We've not seen this in real-world use. 2.5 pro has been the only model where you can throw a bunch of docs into it, give it a "template" document (report, proposal, etc), even some other-project-example stuff, and tell it to gather all relevant context from each file and produce "template", and it does surprisingly well. Couldn't reproduce this with any other top tier model, at this level of quality.
We have long context evals using internal data that are leveraged for this (modeled after longproc specifically) and the performance across the board is pretty bad. Task-wise for us, it's about as real world as it gets, using production data. Summarization, Q&A, coding, reasoning, etc.
But I think this is where the in-distribution vs out-of-distribution distinction really carries weight. If the model has seen more instances of your token sequences in training and thus has more stable semantic representations of them in latent space, it would make sense that it would perform better on average.
In my case, the public evals align very closely with performance on internal enterprise data. They both tank pretty hard. Notably, this is true for all models after a certain context cliff. The flagship frontier models predictably do the best.
Absolutely do swap out models sometimes, but Gemini 2.0 Flash is the right price/performance mix for me right now. Will test Gemini 2.5 Flash-Lite tomorrow though.
I really like to use it also for self reflection where I just input my thoughts and maybe concerns and just see what it has to say.
Somehow it's gotten worse since then, and I'm back to using Claude for serious work.
Gemini is like that guy who keeps talking but has no idea what he's actually talking about.
I still use Gemini for brainstorming, though I take its suggestions with several grains of salt. It's also useful for generating prompts that I can then refine and use with Claude.
Local models are mostly for hobby and privacy, not really efficiency.
ChatGPT is better but tends to be too agreeable, never trying to disagree with what you say even if it's stupid so you end up shooting yourself in the foot.
Claude seems like the best compromise.
Just my two kopecks.
Overall though my primary concern is the UX, and Claude Code is the UX of choice for me currently.
I use only the APIs directly with Aider (so no experience with AI Studio).
My feeling with Claude is that they still perform good with weak prompts, the "taste" is maybe a little better when the direction is kinda unknown by the prompter.
When the direction is known I see Gemini 2.5 Pro (with thinking) on top of Claude with code which does not break. And with o4-mini and o3 I see more "smart" thinking (as if there is a little bit of brain inside these models) at the expense of producing unstable code (Gemini produces more stable code).
I see problems with Claude when complexity increases and I would put it behind Gemini and o3 in my personal ranking.
So far I had no reason to go back to Claude since o3-mini was released.
I was much more satisfied with o3 and Aider, I haven't tried them on this specific problem but I did quite a bit of work on the same project with them last night. I think I'm being a bit unfair, because what Claude got stuck on seems to be a hard problem, but I don't like how they'll happily consume all my money trying the same things over and over, and never say "yeah I give up".
My only complaint about 2.5 Pro is around the inane comments it leaves in the code (// Deleted varName here).
The same happened with GPT-3.5. It was so good early on and got worse as OpenAI began to cut costs. I feel like when GPT-4.1 was cloaked as Optimus on Openrouter, it was really good, but once it launched, it also got worse.
LLMs, on the other hand, operate under different incentives. It’s in a company’s best interest to initially release the strongest model, top the benchmarks, and then quietly degrade performance over time. Unlike traditional software, LLMs have low switching costs, users can easily jump to a better alternative. That makes it more tempting for companies to conceal model downgrades to prevent user churn.
Counterexample: 99% of average Joes have no idea how incredibly enshittified Google Maps has become, to just name one app. These companies intentionally boil the frog very slowly, and most people are incredibly bad at noticing gradual changes (see global warming).
Sure, they could know by comparing, but you could also know whether models are changing behind the scenes by having sets of evals.
We can tell it’s getting worse because of UI changes, slower load times, and more ads. The signs are visible.
With LLMs, it’s different. There are no clear cues when quality drops. If responses seem off, users often blame their own prompts. That makes it easier for companies to quietly lower performance.
That said, many of us on HN use LLMs mainly for coding, so we can tell when things get worse.
Both cases involve the “boiling frog” effect, but with LLMs, users can easily jump to another pot. With traditional software, switching is much harder.
I also have a personal conspiracy theory, i.e., that once a user exceeds a certain use threshold of 2.5 Pro in the Google Gemini app, they start serving a quantized version. Of course, I have no proof, but it certainly feels that way.
Although, given that I rapidly went from +4 to 0 karma, a few other comments in this topic are grey, and at least one is missing, I am getting suspicious. (Or maybe it is just lunch time in MTV.)
Just claude 4 sonnet with thinking just has a bit think then does it
Brokk (https://brokk.ai/) currently uses Flash 2.0 (non-Lite) for Quick Edits, we'll evaluate 2.5 Lite now.
ETA: I don't have a use case for a thinking model that is dumber than Flash 2.5, since thinking negates the big speed advantage of small models. Curious what other people use that for.
I tried many free stuff to try to refactor it but they all loose context window quickly.
What kind of rate limits do these new Gemini models have?
[edit] I'm less excited about this because it looks like their solution was to dramatically raise the base price on the non-thinking variant.
For 2.5 Flash Preview https://web.archive.org/web/20250616024644/https://ai.google...
$0.15/million input text / image / video
$1.00/million audio
Output: $0.60/million non-thinking, $3.50/million thinking
The new prices for Gemini 2.5 Flash ditch the difference between thinking and non-thinking and are now: https://ai.google.dev/gemini-api/docs/pricing
$0.30/million input text / image / video (2x more)
$1.00/million audio (same)
$2.50/million output - significantly more than the old non-thinking price, less than the old thinking price.
And Gemini 2.0 Flash was $0.10/$0.40.
Now 2.0 -> 2.5 is another hefty price increase.
But why is there only thinking flash now?
As they've become legitimately competitive they have moved towards the pricing of their competitors.
https://developers.googleblog.com/en/gemini-2-5-thinking-mod...
How cute they are with their phrasing:
> $2.50 / 1M output tokens (*down from $3.50 output)
Which should be "up from $0.60 (non-thinking)/down from $3.50 (thinking)"
In addition, it's also relevant because for the last 3 months people have built things on top of this.
Has anyone here had any luck working around this problem?
I have about 500,000 news articles I am parsing. OpenAI models work well but found Gemini had fewer mistakes.
Problem is; they give me a terrible 10k RPD limit. To increase to the next tier, they then require a minimum amount of spending but I can't reach that amount even when maxing the RPD limit for multiple days in a row.
I emailed them twice and completed their forms but everyone knows how this works. So now I'm back at OpenAI, with a model with a bit more mistakes but that won't 403 me after half an hour of using it due to their limits.
Gemini 2.5 Flash Lite (Audio Input) - $0.5/million tokens
Gemini 2.0 Flash Lite (Audio Input) - $0.075/million tokens
Wonder what led to such a high bump in Audio token processing
One odd disconnect that still exists in LLM pricing is the fact that providers charge linearly with respect to token consumption, but costs are actually quadratic with an increase in sequence length.
At this point, since a lot of models have converged around the same model architecture, inference algorithms, and hardware - the chosen costs are likely due to a historical, statistical analysis of the shape of customer requests. In other words, I'm not surprised to see costs increase as providers gather more data about real-world user consumption patterns.
remus•3h ago