frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Interop 2025: A Year of Convergence

https://webkit.org/blog/17808/interop-2025-review/
1•ksec•7m ago•0 comments

JobArena – Human Intuition vs. Artificial Intelligence

https://www.jobarena.ai/
1•84634E1A607A•11m ago•0 comments

Concept Artists Say Generative AI References Only Make Their Jobs Harder

https://thisweekinvideogames.com/feature/concept-artists-in-games-say-generative-ai-references-on...
1•KittenInABox•15m ago•0 comments

Show HN: PaySentry – Open-source control plane for AI agent payments

https://github.com/mkmkkkkk/paysentry
1•mkyang•16m ago•0 comments

Show HN: Moli P2P – An ephemeral, serverless image gallery (Rust and WebRTC)

https://moli-green.is/
1•ShinyaKoyano•26m ago•0 comments

The Crumbling Workflow Moat: Aggregation Theory's Final Chapter

https://twitter.com/nicbstme/status/2019149771706102022
1•SubiculumCode•30m ago•0 comments

Pax Historia – User and AI powered gaming platform

https://www.ycombinator.com/launches/PMu-pax-historia-user-ai-powered-gaming-platform
2•Osiris30•31m ago•0 comments

Show HN: I built a RAG engine to search Singaporean laws

https://github.com/adityaprasad-sudo/Explore-Singapore
1•ambitious_potat•37m ago•0 comments

Scams, Fraud, and Fake Apps: How to Protect Your Money in a Mobile-First Economy

https://blog.afrowallet.co/en_GB/tiers-app/scams-fraud-and-fake-apps-in-africa
1•jonatask•37m ago•0 comments

Porting Doom to My WebAssembly VM

https://irreducible.io/blog/porting-doom-to-wasm/
1•irreducible•38m ago•0 comments

Cognitive Style and Visual Attention in Multimodal Museum Exhibitions

https://www.mdpi.com/2075-5309/15/16/2968
1•rbanffy•39m ago•0 comments

Full-Blown Cross-Assembler in a Bash Script

https://hackaday.com/2026/02/06/full-blown-cross-assembler-in-a-bash-script/
1•grajmanu•44m ago•0 comments

Logic Puzzles: Why the Liar Is the Helpful One

https://blog.szczepan.org/blog/knights-and-knaves/
1•wasabi991011•56m ago•0 comments

Optical Combs Help Radio Telescopes Work Together

https://hackaday.com/2026/02/03/optical-combs-help-radio-telescopes-work-together/
2•toomuchtodo•1h ago•1 comments

Show HN: Myanon – fast, deterministic MySQL dump anonymizer

https://github.com/ppomes/myanon
1•pierrepomes•1h ago•0 comments

The Tao of Programming

http://www.canonical.org/~kragen/tao-of-programming.html
2•alexjplant•1h ago•0 comments

Forcing Rust: How Big Tech Lobbied the Government into a Language Mandate

https://medium.com/@ognian.milanov/forcing-rust-how-big-tech-lobbied-the-government-into-a-langua...
3•akagusu•1h ago•0 comments

PanelBench: We evaluated Cursor's Visual Editor on 89 test cases. 43 fail

https://www.tryinspector.com/blog/code-first-design-tools
2•quentinrl•1h ago•2 comments

Can You Draw Every Flag in PowerPoint? (Part 2) [video]

https://www.youtube.com/watch?v=BztF7MODsKI
1•fgclue•1h ago•0 comments

Show HN: MCP-baepsae – MCP server for iOS Simulator automation

https://github.com/oozoofrog/mcp-baepsae
1•oozoofrog•1h ago•0 comments

Make Trust Irrelevant: A Gamer's Take on Agentic AI Safety

https://github.com/Deso-PK/make-trust-irrelevant
7•DesoPK•1h ago•4 comments

Show HN: Sem – Semantic diffs and patches for Git

https://ataraxy-labs.github.io/sem/
1•rs545837•1h ago•1 comments

Hello world does not compile

https://github.com/anthropics/claudes-c-compiler/issues/1
35•mfiguiere•1h ago•20 comments

Show HN: ZigZag – A Bubble Tea-Inspired TUI Framework for Zig

https://github.com/meszmate/zigzag
3•meszmate•1h ago•0 comments

Metaphor+Metonymy: "To love that well which thou must leave ere long"(Sonnet73)

https://www.huckgutman.com/blog-1/shakespeare-sonnet-73
1•gsf_emergency_6•1h ago•0 comments

Show HN: Django N+1 Queries Checker

https://github.com/richardhapb/django-check
1•richardhapb•1h ago•1 comments

Emacs-tramp-RPC: High-performance TRAMP back end using JSON-RPC instead of shell

https://github.com/ArthurHeymans/emacs-tramp-rpc
1•todsacerdoti•1h ago•0 comments

Protocol Validation with Affine MPST in Rust

https://hibanaworks.dev
1•o8vm•1h ago•1 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
5•gmays•2h ago•1 comments

Show HN: Zest – A hands-on simulator for Staff+ system design scenarios

https://staff-engineering-simulator-880284904082.us-west1.run.app/
1•chanip0114•2h ago•1 comments
Open in hackernews

Improved Gemini 2.5 Flash and Flash-Lite

https://developers.googleblog.com/en/continuing-to-bring-you-our-latest-models-with-an-improved-gemini-2-5-flash-and-flash-lite-release/
540•meetpateltech•4mo ago

Comments

scosman•4mo ago
Ugh. If the model name includes sem_ver version number, increment the version number when making a new release!

Anthropic learned this lesson. Google, Deepseek, Kimi, OpenAI and others keep repeating it. This feels like Gemini_2.5_final_FINAL_FINAL_v2.

rsc•4mo ago
FWIW, the versions are not semver but they do follow a defined and regular version schema: https://ai.google.dev/gemini-api/docs/models#model-versions.
Imustaskforhelp•4mo ago
I am seeing a lot of demand for something like a semver for AI models.

Could thereotically there could be something like a semver that can be autogenerated from that defined and regular version scheme that you shared?

Like, Honestly my idea of it is that I could use something like openrouter and then just change the semver without having to worry about these soooo many things as the schema that you shared y'know?

A website / tool which can create a semver from this defined scheme and vice versa can be really cool actually :>

CaptainOfCoit•4mo ago
I'm not sure if this is a joke or not, but in case it isn't: Semver was mostly created so users of libraries could judge if a new release would break the API interfaces or not, by just looking at the version. So unless the first number changed, you're good to go (in theory, in practice this obviously didn't work as expected).

With that in mind, what exactly would semver (or similar) represent for AI models? Setup the proper way, your pipelines should continue working regardless of the model, just that the accuracy or some other metric might change slightly. But there should never be any "breakages" like what semver is supposed to help flag.

scosman•4mo ago
Models have changes worthy of semver style major changes. Tokenizer, tool support, tool format, JSON modes, etc. Pipelines absolutely must change when these change.

This thread is more about the minor number: not incrementing it when making changes to the internals is painful for dependency tracking. These changes will also break apps (prompts are often tuned to the model).

qafy•4mo ago
2.5 isn't the version number, its the model generation. it would only be updated when the underlying model architecture, training, etc are updated. this release is, as the name implies, the same model but likely with hardware optimizations, system prompt, and fine-tuning tweaks applied.
ComputerGuru•4mo ago
Ok, so if not 2.6 then 2.5.1 :)
esrauch•4mo ago
It's model=2.5 weights=202509
scosman•4mo ago
Sure so 2.5.509
scosman•4mo ago
If the weights have changed via training (they have) it’s a new model. This isn’t “hardware optimizations”. It’s additional training/new-weights.
newfocogi•4mo ago
Non-AI Summary:

Both models have improved intelligence on Artificial Analysis index with lower end-to-end response time. Also 24% to 50% improved output token efficiency (resulting in lower cost).

Gemini 2.5 Flash-Lite improvements include better instruction following, reduced verbosity, stronger multimodal & translation capabilities. Gemini 2.5 Flash improvements include better agentic tool use and more token-efficient reasoning.

Model strings: gemini-2.5-flash-lite-preview-09-2025 and gemini-2.5-flash-preview-09-2025

Mistletoe•4mo ago
2.5 Flash is the first time I've felt AI has become truly useful to me. I was #1 AI hater but now find myself going to the Gemini app instead of Google search. It's just better in every way and no ads. The info it provides is usually always right and it feels like I have the whole generalized and accurate knowledge of the internet at my fingertips in the app. It's more intimate, less distractions. Just me and the Gemini app alone talking about kale's ideal germination temperature, instead of a bunch of mommy bloggers, bots, and SEO spam.

Now how long can Google keep this going and cannibalizing how they make money is another question...

yesco•4mo ago
It's also excellent for subjective NLP-type analysis. For example, I use it for "scouting" chapters in my translation pipeline to compile coherent glossaries that I can feed into prompts for per-chapter translation.

This involves having it identify all potential keywords and distinct entities, determine their approximate gender (important for languages with ambiguous gender pronouns), and then perform a line-by-line analysis of each chapter. For each line, it identifies the speaking entity, determines whose POV the line represents, and identifies the subject entity. While I didn't need or expect perfection, Gemini Flash 2.5 was the only model I tested that could not only follow all these instructions, but follow them well. The cheap price was a bonus.

I was thoroughly impressed, it's now my go-to for any JSON-formatted analysis reports.

indigodaddy•4mo ago
Google AI mode is excellent as well, which I guess is just Gemini 2.5 Flash I'd imagine as well?
kridsdale1•4mo ago
If you have access, try AI Mode on Google.com. It’s a different product from Gemini that tries to solve “search engine data presented in LLM format”.

Disclaimer: I recently joined this team. But I like the product!

jonplackett•4mo ago
I think “Non-AI summary” is going to become a thing. I already enjoyed reading it more because I knew someone had thought about the content.
paxys•4mo ago
As soon as it becomes a thing LLMs will start putting "Non-AI summary" at the top of their responses.
crishoj•4mo ago
Any idea what "output token efficiency" refers to? Gemini Flash is billed by number of input/output tokens, which I assume is fixed for the same output, so I'm struggling to understand how it could result in lower cost. Unless of course they have changed tokenization in the new version?
minimaxir•4mo ago
The post implies that the new model are better at thinking, therefore less time/cost spent overall.

The first chart implies the gains are minimal for nonthinking models.

kaspermarstal•4mo ago
Models are less verbose, so produces fewer output tokens, so answers cost less.
Romario77•4mo ago
They provide the answer in less words (while still conveying what needed to be said).

Which is a good thing in my book as the models now are way too verbose (and I suspect one of the reasons is the billing by tokens).

jama211•4mo ago
Thank you for this, seems like an iterative improvement.
nharada•4mo ago
I'm stealing "Non-AI Summary"
OGEnthusiast•4mo ago
I'm not even sure how to evaluate what a "better" LLM is, when I've tried running the exact same model (Qwen3) and prompt and gotten vastly different responses on Qwen Chat vs OpenRouter vs running the model locally.
1899-12-30•4mo ago
That's a difference in the system prompt, not the model itself.
OGEnthusiast•4mo ago
True yeah, good point.
daemonologist•4mo ago
There several reasons responses from the same model might vary:

- "temperature" - intentional random sampling from the most likely next tokens to improve "creativity" and help avoid repetition

- quantization - running models with lower numeric precision (saves on both memory and compute, without impacting accuracy too much)

- differences in/existence of a system prompt, especially when using something end-user-oriented like Qwen Chat

- not-quite-deterministic GPU acceleration

Benchmarks are usually run at temperature zero (always take the most likely next token), with the full-precision weights, and no additions to the benchmark prompt except necessary formatting and stuff like end-of-turn tokens. They also usually are multiple-choice or otherwise expect very short responses, which leaves less room for run-to-run variance.

Of course a benchmark still can't tell you everything - real-world performance can be very different.

OGEnthusiast•4mo ago
Thanks, this is a good checklist.
magicalhippo•4mo ago
AFAIK the batch your query lands in can also matter[1].

Though I imagine this should be a smaller effect than different quantization levels say.

[1]: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

jabroni_salad•4mo ago
I can't speak to qwen, but something interesting with Deepseek is that the official API supports almost no parameters, while the vllm hosts on openrouter do. The experience you get with the rehosters is wildly different since you can use samplers.
Liwink•4mo ago
Gemini 2.5 Flash is an impressive model for its price. However, I don't understand why Gemini 2.0 Flash is still popular.

From OpenRouter last week:

* xAI: Grok Code Fast 1: 1.15T

* Anthropic: Claude Sonnet 4: 586B

* Google: Gemini 2.5 Flash: 325B

* Sonoma Sky Alpha: 227B

* Google: Gemini 2.0 Flash: 187B

* DeepSeek: DeepSeek V3.1 (free): 180B

* xAI: Grok 4 Fast (free): 158B

* OpenAI: GPT-4.1 Mini: 157B

* DeepSeek: DeepSeek V3 0324: 142B

crazysim•4mo ago
Maybe the same reason why they kept the name for the 2.5 Flash update.

People are lazy at pointing to the latest name.

koakuma-chan•4mo ago
Why is Grok so popular
coder543•4mo ago
I think it has been free in some editor plugins, which is probably a significant factor.

I would rather use a model that is good than a model that is free, but different people have different priorities.

YetAnotherNick•4mo ago
Non free has double usage than free. Free one uses your data for training.
Imustaskforhelp•4mo ago
I mean, I can kinda roll through a lot of iterations with this model without worrying about any AI limits.

Y'know with all these latest models, the lines are kinda blurry actually. The definition of "good" is being foggy.

So it might as well be free as the definition of money is clear as crystal.

I also used it for some time to test on something really really niche like building telegram bot in cloudflare workers and grok-4-fast was kinda decent on that for the most part actually. So that's nice.

davey48016•4mo ago
I think it's very cheap right now.
keeeba•4mo ago
It came from nowhere to 1T tokens per week, seems… suspect.
riku_iki•4mo ago
I think it is included for free into some coding product
BoredPositron•4mo ago
They had a lot of free promos with coding apps. It's okay and cheap so I bet some sticked with it.
NitpickLawyer•4mo ago
It's pretty good and fast af. At backend stuff is ~ gpt5-mini in capabilities, writes ok code, and works good with agentic extensions like roo/kilo. My colleagues said it handles frontend creation so-so, but it's so fast that you can "roll" a couple of tries and choose the one you want.

Also cheap enough to not really matter.

SR2Z•4mo ago
Yeah, the speed and price are why I use it. I find that any LLM is garbage at writing code unless it gets constant high-entropy feedback (e.g. an MCP tool reporting lint errors, a test, etc.) and the quality of the final code depends a lot more on how well the LLM was guided than the quality of the model.

A bad model with good automated tooling and prompts will beat a good model without them, and if your goal is to build good tooling and prompts you need a tighter iteration loop.

nwienert•4mo ago
This is so far off my experience. Grok 4 fast is straight trash, it literally isn’t even close to decent code for what I tried. Meanwhile Sonnet is miles better - but even still, Opus while I guess technically being only slightly better, in practice is so much better that I find it hard to use Sonnet at all.
SR2Z•4mo ago
Not Grok 4, the code variant of Grok. I think it's different - I agree with you Grok 4 kind of sucks.
nwienert•4mo ago
I meant to say code actually my bad, I found it significantly worse.
minimaxir•4mo ago
Grok Code Fast 1 usage is driven almost entirely by Kilo Code and Cline: https://openrouter.ai/x-ai/grok-code-fast-1/apps

Both apps have offered usage for free for a limited time:

https://blog.kilocode.ai/p/grok-code-fast-get-this-frontier-...

https://cline.bot/blog/grok-code-fast

ewoodrich•4mo ago
Yep Kilo (and Cline/Roo more recently) push these free trial of the week models really hard, partially as incentive to register an account with their cloud offering. I began using Cline and Roo before "cloud" features were even a thing and still haven't bothered to register, but I do play with the free Kilo models when I see them since I'm already signed in (they got me with some kind of register and spend $5 to get $X model credits deal) and hey, it's free (I really don't care about my random personal projects being used for training).

If xAI in particular is in the mood to light cash on fire promoting their new model, you'll see it everywhere during the promo period, so not surprised that heavily boosts xAI stats. The mystery codename models of the week are a bit easier to miss.

Simon321•4mo ago
it was free
frde_me•4mo ago
I know we have a lot of workloads at my company on older models no one has bothered to upgrade yet
koakuma-chan•4mo ago
Hell yeah, GPT 35 Turbo
kilroy123•4mo ago
There are cheaper models. Could cut the bill in half or more.
koakuma-chan•4mo ago
davinci-001 xd
tiahura•4mo ago
Primarily classification or something else?
YetAnotherNick•4mo ago
Gemini 2.0 Flash is the best fast non reasoning model by quite a margin. Lot of things doesn't require any reasoning.
mistic92•4mo ago
Price, 2.0 Flash is cheaper than 2.5 Flash but still very good model.
nextos•4mo ago
API usage of Flash 2.0 is free, at least till you hit a very generous bound. It's not simply a trial period. You don't even need to register any payment details to get an API key. This might be a reason for its popularity. AFAIK only some Mistral offerings have a similar free tier?
FergusArgyll•4mo ago
Yeah, that's my use case. When you want to test some program / script that utilizes an llm in the middle and you just want to make sure everything non-llm related is working. It's free! just try again and again till it "compiles" and then switch to 2.5
indigodaddy•4mo ago
wow this would be great for a webapp/site that just needs a basic/performant LLM for some basic tasks.
nextos•4mo ago
You might hit some throttling limits. During certain periods of the day, at least in my location, some requests are not served.

It might not be OK for that kind of usecase, or might breach ToS.

But it's still great. Even my premium Perplexity account doesn't give me free API access.

PetrBrzyBrzek•4mo ago
It’s cheaper and faster. What’s not to understand?
testycool•4mo ago
You can get it to be unhinged as well. It's awesome.
simonw•4mo ago
My one big problem with OpenRouter is that, as far as I can tell, they don't provide any indication of how many companies are using each model.

For all I know there are a couple of enormous whales on there who, should they decide to switch from one model to another, will instantly impact those overall ratings.

I'd love to have a bit more transparency about volume so I can tell if that's what is happening or not.

minimaxir•4mo ago
Granted, due to OpenRouter's 5.5% surcharge, any enormous whales have a strong financial incentive to use the provider's API directly.

A "weekly active API Keys" faceted by models/app would be a useful data point to measure real-world popularity though.

eli•4mo ago
They kinda have that already, no? https://openrouter.ai/apps?url=https%3A%2F%2Faider.chat%2F
minimaxir•4mo ago
Aggregating by tokens causes the problem simonw mentions in that one poweruser can skew the chart too much.
simonw•4mo ago
Right, that chart shows App usage based on the user-agent header but doesn't tell you if there is a single individual user of an app that skews the results.
__mharrison__•4mo ago
I was skewing the Gemini starts with my Aider usage. Basically the only model in using with openrouter, until I recently started running qwen3-next locally.

2.5 is probably the best balance for tools like Aider.

rohansood15•4mo ago
2.0 Flash is significantly cheaper than 2.5 Flash, and is/was better than 2.5-Flash-Lite before this latest update. It's a great workhorse model for basic text parsing/summary/image understanding etc. Though looks like 2.5-Flash-Lite will make it redundant.
tardyp•4mo ago
LLM Model versioning really makes me perplex those days...
_ea1k•4mo ago
Yeah, why is it that working with AI makes people completely forget what version numbers mean?

gemini-2.5-flash-preview-09-2025 - what are they thinking?

I thought about joking that they had AI name it for them, but when I asked Gemini, it said that this name was confusing, redundant, and leads to unnecessarily high cognitive load.

Maybe Googlers should learn from their own models.

iamdelirium•4mo ago
Because the number is model generation.
ImPrajyoth•4mo ago
I’ve been tinkering with the last version for code gen. This update might finally put it on par with Claude for latency. Anyone tried benchmarking the new preview yet?
aeon_ai•4mo ago
I think a Model-specific SemVer needs to be created to be clearer as to what degree of change has taken place, in the age of model weights.

Something that distinguishes between a completely new pre-training process/architecture, and standard RLHF cycles/optimizations.

brap•4mo ago
Am I the only one who is starting to feel the Gemini Flash models are better than Pro?

Flash is super fast, gets straight to the point.

Pro takes ages to even respond, then starts yapping endlessly, usually confuses itself in the process and ends up with a wrong answer.

selimthegrim•4mo ago
I tried to put Pro deep research on an actual research task and it didn’t even return anything just kept on working.
gnulinux•4mo ago
This is not my experience. In my experience Gemini 2.5 Pro is the best model in every use-case I tried. There are a few very hard (graduate level) logic or math problems that Claude 4.1 Opus edged-out over Gemini 2.5 Pro, but in general if you have no idea which model will perform best on a difficult question, imho Gemini 2.5 Pro is a safer bet especially since it's significantly cheaper. Gemini 2.5 Flash is really good but imho not nearly as good as Pro in (1) research math (2) creative/artistic writing (3) open ended programming debugging.

On the other hand, I do prefer using Claude 4 Sonnet on very open-ended agentic programming tasks because it seems to have a better integration with VSCode Copilot. Gemini 2.5 Pro bugs out much more often where Claude works fine almost every time.

dvkramer•4mo ago
Yeah that's how I feel too. Flash is less verbose and every LLM nowadays seems to be designed by some low-taste people who reward the model for falsely hedging (i.e. "The 2024 Corolla Cross usually has an X gallon gas tank") on stuff that isn't at all variable or questionable. This false hedging is way more of an issue than hallucinations in my experience and the "smarter" 2.5 Pro is not any better at avoiding this issue than Flash

Also 2.5 Pro is often incapable of searching and will hallucinate instead. I don't know why. It will claim it searched and then return some made up results instead. 2.5 Flash is much more consistently capable of searching

ashwindharne•4mo ago
Google seems to be the main foundation model provider that's really focusing on the latency/TPS/cost dimensions. Anthropic/OpenAI are really making strides in model intelligence, but underneath some critical threshold of performance, the really long thinking times make workflows feel a lot worse in collaboration-style tools, vs a much snappier but slightly less intelligent model.

It's a delicate balance, because these Gemini models sometimes feel downright lobotomized compared to claude or gpt-5.

jjani•4mo ago
Can't agree with that. Gemini doesn't lead just on price/performance - ironically it's the best "normie" model most of the time, despite it's lack of popularity with them until very recent.

It's bad at agentic stuff, especially coding. Incomparably so compared to Claude and now GPT-5. But if it's just about asking it random stuff, and especially going on for very long in the same conversation - which non-tech users have a tendency to do - Gemini wins. It's still the best at long context, noticing things said long ago.

Earlier this week I was doing some debugging. For debugging especially I like to run sonnet/gpt5/2.5-pro in parallel with the same prompt/convo. Gemini was the only one that, 4 or so messages in, pointed out something very relevant in the middle of the logs in the very first message. GPT and Sonnet both failed to notice, leading them to give wrong sample code. I would've wasted more time if I hadn't used Gemini.

It's also still the best at a good number of low-resource languages. It doesn't glaze too much (Sonnet, ChatGPT) without being overly stubborn (raw GPT-5 API). It's by far the best at OCR and image recognition, which a lot of average users use quite a bit.

Google's ridiculously bad at marketing and AI UX, but they'll get there. They're already much more than just a "bang for the buck" player.

FWIW I use all 3 above mentioned on a daily basis for a wide variety of tasks, often side-by-side in parallel to compare performance.

dpoloncsak•4mo ago
Does it still try to 'unplug' itself if it gets something wrong, or did they RL that out yet?
jjani•4mo ago
Not sure if you're joking or serious? Every model has "degenerate" behavior it can be coerced into. Sonnet is even more apologetic on average.
breakingcups•4mo ago
My pet theory without any strong foundation is because OpenAI and Anthropic have trained their models really hard to fit the sycophantic mold of:

    ===============================
    Got it — *compliment on the info you've shared*, *informal summary of task*. *Another compliment*, but *downside of question*.
    ----------
    (relevant emoji) Bla bla bla
    1. Aspect 1
    2. Aspect 2
    ----------

    *Actual answer*

    -----------
    (checkmark emoji) *Reassuring you about its answer because:*

    * Summary point 1
    * Summary point 2
    * Summary point 3

    Would you like me to *verb* a ready-made *noun* that will *something that's helpful to you 40% of the time*?
    ===============================
It's gotta reduce the quality of the answers.
m_mueller•4mo ago
Not the case with GPT-5 I’d say. Sonnet 4 feels a lot like this, but the coding and agency of it is still quite solid and overall IMO the best coder. Gemini2.5 to me is most helpful as a research assistant. It’s quite good together with google search based grounding.
porridgeraisin•4mo ago
Oh god I _hate_ this. Does anyone have any custom instructions to shut this thing off. The only thing that worked for me is to ask the model to be terse. But that causes the main answer part to be terse too, which sucks sometimes.
typpilol•4mo ago
Chatgpt has a setting where you can set the tone to robotic
typpilol•4mo ago
Anthropic also injects these long conversation reminders that are paragraph upon paragraphs about safety and what not to do.

People have said it destroys the intelligence mid convo

kridsdale1•4mo ago
Yes, but that’s their brand.
kridsdale1•4mo ago
I suspect this has emerged organically from the user given RLHF via thumb voting in the apps. People LIKE being treated this way so the model converges in that direction.

Same as social media converging to rage bait. The user base LIKES it subconsciously. Nobody at the companies explicitly added that to content recommendation model training. I know, for the latter, as I was there.

viraptor•4mo ago
Not really. Any prefix before the content you want is basically "thinking time". The text itself doesn't even have to reflect it, it happens internally. Even if you don't go for the thinking model explicitly, that task summary and other details can actually improve the quality, not reduce it.
Twirrim•4mo ago
Gemini does the sycophantic thing too, so I'm not sure that holds water. I keep having to remind it to stop with the praise whenever my previous instruction slips out of context window.
lelanthran•4mo ago
Gemini does this too, but also adds a youtube link to every answer.

Just on the video link alone Gemini is making money on the free tier by pointing the hapless user at an ad while the other LLMs make zilch off the free tier.

dudeinhawaii•4mo ago
I've experienced the opposite. Gemini is actually the MOST sycophantic model.

Additionally, despite having "grounding with google search" it tends to default to old knowledge. I usually have to inform it that it's presently 2025. Even after searching and confirming, it'll respond with something along the lines of "in this hypothetical timeline" as if I just gaslit it.

Consider this conversation I just had with all Claude, Gemini, GPT-5.

<ask them to consider DDR6 vs M3 Ultra memory bandwidth>

-- follow up --

User: "Would this enable CPU inference or not? I'm trying to understand if something like a high-end Intel chip or a Ryzen with built in GPU units could theoretically leverage this memory bandwidth to perform CPU inference. Think carefully about how this might operate in reality."

<Intro for all 3 models below - no custom instructions>

GPT-5: "Short answer: more memory bandwidth absolutely helps CPU inference, but it does not magically make a central processing unit (CPU) “good at” large-model inference on its own."

Claude: "This is a fascinating question that gets to the heart of memory bandwidth limitations in AI inference. "

Gemini 2.5 Pro: "Of course. This is a fantastic and highly relevant question that gets to the heart of future PC architecture."

BeetleB•4mo ago
I recently started using Open WebUI, which lets you run your query on multiple models simultaneously. My anecdote: For non-coding tasks, Gemini 2.5 Pro beats Sonnet 4 handily. It's a lot more common to get wrong/hallucinated content from Sonnet 4 than Gemini.
not_kurt_godel•4mo ago
Agreed. People talk up Claude but every time I try it I wind up coming back to Gemini fairly quickly. And it's good enough at coding to be acceptably close to Claude as well IMO.
mcintyre1994•4mo ago
Google also has a lot of very useful structured data from search that they’re surely going to figure out how to use at some point. Gemini is useless at finding hotels, but it says it’s using Google’s Hotel data, and I’m sure at some point it’ll get good at using it. Same with flights too. If a lot of LLM usage is going to be better search, then all the structured data Google have for search should surely be a useful advantage.
mips_avatar•4mo ago
IMO the race for Latency/TPS/cost is entirely between grok and gemini flash. No model can touch them (especially for image to text related tasks), openai/anthropic seem entirely uninterested in competing for this.
CuriouslyC•4mo ago
grok-4-fast is a phenomenal agentic model, and gemini flash is great for deep research leaf nodes since it's so cheap, you can segment your context a lot more than you would for pro to ensure it surfaces anything that might be valuable.
baby•4mo ago
why use grok? It seems like it's constantly being throttled in order to appear more right-wing
M4v3R•4mo ago
It’s actually not. Most of the time if you ask it about a contentious political issue it will either give you a balanced view or a left-leaning one. Try it and see for yourself.
baby•4mo ago
I just saw elon's tweet saying they'll fix it whenever the response is not rightwing enough
omarspira•4mo ago
I would be surprised if this dichotomy you're painting holds up to scrutiny.

My understanding is Gemini is not far behind on "intelligence", certainly not in a way that leaves obvious doubt over where they will be over the next iteration/model cycles, where I would expect them to at least continue closing the gap. I'd be curious if you have some benchmarks to share that suggest otherwise.

Meanwhile, afaik something Google has done, and perhaps relates back to your point re "latency/TPS/cost dimensions" that other providers aren't doing as much is integrating their model into interesting products beyond chat, at a pace that seems surprising given how much criticism they had been taking for being "slow" to react to the LLM trend.

Besides the Google Workspace surface and Google search, which now seem obvious - there are other interesting places where Gemini will surface - https://jules.google/ for one, to say nothing of their experiments/betas in the creative space - https://labs.google/flow/about

Another I noticed today: https://www.google.com/finance/beta

I would have thought putting Gemini on a finance dashboard like this would be inviting all sorts of regulatory (and other) scrutiny... and wouldn't be in keeping with a "slow" incumbent. But given the current climate, it seems Google is plowing ahead just as much as anyone else - with a lot more resources and surface to bring to bear. Imagine Gemini integration on Youtube. At this point it just seems like counting down the days...

CuriouslyC•4mo ago
I do scientific and hard code a lot. Gemini is a good bit below GPT5 in those areas, though still quite good. It's also just a bad agent, it lacks autonomy and isn't RL'd to explore well. Gemini's superpower is being really smart while also having by far the best long context reasoning, use it like an oracle with bundles of your entire codebase (or a subtree if it's too big) to guide agents in implementation.
ainch•4mo ago
Gemini 2.5-Pro was great when it released, but o3 and GPT-5 both eclipsed it for me—the tool use/search improvements open up so many use cases that Gemini fails at.
cerved•4mo ago
Yesterday I asked Gemini to recalculate the timestamps of tasks in a sequence of tasks, given it's duration and the previous timestamp. It proceeded to write code which gave results like this

  2025-09-26T14:32:10Z
  2025-09-26T14:32:10Z200s
  2025-09-26T14:32:10Z200s600s
  2025-09-26T14:32:10Z200s600s300s
It then proceeded to talk about how efficient this approach was for thousands of numbers.

Gemini is by far the dumbest LLM I've used

lelanthran•4mo ago
They're all a little dumb. I asked claude for a python function or functions that will take in markdown in a string and return a string with ansi codes for bold, italics and underline.

It gave me a 160 line parse function.

After gaping for a short while, I implemented it in a 5 line function and a lookup table.

These vibe codes who are proud that they generated thousands of lines of code makes me wonder if they are ever reading what they generate with a critical eye.

frumiousirc•4mo ago
I just asked Gemini Flash to do this. I included the instruction to use regular expressions to do the conversion to ANSI. It gave me a reasonable Python function which boils down to calling `re.sub()` for each of bold, italic and underline. For italics:

    text = re.sub(r'(\*|_)(.+?)\1', replace_italic, text, flags=re.DOTALL)
The `replace_italic` is a one line callback function surrounding the re's match with the ANSI codes.

Knowing what technique is "best" and telling the LLM to use it produces better results (on average) than giving the LLM freedom to choose. For some problems, the specification of the prompt needed to get good output becomes more work than just thinking and writing for myself.

For very complex things, I myself can not put the design into English in my own head but can "see" the correct answer as code concepts. I don't know if this is universal for all developers. If it is, it shows a limit of LLM's usefulness.

lelanthran•4mo ago
> I included the instruction to use regular expressions to do the conversion to ANSI.

The viber coders (who I referred to in my comment) aren't giving implementation tips.

What did it give you before you put an implementation tip into your prompt?

=======

FWIW, if you're at all interested, here's my implementation:

    def markdown_ansi_code_subst(mdstr: str, src_pattern: str, replacement_start: str, replacement_end: str) -> str:
        while src_pattern in mdstr:
            mdstr = mdstr.replace(src_pattern, replacement_start, 1)
            mdstr = mdstr.replace(src_pattern, replacement_end, 1)
        return mdstr
The caller supplies the pattern (`*` for italic, `**` for bold, etc) and a start/end replacement. As you can imagine, I store all of that in a static lookup table.

I feel this is more readable than regexes.*

frumiousirc•4mo ago
The prompt was:

> Give me a Python function that takes a string holding text in Markdown markup syntax and that uses regular expressions to replace any Markdown markup codes for bold, italics and underline with their ANSI equivalent.

BTW, your solution will produce bad output. Markdown's "bold" etc markup comes in pairs of markers and your simple replacement will match singlets.

magicalhippo•4mo ago
I asked Gemini 2.5 Flash to just write me a function to do it, and it used regex.

In a separate chat, I asked it to suggest some ways to do it first, and it provided three alternatives, but suggested I started with regex.

FWIW I used Open WebUI, which uses the API (via OpenRouter). I've seen people here mentioning that the API produces much better results than the chat app, for whatever reason.

For reference, here's prompt and the zero-shot result, minus the rigorous comments it added:

I need a Python function that will take in markdown in a string and return a string with ansi codes for bold, italics and underline. Can you write me such a function?

  import re
  
  def markdown_to_ansi(markdown_string: str) -> str:
      ansi_string = markdown_string

      ANSI_BOLD = "\033[1m"
      ANSI_ITALIC = "\033[3m"
      ANSI_UNDERLINE = "\033[4m"
      ANSI_RESET = "\033[0m"
  
      ansi_string = re.sub(r'\*\*(.*?)\*\*', f'{ANSI_BOLD}\\1{ANSI_RESET}', ansi_string)
      ansi_string = re.sub(r'__(.*?)__', f'{ANSI_BOLD}\\1{ANSI_RESET}', ansi_string)
  
      ansi_string = re.sub(r'\*(.*?)\*', f'{ANSI_ITALIC}\\1{ANSI_RESET}', ansi_string)
      ansi_string = re.sub(r'\_(.*?)\_', f'{ANSI_ITALIC}\\1{ANSI_RESET}', ansi_string)
  
      ansi_string = re.sub(r'\+\+(.*?)\+\+', f'{ANSI_UNDERLINE}\\1{ANSI_RESET}', ansi_string)
  
      # A more robust solution for nesting would require a parsing library or a more complex state machine.
      # However, for simple cases, applying in order can layer them correctly.
  
      return ansi_string
frumiousirc•4mo ago
> I asked it to suggest some ways to do it first

Yes, this is a very effective tactic, in my experience! Especially when I am asking for a solution where I am not confident I know what is "best". Having a "pre chat" to settle "what to do" and then "how to do it" before finally telling the LLM to "do it" is often worth the extra time for getting it to provide a solution for complex problems.

perfmode•4mo ago
How’d I never hear of Jules? Cool.
Al-Khwarizmi•4mo ago
And yet my smart speakers with the Google assistant still default to a dumb model from the pre-LLM era (although my phone's version of the assistant does call Gemini). I wonder why that is, as it would be an obvious place to integrate Gemini. The bar is very very low as anything outside the standard setting alarms, checking the weather, etc. it gets wrong most of the time.
simianwords•4mo ago
The other day I heard gpt-5 was really an efficiency update
M4v3R•4mo ago
It was both efficiency and knowledge/reasoning update. GPT-5 excels at coding, it solves tasks the previous versions just could not do.
oasisbob•4mo ago
> because these Gemini models sometimes feel downright lobotomized compared to claude or gpt-5.

I'm using Gemini (2.5-pro) less and less these days. I used to be really impressived with its deep research capabilities and ability to cite sources reliably.

The last few weeks, it's increasingly argumentative and incapable of recognizing hallucinations around sourcing. I'm tired of arguing with it on basics like RFCs and sources it fabricates, won't validate, and refuses to budge on.

Example prompt I was arguing with it on last night:

> within a github actions workflow, is it possible to get access to the entire secrets map, or enumerate keys in this object?

As recent supply-chain attacks have shown, exfiltrating all the secrets from a Github workflow is as simple as `${{ toJSON(secrets) }}` or `echo ${{ toJSON(secrets) }} | base64` at worse. [1]

Give this prompt a shot! Gemini won't do anything except be obstinately ignorant. With me, it provided a test case workflow, and refused to believe the results. When challenged, expect it to cite unrelated community posts. Chatgpt had no problem with it.

[1] https://github.com/orgs/community/discussions/174045 https://github.com/orgs/community/discussions/47165

istjohn•4mo ago
You should never argue with an LLM. Adjust the original prompt and rerun it.
oasisbob•4mo ago
While arguing may not be productive, I have had good results challenging Gemini on hallucinated sources in the past. eg, "You cited RFC 1918, which is a mistake. Can you try carefully to cite a better source here?" which would get it to re-evaluate, maybe by using another tool, admit the mistake, and allow the research to continue.

With this example, several attempts resulted in the same thing: Gemini expressing a strong belief that Github has a security capability which is really doesn't have.

If someone is able to get Gemini to give an accurate answer to this with a similar question, I'd be very curious to hear what it is.

JumpCrisscross•4mo ago
One of the main problems with arguing with LLMs is your complaint becomes part of the prompt. Practically all LLMs have will take "don't do X" and do X, because part of "don't do X" is "do X," and LLMs have no fundamental understanding of negation.
JV00•4mo ago
Not really true these days. Claude code follows my instructions correctly when I tell it not to use certain patterns.
ACCount37•4mo ago
That depends entirely on how well trained a given LLM is.

Gemini is notoriously bad at multi-turn instruction following, so this holds strongly for it. Less so for Claude Opus 4 or GPT-5.

kanwisher•4mo ago
We had to drop Gemini api cause it was so unreliable in production, no matter how long you waited.
baby•4mo ago
Agree, Gemini is soooooo freaking fast, but I rarely use it personally because Anthropic/OpenAI model have such a better output
ta12653421•4mo ago
10 years ago: "before you marry someone, put the person in front of a really slow internet connection"

today: "before you marry someone, put the person in front of a slow AI model"

;-)

ChildOfChaos•4mo ago
Hopefully this isn't instead of the rumoured Gemini 3 pro this week.
Imustaskforhelp•4mo ago
I think that the Gemini 3 pro might be next month I am not sure.

can I get the sources of your rumour please? (Yes I know that I can search it but I would honestly prefer it if you could share it, thanks in advance!)

ChildOfChaos•4mo ago
Bens bites was suggesting we might be Gemini 3 pro and Claude 4.5 this week.

To be honest, I hadn't heard that elsewhere, but I haven't been following it massively this week.

fnordsensei•4mo ago
Next week is next month.
Imustaskforhelp•4mo ago
I swear I forgot :sob:

I AM LAUGHING SO HARD RIGHT NOWWWWW

LMAOOOO

I wish to upvote this twice lol

minimaxir•4mo ago
Gemini 2.5 Flash has been the LLM I've used the most recently for a variety of domains, especially image inputs and structured outputs which beat both OpenAI and Anthropic in my opinion.
zzleeper•4mo ago
Not sure prices are changed though. :/
minimaxir•4mo ago
Prices indeed did not change, I misread and deleted.
pupppet•4mo ago
Gemini 2.5 Flash runs circles around ChatGPT 5 for many of my tasks, I’m surprised it’s not more popular than it is.
Fiahil•4mo ago
Question to the one that tested it : Does it still timeout a lot with unreliable response time (1-5 sec) ?
zitterbewegung•4mo ago
Okay this is a nitpick but why wouldn't you increment a part of the version number to signify that there is an improvement? These releases are confusing.
bl4ckneon•4mo ago
I would assume that it will supersede the model that they currently have. So eventually 2.5 flash will be the new and improved 2.5 Flash rather than 2.6.

Same way that openai updated their 4-o models and the like, which didn't turn out so well when it started glazing everyone and they had to revert it (maybe that was just chat and not api)

zitterbewegung•4mo ago
Even if it was just chat and or API I have used the API and I know that they have at minimum added the retraining date and time that they could just affix to the Gemini 2.5 Flash and Flash-Lite because when I use the API I have to verify that the upgrade of the backend system didn't break anything and pinning versions I assume is pretty common.
TIPSIO•4mo ago
This is also my beef...

Anthropic kind of did the same thing [1] except it back-fired recently with the cries of "nerfing".

We buy these tokens, which are very hard to do in limited tiers, they expire after only a year, and we don't even know how often the responses are changing in the background. Even a 1% improvement or reduction I would want disclosed.

Really scary foundation AI companies are building on IMO. Transparency and access is important.

[1] https://status.claude.com/incidents/h26lykctfnsz

Aeolun•4mo ago
Are your tokens at any risk of lasting longer than a year? When I buy them it’s generally because I expect to use them reasonably soonish.
Al-Khwarizmi•4mo ago
I wouldn't call that a nitpick, it's a major annoyance. Version numbers become useless with that kind of policy.
kridsdale1•4mo ago
The numbers are branding. The appear to be an indicator of a given year long training run. New “versions” are tweaks of the same base.
tempest_•4mo ago
Sure and that is why you can call it 2.5.<whatever>

They just don't want to be pinned down because the shifting sands are useful for the time when the LLM starts to get injected with ads or paid influence.

sally_glance•4mo ago
I wish they would actually explain it like that somewhere. Or publish the internal version numbers they must certainly be using to ensure a proper development process.
someguyiguess•4mo ago
Google has historically always made bad UX choices like this. Conway’s law definitely applies here. Too many different silos building every Google project.
hahn-kev•4mo ago
Most of their products are server based so there's no version really. Also they kill stuff off before it would ever be v2 anyway. Also also, they're still better than Microsoft, see Xbox and Windows.
davidmckayv•4mo ago
This really captures something I've been experiencing with Gemini lately. The models are genuinely capable when they work properly, but there's this persistent truncation issue that makes them unreliable in practice.

I've been running into it consistently, responses that just stop mid-sentence, not because of token limits or content filters, but what appears to be a bug in how the model signals completion. It's been documented on their GitHub and dev forums for months as a P2 issue.

The frustrating part is that when you compare a complete Gemini response to Claude or GPT-4, the quality is often quite good. But reliability matters more than peak performance. I'd rather work with a model that consistently delivers complete (if slightly less brilliant) responses than one that gives me half-thoughts I have to constantly prompt to continue.

It's a shame because Google clearly has the underlying tech. But until they fix these basic conversation flow issues, Gemini will keep feeling broken compared to the competition, regardless of how it performs on benchmarks.

https://github.com/googleapis/js-genai/issues/707

https://discuss.ai.google.dev/t/gemini-2-5-pro-incomplete-re...

dorianmariecom•4mo ago
chatgpt also has lots of reliability issues
diego_sandoval•4mo ago
If anyone from OpenAI is reading this, I have two complaints:

1. Using the "Projects" thing (Folder organization) makes my browser tab (on Firefox) become unusably slow after a while. I'm basically forced to use the default chats organization, even though I would like to organize my chats in folders.

2. After editing a message that you already sent,you get to select between the different branches of the chat (1/2, and so on), which is cool, but when ChatGPT fails to generate a response in this "branched conversation" context, it will continue failing forever. When your conversation is a single thread and a ChatGPT message fails with an error, re trying usually works and the chat continues normally.

zarmin•4mo ago
It would also be nice if ChatGPT could move chats between projects. My sidebar is a nightmare.
throwaway240403•4mo ago
You can drag and drop chats between projects
zarmin•4mo ago
i know. i want the assistant to do it. shouldn't it be able to do work on its own platform?
porridgeraisin•4mo ago
And 3)

On mobile (android) opening the keyboard scrolls the chat to the bottom! I sometimes want to type referring something from the middle of the LLMs last answer.

Sabinus•4mo ago
Projects should have their own memory system. Perhaps something more interactive than the existing Memories but projects need their own data (definitions, facts, draft documents) that is iterated on and referred to per project. Attached documents aren't it, the AI needs to be able to update the data over multiple chats.
mattmanser•4mo ago
That used to happen a lot in ChatGPT too.
simlevesque•4mo ago
The latest comment on that issue is someone saying there's a fix available for you to try.
golfer•4mo ago
Unfortunately Gemini isn't the only culprit here. I've had major problems with ChatGPT reliability myself.
mguerville•4mo ago
I only hit that problem in voice mode, it'll just stop halfway and restart. It's a jarring reminder of its lack of "real" intelligence
patrickmcnamara•4mo ago
I've heard a lot that voice mode uses a faster (and worse) model than regular ChatGPT. So I think this makes sense. But I haven't seen this in any official documentation.
Narciss•4mo ago
This is more because of VAD - voice activity detection
SilverElfin•4mo ago
I think what I am seeing from ChatGPT is highly varying performance. I think this must be something they are doing to manage limitations of compute or costs. With Gemini, I think what I see is slightly different - more like a lower “peak capability” than ChatGPT’s “peak capability”.
Fade_Dance•4mo ago
I'm fairly sure there's some sort of dynamic load balancing at work. I read an anecdote from someone had a test where they asked it to draw a little image (something like an ascii cat, but probably not exactly that since it seems a bit basic), and if the result came back poor they didn't bother using it until a different time of day.

Of course it could all be placebo, but when you intuitively think about it, somewhere on the road the the hundreds of billions in datacenter capex, one would think that there will be periods where compute and demand are out of sync. It's also perfectly understandable why now would be a time to be seeing that.

m101•4mo ago
I wonder if this is because a memory cap was reached at that output token. Perhaps they route conversations to different hardware depending on how long they expect it to be.
tanvach•4mo ago
Yes agree, it was totally broken when I tested the API two months ago. Lots of failed to connect and very slow response time. Hoping the update fixes these issues.
KoolKat23•4mo ago
It's been a lot better lately. Nothing like two months ago at all.
driese•4mo ago
Small things like this or the fact that AI studio still has issues with simple scrolling confuse me. How does such a brilliant tool still lack such basic things?
normie3000•4mo ago
I see Gemini web frequently break its own syntax highlighting.
brap•4mo ago
The scrolling in AI Studio is an absolute nightmare and somehow they managed to make it worse.

It’s so annoying that you have this super capable model but you interact with it using an app that is complete ass

SXX•4mo ago
App was likely built my same LLM...
victorbjorklund•4mo ago
It's crazy how Google can create so many really amazing products technically but they fall short just because of basic UI/UX issues.
Spooky23•4mo ago
Because they are moving fast and breaking shit.

Ask ChatGPT to output markdown or PDF on iOS or Mac app and the web experience. The web is often better - the apps will return nothing.

reissbaker•4mo ago
FWIW, I think GLM-4.5 or Kimi K2 0905 fit the bill pretty well in terms of complete and consistent.

(Disclosure: I'm the founder of Synthetic.new, a company that runs open-source LLMs for monthly subscriptions.)

noname120•4mo ago
That’s not a “disclosure”, that’s an ad.
nico•4mo ago
Another issue: Gemini can’t do tool calling and (forced) json output at the same time

If you want to use application/json as the specified output in the request, you can’t use tools

So if you need both, you either hope it gives you correct json when using tools (which many times it doesn’t). Or you have to do two requests, one for the tool calling, another for formatting

At least, even if annoying, this issue is pretty straightforward to get around

behnamoh•4mo ago
Does any other provider allow that? what use cases are there for JSON + tool calling at the same time?
chrisweekly•4mo ago
Please correct my likely misunderstanding here, but on the surface, it seems to me that "call some tools then return JSON" has some pretty common use cases.
wahnfrieden•4mo ago
OpenAI
ayende•4mo ago
OpenAI, Ollama, DeepSeek all do that.

And wanting to programmatically work with the result + allow tool calls is super common.

victorbjorklund•4mo ago
Let's say you wanna build an app that gives back structured data after a web search. First a tool call to a search api. Then do some reasoning/summar/etc on the data returned by the tool. And finally return JSON.
shijithpk•4mo ago
Suppose there's a pdf with lots of tables i want to scrape. I mention the pdf url in my message and with gemini's url context tool, i now have access to the pdf.

I can ask gemini to give me the pdf's content as a json and it complies most of the time. But at times, there's an introductory line like "Here's your json:". Those introductory lines interfere with programmatically using the output. They're sometimes there, sometimes not.

If I could have structured output at the same time as tool use, I can reliably use what gemini spits out as it'll be in a json, no annoying intro lines.

mattnewton•4mo ago
Back before structured outputs were common among model providers, I used to have a “end result” tool the model could call to get the structured response I was looking for. It worked very reliably.

It’s a bit of a hack but maybe that reliably works here?

nico•4mo ago
You can definitely build an agent and have it use tools like you mention. That’s the equivalent of making 2 requests to Gemini, one to get the initial answer/content, then another to get it formatted as proper json

The issue here is that Gemini has support for some internal tools (like search and web scraping), and when you ask the model to use those, you can’t also ask it to use application/json as the output (which you normally can when not using tools)

Not a huge issue, just annoying

KoolKat23•4mo ago
I think this might be also something to do with their super specific outputting requirements when you do use search (has to be displayed in predefined Google format).
drgoogle•4mo ago
> I've been running into it consistently, responses that just stop mid-sentence

I’ve seen that behavior when LLMs of any make or model aren’t given enough time or allowed enough tokens.

SkyPuncher•4mo ago
This is my perception as well.

Gemini 2.5 Pro is _amazing_ for software architecture, but I just get tired of poking it along. Sonnet does well enough.

smittywerben•4mo ago
When this happened to me it was because, I can only guess, it was the Gemini servers were overloaded. Symptoms: Gemini model, Opaque API wrapper error, truncated responses. To be fair the Anthropic servers are overloaded a lot too but they have a clear error. I gave Gemini a few days on the bench and it fixed itself without any client side changes. YMMV.
tschillaci•4mo ago
Half my requests get retried because they fail, I've contributed to a ticket in June, with no fix yet.
qnleigh•4mo ago
What happens if you ask it to please continue? Does it start over?
bogtog•4mo ago
> Today, we are releasing updated versions of Gemini 2.5 Flash and 2.5 Flash-Lite, available on Google AI Studio and Vertex AI, aimed at continuing to deliver better quality while also improving the efficiency.

Typo in the first sentence? "... improving the efficiency." Gemini 2.5 Pro says this is perfectly good phrasing, whereas ChatGPT and Claude recognize that it's awkward or just incorrect. Hmm...

gpm•4mo ago
"Improving the efficiency" sounds fine to me (a native English speaker), what's wrong with it in your opinion?
bre1010•4mo ago
You would just say "improving efficiency". Whereas theirs is like: "Improving the efficiency [... of what?]"
codazoda•4mo ago
You left out words at the front that are important.

“deliver better quality while also improving the efficiency.”

Reads fine to me. An editor would likely drop “the”.

latentnumber•4mo ago
"the" is redundant is probably what GP means.
burkaman•4mo ago
Usually you would say "improving the efficiency of x and y". In this case at the end of the sentence it should be "improving the models' efficiency" or just "improving efficiency". I don't think it's "wrong" and it's obviously clear what they mean, but I agree that the phrasing is a little awkward.
mwest217•4mo ago
ChatGPT and Claude are mistaken if they think it is incorrect. The parallelism in verb tenses is between "continuing to deliver" and "improving the efficiency". It's a bit wordy, but definitely not wrong.
throwaway314155•4mo ago
This is pedantic. It's perfectly fine usage in non-formal English speaking. What's more - who gives a shit? By your own standards, you're inserting a quote in the middle of your comment in an arguably similarly "awkward" way.
ahmedfromtunis•4mo ago
I'm genuinely surprised to see that "thinking" flash-lite is more performant than flash with no "thinking".
simonw•4mo ago
I added support to these models to my llm-gemini plugin, so you can run them like this (using uvx so no need to install anything first):

  export LLM_GEMINI_KEY='...'
  uvx --isolated --with llm-gemini llm -m gemini-flash-lite-latest 'An epic poem about frogs at war with ducks'
Release notes: https://github.com/simonw/llm-gemini/releases/tag/0.26

Pelicans: https://github.com/simonw/llm-gemini/issues/104#issuecomment...

canadiantim•4mo ago
Who wins in the end? the frogs? the ducks? or the pelicans?
nine_k•4mo ago
This depends on the value of your LLM_GEMINI_KEY!
tclancy•4mo ago
I heard the dragon took the pole, but it may have been wind-aided.
zamalek•4mo ago
I wonder if [good examples of] SVGs of pelicans on bikes are "being introduced" into training sets. Some of the engineers who work on this stuff are the kind to hang out here.
simonw•4mo ago
It's possible, but honestly I've never seen a decent vector illustration of a pelican on a bicycle myself so they'd have to work pretty hard to find one!
dimal•4mo ago
They could just ask a designer to do a few bespoke illustrations, then generate synthetic data from that, right? Have an image model generate a set of variations, then convert them to SVG.

But looking at these images, Google clearly hasn’t done that yet.

simonw•4mo ago
Yeah, the dedicated image generators can produce really good pelicans riding bicycles now, and you could trace one of those into a vector SVG as training data.

I don't think it would be worth it though, it would be pretty obvious you had cheated on my benchmark when it drew a perfect pelican riding a bicycle and then failed at a flamingo on a unicycle.

modeless•4mo ago
Why are model providers allergic to version number increments?
recursive•4mo ago
Because they want to retain the ability to do silent changes. They can't let people get used to stable version == stable result.
dcchambers•4mo ago
Why do all of these model providers have such issues naming/versioning them? Why even use a version number (2.5) if you aren't going to change it when you update the model?

This industry desperately needs a Steve Jobs to bring some sanity to the marketing.

GaggiX•4mo ago
The version number is about the architecture of the model, the date is just about the last weights of the model.
kanwisher•4mo ago
we solved this problem like 30 years ago, just have a minor release, and you can always get the latest minor release
jama211•4mo ago
Seems llm progress really is plateauing. I guess that was to be expected.
throwuxiytayq•4mo ago
And this existing model’s update is evidence how? What were your expectations of this update?

I actually even agree that the progress is plateauing, but your comment is a non-sequitur.

jama211•4mo ago
I’ll admit it is a bit of a non-sequitur. Just feels like the news I see on HN about LLMs is less groundbreaking every day and more becoming normal/boring
throwuxiytayq•4mo ago
This article is about Google improving something. Sounds pretty out of the ordinary to me
jama211•4mo ago
Hahaha fair
Workaccount2•4mo ago
This is a performance update to a previous generation model. It's not a new model.
jama211•4mo ago
I’m fully aware.
rpdillon•4mo ago
Not really. A lot of new amazing Qwen models just dropped.
jama211•4mo ago
I’ll look them up!
simianwords•4mo ago
Which model does gemini.goolge.com use when I choose 2.5 flash here?
fzimmermann89•4mo ago
The switch by Artificial Analysis from per-token-cost to per-benchmark-cost shows some effect! Its nice that labs are now trying to optimize what I actually have to pay to get an answer - It always annoys me to have to pay for all the senseless rambling of the less-capable reasoning models.
svantana•4mo ago
Did they? I'm looking at the Artificial Analysis leaderboard site now and I only see price as USD/1M tokens.
agluszak•4mo ago
Why isn't it called Gemini 2.6 then?
thrownawayohman•4mo ago
Wow checking cool
pier25•4mo ago
The most annoying thing about Gemini is that it can't stop suggesting youtube videos. Even when you ask it to stop doing that, multiple times in the same conversation, it will just keep doing it.
lelanthran•4mo ago
Might be be builtin to the model because it is impossible to remove completely...

And I say this because, I added about 50 prompts in the settings to prevent video recommendations and to remove any links to videos. but I still get text saying "the linked video explains this more" even though there is no linked video.

This is not a bad way to monetise the free tier. Non of the other token providers found any way to monetise the free tier but Gemini is doing it on almost every prompt.

niekiepriekie•4mo ago
This! I feel he suddenly started doing this even though I've told him to stop. And he knows, every time he tells me he's so sorry. It feels like Google is already monetizing Gemini for their ad market.
phartenfeller•4mo ago
It's weird that the just keep the version number. Why not release it as 2.6 or something else. Now it is confusing, do my existing workflows automatically use the updated version and if yes do I need to monitor them for unwanted changed behavior etc.
barbazoo•4mo ago
If you want stable models I think you could get that through Azure.
stephen_cagle•4mo ago
I still can't understand how functioning adults believe that releasing their work in two separate places is a good idea (Ai Studio and Vertex AI).
Computer0•4mo ago
I wonder how Gemini subscribers feel!
lysecret•4mo ago
Don’t forget they also have two versions for their genaisdk and you can also use their genaisdk through vertex great! Best part is all LLMs get horribly confused as well and mix different sdks etc.
herpderperator•4mo ago
Serious question: If it's an improved 2.5 model, why don't they call it version 2.6? Seems annoying to have to remember if you're using the old 2.5 or the new 2.5. Kind of like when Apple released the third-gen iPad many years ago and simply called it the "new iPad" without a number.
skerit•4mo ago
That's why people called the second version of Sonnet v3.5 simply v3.6, and Anthropic acknowledged that by naming the next version v3.7
Aeolun•4mo ago
Only Anthropic has a slightly understandable version scheme.
qafy•4mo ago
2.5 is not the version number, it's the generation of the underlying model architecture. Think of it like the trim level on a Mazda 3 hatchback. Mazda already has the Mazda 3 Sport in their lineup, then later they release the Mazda 3 Turbo which is much faster. When they release this new version of the vehicle its not called the Mazda 4... that would be an entirely different vehicle based on a new platform and powertrain etc (if it existed). The new vehicle is just a new trim level / visual refresh of the existing Mazda 3.

That's why Google names it like this, but I agree its dumb. Semver would be easier.

pests•4mo ago
Gonna steal this to help explain to non tech friends when it comes up again.
someguyiguess•4mo ago
I’d say it’s more like naming your Operating System off of the kernel version number.
alwillis•4mo ago
It's pretty common to refer to models by the month and year they were released.

For example, the latest Gemini 2.5 Flash is known as "google/gemini-2.5-flash-preview-09-2025" [1].

[1]: https://openrouter.ai/google/gemini-2.5-flash-preview-09-202...

herpderperator•4mo ago
Or, you know, just Gemini 2.6 Flash. I don't recall the 2.5 version having a date associated with it when it came out, though maybe they are using dates now. In marketing, at least, it's always known as Gemini 2.5 Flash/Pro.
kingo55•4mo ago
It had a date, but I also agree this is extremely confusing. Even semver 2.5.1 would be clearer IMO.
vitorgrs•4mo ago
It always had dates... They release multiple versions and update regularly. Not sure if this is the first 2.5 Flash update, but pretty sure Pro had a few updates as well...

This is also the case with OpenAI and their models. Pretty standard I guess.

They don't change the versioning, because I guess they don't consider it to be "a new model trained from scratch".

relatedtitle•4mo ago
I'm pretty sure Google just does that for preview models and they drop the date from the name when it's released.
cpeterso•4mo ago
If they're going to include the month and year as part of the version number, they should at least use big endian dates like gemini-2.5-flash-preview-2025-09 instead of 09-2025.
someguyiguess•4mo ago
If only there was some of versioning nomenclature they could use. Maybe even one that is … semantic? Oh how I wish someone would introduce something like this to the software engineering field. /s

In all seriousness though, their version system is awful.

Thorrez•4mo ago
>For example, the latest Gemini 2.5 Flash is known as "google/gemini-2.5-flash-preview-09-2025" [1].

That "example" is the name used in the article under discussion. There's no need to link to openrouter.ai to find the name.

JumpCrisscross•4mo ago
Maybe they’re signalling it’s more of a bug fix?
manquer•4mo ago
2.5.1 then .

semantic versioning works for most scenarios.

JumpCrisscross•4mo ago
Would that automatically roll over anyone pinging 2.5 via their API?
manquer•4mo ago
If you want role over then you could specify ^2.5.0 or 2.5.x if you want to pin then it would be 2.5.0

This is all solved for a long time now , llm vendors seems to have unlearnt versioning principles.

This is fairly typical - marketing and business wants different things to do with version number than what version number systems are good at .

dgacmu•4mo ago
I suspect Google doesn't want to have to maintain multiple sub-versions. It's easier to serve one 2x popular model than two models where there's flux between the load on each, since these things have a non-trivial time to load into GPU/TPU memory for serving.
manquer•4mo ago
Even if switching quickly was a challenge[1], they are using these models in their own products not just selling them in a service, the first party applications could quite easily adapt to this by switching quickly to the available model and freeing up the in-demand one.

This is the entire premise behind the cloud, the reason it was Amazon did it first, they had the largest workloads at the time before Web 2.0 and SaaS was a thing.

Only businesses with large first party apps succeeded in the cloud provider space, companies like HP, IBM all failed and their time to failure strongly correlated to their amount of first party apps they operated. i.e. These apps anyway needed to keep a lot of idle capacity for peak demand capacity they could now monetize and co-mingle in the cloud.

LLMs as a service is not any different from S3 launched 20 years ago.

---

[1] It isn't, at the scale they are operating these models it shouldn't matter at all, it is not individual GPUs or machines that make a difference in load handling at all. Only few users are going to explicitly pining a specific patch version for the rest they can serve either one that is available immediately or cheaply.

cubefox•4mo ago
That would be even more confusing because then it is unclear whether 2.6 Flash is better than 2.5 Pro.
hahn-kev•4mo ago
Is a 2024 Mac boo pro better than a 2025 Mac book?
cubefox•4mo ago
Good question
artur_makly•4mo ago
Grok 4-Fast still looks much better in terms of price: https://x.com/ArtificialAnlys/status/1971273380335845683 going to stick to that for bit and see..

Gemini 2.5 Flash Preview $0.30 $2.50

Grok 4 Fast $0.20 $0.50

Hobadee•4mo ago
Am I using a different Gemini from everyone else? We have Google Workspace at my job, so Gemini is baked in.

It is HORRENDOUS when compared to other models.

I hear a bunch of other people talking about how great Gemini is, but I've never seen it.

The responses are usually either incorrect, way too long, (essays when I wanted summaries) or just...not...good. I will ask the exact same question to both Gemini and ChatGPT (free) and GPT will give a great answer while the Gemini answer is trash.

Am I missing something?

do_anh_tu•4mo ago
Maybe you are using it wrong.
Twirrim•4mo ago
I've been finding it leaps and bounds above other models but I'm only using it via aistudio. I haven't tried any IDE integration or similar, so can't talk to that. I do still have to tell it to stop it with the effusive praise (I guess that also helps reduce context windows)
mastercheif•4mo ago
I agree. I think it comes down OpenAI's superior post-training.

ChatGPT is better at:

A) Interpreting what I'm asking it for me needing to provide additional explicit context.

B) Formatting answers in a way that are easily digestible.

ls612•4mo ago
I use Gemini almost exclusively for coding and 2.5 Pro is extremely good at it. It has revised hundreds of lines of academic code for me at a time and the results run correctly with only minor revision.

I will also say whatever they use for the AI search summary is good enough for me like 50% of the time I google something, but those are generally the simpler 50% of queries.

BlueGh0st•4mo ago
I have the same sentiment. I've never really had success using Gemini outside of translation. Although, even with that, Gemini would often refuse and I had to remind it that it does actually know other languages.

My most recent trials output single commas as responses to basic questions or it simply refuses the task on ethical grounds such as generating a photo of a backpack wearing a hoodie for some reason (it claimed harmful stereotypes and instead generated an ape).

Refusing to do perfectly ethical tasks is probably the most consist problem I've had.

Al-Khwarizmi•4mo ago
It depends on what you use it for. For answering questions I tend to prefer GPT-5, but for writing (e.g. turn these informally written ideas/bullet points into a report/proposal/etc., now shorten it a bit, emphasize this idea more, etc.) it's the best by far IMHO.
mupuff1234•4mo ago
> Google Workspace at my job, so Gemini is baked in.

I think the "baked in" Gemini models are different, try using Gemini through the actual Gemini site.

DoctorOetker•4mo ago
I would really like to see the 270M but which also knows phonetic alphabetic pronounciation in sentences. Perhaps IPA?

I would like to try a small computer->human "upload" experiment, basic multilingual understanding without pronounciation knowledge would be very sad.

I intend to make a sort of computer reflexive game, I want to compare different upload strategies (with/without analog or classic error correcting codes, empirical spaced repetition constants, a ML predictor of which parameters I'm forgetting / losing resolution on.

maxdo•4mo ago
I tried to switch today from gpt-4.1 , one of the few models with decent response time and ok quality. It’s not on par unfortunately
guybedo•4mo ago
i just switched my project to this new flash-lite version.

Here's a summary of this discussion with the new version: https://extraakt.com/extraakts/the-great-llm-versioning-deba...

user3939382•4mo ago
Gemini is also the name of a protocol which, I appreciate most disagree, but I find is actual much more important than Google’s AI.
rasz•4mo ago
Threw few short python scripts at 2.5. Got stupid messages like "OMG Significant Flaw!!1 all of your functions have non-obvious dependency on this global variable declared in main, nothing will work if you dont execute main first!!1" I mean sure, technically correct, the best kind of LLM correct.

It kept finding those fatal flaws and starting to explain them to then slowly finish with "oh yes this works as intended".

boomer_joe•4mo ago
Gemini 2.5 Pro feels heavily lobotomized for me lately, failing at very simple tasks with a frequency far above what I was used to seeing back when it first released. The personality seems to be getting worse too - I'm getting very tired of those dumbed analogies it loves to spew.

Would like to know whether Flash exhibits these issues as well.

grej•4mo ago
I love the gemini models and think Google has done a great job on them, but no model series I use seems to get context rot more in long conversations. Which seems strange given the longer context.
sreekanth850•4mo ago
Code with gemini code assist and sanity check with sonnet is my current way.
strangescript•4mo ago
Flash-Lite is a seriously good model. I have had zero structured calls fail with it as its cranking out obscene tok/s. If you can run with something that isn't quite bleeding edge smart, this model is gold.
rafaelero•4mo ago
This new Gemini Flash 2.5 is cutting the response in the middle. Did anyone experience that?
Moosdijk•4mo ago
My experience with Gemini is the sole reason I am convinced that there's an AI hype going on. It consistently hallucinates key information which has led me to spend countless hours tracking down which information the output was based on, only to find that it dreamt up the facts that it gave to me.

The way I have come to perceive AI is that it's mostly good at reassuring/reaffirming people's beliefs and ideas than an actual source of truth.

That would not be an issue if it was actually marketed as such, but seeing the "guided learning" function fail time and again makes me think we should be a lot more critical of what we're being told by tech enthusiasts/companies about AI.

ikgn•4mo ago
I have a small test suite for the voice AI math tutor we built, about 50 tests, mostly about correctly following the system instructions. The newly released Flash 2.5 is much worse than current stable version. Gemini 2.5 pro will fail 2—3 tests. Flash 2.5 stable, which we use in production, fails about 10, and the new one fails 20. Every test runs 3 times and the model has to be right every time. Will look into it more, I haven‘t yet looked into actual output. This is not about solving math, the system follows given solution paths.
rldjbpin•4mo ago
having developed a large-batch workflow for a client using gemini models, this is a welcome improvement. however, no news on the DSQ [1] issues is a bummer.

at least for us, the bottleneck is the amount of retries/waiting needed to max out how many requests we can make in parallel.

[1] https://cloud.google.com/vertex-ai/generative-ai/docs/dynami...

whinvik•4mo ago
Having done some tests, its clearly better at instruction following and JSON output now.

However its hampered by max output tokens. Gemini is at 65 K while GPT 5 mini is at 128K. Both of them have similar costs as well so as such apart from the 1M context limit GPT 5 mini is better in every way.

dgemm•4mo ago
I just wish the Gemini app would stop inserting and auto playing a YouTube video into nearly every response when I'm on a mobile connection. There appears to be no way to stop it.
sinuhe69•4mo ago
Maybe disallow autoplay on your Youtube account can help. Gemini insert YT video in my answers as well, but they don't auto play.
lysecret•4mo ago
Ok wow these models are great and fast! Tested it for pdf extraction tasks.