One month Gemini is on top, then ChatGPT, then Anthropic. Not sure why everyone gets FOMO whenever a new version gets released.
I don't think any other company has all these ingredients.
Microsoft has the chance of changing habit the most by virtue of being bundled into business contracts that have companies with policies not allowing any other product in the workplace.
Even other search competitors have not proven to be a danger to Google. There is nothing stopping that search money coming in.
Or maybe Google just benchmaxxed and this doesn't translate at all in real world performance.
TBD if that performance generalizes to other real world tasks.
2) Google's search revenue last quarter was $56 billion, a 14% increase over Q3 2024.
2) I'm not suggesting this will happen overnight but especially younger people gravitate towards LLM for information search + actively use some sort of ad blocking. In the long run it doesn't look great for Google.
[1] Binomial formula gives a confidence interval of 3.7%, using p=0.77, N=500, confidence=95%
Also, models are already pretty good but product/market fit (in terms of demonstrated economic value delivered) remains elusive outside of a couple domains. Does a model that's (say) 30% better reach an inflection point that changes that narrative, or is a more qualitative change required?
But we'll have to wait a few weeks to see if the nerfed model post-release is still as good.
Having said that, OpenAI's ridiculous hype cycle has been living on borrowed time. OpenAI has zero moat, and are just one vendor in a space with many vendors, and even incredibly competent open source models by surprise Chinese entrants. Sam Altman going around acting like he's a prophet and they're the gatekeepers of the future is an act that should be super old, but somehow fools and their money continue to be parted.
So far, IMHO, Claude Code remains significantly better than Gemini CLI. We'll see whether that changes with Gemini 3.
EDIT: Don't disagree that Gemini CLI has a lot of rough edges, though.
Claude code seems to be more compatible with the model (or the reverse) whereas gemini-cli still feels a bit awkward (as of 2.5 Pro). I'm hoping its better with 3.0!
https://www.reddit.com/r/Bard/comments/1p093fb/gemini_3_in_c...
| Benchmark | 3 Pro | 2.5 Pro | Sonnet 4.5 | GPT-5.1 |
|-----------------------|-----------|---------|------------|-----------|
| Humanity’s Last Exam | 37.5% | 21.6% | 13.7% | 26.5% |
| ARC-AGI-2 | 31.1% | 4.9% | 13.6% | 17.6% |
| GPQA Diamond | 91.9% | 86.4% | 83.4% | 88.1% |
| AIME 2025 | | | | |
| (no tools) | 95.0% | 88.0% | 87.0% | 94.0% |
| (code execution) | 100% | — | 100% | — |
| MathArena Apex | 23.4% | 0.5% | 1.6% | 1.0% |
| MMMU-Pro | 81.0% | 68.0% | 68.0% | 80.8% |
| ScreenSpot-Pro | 72.7% | 11.4% | 36.2% | 3.5% |
| CharXiv Reasoning | 81.4% | 69.6% | 68.5% | 69.5% |
| OmniDocBench 1.5 | 0.115 | 0.145 | 0.145 | 0.147 |
| Video-MMMU | 87.6% | 83.6% | 77.8% | 80.4% |
| LiveCodeBench Pro | 2,439 | 1,775 | 1,418 | 2,243 |
| Terminal-Bench 2.0 | 54.2% | 32.6% | 42.8% | 47.6% |
| SWE-Bench Verified | 76.2% | 59.6% | 77.2% | 76.3% |
| t2-bench | 85.4% | 54.9% | 84.7% | 80.2% |
| Vending-Bench 2 | $5,478.16 | $573.64 | $3,838.74 | $1,473.43 |
| FACTS Benchmark Suite | 70.5% | 63.4% | 50.4% | 50.8% |
| SimpleQA Verified | 72.1% | 54.5% | 29.3% | 34.9% |
| MMLU | 91.8% | 89.5% | 89.1% | 91.0% |
| Global PIQA | 93.4% | 91.5% | 90.1% | 90.9% |
| MRCR v2 (8-needle) | | | | |
| (128k avg) | 77.0% | 58.0% | 47.1% | 61.6% |
| (1M pointwise) | 26.3% | 16.4% | n/s | n/s |
n/s = not supportedEDIT: formatting, hopefully a bit more mobile friendly
What do you mean? These coding leaderboards were at single digits about a year ago and are now in the seventies. These frontier models are arguably already better at the benchmark that any single human - it's unlikely that any particular human dev is knowledgeable to tackle the full range of diverse tasks even in the smaller SWE-Bench Verified within a reasonable time frame; to the best of my knowledge, no one has tried that.
Why should we expect this to be the limit? Once the frontier labs figure out how to train these fully with self-play (which shouldn't be that hard in this domain), I don't see any clear limit to the level they can reach.
What makes me even more curious is the following
> Model dependencies: This model is not a modification or a fine-tune of a prior model
So did they start from scratch with this one?
My hunch is that, aside from "safety" reasons, the Google Books lawsuit left some copyright wounds that Google did not want to reopen.
Anyone with money can trivially catch up to a state of the art model from six months ago.
And as others have said, late is really a function of spigot, guardrails, branding, and ux, as much as it is being a laggard under the hood.
How come apple is struggling then?
The may want to use 3rd party or just wait for AI to be more stable to see how people actually use it instead of adding slop in the core of their product.
On Terminal-Bench 2 for example, the leader is currently "Codex CLI (GPT-5.1-Codex)" at 57.8%, beating this new release.
I can't wait to try 3.0, hopefully it continues this trend. Raw numbers in a table don't mean much, you can only get a true feeling once you use it on existing code, in existing projects. Anyway, the top labs keeping eachother honest is great for us, the consumers.
I hope it's isn't such a sycophant like the current gemini 2.5 models, it makes me doubt its output, which is maybe a good thing now that I think about it.
What's with the hyperbole? It'll tighten the screws, but saying that it's "over for the other labs' might be a tad premature.
Its not over and never will be for 2 decade old accounting software, it is definitely will not be over for other AI labs.
I feel like many will be pretty disappointed by their self created expectations for this model when they end up actually using it and it turns out to be fairly similar to other frontier models.
Personally I'm very interested in how they end up pricing it.
Because it seems to lead by a decent margin on the former and trails behind on the latter
Anyone happen to know why? Is this website by any change sharing information on safe medical abortions or women's rights, something which has gotten websites blocked here before?
I actually never discovered who was responsible for the blockade, until I read this comment. I'm going to look into Allot and send them an email.
EDIT: Also, your DNS provider is censoring (and probably monitoring) your internet traffic. I would switch to a different provider.
Yeah, that was via my ISPs DNS resolver (Vodafone), switching the resolver works :)
The responsible party is ultimately our government who've decided it's legal to block a wide range of servers and websites because some people like to watch illegal football streams. I think Allot is just the provider of the technology.
---
But seriously, I find it helps to set a custom system prompt that tells Gemini to be less sycophantic and to be more succinct and professional while also leaving out those extended lectures it likes to give.
https://www.google.com/search?q=gemini+u.s.+senator+rape+all...
https://www.google.com/search?q=gemini+u.s.+senator+rape+all...
Also interesting to know that Google Antigravity (antigravity.google / https://github.com/Google-Antigravity ?) leaked. I remember seeing this subdomain recently. Probably Gemini 3 related as well.
Org was created on 2025-11-04T19:28:13Z (https://api.github.com/orgs/Google-Antigravity)
Space? (Google Cloud, Google Antigravity?)
Speed? (Flash, Flash-Lite, Antigravity? meh)
Clothes? (Google Antigravity, a wearable?)
Perhaps SWE bench just doesn't capture a lot of the improvement? If the web design improvements people have been posting on twitter, I suspect this will be a huge boon for developers. SWE benchmark is really testing bugfixing/feature dev more.
Anyway let's see. I'm still hyped!
This model is not a modification or a fine-tune of a prior model
Is that common to mention that? Feels like they built something from scratchEvals are hard.
GPT 5.1 Codex beats Gemini 3 on Terminal Bench specifically on Codex CLI, but that's apples-to-oranges (hard to tell how much of that is a Codex-specific harness vs model). Look forward to seeing the apples-to-apples numbers soon, but I wouldn't be surprised if Gemini 3 wins given how close it comes in these benchmarks.
I did not bother verifying the other claims.
It would be interesting to see the apples-to-apples figure, i.e. with Google's best harness alongside Codex CLI.
Will be interesting to see what Google releases that's coding-specific to follow Gemini 3.
My point is, although the model itself may have performed in benchmarks, I feel like there are other tools that are doing better just by adapting better training/tooling. Gemini cli, in particular, is not so great looking up for latest info on web. Qwen seemed to be trained better around looking up for information (or to reason when/how to), in comparision. Even the step-wise break down of work felt different and a bit smoother.
I do, however, use gemini cli for the most part just because it has a generous free quota with very few downsides comparted to others. They must be getting loads of training data :D.
The bucket name "deepmind-media" has been used in the past on the deepmind official site, so it seems legit.
I wonder how significant this is. DeepMind was always more research-oriented that OpenAI, which mostly scaled things up. They may have come up with a significantly better architecture (Transformer MoE still leaves a lot of room).
surrTurr•2h ago