wayback machine still has it: https://web.archive.org/web/20251118111103/https://storage.g...
here’s the archived pdf: https://web.archive.org/web/20251118111103/https://storage.g...
Also notable which models they include for comparison: Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1. That seems like a minor snub against Grok 4 / Grok 4.1.
https://firstpagesage.com/reports/top-generative-ai-chatbots... suggests 0.6% of chat use cases, well below the other big names, and I suspect those stats for chat are higher than other scenarios like business usage. Given all that, I can see how Gemini might not be focused on competing with them.
it is understandable that grok is not popular.
I would want to hear more detail about prompts, frameworks, thinking time, etc., but they don't matter too much. The main caveat would be that this is probably on the public test set, so could be in pretraining, and there could even be some ARC-focussed post-training - I think we don't know yet and might never know.
But for any reasonable setup, if no egregious cheating, that is an amazing score on ARC 2.
One month Gemini is on top, then ChatGPT, then Anthropic. Not sure why everyone gets FOMO whenever a new version gets released.
I don't think any other company has all these ingredients.
Microsoft has the chance of changing habit the most by virtue of being bundled into business contracts that have companies with policies not allowing any other product in the workplace.
Elaborate please. Are you saying that MS is forcing customers to make Copilot the only allowed LLM product?
Microsoft has contracts to provide software to companies. Companies have policies that only provided software and ai is allowed. Ipso facto
They have a long way to go to become profitable though. Those users will get less sticky when openAI starts upping their pricing/putting ads everywhere/making the product worse to save money/all of the above.
Even other search competitors have not proven to be a danger to Google. There is nothing stopping that search money coming in.
Or maybe Google just benchmaxxed and this doesn't translate at all in real world performance.
TBD if that performance generalizes to other real world tasks.
2) Google's search revenue last quarter was $56 billion, a 14% increase over Q3 2024.
2) I'm not suggesting this will happen overnight but especially younger people gravitate towards LLM for information search + actively use some sort of ad blocking. In the long run it doesn't look great for Google.
[1] Binomial formula gives a confidence interval of 3.7%, using p=0.77, N=500, confidence=95%
Also, models are already pretty good but product/market fit (in terms of demonstrated economic value delivered) remains elusive outside of a couple domains. Does a model that's (say) 30% better reach an inflection point that changes that narrative, or is a more qualitative change required?
But we'll have to wait a few weeks to see if the nerfed model post-release is still as good.
Having said that, OpenAI's ridiculous hype cycle has been living on borrowed time. OpenAI has zero moat, and are just one vendor in a space with many vendors, and even incredibly competent open source models by surprise Chinese entrants. Sam Altman going around acting like he's a prophet and they're the gatekeepers of the future is an act that should be super old, but somehow fools and their money continue to be parted.
Also I really hoped for a 2M+ context. I'm living on the context edge even with 1M.
So far, IMHO, Claude Code remains significantly better than Gemini CLI. We'll see whether that changes with Gemini 3.
Not that Google didn't use to have problems shipping useful things. But it's gotten a lot worse.
EDIT: Don't disagree that Gemini CLI has a lot of rough edges, though.
That's because coding is currently the only reliable benchmark where reasoning capabilities transfer to predict capabilities for other professions like law. Coding is the only area where they are shy to release numbers. All these exam scores are fakeable by gaming those benchmarks.
Claude code seems to be more compatible with the model (or the reverse) whereas gemini-cli still feels a bit awkward (as of 2.5 Pro). I'm hoping its better with 3.0!
https://www.reddit.com/r/Bard/comments/1p093fb/gemini_3_in_c...
| Benchmark | 3 Pro | 2.5 Pro | Sonnet 4.5 | GPT-5.1 |
|-----------------------|-----------|---------|------------|-----------|
| Humanity's Last Exam | 37.5% | 21.6% | 13.7% | 26.5% |
| ARC-AGI-2 | 31.1% | 4.9% | 13.6% | 17.6% |
| GPQA Diamond | 91.9% | 86.4% | 83.4% | 88.1% |
| AIME 2025 | | | | |
| (no tools) | 95.0% | 88.0% | 87.0% | 94.0% |
| (code execution) | 100% | - | 100% | - |
| MathArena Apex | 23.4% | 0.5% | 1.6% | 1.0% |
| MMMU-Pro | 81.0% | 68.0% | 68.0% | 80.8% |
| ScreenSpot-Pro | 72.7% | 11.4% | 36.2% | 3.5% |
| CharXiv Reasoning | 81.4% | 69.6% | 68.5% | 69.5% |
| OmniDocBench 1.5 | 0.115 | 0.145 | 0.145 | 0.147 |
| Video-MMMU | 87.6% | 83.6% | 77.8% | 80.4% |
| LiveCodeBench Pro | 2,439 | 1,775 | 1,418 | 2,243 |
| Terminal-Bench 2.0 | 54.2% | 32.6% | 42.8% | 47.6% |
| SWE-Bench Verified | 76.2% | 59.6% | 77.2% | 76.3% |
| t2-bench | 85.4% | 54.9% | 84.7% | 80.2% |
| Vending-Bench 2 | $5,478.16 | $573.64 | $3,838.74 | $1,473.43 |
| FACTS Benchmark Suite | 70.5% | 63.4% | 50.4% | 50.8% |
| SimpleQA Verified | 72.1% | 54.5% | 29.3% | 34.9% |
| MMLU | 91.8% | 89.5% | 89.1% | 91.0% |
| Global PIQA | 93.4% | 91.5% | 90.1% | 90.9% |
| MRCR v2 (8-needle) | | | | |
| (128k avg) | 77.0% | 58.0% | 47.1% | 61.6% |
| (1M pointwise) | 26.3% | 16.4% | n/s | n/s |
n/s = not supportedEDIT: formatting, hopefully a bit more mobile friendly
What do you mean? These coding leaderboards were at single digits about a year ago and are now in the seventies. These frontier models are arguably already better at the benchmark that any single human - it's unlikely that any particular human dev is knowledgeable to tackle the full range of diverse tasks even in the smaller SWE-Bench Verified within a reasonable time frame; to the best of my knowledge, no one has tried that.
Why should we expect this to be the limit? Once the frontier labs figure out how to train these fully with self-play (which shouldn't be that hard in this domain), I don't see any clear limit to the level they can reach.
Whether an individual human could do well across all tasks in a benchmark is probably not the right question to be asking a benchmark to measure. It's quite easy to construct benchmark tasks a human can't do well in that you don't even need AI to do better.
What field are you in where you feel that there might not have been any growth in capabilities at all?
EDIT: Typo
What makes me even more curious is the following
> Model dependencies: This model is not a modification or a fine-tune of a prior model
So did they start from scratch with this one?
My hunch is that, aside from "safety" reasons, the Google Books lawsuit left some copyright wounds that Google did not want to reopen.
As of a couple weeks ago (the last time I checked) if you are signed in to multiple Google accounts and you cannot accept the non-commercial terms for one of them for AI Studio, the site is horribly broken (the text showing which account they’re asking you to agree to the terms for is blurred, and you can’t switch accounts without agreeing first).
In Google’s very slight defense, Anthropic hasn’t even tried to make a proper sign in system.
Like, kind of unreasonably good. You’d expect some perfunctory Electronic app that just barely wraps the website. But no, you get something that feels incredibly polished…more so than a lot of recent apps from Apple…and has powerful integrations into other apps, including text editors and terminals.
Gemini 1.0 was strictly worse than GPT-3.5 and was unusable due to "safety" features.
Google followed that up with 1.5 which was still worse than GPT-3.5 and unbelievably far behind GPT-4. At this same time Google had their "black nazi" scandals.
With Gemini 2.0 finally had a model that was at least useful for OCR and with their fash series a model that, while not up to par in capabilities, was sufficiently inexpensive that it found uses.
Only with Gemini-2.5 did Google catch up with SoTA. It was within "spitting distance" of the leading models.
Google did indeed drop the ball, very, very badly.
I suspect that Sergey coming back helped immensely, somehow. I suspect that he was able to tame some of the more dysfunctional elements of Google, at least for a time.
Unfortunate typo.
Anyone with money can trivially catch up to a state of the art model from six months ago.
And as others have said, late is really a function of spigot, guardrails, branding, and ux, as much as it is being a laggard under the hood.
How come apple is struggling then?
The may want to use 3rd party or just wait for AI to be more stable to see how people actually use it instead of adding slop in the core of their product.
Announcing a load of AI features on stage and then failing to deliver them doesn't feel very strategic.
Enter late, enter great.
To be fair to Apple, so far the only mass market LLM use case so far is just a simple chatbot, and they don't seem to be interested in that. It remains to be seen if what Apple wants to do ("private" LLMs with access to your personal context acting as intimate personal assistants) is even possible to do reliably. It sounds useful, and I do believe it will eventually be possible, but no one is there yet.
They did botch the launch by announcing the Apple Intelligence features before they are ready though.
Their major version number bumps are a new pre-trained model. Minor bumps are changes/improvements to post-training on the same foundation.
On Terminal-Bench 2 for example, the leader is currently "Codex CLI (GPT-5.1-Codex)" at 57.8%, beating this new release.
This chart (comparing base models to base models) probably gives a better idea of the total strength of each model.
I can't wait to try 3.0, hopefully it continues this trend. Raw numbers in a table don't mean much, you can only get a true feeling once you use it on existing code, in existing projects. Anyway, the top labs keeping eachother honest is great for us, the consumers.
I hope it's isn't such a sycophant like the current gemini 2.5 models, it makes me doubt its output, which is maybe a good thing now that I think about it.
What's with the hyperbole? It'll tighten the screws, but saying that it's "over for the other labs' might be a tad premature.
Its not over and never will be for 2 decade old accounting software, it is definitely will not be over for other AI labs.
The new Gemini is not THAT far of a jump to switch your org to a new model if you already invested in e.g. OpenAI.
The difference must be night and day to call it "its over".
Right they all are marginally different. Today google fine tuned their model to be better, tomorrow it will be new Kimi, after that DeepSeek.
I feel like many will be pretty disappointed by their self created expectations for this model when they end up actually using it and it turns out to be fairly similar to other frontier models.
Personally I'm very interested in how they end up pricing it.
Because it seems to lead by a decent margin on the former and trails behind on the latter
LCB Pro are leet code style questions and SWE bench verified is heavily benchmaxxed very old python tasks.
However, going above 75%, it is likely about the same. The remaining instances are likely underspecified despite the effort of the authors that made the benchmark "verified". From what I have seen, these are often cases where the problem statement says implement X for Y, but the agent has to simply guess whether to implement the same for other case Y' - which leads to losing or winning an instance.
Models have begun to fairly thoroughly saturate "knowledge" and such, but there are still considerable bumps there
But the _big news_, and the demonstration of their achievement here, are the incredible scores they've racked up here for what's necessary for agentic AI to become widely deployable. t2-bench. Visual comprehension. Computer use. Vending-Bench. The sorts of things that are necessary for AI to move beyond an auto-researching tool, and into the realm where it can actually handle complex tasks in the way that businesses need in order to reap rewards from deploying AI tech.
Will be very interesting to see what papers are published as a result of this, as they have _clearly_ tapped into some new avenues for training models.
And here I was, all wowed, after playing with Grok 4.1 for the past few hours! xD
While the private questions don't seem to be included in the performance results, HLE will presumably flag any LLM that appears to have gamed its scores based on the differential performance on the private questions. Since they haven't yet, I think the scores are relatively trustworthy.
This was the primary bottleneck preventing models from tackling novel scientific problems they haven't seen before.
If Gemini 3 Pro has transcended "reading the internet" (knowledge saturation), and made huge progress in "thinking about the internet" (reasoning scaling), then this is a really big deal.
Benchmark | Gemini 3 Pro | Gemini 2.5 Pro | Claude Sonnet 4.5 | GPT-5.1 | GPT-5.1 Thinking
---------------------------|--------------|----------------|-------------------|---------|------------------
Humanity's Last Exam | 37.5% | 21.6% | 13.7% | 26.5% | 52%
ARC-AGI-2 | 31.1% | 4.9% | 13.6% | 17.6% | 28%
GPQA Diamond | 91.9% | 86.4% | 83.4% | 88.1% | 61%
AIM 2025 | 95.0% | 88.0% | 87.0% | 94.0% | 48%
MathArena Apex | 23.4% | 0.5% | 1.6% | 1.0% | 82%
MMMU-Pro | 81.0% | 68.0% | 68.0% | 80.8% | 76%
ScreenSpot-Pro | 72.7% | 11.4% | 36.2% | 3.5% | 55%
CharXiv Reasoning | 81.4% | 69.6% | 68.5% | 69.5% | N/A
OmniDocBench 1.5 | 0.115 | 0.145 | 0.145 | 0.147 | N/A
Video-MMMU | 87.6% | 83.6% | 77.8% | 80.4% | N/A
LiveCodeBench Pro | 2,439 | 1,775 | 1,418 | 2,243 | N/A
Terminal-Bench 2.0 | 54.2% | 32.6% | 42.8% | 47.6% | N/A
SWE-Bench Verified | 76.2% | 59.6% | 77.2% | 76.3% | N/A
t2-bench | 85.4% | 54.9% | 84.7% | 80.2% | N/A
Vending-Bench 2 | $5,478.16 | $573.64 | $3,838.74 | $1,473.43| N/A
FACTS Benchmark Suite | 70.5% | 63.4% | 50.4% | 50.8% | N/A
SimpleQA Verified | 72.1% | 54.5% | 29.3% | 34.9% | N/A
MMLU | 91.8% | 89.5% | 89.1% | 91.0% | N/A
Global PIQA | 93.4% | 91.5% | 90.1% | 90.9% | N/A
MRCR v2 (8-needle) | 77.0% | 58.0% | 47.1% | 61.6% | N/A
Argh it doesn't come out write in HN
Benchmark..................Description...................Gemini 3 Pro....GPT-5.1 (Thinking)....Notes
Humanity's Last Exam.......Academic reasoning.............37.5%..........52%....................GPT-5.1 shows 7% gain over GPT-5's 45%
ARC-AGI-2...................Visual abstraction.............31.1%..........28%....................GPT-5.1 multimodal improves grid reasoning
GPQA Diamond................PhD-tier Q&A...................91.9%..........61%....................GPT-5.1 strong in physics (72%)
AIME 2025....................Olympiad math..................95.0%..........48%....................GPT-5.1 solves 7/15 proofs correctly
MathArena Apex..............Competition math...............23.4%..........82%....................GPT-5.1 handles 90% advanced calculus
MMMU-Pro....................Multimodal reasoning...........81.0%..........76%....................GPT-5.1 excels visual math (85%)
ScreenSpot-Pro..............UI understanding...............72.7%..........55%....................Element detection 70%, navigation 40%
CharXiv Reasoning...........Chart analysis.................81.4%..........69.5%.................N/A
That's a scandal, IMO.
Given that Gemini-3 seems to do "fine" against the thinking versions why didn't they post those results? I get that PMs like to make a splash but that's shockingly dishonest.
> For Claude Sonnet 4.5, and GPT-5.1 we default to reporting high reasoning results, but when reported results are not available we use best available reasoning results.
https://storage.googleapis.com/deepmind-media/gemini/gemini_...
The 17.6% is for 5.1 Thinking High.
I'll wait for the official blog with benchmark results.
I suspect that our ability to benchmark models is waning. Much more investment required in this area, but what is the play out?
Anyone happen to know why? Is this website by any change sharing information on safe medical abortions or women's rights, something which has gotten websites blocked here before?
I actually never discovered who was responsible for the blockade, until I read this comment. I'm going to look into Allot and send them an email.
EDIT: Also, your DNS provider is censoring (and probably monitoring) your internet traffic. I would switch to a different provider.
Yeah, that was via my ISPs DNS resolver (Vodafone), switching the resolver works :)
The responsible party is ultimately our government who've decided it's legal to block a wide range of servers and websites because some people like to watch illegal football streams. I think Allot is just the provider of the technology.
---
But seriously, I find it helps to set a custom system prompt that tells Gemini to be less sycophantic and to be more succinct and professional while also leaving out those extended lectures it likes to give.
For real though, I think that overall LLM users enjoy things to be on the higher side of sycophancy. Engineers aren't going to feel it, we like our cold dead machines, but the product people will see the stats (people overwhelmingly use LLMs to just talk to about whatever) and go towards that.
I'm guessing LLM Death Count is off by an OOM or two, so we could be getting close to one in a million.
https://www.google.com/search?q=gemini+u.s.+senator+rape+all...
Also interesting to know that Google Antigravity (antigravity.google / https://github.com/Google-Antigravity ?) leaked. I remember seeing this subdomain recently. Probably Gemini 3 related as well.
Org was created on 2025-11-04T19:28:13Z (https://api.github.com/orgs/Google-Antigravity)
Speed? (Flash, Flash-Lite, Antigravity) this is my guess. Bonus: maybe Gemini Diffusion soon?
Space? (Google Cloud, Google Antigravity?)
Clothes? (A light wearable -> Antigravity?)
Gaming? (Ghosting/nontangibility -> antigravity?)
"Google Antigravity" refers to a new AI software platform announced by Google designed to help developers write and manage code.
The term itself is a bit of a placeholder or project name, combining the brand "Google" with the concept of "antigravity"—implying a release from the limitations of traditional coding.
In simple terms, Google Antigravity is a sophisticated tool for programmers that uses powerful AI systems (called "agents") to handle complex coding tasks automatically. It takes the typical software workbench (an IDE) and evolves it into an "agent-first" system.
Agentic Platform: It's a central hub where many specialized AI helpers (agents) live and work together. The goal is to let you focus on what to build, not how to build it.
Task-Oriented: The platform is designed to be given a high-level goal (a "task") rather than needing line-by-line instructions.
Autonomous Operation: The AI agents can work across all your tools—your code editor, the command line, and your web browser—without needing you to constantly supervise or switch between them.
Now the page is somewhat live on that URL
[1] https://blog.google/technology/ai/introducing-pathways-next-...
Perhaps SWE bench just doesn't capture a lot of the improvement? If the web design improvements people have been posting on twitter, I suspect this will be a huge boon for developers. SWE benchmark is really testing bugfixing/feature dev more.
Anyway let's see. I'm still hyped!
According to at least OpenAI, who probably produces the most tokens (if we don't count google AI overviews and other unrequested AI bolt-ons) out of all the labs, programming tokens account for ~4% of total generations.
That's nothing. The returns will come from everyone and their grandma paying $30-100/mo to use the services, just like everyone pays for a cell phone and electricity.
Don't be fooled, we are still in the "Open hands" start-up business phase of LLMs. The "enshitification" will follow.
This model is not a modification or a fine-tune of a prior model
Is that common to mention that? Feels like they built something from scratchEvals are hard.
My take would be that coding itself is hard, but I'm a software engineer myself so I'm biased.
GPT 5.1 Codex beats Gemini 3 on Terminal Bench specifically on Codex CLI, but that's apples-to-oranges (hard to tell how much of that is a Codex-specific harness vs model). Look forward to seeing the apples-to-apples numbers soon, but I wouldn't be surprised if Gemini 3 wins given how close it comes in these benchmarks.
I did not bother verifying the other claims.
It would be interesting to see the apples-to-apples figure, i.e. with Google's best harness alongside Codex CLI.
Will be interesting to see what Google releases that's coding-specific to follow Gemini 3.
That'd be a bad idea, models are often trained for specific tools (like GPT Codex is trained for Codex, and Sonnet has been trained with Claude Code in mind), and also vice-versa that the tools are built with a specific model in mind, as they all work differently.
Forcing all the models to use the same tool for execution sounds like a surefire way of getting results that doesn't represent real usage, but instead arbitrarily measure how well a model works with the "standard harness", which if people start caring about, will start to become gamed instead.
What do you mean by "standard eval harness"?
The magic of LLMs is that they can understand the latent space of a problem and infer a mostly accurate response. Saying you need to subscribe to get the latest tools is just a sales tactic trained into the models to protect profits.
My point is, although the model itself may have performed in benchmarks, I feel like there are other tools that are doing better just by adapting better training/tooling. Gemini cli, in particular, is not so great looking up for latest info on web. Qwen seemed to be trained better around looking up for information (or to reason when/how to), in comparision. Even the step-wise break down of work felt different and a bit smoother.
I do, however, use gemini cli for the most part just because it has a generous free quota with very few downsides comparted to others. They must be getting loads of training data :D.
Gemini is very good a pointing out flaws that are very subtle and non noticeable at a first and second glance.
It also produces code that is easy to reason about. You can then feed it to GPT-5.x for refinement and then back to Gemini for assessment.
It's probably pretty liberating, because you can make a "spikey" intelligence with only one spike to really focus on.
I code non-trivial stuff with it like multi-threaded code and at least for my style of AI coding which is to do fairly small units of work with multiple revisions it is good enough for me to not to even consider the competition.
Just giving you a perspective on how the benchmarks might not be important at all for some people and how Claude may have a difficult time being the definitive coding model.
It may be cheaper but it's much, much slower, which is a total flow killer in my experience.
The bucket name "deepmind-media" has been used in the past on the deepmind official site, so it seems legit.
I wonder how significant this is. DeepMind was always more research-oriented that OpenAI, which mostly scaled things up. They may have come up with a significantly better architecture (Transformer MoE still leaves a lot of room).
That seems like a low bar. Who's training frontier LLMs on CPUs? Surely they meant to compare TPUs to GPUs. If "this is faster than a CPU for massively parallel AI training" is the best you can say about it, that's not very impressive.
Though come on... Even with proofreading, this is an easy one to miss.
They are both designed to do massively parallel operations. TPUs are just a bit more specific to matrix multiply+adds while GPUs are more generic.
For example, the frontier models of early-to-mid 2024 could reliably follow what seemed to be 20-30 instructions. As you gave more instructions than that in your prompt, the LLMs started missing some and your outputs became inconsistent and difficult to control.
The latest set of models (2.5 Pro, GPT-5, etc) seem to top out somewhere in the 100 range? They are clearly much better at following a laundry list of instructions, but they also clearly have a limit and once your prompt is too large and too specific you lose coherence again.
If I had to guess, Gemini 3 Pro has once again pushed the bar, and maybe we're up near 250 (haven't used it, I'm just blindly projecting / hoping). And that's a huge deal! I actually think it would be more helpful to have a model that could consistently follow 1000 custom instructions than it would be to have a model that had 20 more IQ points.
I have to imagine you could make some fairly objective benchmarks around this idea, and it would be very helpful from an engineering perspective to see how each model stacked up against the others in this regard.
If you ever played competitive game the difference is insane between these tiers
Our most intelligent model with SOTA reasoning and multimodal understanding, and powerful agentic and vibe coding capabilities
<=200K tokens • Input: $2,00 / Output: $12,00
> 200K tokens • Input: $4,00 / Output: $18,00
Knowledge cut off: Jan. 2025
NVDA is down 3.26%
I think a specialized hardware for training models is the next big wave in China.
I already enjoy Gemini 2.5 pro for planning and if Gemini 3 is priced similarly, I'll be incredibly happy to ditch the painfully pricey Claude max subscription. To be fair, I've already got an extremely sour taste in my mouth from the last Anthropic bait and switch on pricing and usage, so happy to see Google take the crown here.
For comparison: Gemini 2.5 Pro was $1.25/M for input and $10/M for output Gemini 1.5 Pro was $1.25/M for input and $5/M for output
Still taking nearly a year to train and run post training safety and stability tuning.
With 10x the infrastructure they could iterate much faster, I don't see AI infrastructure as a bubble, it is still a bottleneck on pace of innovation at today's active deployment level.
And I really don't think I'm alone in this.
Who is training LLMs with CPUs?
https://storage.googleapis.com/deepmind-media/Model-Cards/Ge...
rvz•11h ago
Well don't complain when you are using Gmail and your emails are being trained to develop Gemini.
patates•10h ago
rvz•7h ago
I don't expect them to follow their own privacy policies.
[0] https://www.yahoo.com/news/articles/google-sued-over-gemini-...