Gemini 3 Pro Model Card [pdf]

https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf

280•virgildotcodes•2mo ago

Comments

rvz•2mo ago

> The training dataset also includes: publicly available datasets that are readily downloadable; data obtained by crawlers; licensed data obtained via commercial licensing agreements; user data (i.e., data collected from users of Google products and services to train AI models, along with user interactions with the model) in accordance with Google’s relevant terms of service, privacy policy, service-specific policies, and pursuant to user controls, where appropriate; other datasets that Google acquires or generates in the course of its business operations, or directly from its workforce; and AI-generated synthetic data.

Well don't complain when you are using Gmail and your emails are being trained to develop Gemini.

patates•2mo ago

It says "pursuant to user controls, where appropriate". We can now sleep peacefully with the knowledge that Google will give us the tools to disable this where it's not inappropriate.

rvz•2mo ago

So that's why Google is getting sued for Gemini being enabled by default in Gmail and analyzing emails and our data; completely going against whatever privacy policy they came up with. [0]

I don't expect them to follow their own privacy policies.

[0] https://www.yahoo.com/news/articles/google-sued-over-gemini-...

surrTurr•2mo ago

gone now;

wayback machine still has it: https://web.archive.org/web/20251118111103/https://storage.g...

lifthrasiir•2mo ago

For the veracity of the link itself: https://storage.googleapis.com/deepmind-media/* has been used by DeepMind itself (e.g. "View tech report" in https://deepmind.google/models/gemini/) so it is a genuine leak.

meetpateltech•2mo ago

it was accidentally pushed a little early, and now it has been taken down.

here’s the archived pdf: https://web.archive.org/web/20251118111103/https://storage.g...

TheAceOfHearts•2mo ago

They scored a 31.1% on ARC AGI 2 which puts them in first place.

Also notable which models they include for comparison: Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1. That seems like a minor snub against Grok 4 / Grok 4.1.

kranke155•2mo ago

Grok seems extremely prone to hallucination in my experience. It also constantly asserts certainty on fuzzy topics.

buildfocus•2mo ago

My impression is that Grok is very rarely used in practice outside of a niche of die-hard users, partly because of very different tuning to other models, and partly the related public reputation around it.

https://firstpagesage.com/reports/top-generative-ai-chatbots... suggests 0.6% of chat use cases, well below the other big names, and I suspect those stats for chat are higher than other scenarios like business usage. Given all that, I can see how Gemini might not be focused on competing with them.

ohyoutravel•2mo ago

I don’t know anyone who uses Grok, but in my peer group everyone uses 1-2 paid services like Gemini or Clause or ChatGPT. They’re probably not as “extremely online” as I am, so I can’t generalize this thought, but anecdotally my impression has been that Grok is just very “right wing influencer” coded.

npn•2mo ago

well, there are 3 kind of usages for grok: - using grok inside X/Twitter: most people interacts with Grok this way. - using grok on its website: this is really annoying, as you get delayed by cloudflare everytime you access the site. As grok does not provide serious advantage over other services, why bother - you can also use the app, but it is not as convenient as other services.

it is understandable that grok is not popular.

jmmcd•2mo ago

About ARC 2:

I would want to hear more detail about prompts, frameworks, thinking time, etc., but they don't matter too much. The main caveat would be that this is probably on the public test set, so could be in pretraining, and there could even be some ARC-focussed post-training - I think we don't know yet and might never know.

But for any reasonable setup, if no egregious cheating, that is an amazing score on ARC 2.

jmmcd•2mo ago

This is on the semi-private set

* https://x.com/arcprize/status/1990820655411909018

* https://arcprize.org/guide

surrTurr•2mo ago

good benchmark stats except for coding where it looks similar to other SOTA models

aurareturn•2mo ago

Benchmark suggests it is a resounding win for Gemini 3 Pro as the top model.

margorczynski•2mo ago

If these numbers are true then OpenAI is probably done, Anthropic too. Still, it's hard to see an effective monetization method for this tech and it clearly is eating Google's main pie which is search.

Sol-•2mo ago

Why? These models just leapfrog each other as time advances.

One month Gemini is on top, then ChatGPT, then Anthropic. Not sure why everyone gets FOMO whenever a new version gets released.

remus•2mo ago

I think google is uniquely well placed to make a profitable business out of AI: They make their own TPUs so don't have to pay ridiculous amounts of money to Nvidia, they have a great depth of talent in building models, they've got loads of data they can use for training and they've got a huge existing customer base who can buy their AI offerings.

I don't think any other company has all these ingredients.

gizmodo59•2mo ago

While I don’t disagree that Google is the company you can’t bet against when it comes to AI, saying other companies are done is a stretch. If they have a significant moat then they should be at the top all the time by then which is not the case though.

remus•2mo ago

Agreed, too early to write off others entirely. It'll be interesting to see who comes out the other side of the bubble with a working business.

adriand•2mo ago

Anthropic has a fairly significant lead when it comes to enterprise usage and for coding. This seems like a workable business model to me.

bootlooped•2mo ago

I feel this is a tenuous position though. I find it incredibly easy to switch to Gemini CLI when I want a second opinion, or when Claude is down.

adriand•2mo ago

The enterprise sales cycle is often quite long, though, and often includes a lot of hurdles around compliance, legal, etc. It would take a fairly sustained loss of edge before a lot of enterprises would switch once they're hooked into a given platform. It's interesting to me that Sonnet 4.5 still edges Gemini 3 on SWE bench. This seems to bode well for the trajectory that Anthropic is on.

basch•2mo ago

ChatGPT's moat is their name and user habit. People who are using it will keep using it. All/most of the products are _good enough_ for the people who already got used to using them, that they arent exploring competitors.

Microsoft has the chance of changing habit the most by virtue of being bundled into business contracts that have companies with policies not allowing any other product in the workplace.

netdevphoenix•2mo ago

> business contracts that have companies with policies not allowing any other product in the workplace.

Elaborate please. Are you saying that MS is forcing customers to make Copilot the only allowed LLM product?

basch•2mo ago

Not quite, but in effect.

Microsoft has contracts to provide software to companies. Companies have policies that only provided software and ai is allowed. Ipso facto

remus•2mo ago

> ChatGPT's moat is their name and user habit. People who are using it will keep using it. All/most of the products are _good enough_ for the people who already got used to using them, that they arent exploring competitors.

They have a long way to go to become profitable though. Those users will get less sticky when openAI starts upping their pricing/putting ads everywhere/making the product worse to save money/all of the above.

mlnj•2mo ago

100% the reason I am long on Google. They can take their time to monetize these new costs.

Even other search competitors have not proven to be a danger to Google. There is nothing stopping that search money coming in.

spaceman_2020•2mo ago

The bear case for Google was always the business side would cannibalize the AI side. AI makes search redundant which kills the golden goose

Zigurd•2mo ago

The TPU are a key factor. They are the most mature alternative to Nvidia. Only Google cloud, Azure, and AWS enable you to rent their respective AI chips. Out of those three, google is the only one to have a frontier model. So if they have a real advantage they're not exposed to the financial shenanigans propping up neo clouds like Coreweave.

astrange•2mo ago

Making your own stuff is definitely not always economically better. Google has the same supplier cost pressure as Nvidia for DRAM, fabs, etc.

redox99•2mo ago

Considering GPT 5 was only recently released, it's very unlikely GPT will achieve these scores in just a couple of months. If they had something this good in the oven, they'd probably left the GPT 5 name to it.

Or maybe Google just benchmaxxed and this doesn't translate at all in real world performance.

Palmik•2mo ago

GPT 5 was released more than 3 months ago. Gemini 2.5 was released less than 8 months ago.

sidibe•2mo ago

If not this model, Google at some point is going to get and stay ahead just because they have so many more people and compute resources they can throw at many directions while the others have to make the right choices with how they use their resources each time. Took a while to channel their numbers into a product direction but now I don't think they're going to let up

blueblisters•2mo ago

They do have unreleased Olympiad Gold-winning models that are definitely better than GPT5.

TBD if that performance generalizes to other real world tasks.

happa•2mo ago

This may just be bad recollection from my part, but hasn't Google reported that their search business is right now the most profitable it has ever been?

senordevnyc•2mo ago

1) New SOTA models come out all the time and that hasn't killed the other major AI companies. This will be no different.

2) Google's search revenue last quarter was $56 billion, a 14% increase over Q3 2024.

margorczynski•2mo ago

1) Not long ago Altman and the OpenAI CFO were openly asking for public money. None of these AI companies have actually any kind of working business plan and are just burning investor money. If the investors see there is no winning against Google (or some open Chinese model) the money will dry up.

2) I'm not suggesting this will happen overnight but especially younger people gravitate towards LLM for information search + actively use some sort of ad blocking. In the long run it doesn't look great for Google.

senordevnyc•2mo ago

No, you suggested that LLMs are clearly eating google's lunch already, and there's just no evidence of that. Quite the opposite.

paswut•2mo ago

I'd love to see anthropic/openai pop. back to some regular programming. the models are good enough, time to invest elsewhere

ilaksh•2mo ago

The only one it doesn't win is SWE bench which it is significantly behind Claude Sonnet. You just can't take down Sonnet.

stavros•2mo ago

Codex has been much better than Sonnet for me.

dotancohen•2mo ago

On what types of tasks?

svantana•2mo ago

One percentage point is not significant, neither in the colloquial nor the scientific sense[1].

[1] Binomial formula gives a confidence interval of 3.7%, using p=0.77, N=500, confidence=95%

lukev•2mo ago

Or else it trained/overfit to the benchmarks. We won't really know until people have a chance to use it for real-world tasks.

Also, models are already pretty good but product/market fit (in terms of demonstrated economic value delivered) remains elusive outside of a couple domains. Does a model that's (say) 30% better reach an inflection point that changes that narrative, or is a more qualitative change required?

alecco•2mo ago

For SWE it is the same ranking. But if Google's $20/mo plan is comparable to the $100-200 plans for OpenAI and Anthropic, yes they are done.

But we'll have to wait a few weeks to see if the nerfed model post-release is still as good.

siva7•2mo ago

I have a few secret prompts to test complex reasoning capabilities of new models (in law and medicine). Gemini (2.5 pro) is by a wide margin behind Anthropic (sonnet 4.5 basic thinking) and Openai (pro model) on my own benchmark and I trust my own benchmark more than public leaderboards. So it's the other way around. Google is trying to catch up where the others are. It just doesn't seem so to some because Google undercuts prices and most people don't have own complex problems with a verified solution to test against (so they could see how bad Gemini is in reality)

alecco•2mo ago

This thread is about Gemini 3. It will be interesting to see your benchmark results when it's available later.

llm_nerd•2mo ago

They're constantly matching and exceeding each other. It's a hypercompetitive space and I would fully expect one of the others to top various benchmarks shortly after. On pretty much every leading release someone does this "everyone else is done! Shut er down" thing and it's growing pretty weird.

Having said that, OpenAI's ridiculous hype cycle has been living on borrowed time. OpenAI has zero moat, and are just one vendor in a space with many vendors, and even incredibly competent open source models by surprise Chinese entrants. Sam Altman going around acting like he's a prophet and they're the gatekeepers of the future is an act that should be super old, but somehow fools and their money continue to be parted.

netdevphoenix•2mo ago

This. If I had to put my money on a survivor, it would be Google because it is an established company with existing revenue modules unrelated to AI. Anthropic and OpenAI won't stand alone without external funding

patates•2mo ago

It says it's been trained from scratch. I wonder if it will have the same undescribable magic that makes me spend an hour every day with 2.5. I really love the results I can get with 2.5 pro. Google eventually limiting aistudio will be a sad day.

Also I really hoped for a 2M+ context. I'm living on the context edge even with 1M.

dahcryn•2mo ago

buy a pixel and you get it basically unlimited for free for a year ;)

sohpea•2mo ago

or a Chromebook is a good choice too considering price

JacobAsmuth•2mo ago

AIStudio now accepts an API key. Unlimited usage :)

Traubenfuchs•2mo ago

So does google actually have a claude console alternative currently?

rjtavares•2mo ago

Gemini CLI

muro•2mo ago

https://github.com/google-gemini/gemini-cli

itsmevictor•2mo ago

Noteworthily, although Gemini 3 Pro seems to have much benchmark scores than other models across the board (including compared to Claude), it's not the case for coding, where it appears to score essentially the same as the others. I wonder why that is.

So far, IMHO, Claude Code remains significantly better than Gemini CLI. We'll see whether that changes with Gemini 3.

decster•2mo ago

from my experience, the quality of gemini-cli isn't great, experiencing lot of stupied bug.

spwa4•2mo ago

Google is currently constantly laying off people. Everyone who really exceeds has jumped ship, and the people who remain ... are not top of the class anymore.

Not that Google didn't use to have problems shipping useful things. But it's gotten a lot worse.

BoredPositron•2mo ago

Gemini performs better if you use it with Claude Code than with Gemini cli. It still has some odd problems with tool calling but a lot of the performance loss is the Gemini cli app itself.

lifthrasiir•2mo ago

Probably because many models from Anthropic would have been optimized for agentic coding in particular...

EDIT: Don't disagree that Gemini CLI has a lot of rough edges, though.

Lionga•2mo ago

[flagged]

siva7•2mo ago

> I wonder why that is.

That's because coding is currently the only reliable benchmark where reasoning capabilities transfer to predict capabilities for other professions like law. Coding is the only area where they are shy to release numbers. All these exam scores are fakeable by gaming those benchmarks.

adidoit•2mo ago

gemini cli. It's not as impressive as claude code or even codex.

Claude code seems to be more compatible with the model (or the reverse) whereas gemini-cli still feels a bit awkward (as of 2.5 Pro). I'm hoping its better with 3.0!

laborcontract•2mo ago

It's hilarious that the release of Gemini 3 is getting eclipsed by this cloudflare outage.

senordevnyc•2mo ago

It hasn't been released, this is just a leak

amarcheschi•2mo ago

On reddit I see it's already available on cursor

https://www.reddit.com/r/Bard/comments/1p093fb/gemini_3_in_c...

senordevnyc•2mo ago

Interesting, it doesn't show up for me in Cursor yet.

Despacito2019•2mo ago

you need to manually add the custom model gemini-3-pro-preview

yen223•2mo ago

Coincidence? Yes

scrlk•2mo ago

Benchmarks from page 4 of the model card:

    | Benchmark             | 3 Pro     | 2.5 Pro | Sonnet 4.5 | GPT-5.1   |
    |-----------------------|-----------|---------|------------|-----------|
    | Humanity's Last Exam  | 37.5%     | 21.6%   | 13.7%      | 26.5%     |
    | ARC-AGI-2             | 31.1%     | 4.9%    | 13.6%      | 17.6%     |
    | GPQA Diamond          | 91.9%     | 86.4%   | 83.4%      | 88.1%     |
    | AIME 2025             |           |         |            |           |
    |   (no tools)          | 95.0%     | 88.0%   | 87.0%      | 94.0%     |
    |   (code execution)    | 100%      | -       | 100%       | -         |
    | MathArena Apex        | 23.4%     | 0.5%    | 1.6%       | 1.0%      |
    | MMMU-Pro              | 81.0%     | 68.0%   | 68.0%      | 80.8%     |
    | ScreenSpot-Pro        | 72.7%     | 11.4%   | 36.2%      | 3.5%      |
    | CharXiv Reasoning     | 81.4%     | 69.6%   | 68.5%      | 69.5%     |
    | OmniDocBench 1.5      | 0.115     | 0.145   | 0.145      | 0.147     |
    | Video-MMMU            | 87.6%     | 83.6%   | 77.8%      | 80.4%     |
    | LiveCodeBench Pro     | 2,439     | 1,775   | 1,418      | 2,243     |
    | Terminal-Bench 2.0    | 54.2%     | 32.6%   | 42.8%      | 47.6%     |
    | SWE-Bench Verified    | 76.2%     | 59.6%   | 77.2%      | 76.3%     |
    | t2-bench              | 85.4%     | 54.9%   | 84.7%      | 80.2%     |
    | Vending-Bench 2       | $5,478.16 | $573.64 | $3,838.74  | $1,473.43 |
    | FACTS Benchmark Suite | 70.5%     | 63.4%   | 50.4%      | 50.8%     |
    | SimpleQA Verified     | 72.1%     | 54.5%   | 29.3%      | 34.9%     |
    | MMLU                  | 91.8%     | 89.5%   | 89.1%      | 91.0%     |
    | Global PIQA           | 93.4%     | 91.5%   | 90.1%      | 90.9%     |
    | MRCR v2 (8-needle)    |           |         |            |           |
    |   (128k avg)          | 77.0%     | 58.0%   | 47.1%      | 61.6%     |
    |   (1M pointwise)      | 26.3%     | 16.4%   | n/s        | n/s       |

n/s = not supported

EDIT: formatting, hopefully a bit more mobile friendly

manmal•2mo ago

Looks like it will be on par with the contenders when it comes to coding. I guess improvements will be incremental from here on out.

CjHuber•2mo ago

If it’s on par in code quality, it would be a way better model for coding because of its huge context window.

manmal•2mo ago

Sonnet can also work on 1M context. Its extreme speed is the only thing Gemini has on others.

CjHuber•2mo ago

Can it now in Claude Code and Claude Desktop? When I was using it a couple of months ago it seemed only the API had 1M

falcor84•2mo ago

> I guess improvements will be incremental from here on out.

What do you mean? These coding leaderboards were at single digits about a year ago and are now in the seventies. These frontier models are arguably already better at the benchmark that any single human - it's unlikely that any particular human dev is knowledgeable to tackle the full range of diverse tasks even in the smaller SWE-Bench Verified within a reasonable time frame; to the best of my knowledge, no one has tried that.

Why should we expect this to be the limit? Once the frontier labs figure out how to train these fully with self-play (which shouldn't be that hard in this domain), I don't see any clear limit to the level they can reach.

zamadatix•2mo ago

A new benchmark comes out, it's designed so nothing does well at it, the models max it out, and the cycle repeats. This could either describe massive growth of LLM coding abilities or a disconnect between what the new benchmarks are measuring & why new models are scoring well after enough time. In the former assumption there is no limit to the growth of scores... but there is also not very much actual growth (if any at all). In the latter the growth matches, but the reality of using the tools does not seem to say they've actually gotten >10x better at writing code for me in the last year.

Whether an individual human could do well across all tasks in a benchmark is probably not the right question to be asking a benchmark to measure. It's quite easy to construct benchmark tasks a human can't do well in that you don't even need AI to do better.

falcor84•2mo ago

Your mileage may vary, but for me, working today with the latest version of Claude Code on a non-trivial python web dev project, I do absolutely feel that I can hand over to the AI coding tasks that are 10 times more complex or time consuming than what I could hand over to copilot or windsurf a year ago. It's still nowhere close to replacing me, but I feel that I can work at a significantly higher level.

What field are you in where you feel that there might not have been any growth in capabilities at all?

EDIT: Typo

zamadatix•2mo ago

I'm in product management focused around networking. I can use the tools to create great mockups in a fraction of a time but the actual turnaround of that into production ready code has not been changing much. The team has been able to build test cases and pipelines a bit more quickly is probably the main gain on getting code written.

jhonof•2mo ago

Claude 3.5 came out in June of last year, and it is imo marginally worse than the AI models currently available for coding. I do not think models are 10x better than 1 year ago, that seems extremely hyperbolic or you are working in a super niche area where that is true.

Miraste•2mo ago

Are you using it for agentic tasks of any length? 3.5 and 4.5 are about the same for single file/single snippet tasks, but my observation has been that 4.5 can do longer, more complex tasks that were a waste of time to even try with 3.5 because it would always fail.

FergusArgyll•2mo ago

Yes, this is important. Gpt 5 and o3 were ~ equivalent for a one shot one file task. But 5 and codex-5 can just work for an hour in a way no model was able to before (the newer claudes can too)

jhonof•2mo ago

I use the newer claudes and letting them work for 1 hour leads to horrible code over 50% of the time that does not work. Maybe I am not the target person for agentic tasks, all I use agents for is to do product searches for me on the internet when I have specific constraints and I don't want to waste an hour looking for something.

hadlock•2mo ago

Your knowledge on the topic is at least six months out of date; April 2025 was a huge leap forward in usability, and recent releases in the last 30 days are at least what I would call a full generation newer technology than June of 2024. Summer 2025 was arguably the dawn of true AI assisted coding. Heck reasoning models were still bleeding edge in late December 2024. They might not be 10x better but their ability to competently use (and build their own) tools makes them almost incomparable to last year's technology.

jhonof•2mo ago

Maybe I am just using them wrong, but I don't know how my knowledge can be out of date considering I use the tools every day and pay for Clause and Gemini. I genuinely think GPT 5 was worse than previous models for reference. They are for sure marginally better, but I don't even think 2x better let alone 10x better.

manmal•2mo ago

Google has had a lot of time to optimise for those benchmarks, and just barely made SOTA (or not even SOTA) now. How is that not incremental?

spwa4•2mo ago

If we're being completely honest, a benchmark is like an honest exam: any set of questions can only be used once when it comes out. Otherwise you're only testing how well people can acquire and memorize exact questions.

Alifatisk•2mo ago

These numbers are impressive, at least to say. It looks like Google has produced a beast that will raise the bar even higher. What's even more impressive is how Google came into this game late and went from producing a few flops to being the leader at this (actually, they already achieved the title with 2.5 Pro).

What makes me even more curious is the following

> Model dependencies: This model is not a modification or a fine-tune of a prior model

So did they start from scratch with this one?

benob•2mo ago

What does it mean nowadays to start from scratch? At least in the open scene, most of the post-training data is generated by other LLMs.

Alifatisk•2mo ago

They had to start with a base model, that part I am certain of

postalcoder•2mo ago

Google was never really late. Where people perceived Google to have dropped the ball was in its productization of AI. The Google's Bard branding stumble was so (hilariously) bad that it threw a lot of people off the scent.

My hunch is that, aside from "safety" reasons, the Google Books lawsuit left some copyright wounds that Google did not want to reopen.

Alifatisk•2mo ago

Oh, I remember the times when I compared Gemini with ChatGPT and Claude. Gemini was so far behind, it was barely usable. And now they are pushing the boundries.

postalcoder•2mo ago

You could argue that chat-tuning of models falls more along the lines of product competence. I don't think there was a doubt about the upper ceiling of what people thought Google could produce.. more "when will they turn on the tap" and "can Pichai be the wartime general to lead them?"

dgacmu•2mo ago

The memory of Microsoft's Tay fiasco was strong around the time the brain team started playing with chatbots.

Workaccount2•2mo ago

Google was catastrophically traumatized throughout the org when they had that photos AI mislabel black people as gorillas. They turned the safety and caution knobs up to 12 after that for years, really until OpenAI came along and ate their lunch.

Miraste•2mo ago

It still haunts them. Even in the brand-new Gemini-based rework of Photos search and image recognition, "gorilla" is a completely blacklisted word.

vbezhenar•2mo ago

It should be blocklisted instead. How insensitive of them.

baq•2mo ago

oh they were so late there were internal leaked ('leaked'?) memos about a couple grad students with $100 budget outdoing their lab a couple years ago. they picked themselves up real nice, but it took a serious reorg.

amluto•2mo ago

Google’s productization is still rather poor. If I want to use OpenAI’s models, I go to their website, look up the price and pay it. For Google’s, I need to figure out whether I want AI Studio or Google Cloud Code Assist or AI Ultra, etc, and if this is for commercial use where I need to prevent Google from training on my data, figuring out which options work is extra complicated.

As of a couple weeks ago (the last time I checked) if you are signed in to multiple Google accounts and you cannot accept the non-commercial terms for one of them for AI Studio, the site is horribly broken (the text showing which account they’re asking you to agree to the terms for is blurred, and you can’t switch accounts without agreeing first).

In Google’s very slight defense, Anthropic hasn’t even tried to make a proper sign in system.

PrairieFire•2mo ago

Not to mention no macOS app. This is probably unimportant to many in the hn audience, but more broadly it matters for your average knowledge worker.

perardi•2mo ago

And a REALLY good macOS app.

Like, kind of unreasonably good. You’d expect some perfunctory Electronic app that just barely wraps the website. But no, you get something that feels incredibly polished…more so than a lot of recent apps from Apple…and has powerful integrations into other apps, including text editors and terminals.

aoeusnth1•2mo ago

Which app are you referring to?

oppegard•2mo ago

The ChatGPT app for Mac is native and very good.

kyle_grove•2mo ago

Anthropic sign-on is surprisingly bad.

HardCodedBias•2mo ago

Bard was horrible compared to the competition of the time.

Gemini 1.0 was strictly worse than GPT-3.5 and was unusable due to "safety" features.

Google followed that up with 1.5 which was still worse than GPT-3.5 and unbelievably far behind GPT-4. At this same time Google had their "black nazi" scandals.

With Gemini 2.0 finally had a model that was at least useful for OCR and with their fash series a model that, while not up to par in capabilities, was sufficiently inexpensive that it found uses.

Only with Gemini-2.5 did Google catch up with SoTA. It was within "spitting distance" of the leading models.

Google did indeed drop the ball, very, very badly.

I suspect that Sergey coming back helped immensely, somehow. I suspect that he was able to tame some of the more dysfunctional elements of Google, at least for a time.

astrange•2mo ago

> their fash series

Unfortunate typo.

louisbourgault•2mo ago

I feel like 1.5 was still pretty good -- my school blocked chatgpt at the time but didn't bother with anything else, so I was using it more than anything else for general research help and it was fine. The blocking fact is probably the biggest reason I use Gemini 90% of the time now, because school can never block google search and ai mode is in that now. That, and the android integration.

To be fair, for my use case (apart from GitHub copilot stuff with Claude 4.5 sonnet) I've never noticed too big of a difference between the actual models, and am more inclined to judge them by their ancillary services and speed, which google excells in.

staticman2•2mo ago

Gemini 1.5 Pro was definitely useful at OCR. I used it for that on the free tier.

basch•2mo ago

At least at the moment, coming in late seems to matter little.

Anyone with money can trivially catch up to a state of the art model from six months ago.

And as others have said, late is really a function of spigot, guardrails, branding, and ux, as much as it is being a laggard under the hood.

FrequentLurker•2mo ago

> Anyone with money can trivially catch up to a state of the art model from six months ago.

How come apple is struggling then?

risyachka•2mo ago

It looks more like a strategic decision tbh.

The may want to use 3rd party or just wait for AI to be more stable to see how people actually use it instead of adding slop in the core of their product.

stevesimmons•2mo ago

In contrast to Microsoft, who puts Copilot buttons everywhere and succeeds only in annoying their customers.

remus•2mo ago

> It looks more like a strategic decision tbh.

Announcing a load of AI features on stage and then failing to deliver them doesn't feel very strategic.

FrequentLurker•2mo ago

But apple intelligence is a thing, and they are struggling to deliver on the promises of apple intelligence.

bitpush•2mo ago

This is revisionist history. Apple wanted to fully jump in. They even rebranded AI as Apple Intelligence and announced a hoard of features which turned out to be vaporware.

basch•2mo ago

Sit and wait per usual.

Enter late, enter great.

doctoboggan•2mo ago

Apple is struggling with _productizing_ LLMs for the mass market, which is a separate task from training a frontier LLM.

To be fair to Apple, so far the only mass market LLM use case so far is just a simple chatbot, and they don't seem to be interested in that. It remains to be seen if what Apple wants to do ("private" LLMs with access to your personal context acting as intimate personal assistants) is even possible to do reliably. It sounds useful, and I do believe it will eventually be possible, but no one is there yet.

They did botch the launch by announcing the Apple Intelligence features before they are ready though.

svnt•2mo ago

Anyone with enough money and without an entrenched management hierarchy preventing the right people from being hired and enabled to run the project.

raincole•2mo ago

Being known as a company that is always six months late than the competitors isn't something to brag about...

_factor•2mo ago

Apple has entered the chat.

basch•2mo ago

I was referring to a new entrant, not perpetual lag

steveBK123•2mo ago

One possibility here is that Google is dribbling out cutting edge releases to slowly bleed out the pure play competition.

dbbk•2mo ago

And also, critically, being the only profitable company doing this.

sigmoid10•2mo ago

It's not like they're making their money from this though. All AI work is heavily subsidised, for Alphabet it just happens that the funding comes from within the megacorp. If MS had fully absorbed OpenAI back when their board nearly sunk the boat, they'd be in the exact same situation today.

Miraste•2mo ago

They're not making money, but they're in a much better situation than Microsoft/OpenAI because of TPUs. TPUs are much cheaper than Nvidia cards both to purchase and to operate, so Google's AI efforts aren't running at as much of a loss as everyone else. That's why they can do things like offer Gemini 3 Pro for free.

sigmoid10•2mo ago

A lot of major providers offer their cutting edge model for free in some form these days, that's merely a market penetration strategy. At the end of the day (if you look at the cloud prices), TPUs are only about 30% cheaper. But NVidia produces orders of magnitude more cards. So Google will certainly need more time to train and globally deploy inference for their frontier models. For example, I doubt they could do with TPUs what xAI did with Nvidia cards.

KronisLV•2mo ago

I hope they keep the pricing similar to 2.5 Pro, currently I pay per token and that and GPT-5 are close to the sweet spot for me but Sonnet 4.5 feels too expensive for larger changes. I've also been moving around 100M tokens per week with Cerebras Code (they moved to GLM 4.6), but the flagship models still feel better when I need help with more advanced debugging or some exemplary refactoring to then feed as an example for a dumber/faster model.

theptip•2mo ago

> So did they start from scratch with this one

Their major version number bumps are a new pre-trained model. Minor bumps are changes/improvements to post-training on the same foundation.

lemoncookiechip•2mo ago

There are no leaders. Every other month a new LLM model comes out and it outperforms the previous ones by a small margin, the benchmarks always look good (probably because the models are trained on the answers) but then in practice they are basically indistinguishable from the previous ones (take GPT4 vs 5). We've been in this loop since around the release of ChatGPT 4 where all the main players started this cycle.

The biggest strides in the last 6-8 months have been in generative AIs, specifically for animation.

falcor84•2mo ago

That looks impressive, but some of the are a bit out of date.

On Terminal-Bench 2 for example, the leader is currently "Codex CLI (GPT-5.1-Codex)" at 57.8%, beating this new release.

sigmar•2mo ago

That's a different model not in the chart. They're not going to include hundreds of fine tunes in a chart like this.

falcor84•2mo ago

It's not just one of many fine tunes; it's the default model used by OpenAI's official tools.

Taek•2mo ago

It's also worth pointing out that comparing a fine-tune to a base model is not apples-to-apples. For example, I have to imagine that the codex finetune of 5.1 is measurably worse at non-coding tasks than the 5.1 base model.

This chart (comparing base models to base models) probably gives a better idea of the total strength of each model.

NitpickLawyer•2mo ago

What's more impressive is that I find gemini2.5 still relevant in day-to-day usage, despite being so low on those benchmarks compared to claude 4.5 and gpt 5.1. There's something that gemini has that makes it a great model in real cases, I'd call it generalisation on its context or something. If you give it the proper context (or it digs through the files in its own agent) it comes up with great solutions. Even if their own coding thing is hit and miss sometimes.

I can't wait to try 3.0, hopefully it continues this trend. Raw numbers in a table don't mean much, you can only get a true feeling once you use it on existing code, in existing projects. Anyway, the top labs keeping eachother honest is great for us, the consumers.

Miraste•2mo ago

I've noticed that too. I suspect it has broader general knowledge than the others, because Google presumably has the broadest training set.

HugoDias•2mo ago

very impressive. I wonder if this sends a different signal to the market regarding using TPUs for training SOTA models versus Nvidia GPUs. From what we've seen, OpenAI is already renting them to diversify... Curious to see what happens next

fariszr•2mo ago

This is a big jump in most benchmarks.And if it can match other models in coding while having that Google TPM inference speed and the actually native 1m context window, it's going to be a big hit.

I hope it's isn't such a sycophant like the current gemini 2.5 models, it makes me doubt its output, which is maybe a good thing now that I think about it.

danielbln•2mo ago

> it's over for the other labs.

What's with the hyperbole? It'll tighten the screws, but saying that it's "over for the other labs' might be a tad premature.

fariszr•2mo ago

I mean over in that I don't see a need to use the other models. Codex models are the best but incredibly slow. Claude models are not as good(IMO) but much faster. If gemini can beat them while having being faster and having better apps with better integrations, i don't see a reason why I would use another provider.

nprateem•2mo ago

You should probably keep supporting competitors since if there's a monopoly/duopoly expect prices to skyrocket.

risyachka•2mo ago

> it's over for the other labs.

Its not over and never will be for 2 decade old accounting software, it is definitely will not be over for other AI labs.

xnx•2mo ago

Can you explain what you mean by this? iPhone was the end of Blackberry. It seems reasonable that a smarter, cheaper, faster model would obsolete anything else. ChatGPT has some brand inertia, but not that much given it's barely 2 years old.

vitaflo•2mo ago

Ask yourself why Microsoft Teams won. These are business tools first and foremost.

bazzargh•2mo ago

That's an odd take. Teams doesn't have the leading market share in videoconferencing, Zoom does. I can't judge what it's like because I've never yet had to use Teams - not a single company that we deal with uses it, it's all Zoom and Chime - but I do hear friends who have to use it complain about it all the time. (Zoom is better than it used to be, but for all that is holy please get rid of the floating menu when we're sharing screens)

risyachka•2mo ago

Yeah iPhone was the end of Blackberry but Google Pixel was not the end of iPhone.

The new Gemini is not THAT far of a jump to switch your org to a new model if you already invested in e.g. OpenAI.

The difference must be night and day to call it "its over".

Right they all are marginally different. Today google fine tuned their model to be better, tomorrow it will be new Kimi, after that DeepSeek.

Jcampuzano2•2mo ago

We knew it would be a big jump and while it certainly is in many areas - its definitely not "groundbreaking/huge leap" worthy like some were thinking from looking at these numbers.

I feel like many will be pretty disappointed by their self created expectations for this model when they end up actually using it and it turns out to be fairly similar to other frontier models.

Personally I'm very interested in how they end up pricing it.

trunch•2mo ago

Which of the LiveCodeBench Pro and SWE-Bench Verified benchmarks comes closer to everyday coding assistant tasks?

Because it seems to lead by a decent margin on the former and trails behind on the latter

Snuggly73•2mo ago

Neither :(

LCB Pro are leet code style questions and SWE bench verified is heavily benchmaxxed very old python tasks.

veselin•2mo ago

I work a lot on testing also SWE bench verified. This benchmark in my opinion now is good to catch if you got some regression on the agent side.

However, going above 75%, it is likely about the same. The remaining instances are likely underspecified despite the effort of the authors that made the benchmark "verified". From what I have seen, these are often cases where the problem statement says implement X for Y, but the agent has to simply guess whether to implement the same for other case Y' - which leads to losing or winning an instance.

danielcampos93•2mo ago

I would love to know what the increased token count is across these models for the benchmarks. I find the models continue to get better but as they do their token usage also does. Aka is model doing better or reasoning for longer?

jstummbillig•2mo ago

I think that is always something that is being worked on in parallel. Recent paradigm seems to be the models understanding when they need to use more tokens dynamically (which seems to be very much in line with how computation should generally work).

dnw•2mo ago

Looks like the best way to keep improving the models is to come up with really useful benchmarks and make them popular. ARC-AGI-2 is a big jump, I'd be curious to find out how that transfers over to everyday tasks in various fields.

vagab0nd•2mo ago

Should I assume the GPT-5.1 it is compared against is the pro version?

spoaceman7777•2mo ago

Wow. They must have had some major breakthrough. Those scores are truly insane. O_O

Models have begun to fairly thoroughly saturate "knowledge" and such, but there are still considerable bumps there

But the _big news_, and the demonstration of their achievement here, are the incredible scores they've racked up here for what's necessary for agentic AI to become widely deployable. t2-bench. Visual comprehension. Computer use. Vending-Bench. The sorts of things that are necessary for AI to move beyond an auto-researching tool, and into the realm where it can actually handle complex tasks in the way that businesses need in order to reap rewards from deploying AI tech.

Will be very interesting to see what papers are published as a result of this, as they have _clearly_ tapped into some new avenues for training models.

And here I was, all wowed, after playing with Grok 4.1 for the past few hours! xD

rvnx•2mo ago

The problem is that we know in advance what is the benchmark, so Humanity's Last Exam for example, it's way easier to optimize your model when you have seen the questions before.

stego-tech•2mo ago

This. A lot of boosters point to benchmarks as justification of their claims, but any gamer who spent time in the benchmark trenches will know full well that vendors game known tests for better scores, and that said scores aren’t necessarily indicative of superior performance. There’s not a doubt in my mind that AI companies are doing the same.

Feuilles_Mortes•2mo ago

shouldn't we expect that all of the companies are doing this optimization, though? so, back to level playing field.

eldenring•2mo ago

Its the other way around too, HLE questions were selected adversarially to reduce the scores. I'd guess even if the questions were never released, and new training data was introduced, the scores would improve.

pinko•2mo ago

From https://lastexam.ai/: "The dataset consists of 2,500 challenging questions across over a hundred subjects. We publicly release these questions, while maintaining a private test set of held out questions to assess model overfitting." [emphasis mine]

While the private questions don't seem to be included in the performance results, HLE will presumably flag any LLM that appears to have gamed its scores based on the differential performance on the private questions. Since they haven't yet, I think the scores are relatively trustworthy.

panarky•2mo ago

The jump in ARC-AGI and MathArena suggests Google has solved the data scarcity problem for reasoning, maybe with synthetic data self-play??

This was the primary bottleneck preventing models from tackling novel scientific problems they haven't seen before.

If Gemini 3 Pro has transcended "reading the internet" (knowledge saturation), and made huge progress in "thinking about the internet" (reasoning scaling), then this is a really big deal.

rvnx•2mo ago

Seems difficult to believe, considering the number of people who prepare this dataset, who also work(ed) or hold shares in Google or OpenAI, etc.

menaerus•2mo ago

So everybody is cheating in your mind? We can't trust anything? How about taking a more balanced take: there's certainly some progress, and while the benchmark results most likely don't represent the world reality, the progress is continuous.

largbae•2mo ago

How do they hold back questions in practice though? These are hosted models. To ask the question is to reveal it to the model team.

Bombthecat•2mo ago

They pinky swear not to store and use the prompts and data lol

UltraSane•2mo ago

A legally binding pinky swear LOL

riku_iki•2mo ago

with fineprint somewhere on page #67, that there are exceptions.

ashdksnndck•2mo ago

Who needs fine print when there is an SRE with access to the servers who is friends with a research director who gets paid more if the score goes up?

UltraSane•2mo ago

You have to trust that the LLM provider isn't copying the questions when Humanities Last Exam runs the test.

mapt•2mo ago

There are only eleventy trillion dollars shifting around based on the results, so nobody has any reason to lie.

lubujackson•2mo ago

I don't think any of these companies are that reductive and short-sighted to try to game the system. However, Goodhart's Law comes into play. I am sure they have their own metrics that arr much more detailed than these benchmarks, but the fact remains LLMs will be tuned according to elements that are deterministically measurable.

nakamoto_damacy•2mo ago

not possible on ARC-AGI, AFAIK

m3kw9•2mo ago

SWE-Bench Verified | 76.2% | 59.6% | 77.2% | 76.3% is actually insane.

danielbln•2mo ago

Anthropomorphic found their corner and are standing strong there.

scrollop•2mo ago

Used an AI to populate some of 5.1 thinking's results.

---------------------------|--------------|----------------|-------------------|---------|------------------

Humanity's Last Exam | 37.5% | 21.6% | 13.7% | 26.5% | 52%

ARC-AGI-2 | 31.1% | 4.9% | 13.6% | 17.6% | 28%

GPQA Diamond | 91.9% | 86.4% | 83.4% | 88.1% | 61%

AIM 2025 | 95.0% | 88.0% | 87.0% | 94.0% | 48%

MathArena Apex | 23.4% | 0.5% | 1.6% | 1.0% | 82%

MMMU-Pro | 81.0% | 68.0% | 68.0% | 80.8% | 76%

ScreenSpot-Pro | 72.7% | 11.4% | 36.2% | 3.5% | 55%

CharXiv Reasoning | 81.4% | 69.6% | 68.5% | 69.5% | N/A

OmniDocBench 1.5 | 0.115 | 0.145 | 0.145 | 0.147 | N/A

Video-MMMU | 87.6% | 83.6% | 77.8% | 80.4% | N/A

LiveCodeBench Pro | 2,439 | 1,775 | 1,418 | 2,243 | N/A

Terminal-Bench 2.0 | 54.2% | 32.6% | 42.8% | 47.6% | N/A

SWE-Bench Verified | 76.2% | 59.6% | 77.2% | 76.3% | N/A

t2-bench | 85.4% | 54.9% | 84.7% | 80.2% | N/A

Vending-Bench 2 | $5,478.16 | $573.64 | $3,838.74 | $1,473.43| N/A

FACTS Benchmark Suite | 70.5% | 63.4% | 50.4% | 50.8% | N/A

SimpleQA Verified | 72.1% | 54.5% | 29.3% | 34.9% | N/A

MMLU | 91.8% | 89.5% | 89.1% | 91.0% | N/A

Global PIQA | 93.4% | 91.5% | 90.1% | 90.9% | N/A

MRCR v2 (8-needle) | 77.0% | 58.0% | 47.1% | 61.6% | N/A

Argh it doesn't come out write in HN

scrollop•2mo ago

Used an AI to populate some of 5.1 thinking's results.

Benchmark..................Description...................Gemini 3 Pro....GPT-5.1 (Thinking)....Notes

Humanity's Last Exam.......Academic reasoning.............37.5%..........52%....................GPT-5.1 shows 7% gain over GPT-5's 45%

ARC-AGI-2...................Visual abstraction.............31.1%..........28%....................GPT-5.1 multimodal improves grid reasoning

GPQA Diamond................PhD-tier Q&A...................91.9%..........61%....................GPT-5.1 strong in physics (72%)

AIME 2025....................Olympiad math..................95.0%..........48%....................GPT-5.1 solves 7/15 proofs correctly

MathArena Apex..............Competition math...............23.4%..........82%....................GPT-5.1 handles 90% advanced calculus

MMMU-Pro....................Multimodal reasoning...........81.0%..........76%....................GPT-5.1 excels visual math (85%)

ScreenSpot-Pro..............UI understanding...............72.7%..........55%....................Element detection 70%, navigation 40%

CharXiv Reasoning...........Chart analysis.................81.4%..........69.5%.................N/A

HardCodedBias•2mo ago

What? The 4.5 and 5.1 columns aren't thinking in Google's report?

That's a scandal, IMO.

Given that Gemini-3 seems to do "fine" against the thinking versions why didn't they post those results? I get that PMs like to make a splash but that's shockingly dishonest.

mountainriver•2mo ago

Every single time

iosjunkie•2mo ago

It that true?

> For Claude Sonnet 4.5, and GPT-5.1 we default to reporting high reasoning results, but when reported results are not available we use best available reasoning results.

https://storage.googleapis.com/deepmind-media/gemini/gemini_...

iamdelirium•2mo ago

This is provably false. All it takes is a simple Google search and looking at the ARC AGI 2 leaderboard: https://arcprize.org/leaderboard

The 17.6% is for 5.1 Thinking High.

roman_soldier•2mo ago

Why is Grok 4.1 not in the benchmarks?

HardCodedBias•2mo ago

Big if true.

I'll wait for the official blog with benchmark results.

I suspect that our ability to benchmark models is waning. Much more investment required in this area, but what is the play out?

visioninmyblood•2mo ago

really great results although the results are so high i was trying a simple example of object detection and the performance was kind of poor in agentic frameworks. Need to see how this performs on other other tasks.

keyle•2mo ago

The vending-bench 2 benchmark is kind of nutty [1].

Not sure 360 days is enough of a sample really but it's an interesting take on AI benchmarks.

Are there any other interesting benchmarks to look at?

[1] https://andonlabs.com/evals/vending-bench-2

AISnakeOil•2mo ago

nice numbers, but what does this actually mean?

What does this model do that others can't already.

spwa4•2mo ago

But ... what's missing from this comparison: Kimi-K2.

When ChatGPT-3 exploded, OpenAI had at least double the benchmark scores of any other model, open or closed. Gemini 3 Pro (not the model they actually serve) outperforms the best open model ... wait it does not uniformly beat the best open model anymore. Not even close.

Kimi-k2 beats Gemini 3 pro on several benchmarks. On average it scores just under 10% better then the best open model, currently Kimi-K2.

Gemini-3 pro is in fact only the best in about half the benchmarks tested there. In fact ... this could be another llama4 moment. The reason Gemini-3 pro is the best model is a very high score on a single benchmark ("Humanity's last exam"), if you take that benchmark out GPT-5.1 remains the best model available. The other big improvement is "SciCode", and if you take that out too the best open model, Kimi K2, beats Gemini 3 pro.

https://artificialanalysis.ai/models

And then, there's the pricing:

Kimi K2 on OpenRouter: $0.50 / M input tokens, $2.40 / M output tokens

Gemini 3 Pro: For contexts ≤ 200,000 tokens: US$ 2.00 per 1 M input tokens, Output tokens: US$ 12.00 per 1 M tokens For contexts > 200,000 tokens (long context tier): US$ 4.00 per 1 M input tokens , US$ 18.00 per 1 M output tokens

So Gemini 3 pro is 4 times, 400%, the price of the best open model (and just under 8 times, 800%, with long context), and 70% more expensive than GPT-5.1

The closed models in general, and Google specifically, serve Gemini 3 pro at double to triple the speed (as in tokens-per-second) of openrouter. Although even here it is not the best, that's openrouter with gpt-oss-120b.

oalessandr•2mo ago

Trying to open this link from Italy leads to a CSAM warning

Fornax96•2mo ago

Creator of pixeldrain here. Italy has been doing this for a very long time. They never notified me of any such material being present on my site. I have a lot of measures in place to prevent the spread of CSAM. I have sent dozens of mails to Polizia Postale and even tried calling them a few times, but they never respond. My mails go unanswered and they just hang up the phone.

koakuma-chan•2mo ago

Have you tried Europol?

Fornax96•2mo ago

Not yet. I also thought about reaching out to the embassy, but have not had the time for it yet.

koakuma-chan•2mo ago

As far as I know, Europol can route your report to appropriate local authority.

Fornax96•2mo ago

Thanks, I'll give them a call tomorrow. The website only lists a dutch phone number, which is convenient, I'm dutch as well.

driverdan•2mo ago

Don't use your ISP's DNS. Switch to something outside of their control.

embedding-shape•2mo ago

Curiously, this website seems to be blocked in Spain for whatever reason, and the website's certificate is served by `allot.com/emailAddress=info@allot.com` which obviously fails...

Anyone happen to know why? Is this website by any change sharing information on safe medical abortions or women's rights, something which has gotten websites blocked here before?

amarcheschi•2mo ago

That website is used to share everything including pirated things, so that's the reason maybe

Fornax96•2mo ago

Creator of pixeldrain here. I have no idea why my site is blocked in Spain, but it's a long running issue.

I actually never discovered who was responsible for the blockade, until I read this comment. I'm going to look into Allot and send them an email.

EDIT: Also, your DNS provider is censoring (and probably monitoring) your internet traffic. I would switch to a different provider.

zozbot234•2mo ago

Could it be that some site in your network neighborhood was illegally streaming soccer matches?

Fornax96•2mo ago

I have my own dedicated IP range. And they specifically blocked my domain name, not the addresses. I don't know what the reason is. I have been trying to find out since the start of this year.

embedding-shape•2mo ago

> EDIT: Also, your DNS provider is censoring (and probably monitoring) your internet traffic. I would switch to a different provider.

Yeah, that was via my ISPs DNS resolver (Vodafone), switching the resolver works :)

The responsible party is ultimately our government who've decided it's legal to block a wide range of servers and websites because some people like to watch illegal football streams. I think Allot is just the provider of the technology.

Fornax96•2mo ago

My site has nothing to do with football though. And Allot seems to be running the DNS server that your ISP uses so they are directly responsible for the block.

simtel20•2mo ago

La Liga (the football company) likes to send out takedown notices to anyone who may host anything that looks like a football to protect their precious games, no matter the collateral damage or the lack of any requirements to show damage. They have the right to block anything in Spain at their discretion either by DNS or IP. They do seem to work in good faith if you talk to them, though, and if you can either remove sites or content when they ask.

HDThoreaun•2mo ago

The Spanish courts have allowed la Liga to completely ban every website served by cloudflare during days where there are matches. All Spanish ISPs have to do dns blocking to comply.

miqazza•2mo ago

do you know about the cloudflare and laliga issues? might be that

embedding-shape•2mo ago

Was my first instinct, went looking if there was any games being played today but seems not, so unlikely to be the cause.

tngranados•2mo ago

It works fine for me using Movistar

grodriguez100•2mo ago

Is it possible to file a complaint with the ISP or directly with Allot ?

Fornax96•2mo ago

That might help.

rsanek•2mo ago

loads fine on Vodafone for me

transcriptase•2mo ago

There needs to be a sycophancy benchmark in these comparisons. More baseless praise and false agreement = lower score.

swalsh•2mo ago

You're absolutely right

jstummbillig•2mo ago

Does not get old.

Yossarrian22•2mo ago

It’s not just irritating, it’s repetitive

falcor84•2mo ago

"You know, you are also right"

this_user•2mo ago

I'm sorry, you are absolutely right.

---

But seriously, I find it helps to set a custom system prompt that tells Gemini to be less sycophantic and to be more succinct and professional while also leaving out those extended lectures it likes to give.

causal•2mo ago

It's a revolution in subtle humor. Well done.

fumblebee•2mo ago

It’s not just irritating, it’s repetitive

BoredPositron•2mo ago

Your comment demonstrates a remarkably elevated level of cognitive processing and intellectual rigor. Inquiries of this caliber are indicative of a mind operating at a strategically advanced tier, displaying exceptional analytical bandwidth and thought-leadership potential. Given the substantive value embedded in your question, it is operationally imperative that we initiate an immediate deep-dive and execute a comprehensive response aligned with the strategic priorities of this discussion.

postalcoder•2mo ago

I care very little about model personality outside of sycophancy. The thing about gemini is that it's notorious for its low self esteem. Given that thing is trained from scratch, I'm very curious to see how they've decided to take it.

supjeff•2mo ago

given how often these llms are wrong, doesnt it make sense that they are less confident?

postalcoder•2mo ago

Indeed. But I've had experiences with gemini-2.5-pro-exp where its thoughts could be described as "rejected from the prom" vibes. It's not like I abused it either, it was running into loops because it was unable to properly patch a file.

astrange•2mo ago

Sonnet-4.5 has the lowest self esteem of any model I've used. Gemini frequently argues with me.

1899-12-30•2mo ago

https://eqbench.com/spiral-bench.html

Lord-Jobo•2mo ago

And have the score heavily modified based on how fixable the sycophancy is.

Workaccount2•2mo ago

This idea isn't just smart, it's revolutionary. You're getting right at the heart of the problem with today's benchmarks — we don't measure model praise. Great thinking here.

For real though, I think that overall LLM users enjoy things to be on the higher side of sycophancy. Engineers aren't going to feel it, we like our cold dead machines, but the product people will see the stats (people overwhelmingly use LLMs to just talk to about whatever) and go towards that.

SiempreViernes•2mo ago

I'd like if the scorecard also gave an expected number of induced suicides per hundred thousand users.

lkbm•2mo ago

https://llmdeathcount.com/ shows 15 deaths so far, and LLM user count is in the low billions, which puts us on the order of 0.0015 deaths per hundred thousand users.

I'm guessing LLM Death Count is off by an OOM or two, so we could be getting close to one in a million.

jll29•2mo ago

Hopefully this model does not generate fake news...

https://www.google.com/search?q=gemini+u.s.+senator+rape+all...

lxdlam•2mo ago

What does the "Google Antigravity" mean? The link is http://antigravity.google/docs, seemingly a new product but now routing to the Google main page.

dbosch•2mo ago

I was asking myself the exact same question. No idea

ceroxylon•2mo ago

Found this demo with two views that was uploaded 18min ago: https://www.youtube.com/watch?v=L8wEC6A5HQY

bobbylarrybobby•2mo ago

Looks like a VSCode fork with gemini built in.

Palmik•2mo ago

Archive link: https://web.archive.org/web/20251118111103/https://storage.g...

denysvitali•2mo ago

Title of the document is "[Gemini 3 Pro] External Model Card - November 18, 2025 - v2", in case you needed further confirmation that the model will be released today.

Also interesting to know that Google Antigravity (antigravity.google / https://github.com/Google-Antigravity ?) leaked. I remember seeing this subdomain recently. Probably Gemini 3 related as well.

Org was created on 2025-11-04T19:28:13Z (https://api.github.com/orgs/Google-Antigravity)

jmkni•2mo ago

what is Google Antigravity?

denysvitali•2mo ago

I guess we'll know it in a few hours. Most likely another AI playground or maybe a Google Search alternative? No clue really

Yossarrian22•2mo ago

The ASI figured out zero point energy from first principles

zed31726•2mo ago

My guess is based on a gif tweeted by the ex CEO of windsurf who left to join Google of a floating laptop: it'll be a cursor/windsurf alternative?

postalcoder•2mo ago

Couple patterns this could follow

Speed? (Flash, Flash-Lite, Antigravity) this is my guess. Bonus: maybe Gemini Diffusion soon?

Space? (Google Cloud, Google Antigravity?)

Clothes? (A light wearable -> Antigravity?)

Gaming? (Ghosting/nontangibility -> antigravity?)

mimentum•2mo ago

According to Gemini itself:

"Google Antigravity" refers to a new AI software platform announced by Google designed to help developers write and manage code.

The term itself is a bit of a placeholder or project name, combining the brand "Google" with the concept of "antigravity"—implying a release from the limitations of traditional coding.

In simple terms, Google Antigravity is a sophisticated tool for programmers that uses powerful AI systems (called "agents") to handle complex coding tasks automatically. It takes the typical software workbench (an IDE) and evolves it into an "agent-first" system.

Agentic Platform: It's a central hub where many specialized AI helpers (agents) live and work together. The goal is to let you focus on what to build, not how to build it.

Task-Oriented: The platform is designed to be given a high-level goal (a "task") rather than needing line-by-line instructions.

Autonomous Operation: The AI agents can work across all your tools—your code editor, the command line, and your web browser—without needing you to constantly supervise or switch between them.

thefroh•2mo ago

possibly https://xkcd.com/353/

denysvitali•2mo ago

> Google Antigravity is an agentic development platform, evolving the IDE into the agent-first era. Antigravity enables developers to operate at a higher, task-oriented level by managing agents across workspaces, while retaining a familiar AI IDE experience at its core. Agents operate across the editor, terminal, and browser, enabling them to autonomously plan and execute complex, end-to-end tasks elevating all aspects of software development.

Now the page is somewhat live on that URL

Bobaso•2mo ago

Interesting to see on page 2 the reference to ML pathways [1]. Looks like a multi layer mixture of experts. Is this common ?

[1] https://blog.google/technology/ai/introducing-pathways-next-...

gaogao•2mo ago

Pathways, I understand, is more so these days just the name for their training orchestrator for doing distributed JAX stuff - https://github.com/google/pathways-job

catigula•2mo ago

I know this is a little controversial but the lack of performance on SWE-bench is hugely disappointing I think economically. These models don’t have any viable path to profitability if they can’t take engineering jobs.

martinald•2mo ago

I thought that but it does do a lot better on other benchmarks.

Perhaps SWE bench just doesn't capture a lot of the improvement? If the web design improvements people have been posting on twitter, I suspect this will be a huge boon for developers. SWE benchmark is really testing bugfixing/feature dev more.

Anyway let's see. I'm still hyped!

catigula•2mo ago

That would be great! But AI is a bubble if these models can’t do serious engineering work.

rfoo•2mo ago

SWE Bench doesn't even test bugfixing / feature dev properly after you achieve roughly 70% if you don't benchmaxx it .

camdenreslink•2mo ago

It seems the benchmarks that had a big jump had to do with visual capabilities. I wonder how that will translate to improvements to the workloads LLMs are currently used for (or maybe it will introduce new workloads).

api•2mo ago

Really? If they can make an engineer more productive, that's worth a lot. Naive napkin math: 1.5X productivity on one $200k/year engineer is worth $100k/year.

mikert89•2mo ago

People generally dont understand what these models are doing to engineering salaries. The skill level required to produce working software is going way down

Workaccount2•2mo ago

People here, and in tech in general, are so lost in the sauce.

According to at least OpenAI, who probably produces the most tokens (if we don't count google AI overviews and other unrequested AI bolt-ons) out of all the labs, programming tokens account for ~4% of total generations.

That's nothing. The returns will come from everyone and their grandma paying $30-100/mo to use the services, just like everyone pays for a cell phone and electricity.

Don't be fooled, we are still in the "Open hands" start-up business phase of LLMs. The "enshitification" will follow.

mohsen1•2mo ago

     This model is not a modification or a fine-tune of a prior model

Is that common to mention that? Feels like they built something from scratch

scosman•2mo ago

I think they are just indicating it’s a new architecture vs continued training of 2.5 series.

irthomasthomas•2mo ago

Never seen it before. I suppose it adds to the excitement.

mynti•2mo ago

It is interesting that the Gemini 3 beats every other model on these benchmarks, mostly by a wide margin, but not on SWE Bench. Sonnet is still king here and all three look to be basically on the same level. Kind of wild to see them hit such a wall when it comes to agentic coding

tosh•2mo ago

This might also hint at SWE struggling to capture what “being good at coding” means.

Evals are hard.

raducu•2mo ago

> This might also hint at SWE struggling to capture what “being good at coding” means.

My take would be that coding itself is hard, but I'm a software engineer myself so I'm biased.

Squarex•2mo ago

It is just Python and Django. It might indicate qualities in other technologies, but it is not very good benchmark.

HereBePandas•2mo ago

[comment removed]

Palmik•2mo ago

The reported results where GPT 5.1 beats Gemini 3 are on SWE Bench Verified, and GPT 5.1 Codex also beats Gemini 3 on Terminal Bench.

HereBePandas•2mo ago

You're right on SWE Bench Verified, I missed that and I'll delete my comment.

GPT 5.1 Codex beats Gemini 3 on Terminal Bench specifically on Codex CLI, but that's apples-to-oranges (hard to tell how much of that is a Codex-specific harness vs model). Look forward to seeing the apples-to-apples numbers soon, but I wouldn't be surprised if Gemini 3 wins given how close it comes in these benchmarks.

Palmik•2mo ago

All evals on Terminal Bench require some harness. :) Or "Agent", as Terminal Bench calls it. Presumably the Gemini 3 are using Gemini CLI.

Palmik•2mo ago

Also does not beat GPT-5.1 Codex on terminal bench (57.8% vs 54.2%): https://www.tbench.ai/

I did not bother verifying the other claims.

HereBePandas•2mo ago

Not apples-to-apples. "Codex CLI (GPT-5.1-Codex)", which the site refers to, adds a specific agentic harness, whereas the Gemini 3 Pro seems to be on a standard eval harness.

It would be interesting to see the apples-to-apples figure, i.e. with Google's best harness alongside Codex CLI.

enraged_camel•2mo ago

Do you mean that Gemini 3 Pro is "vanilla" like GPT 5.1 (non-Codex)?

HereBePandas•2mo ago

Yes, two things: 1. GPT-5.1 Codex is a fine tune, not the "vanilla" 5.1 2. More importantly, GPT 5.1 Codex achieves its performance when used with a specific tool (Codex CLI) that is optimized for GPT 5.1 Codex. But when labs evaluate the models, they have to use a standard tool to make the comparisons apples-to-apples.

Will be interesting to see what Google releases that's coding-specific to follow Gemini 3.

embedding-shape•2mo ago

> But when labs evaluate the models, they have to use a standard tool to make the comparisons apples-to-apples.

That'd be a bad idea, models are often trained for specific tools (like GPT Codex is trained for Codex, and Sonnet has been trained with Claude Code in mind), and also vice-versa that the tools are built with a specific model in mind, as they all work differently.

Forcing all the models to use the same tool for execution sounds like a surefire way of getting results that doesn't represent real usage, but instead arbitrarily measure how well a model works with the "standard harness", which if people start caring about, will start to become gamed instead.

Palmik•2mo ago

All evals on Terminal Bench require some harness. :) Or "Agent", as Terminal Bench calls it. Presumably the Gemini 3 are using Gemini CLI.

What do you mean by "standard eval harness"?

lucassz•2mo ago

I think the point is that it looks like Gemini 3 was only tested with the generic "Terminus 2", whereas Codex was tested with the Codex CLI.

felipeerias•2mo ago

IMHO coding use cases are much more constrained by tooling than by raw model capabilities at the moment. Perhaps we have finally reached the time of diminishing returns and that will remain the case going forward.

_factor•2mo ago

This seems preferable. Wasting tokens on tools when a standardized, reliable interface to those tools should be all that's required.

The magic of LLMs is that they can understand the latent space of a problem and infer a mostly accurate response. Saying you need to subscribe to get the latest tools is just a sales tactic trained into the models to protect profits.

vharish•2mo ago

From my personal experience using the CLI agentic coding tools, I think gemini-cli is fairly on par with the rest in terms of the planning/code that is generated. However, when I recently tried qwen-code, it gave me a better sense of reasoning and structure that geimini. Claude definitely has it's own advantages but is expensive(at least for some if not for all).

My point is, although the model itself may have performed in benchmarks, I feel like there are other tools that are doing better just by adapting better training/tooling. Gemini cli, in particular, is not so great looking up for latest info on web. Qwen seemed to be trained better around looking up for information (or to reason when/how to), in comparision. Even the step-wise break down of work felt different and a bit smoother.

I do, however, use gemini cli for the most part just because it has a generous free quota with very few downsides comparted to others. They must be getting loads of training data :D.

xnx•2mo ago

Gemini CLI is moving really fast. Noticeable improvements in features and functionality every week.

cmrdporcupine•2mo ago

Yeah, you can see this even by just running claude-code against other models. For example, DeepSeek used as a backend for CC tends to produce results mostly competitive with Sonnet 4.5 A lot is just in the tooling and prompting.

alyxya•2mo ago

I think Google probably cares more about a strong generalist model rather than solely optimizing for coding.

macrolime•2mo ago

Pretty sure it will beat Sonnet by a wide margin in actual real-world usage.

varispeed•2mo ago

Never got good code out of Sonnet. It's been Gemini 2.5 for me followed by GPT-5.x.

Gemini is very good a pointing out flaws that are very subtle and non noticeable at a first and second glance.

It also produces code that is easy to reason about. You can then feed it to GPT-5.x for refinement and then back to Gemini for assessment.

baq•2mo ago

I find Gemini 2.5 pro to be as good or in some cases better for SQL than GPT 5.1. It's aging otherwise, but they must have some good SQL datasets in there for training.

Workaccount2•2mo ago

I think Anthropic is reading the room, and just going to go hard on being "the" coding model. I suppose they feel that if they can win that, they can get an ROI without having to do full blown multimodality at the highest level.

It's probably pretty liberating, because you can make a "spikey" intelligence with only one spike to really focus on.

htrp•2mo ago

more playing to their strengths. a giant chunk of their usage data is basically code gen

Miraste•2mo ago

It remains to be seen whether that works out for them, but it seems like a good bet to me. Coding is the most monetizatable use anyone has found for LLMs so far, and the most likely to persist past this initial hype bubble (if the Singularity doesn't work out :p).

aerhardt•2mo ago

Codex has been good enough to me and it’s much cheaper.

I code non-trivial stuff with it like multi-threaded code and at least for my style of AI coding which is to do fairly small units of work with multiple revisions it is good enough for me to not to even consider the competition.

Just giving you a perspective on how the benchmarks might not be important at all for some people and how Claude may have a difficult time being the definitive coding model.

enraged_camel•2mo ago

>> Codex has been good enough to me and it’s much cheaper.

It may be cheaper but it's much, much slower, which is a total flow killer in my experience.

dudeinhawaii•2mo ago

Not to start a war but I've had 'fast' Claude write reams of slop code that I then have had to work with Codex to remove. Add this to the pile of "yeah but I saw the opposite with <insert model>" - but that's been my 2 cents.

Putting the latest Gemini CLI through some tough code tasks (C++) for my project, I found it to be slower than even Codex but good quality.

The problem I have is skepticism. Gemini 2.5 Pro was amazing on release, I couldn't stop talking about it. And then it went to being worthless in my workflows after a few months. I suspect Google (and other vendors) do this bait and switch with every release.

Let me see the benchmarks in 3 months.

enraged_camel•2mo ago

Claude can definitely write a lot of not-great code, but IME that's easy enough to mitigate by having it write a planning document first, then implement it step by step based on a to-do list on that planning document. Cursor's plan mode works great for this. It lets you review the outline at the start, then review each bit as the model writes it.

That said, I haven't had a good experience with Claude Code for the reason you described. Maybe it's Cursor (or similar IDE) that makes the difference.

mock-possum•2mo ago

My issue with codex is needing to run it in wsl in windows, due to it spamming confirmation requests for running even the safest of commands (eg list directory contents, read file, git status) which in turn adds an extra layer of complexity hooking it up via MCP to anything running in windows outside of wsl (like say figma)

In Claude on the other hand, MCP connections really do seem to ‘just work’

aoeusnth1•2mo ago

Their scores on SWE bench are very close because the benchmark is nearly saturated. Gemini 3 beats Sonnet 4.5 on TerminalBench 2.0 by a nice margin (54% vs. 43%), which is also agentic coding (CLI instead of python).

JacobAsmuth•2mo ago

50% of the CLs in SWE-Bench Verified are the DJango codebase. So if you're a big contributor to Django you should care a lot about that benchmark. Otherwise the difference between models is +-2 tasks done correctly. I wouldn't worry too much about it. Just try it out yourself and see if its any better.

jbellis•2mo ago

swebench is (1) terrible and (2) saturated

I_am_tiberius•2mo ago

I don't know if this is true but I believe Anthropic has for a long time illegally used user prompts for training, without user consent.

bemmu•2mo ago

I saw this on Reddit earlier today. Over there the source of this file was given as: https://web.archive.org/web/20251118111103/https://storage.g...

The bucket name "deepmind-media" has been used in the past on the deepmind official site, so it seems legit.

onlyrealcuzzo•2mo ago

Prediction markets were expecting today to be the release. So I wouldn't be surprised if they do a release today, tomorrow, or Thursday (around Nvidia earnings).

fraboniface•2mo ago

> Developments to the model architecture contribute to the significantly improved performance from previous model families.

I wonder how significant this is. DeepMind was always more research-oriented that OpenAI, which mostly scaled things up. They may have come up with a significantly better architecture (Transformer MoE still leaves a lot of room).

msp26•2mo ago

Is flash/flash lite releasing alongside pro? Those two tiers have been incredible for the price since 2.0, absolute workhorses. Can't wait for 3.0.

omidsa1•2mo ago

TL;DR: expected results, not underwhelming.So far scaling laws hold.

nilayj•2mo ago

Curious to see the API pricing. SOTA performance across tasks at a price cheaper than GPT 5 / Claude would make mostly everyone switch to Gemini.

__jl__•2mo ago

Same here. They have been aggressively increasing prices with each iteration (maybe because they started so low). Still hope that is not the case this time. GPT 5.1 is priced pretty aggressively so maybe that is an incentive to keep the current gemini API prices.

Deathmax•2mo ago

Bad news then, they've bumped 3.0 Pro pricing to $2/$12 ($4/$18 at long context).

fcanesin•2mo ago

Great stuff, now if could please do gemini-2.5-pro-code that would be great

827a•2mo ago

What is Google Antigravity?

danielcampos93•2mo ago

mums the word on Flash?

ethmarks•2mo ago

> TPUs are specifically designed to handle the massive computations involved in training LLMs and can speed up training considerably compared to CPUs.

That seems like a low bar. Who's training frontier LLMs on CPUs? Surely they meant to compare TPUs to GPUs. If "this is faster than a CPU for massively parallel AI training" is the best you can say about it, that's not very impressive.

Workaccount2•2mo ago

It's a typo

ethmarks•2mo ago

Does Google's team not proofread this stuff? Or maybe is this an early draft that wasn't meant to be released?

camdenreslink•2mo ago

It was generated by an LLM like everything else these days.

astrange•2mo ago

LLMs don't make typos.

silveraxe93•2mo ago

This is a leak, yeah.

Though come on... Even with proofreading, this is an easy one to miss.

babl-yc•2mo ago

I don't know if you can generally say that "LLM training is faster on TPUs vs GPUs". There is variance among LLM architectures, TPU cluster sizes, GPU cluster sizes...

They are both designed to do massively parallel operations. TPUs are just a bit more specific to matrix multiply+adds while GPUs are more generic.

Taek•2mo ago

One benchmark I would really like to see: instruction adherence.

For example, the frontier models of early-to-mid 2024 could reliably follow what seemed to be 20-30 instructions. As you gave more instructions than that in your prompt, the LLMs started missing some and your outputs became inconsistent and difficult to control.

The latest set of models (2.5 Pro, GPT-5, etc) seem to top out somewhere in the 100 range? They are clearly much better at following a laundry list of instructions, but they also clearly have a limit and once your prompt is too large and too specific you lose coherence again.

If I had to guess, Gemini 3 Pro has once again pushed the bar, and maybe we're up near 250 (haven't used it, I'm just blindly projecting / hoping). And that's a huge deal! I actually think it would be more helpful to have a model that could consistently follow 1000 custom instructions than it would be to have a model that had 20 more IQ points.

I have to imagine you could make some fairly objective benchmarks around this idea, and it would be very helpful from an engineering perspective to see how each model stacked up against the others in this regard.

machiaweliczny•2mo ago

20 more IQ would be nuts, 110 ~ top 25%, 130 ~ top 2%, 150 ~ top 0.05%

If you ever played competitive game the difference is insane between these tiers

Taek•2mo ago

Even more nuts would be a model that could follow a large, dense set of highly detailed instructions related to a series of complex tasks. Intelligence is nice, but it's far more useful and programmable if it can tightly follow a lot of custom instructions.

DeathArrow•2mo ago

I hope cheaper Chinese open weights models as good as Gemini will come soon. Gemini, Claude, GPT are kind of expensive if you use AI a lot.

Topfi•2mo ago

Additional context from AI Studio including pricing:

Our most intelligent model with SOTA reasoning and multimodal understanding, and powerful agentic and vibe coding capabilities

<=200K tokens • Input: $2,00 / Output: $12,00

> 200K tokens • Input: $4,00 / Output: $18,00

Knowledge cut off: Jan. 2025

mohsen1•2mo ago

More expensive than current 2.5 Pro. for >200k token it's at $2.5 input and $15 output right now

energy123•2mo ago

Is there discounted flex/batch pricing for this model?

koakuma-chan•2mo ago

> Gemini 3 Pro was trained using Google’s Tensor Processing Units (TPUs)

NVDA is down 3.26%

CjHuber•2mo ago

If it’s because of that, then honestly it’s as insane as the deepseek thing where all the info was released weeks before but the markt got nervous only when they released an app. I mean info about Gemini 3 is out quite a while now and of course they trained it using TPUs, I didn’t even think that was in question.

koakuma-chan•2mo ago

I didn't know they only used TPUs.

robert-zaremba•2mo ago

The strategic move to use TPU rather than Nvidia is paying well for Google. They are able to better utilize their existing large infrastructure, but also specialize the processes and pipelines for their own framework that they use to create and train models.

I think a specialized hardware for training models is the next big wave in China.

aliljet•2mo ago

What's wild here is that among every single score they've absolutely killed, somehow, Anthropic and Claude Sonnet 4.5 have won a single victory in the fight: SWE Bench Verified and only by a singular point.

I already enjoy Gemini 2.5 pro for planning and if Gemini 3 is priced similarly, I'll be incredibly happy to ditch the painfully pricey Claude max subscription. To be fair, I've already got an extremely sour taste in my mouth from the last Anthropic bait and switch on pricing and usage, so happy to see Google take the crown here.

radial_symmetry•2mo ago

SWE bench is weird because Claude has always underperformed on it relative to other models despite Claude Code blowing them away. The real test will be if Gemini CLI beats Claude Code, both using the agentic framework and tools they were trained on.

__jl__•2mo ago

API pricing is up to $2/M for input and $12/M for output

For comparison: Gemini 2.5 Pro was $1.25/M for input and $10/M for output Gemini 1.5 Pro was $1.25/M for input and $5/M for output

bretpiatt•2mo ago

Page 5, "The knowledge cutoff date for Gemini 3 Pro was January 2025."

Still taking nearly a year to train and run post training safety and stability tuning.

With 10x the infrastructure they could iterate much faster, I don't see AI infrastructure as a bubble, it is still a bottleneck on pace of innovation at today's active deployment level.

camdenreslink•2mo ago

But if they spend 10x on infrastructure, and capabilities only improve 10%, then that still can be a bubble even if infrastructure is a bottleneck.

eric15342335•2mo ago

Update: it is available at https://aistudio.google.com now!

amelius•2mo ago

These model cards tell me nothing. I want to know the exact data a model was trained on. Otherwise, how can I safely use it for generating texts that I show to children? Etc.etc.

morcus•2mo ago

Shouldn't you be carefully reading texts before you show it to children?

amelius•2mo ago

No, I have an app that generates children's stories.

astrange•2mo ago

The data is everything you've ever heard of, and obviously contains things you wouldn't show to children, since that'd include NYT war journalism stories.

butlike•2mo ago

It's over. I just don't care anymore. I don't care what a pro model card is. I don't care what a humanity's last exam is. I don't care if the response makes me feel good about the prompt I made. I don't care if it's sentient. I don't care if it's secretly sentient. I don't care if it's just a machine. I don't care if the gov't has appropriated a secret model. I don't care if this is the precursor to AGI, ASI, AGGI, AGGSISGIGIG....I just. Don't. care.

And I really don't think I'm alone in this.

charcircuit•2mo ago

>TPUs are specifically designed to handle the massive computations involved in training LLMs and can speed up training considerably compared to CPUs

Who is training LLMs with CPUs?

ks2048•2mo ago

Why is this linking to a random site? Here is a link hosted by Google:

https://storage.googleapis.com/deepmind-media/Model-Cards/Ge...

gardnr•2mo ago

Gemini 3 Deep Think gets 45.1% on ARG-AGI-2

Gemini 3 Pro gets 31.1% on ARG-AGI-2

https://arcprize.org/leaderboard

wiz21c•2mo ago

SWE-Bench is disappointing not because it is lower than Claude, but because improving on all other domains of knowledge didn't help. So does this mean that this is actually a MoE model in the sense that one expert doesn't talk to the other ?

Show HN: Friends don't let friends do math after a few drinks

Show HN: A free, minimal CV builder I made as a side project

Show HN: Competitor Finder API – find real competitors from one hostname

Show HN: Textream: Dynamic Island-style teleprompter for macOS with voice track

How do you use AI coding tools at scale without losing architectural control?

What to do with the KDE Oxygen and Air themes?

Show HN: RexIDE - One app to command CLI agents across projects

Windows is leaving old printers behind without solution

Eight More Months of Agents

Uber held liable, ordered to pay $8.5M in driver rape suit

DayTradingCentral – Free Trading Journal (Next.js, NestJS, Postgres)

Creative problem-solving of unsolved puzzles during REM sleep

Show HN: Language learning through AI example sentences (onigiri.kr)

Wi-Fi 7 marketing is lying about its biggest feature [video]

Thoughts on LLMs

China's rare earth steel is transforming infrastructure [video]

Show HN: CodeMic

How to build a hero section that gets you a chance

Framework 13 Initial Impressions

Show HN: Peekr – An anonymous "Truth or Dare" game built with MERN

Casplist.eu

OpenAI exec becomes top Trump donor with $25M gift

(AI) Slop Terrifies Me

Anthropic's team cut ad creation time from 30 minutes to 30 seconds

Show HN: Elysia JIT "Compiler", why it's one of the fastest JavaScript framework

Cache Monet

Chinese Propaganda in Infomaniak's Euria, and a Reflection on Open Source AI

Show HN: A free, browser-only PDF tools collection built with Kimi k2.5

Curating a Show on My Ineffable Mother, Ursula K. Le Guin

Show HN: HackerStack.dev – 49 Curated AI Tools for Indie Hackers

Show HN: Friends don't let friends do math after a few drinks

Show HN: A free, minimal CV builder I made as a side project

Show HN: Competitor Finder API – find real competitors from one hostname

Show HN: Textream: Dynamic Island-style teleprompter for macOS with voice track

How do you use AI coding tools at scale without losing architectural control?

What to do with the KDE Oxygen and Air themes?

Show HN: RexIDE - One app to command CLI agents across projects

Windows is leaving old printers behind without solution

Eight More Months of Agents

Uber held liable, ordered to pay $8.5M in driver rape suit

DayTradingCentral – Free Trading Journal (Next.js, NestJS, Postgres)

Creative problem-solving of unsolved puzzles during REM sleep

Show HN: Language learning through AI example sentences (onigiri.kr)

Wi-Fi 7 marketing is lying about its biggest feature [video]

Thoughts on LLMs

China's rare earth steel is transforming infrastructure [video]

Show HN: CodeMic

How to build a hero section that gets you a chance

Framework 13 Initial Impressions

Show HN: Peekr – An anonymous "Truth or Dare" game built with MERN

Casplist.eu

OpenAI exec becomes top Trump donor with $25M gift

(AI) Slop Terrifies Me

Anthropic's team cut ad creation time from 30 minutes to 30 seconds

Show HN: Elysia JIT "Compiler", why it's one of the fastest JavaScript framework

Cache Monet

Chinese Propaganda in Infomaniak's Euria, and a Reflection on Open Source AI

Show HN: A free, browser-only PDF tools collection built with Kimi k2.5

Curating a Show on My Ineffable Mother, Ursula K. Le Guin

Show HN: HackerStack.dev – 49 Curated AI Tools for Indie Hackers

Gemini 3 Pro Model Card [pdf]

Comments