Gemini 3 Deep Think

https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-deep-think/

189•tosh•2h ago

https://twitter.com/GoogleDeepMind/status/202198151040070909...

Comments

Metacelsus•1h ago

According to benchmarks in the announcement, healthily ahead of Claude 4.6. I guess they didn't test ChatGPT 5.3 though.

Google has definitely been pulling ahead in AI over the last few months. I've been using Gemini and finding it's better than the other models (especially for biology where it doesn't refuse to answer harmless questions).

simianwords•1h ago

The comparison should be with GPT 5.2 pro which has been used successfully to solve open math problems.

throwup238•1h ago

The general purpose ChatGpt 5.3 hasn’t been released yet, just 5.3-codex.

neilellis•1h ago

It's ahead in raw power but not in function. Like it's got the worlds fast engine but one gear! Trouble is some benchmarks only measure horse power.

NitpickLawyer•1h ago

> Trouble is some benchmarks only measure horse power.

IMO it's the other way around. Benchmarks only measure applied horse power on a set plane, with no friction and your elephant is a point sphere. Goog's models have always punched over what benchmarks said, in real world use @ high context. They don't focus on "agentic this" or "specialised that", but the raw models, with good guidance are workhorses. I don't know any other models where you can throw lots of docs at it and get proper context following and data extraction from wherever it's at to where you'd need it.

Davidzheng•1h ago

I gather that 4.6 strengths are in long context agentic workflows? At least over Gemini 3 pro preview, opus 4.6 seems to have a lot of advantages

verdverm•53m ago

It's a giant game of leapfrog, shift or stretch time out a bit and they all look equivalent

CuriouslyC•19m ago

Google is way ahead in visual AI and world modelling. They're lagging hard in agentic AI and autonomous behavior.

nkzd•16m ago

Google models and CLI harness feels behind in agentic coding compared OpenAI and Antrophic

sigmar•1h ago

Here is the methodologies for all the benchmarks: https://storage.googleapis.com/deepmind-media/gemini/gemini_...

The arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"

>Submit a solution which scores 85% on the ARC-AGI-2 private evaluation set and win $700K. https://arcprize.org/guide#overview

gs17•1h ago

Interestingly, the title of that PDF calls it "Gemini 3.1 Pro". Guess that's dropping soon.

sigmar•1h ago

I looked at the file name but not the document title (specifically because I was wondering if this is 3.1). Good spot.

edit: they just removed the reference to "3.1" from the pdf

josalhor•18m ago

I think this is 3.1 (3.0 Pro with the RL improv of 3.0 Flash). But they probably decided to market it as Deep Think because why not charge more for it.

WarmWash•50m ago

The rumor was that 3.1 was today's drop

staticman2•47m ago

That's odd considering 3.0 is still labeled a "preview" release.

riku_iki•39m ago

> If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"

They never will do on private set, because it would mean its being leaked to google.

lukebechtel•1h ago

Arc-AGI-2: 84.6% (vs 68.8% for Opus 4.6)

Wow.

https://blog.google/innovation-and-ai/models-and-research/ge...

karmasimida•1h ago

It is over

baal80spam•1h ago

I for one welcome our new AI overlords.

mnicky•1h ago

Well, fair comparison would be with GPT-5.x Pro, which is the same class of a model as Gemini Deep Think.

nubg•1h ago

Weren't we barely scraping 1-10% on this with state of the art models a year ago and it was considered that this is the final boss, ie solve this and its almost AGI-like?

I ask because I cannot distinguish all the benchmarks by heart.

verdverm•56m ago

Here's a good thread over 1+ month, as each model comes out

https://bsky.app/profile/pekka.bsky.social/post/3meokmizvt22...

tl;dr - Pekka says Arc-AGI-2 is now toast as a benchmark

Aperocky•33m ago

If you look at the problem space it is easy to see why it's toast, maybe there's intelligence in there, but hardly general.

verdverm•11m ago

the best way I've seen this describes is "spikey" intelligence, really good at some points, those make the spikes

humans are the same way, we all have a unique spike pattern, interests and talents

ai are effectively the same spikes across instances, if simplified. I could argue self driving vs chatbots vs world models vs game playing might constitute enough variation. I would not say the same of Gemini vs Claude vs ... (instances), that's where I see "spikey clones"

Aperocky•8m ago

You can get more spiky with AIs, whereas with human brain we are more hard wired.

So maybe we are forced to be more balanced and general whereas AI don't have to.

fishpham•55m ago

Yes, but benchmarks like this are often flawed because leading model labs frequently participate in 'benchmarkmaxxing' - ie improvements on ARC-AGI2 don't necessarily indicate similar improvements in other areas (though it does seem like this is a step function increase in intelligence for the Gemini line of models)

jstummbillig•19m ago

Could it also be that we are just a lot better than a year ago?

layer8•18m ago

Isn’t the point of ARC that you can’t train against it? Or doesn’t it achieve that goal anymore somehow?

theywillnvrknw•15m ago

* that you weren't supposed to be able to

XenophileJKO•8m ago

https://chatgpt.com/s/m_698e2077cfcc81919ffbbc3d7cccd7b3

aeyes•32m ago

https://arcprize.org/leaderboard

$13.62 per task - so we need another 5-10 years for the price to run this to become reasonable?

But the real question is if they just fit the model to the benchmark.

igravious•17m ago

That's not a long time in the grand scheme of things.

throwup238•6m ago

Speak for yourself. Five years is a long time to wait for my plans of world domination.

saberience•26m ago

Arc-AGI (and Arc-AGI-2) is the most overhyped benchmark around though.

It's completely misnamed. It should be called useless visual puzzle benchmark 2.

It's a visual puzzle, making it way easier for humans than for models trained on text firstly. Secondly, it's not really that obvious or easy for humans to solve themselves!

So the idea that if an AI can solve "Arc-AGI" or "Arc-AGI-2" it's super smart or even "AGI" is frankly ridiculous. It's a puzzle that means nothing basically, other than the models can now solve "Arc-AGI"

CuriouslyC•19m ago

The puzzles are calibrated for human solve rates, but otherwise I agree.

saberience•13m ago

My two elderly parents cannot solve Arc-AGI puzzles, but can manage to navigate the physical world, their house, garden, make meals, clean the house, use the TV, etc.

I would say they do have "general intelligence", so whatever Arc-AGI is "solving" it's definitely not "AGI"

simianwords•1h ago

OT but my intuition says that there’s a spectrum

- non thinking models

- thinking models

- best of N models like deep think an gpt pro

Each one is of a certain computational complexity. Simplifying a bit, I think they map to - linear, quadratic and n^3 respectively.

I think there are certain class of problems that can’t be solved without thinking because it necessarily involves writing in a scratchpad. And same for best of N which involves exploring.

Two open questions

1) what’s the higher level here, is there a 4th option?

2) can a sufficiently large non thinking model perform the same as a smaller thinking?

NitpickLawyer•1h ago

> best of N models like deep think an gpt pro

Yeah, these are made possible largely by better use at high context lengths. You also need a step that gathers all the Ns and selects the best ideas / parts and compiles the final output. Goog have been SotA at useful long context for a while now (since 2.5 I'd say). Many others have come with "1M context", but their usefulness after 100k-200k is iffy.

What's even more interesting than maj@n or best of n is pass@n. For a lot of applications youc an frame the question and search space such that pass@n is your success rate. Think security exploit finding. Or optimisation problems with quick checks (better algos, kernels, infra routing, etc). It doesn't matter how good your pass@1 or avg@n is, all you care is that you find more as you spend more time. Literally throwing money at the problem.

mnicky•1h ago

> can a sufficiently large non thinking model perform the same as a smaller thinking?

Models from Anthropic have always been excellent at this. See e.g. https://imgur.com/a/EwW9H6q (top-left Opus 4.6 is without thinking).

simianwords•1h ago

its interesting that opus 4.6 added a paramter to make it think extra hard.

syntaxing•1h ago

Why a Twitter post and not the official Google blog post… https://blog.google/innovation-and-ai/models-and-research/ge...

meetpateltech•1h ago

The official blog post was submitted earlier (https://news.ycombinator.com/item?id=46990637), but somehow this story ranked up quickly on the homepage.

verdverm•58m ago

@dang will often replace the post url & merge comments

HN guidelines prefer the original source over social posts linking to it.

aavci•57m ago

Agreed - blog post is more appropriate than a twitter post

dang•40m ago

Just normal randomness I suppose. I've put that URL at the top now, and included the submitted URL in the top text.

jonathanstrange•1h ago

Unfortunately, it's only available in the Ultra subscription if it's available at all.

xnx•1h ago

Google is absolutely running away with it. The greatest trick they ever pulled was letting people think they were behind.

dfdsf2•1h ago

Trick? Lol not a chance. Alphabet is a pure play tech firm that has to produce products to make the tech accessible. They really lack in the latter and this is visible when you see the interactions of their VP's. Luckily for them, if you start to create enough of a lead with the tech, you get many chances to sort out the product stuff.

dakolli•22m ago

You sound like Russ Hanneman from SV

amunozo•1h ago

Those black nazis in the first image model were a cause of inside trading.

neilellis•1h ago

Less than a year to destroy Arc-AGI-2 - wow.

Davidzheng•1h ago

I unironically believe that arc-agi-3 will have a introduction to solved time of 1 month

etyhhgfff•36m ago

The AGI bar has to be set even higher, yet again.

saberience•9m ago

It's a useless meaningless benchmark though, it just got a catchy name, as in, if the models solve this it means they have "AGI", which is clearly rubbish.

Arc-AGI score isn't correlated with anything useful.

jabedude•6m ago

how would we actually objectively measure a model to see if it is AGI if not with benchmarks like arc-AGI?

XCSme•6m ago

But why only a +0.5% increase for MMMU-Pro?

vessenes•1h ago

Not trained for agentic workflows yet unfortunately - this looks like it will be fantastic when they have an agent friendly one. Super exciting.

dakolli•15m ago

Its really weird how you all are begging to be replaced by llms, you think if agentic workflows get good enough you're going to keep your job? Or not have your salary reduced by 50%?

If Agents get good enough it's not going to build some profitable startup for you (or whatever people think they're doing with the llm slot machines) because that implies that anyone else with access to that agent can just copy you, its what they're designed to do... launder IP/Copyright. Its weird to see people get excited for this technology.

None of this good. We are simply going to have our workforces replaced by assets owned by Google, Anthropic and OpenAI. We'll all be fighting for the same barista jobs, or miserable factory jobs. Take note on how all these CEOs are trying to make it sound cool to "go to trade school" or how we need "strong American workers to work in factories".

ramshanker•54m ago

Do we get any model architecture details like parameter size etc.? Few months back, we used to talk more on this, now it's mostly about model capabilities.

Davidzheng•52m ago

I'm honestly not sure what you mean? The frontier labs have kept arch as secrets since gpt3.5

sinuhe69•46m ago

I'm pretty certain that DeepMind (and all other labs) will try their frontier (and even private) models on First Proof [1].

And I wonder how Gemini Deep Think will fare. My guess is that it will get half the way on some problems. But we will have to take an absence as a failure, because nobody wants to publish a negative result, even though it's so important for scientific research.

[1] https://1stproof.org/

zozbot234•30m ago

The 1st proof original solutions are due to be published in about 24h, AIUI.

simonw•45m ago

The pelican riding a bicycle is excellent. I think it's the best I've seen.

https://simonwillison.net/2026/Feb/12/gemini-3-deep-think/

deron12•43m ago

It's worth noting that you mean excellent in terms of prior AI output. I'm pretty sure this wouldn't be considered excellent from a "human made art" perspective. In other words, it's still got a ways to go!

Edit: someone needs to explain why this comment is getting downvoted, because I don't understand. Did someone's ego get hurt, or what?

gs17•17m ago

It depends, if you meant from a human coding an SVG "manually" the same way, I'd still say this is excellent (minus the reflection issue). If you meant a human using a proper vector editor, then yeah.

dfdsf2•9m ago

Indeed. And when you factor in the amount invested... yeah it looks less impressive. The question is how much more money needs to be invested to get this thing closer to reality? And not just in this instance. But for any instance e.g. a seahorse on a bike.

Manabu-eo•42m ago

How likely this problem is already on the training set by now?

verdverm•39m ago

I've heard it posited that the reason the frontier companies are frontier is because they have custom data and evals. This is what I would do too

throwup238•37m ago

For every combination of animal and vehicle? Very unlikely.

The beauty of this benchmark is that it takes all of two seconds to come up with your own unique one. A seahorse on a unicycle. A platypus flying a glider. A man’o’war piloting a Portuguese man of war. Whatever you want.

recursive•33m ago

No, not every combination. The question is about the specific combination of a pelican on a bicycle. It might be easy to come up with another test, but we're looking at the results from a particular one here.

zarzavat•35m ago

You can always ask for a tyrannosaurus driving a tank.

simonw•33m ago

If anyone trains a model on https://simonwillison.net/tags/pelican-riding-a-bicycle/ they're going to get some VERY weird looking pelicans.

throwup238•41m ago

The reflection of the sun in the water is completely wrong. LLMs are still useless. (/s)

saberience•24m ago

Do you have to still keep trying to bang on about this relentlessly?

It was sort of humorous for the maybe first 2 iterations, now it's tacky, cheesy, and just relentless self-promotion.

Again, like I said before, it's also a terrible benchmark.

Davidzheng•20m ago

Eh, i find it more of a not very informative but lighthearted commentary

dfdsf2•14m ago

Highly disagree.

I was expecting something more realistic... the true test of what you are doing is how representative is the thing in relation to the real world. E.g. does the pelican look like a pelican as it exists in reality? This cartoon stuff is cute but doesnt pass muster in my view.

If it doesn't relate to the real world, then it most likely will have no real effect on the real economy. Pure and simple.

peaseagee•7m ago

The request is for an SVG, generally _not_ the format for photorealistic images. If you want to start your own benchmark, feel free to ask for a photorealistic JPEG or PNG of a pelican riding a bicycle. Could be interesting to compare and contrast, honestly.

okokwhatever•27m ago

I need to test the sketch creation a s a p. I need this in my life because learning to use Freecad is too difficult for a busy person like me (and frankly, also quite lazy)

sho_hn•25m ago

FWIW, the FreeCAD 1.1 nightlies are much easier and more intuitive to use due to the addition of many on-canvas gizmos.

ismailmaj•20m ago

top 10 elo in codeforces is pretty absurd

siva7•17m ago

I can't shake of the feeling that Googles Deep Think Models are not really different models but just the old ones being run with higher number of parallel subagents, something you can do by yourself with their base model and opencode.

Davidzheng•16m ago

And after i do that, how do i combine the output of 1000 subagents into one output? (Im not being snarky here, i think it's a nontrivial problem)