The arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"
>Submit a solution which scores 85% on the ARC-AGI-2 private evaluation set and win $700K. https://arcprize.org/guide#overview
edit: they just removed the reference to "3.1" from the pdf
They never will do on private set, because it would mean its being leaked to google.
Wow.
https://blog.google/innovation-and-ai/models-and-research/ge...
I ask because I cannot distinguish all the benchmarks by heart.
https://bsky.app/profile/pekka.bsky.social/post/3meokmizvt22...
tl;dr - Pekka says Arc-AGI-2 is now toast as a benchmark
humans are the same way, we all have a unique spike pattern, interests and talents
ai are effectively the same spikes across instances, if simplified. I could argue self driving vs chatbots vs world models vs game playing might constitute enough variation. I would not say the same of Gemini vs Claude vs ... (instances), that's where I see "spikey clones"
So maybe we are forced to be more balanced and general whereas AI don't have to.
$13.62 per task - so we need another 5-10 years for the price to run this to become reasonable?
But the real question is if they just fit the model to the benchmark.
It's completely misnamed. It should be called useless visual puzzle benchmark 2.
It's a visual puzzle, making it way easier for humans than for models trained on text firstly. Secondly, it's not really that obvious or easy for humans to solve themselves!
So the idea that if an AI can solve "Arc-AGI" or "Arc-AGI-2" it's super smart or even "AGI" is frankly ridiculous. It's a puzzle that means nothing basically, other than the models can now solve "Arc-AGI"
I would say they do have "general intelligence", so whatever Arc-AGI is "solving" it's definitely not "AGI"
- non thinking models
- thinking models
- best of N models like deep think an gpt pro
Each one is of a certain computational complexity. Simplifying a bit, I think they map to - linear, quadratic and n^3 respectively.
I think there are certain class of problems that can’t be solved without thinking because it necessarily involves writing in a scratchpad. And same for best of N which involves exploring.
Two open questions
1) what’s the higher level here, is there a 4th option?
2) can a sufficiently large non thinking model perform the same as a smaller thinking?
Yeah, these are made possible largely by better use at high context lengths. You also need a step that gathers all the Ns and selects the best ideas / parts and compiles the final output. Goog have been SotA at useful long context for a while now (since 2.5 I'd say). Many others have come with "1M context", but their usefulness after 100k-200k is iffy.
What's even more interesting than maj@n or best of n is pass@n. For a lot of applications youc an frame the question and search space such that pass@n is your success rate. Think security exploit finding. Or optimisation problems with quick checks (better algos, kernels, infra routing, etc). It doesn't matter how good your pass@1 or avg@n is, all you care is that you find more as you spend more time. Literally throwing money at the problem.
Models from Anthropic have always been excellent at this. See e.g. https://imgur.com/a/EwW9H6q (top-left Opus 4.6 is without thinking).
HN guidelines prefer the original source over social posts linking to it.
Arc-AGI score isn't correlated with anything useful.
If Agents get good enough it's not going to build some profitable startup for you (or whatever people think they're doing with the llm slot machines) because that implies that anyone else with access to that agent can just copy you, its what they're designed to do... launder IP/Copyright. Its weird to see people get excited for this technology.
None of this good. We are simply going to have our workforces replaced by assets owned by Google, Anthropic and OpenAI. We'll all be fighting for the same barista jobs, or miserable factory jobs. Take note on how all these CEOs are trying to make it sound cool to "go to trade school" or how we need "strong American workers to work in factories".
And I wonder how Gemini Deep Think will fare. My guess is that it will get half the way on some problems. But we will have to take an absence as a failure, because nobody wants to publish a negative result, even though it's so important for scientific research.
Edit: someone needs to explain why this comment is getting downvoted, because I don't understand. Did someone's ego get hurt, or what?
The beauty of this benchmark is that it takes all of two seconds to come up with your own unique one. A seahorse on a unicycle. A platypus flying a glider. A man’o’war piloting a Portuguese man of war. Whatever you want.
It was sort of humorous for the maybe first 2 iterations, now it's tacky, cheesy, and just relentless self-promotion.
Again, like I said before, it's also a terrible benchmark.
I was expecting something more realistic... the true test of what you are doing is how representative is the thing in relation to the real world. E.g. does the pelican look like a pelican as it exists in reality? This cartoon stuff is cute but doesnt pass muster in my view.
If it doesn't relate to the real world, then it most likely will have no real effect on the real economy. Pure and simple.
Metacelsus•1h ago
Google has definitely been pulling ahead in AI over the last few months. I've been using Gemini and finding it's better than the other models (especially for biology where it doesn't refuse to answer harmless questions).
simianwords•1h ago
throwup238•1h ago
neilellis•1h ago
NitpickLawyer•1h ago
IMO it's the other way around. Benchmarks only measure applied horse power on a set plane, with no friction and your elephant is a point sphere. Goog's models have always punched over what benchmarks said, in real world use @ high context. They don't focus on "agentic this" or "specialised that", but the raw models, with good guidance are workhorses. I don't know any other models where you can throw lots of docs at it and get proper context following and data extraction from wherever it's at to where you'd need it.
Davidzheng•1h ago
verdverm•53m ago
CuriouslyC•19m ago
nkzd•16m ago