wayback machine still has it: https://web.archive.org/web/20251118111103/https://storage.g...
here’s the archived pdf: https://web.archive.org/web/20251118111103/https://storage.g...
Also notable which models they include for comparison: Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1. That seems like a minor snub against Grok 4 / Grok 4.1.
https://firstpagesage.com/reports/top-generative-ai-chatbots... suggests 0.6% of chat use cases, well below the other big names, and I suspect those stats for chat are higher than other scenarios like business usage. Given all that, I can see how Gemini might not be focused on competing with them.
I would want to hear more detail about prompts, frameworks, thinking time, etc., but they don't matter too much. The main caveat would be that this is probably on the public test set, so could be in pretraining, and there could even be some ARC-focussed post-training - I think we don't know yet and might never know.
But for any reasonable setup, if no egregious cheating, that is an amazing score on ARC 2.
Also I really hoped for a 2M+ context. I'm living on the context edge even with 1M.
rvz•1h ago
Well don't complain when you are using Gmail and your emails are being trained to develop Gemini.
patates•4m ago