Now we can create new samples and evals for more complex tasks to train up the next gen, more planning, decomp, context, agentic oriented
OpenAI has largely fumbled their early lead, exciting stuff is happening elsewhere
From what I understand, nobody has done any real scaling since the GPT-4 era. 4.5 was a bit larger than 4, but not as much as the orders of magnitude difference between 3 and 4, and 5 is smaller than 4.5. Google and Anthropic haven't gone substantially bigger than GPT-4 either. Improvements since 4 are almost entirely from reasoning and RL. In 2026 or 2027, we should see a model that uses the current datacenter buildout and actually scales up.
ARC AGI v2: 17.6% -> 52.9%
SWE Verified: 76.3% -> 80%
That's pretty good!
It'll be noteworthy to see the cost-per-task on ARC AGI v2.
(edit: I'm sorry I didn't read enough on the topic, my apologies)
No wall yet and I think we might have crossed the threshold of models being as good or better than most engineers already.
GDPval will be an interesting benchmark and I'll happily use the new model to test spreadsheet (and other office work) capabilities. If they can going like this just a little bit further, much of the office workers will stop being useful.... I don't know yet how to feel about this.
Great for humanity probably but but for the individuals?
I emailed support a while back to see if there was an early access program (99.99% sure the answer is yes). This is when I discovered that their support is 100% done by AI and there is no way to escalate a case to a human.
sfmike•23m ago
verdverm•21m ago
fouronnes3•16m ago
armenarmen•11m ago
Wowfunhappy•17m ago
brokencode•7m ago
I don’t think it’s publicly known for sure how different the models really are. You can improve a lot just by improving the post-training set.
elgatolopez•11m ago
catigula•10m ago