I've thought before that AI is as "intelligent" as your smartphone is "smart," but I didn't think "reasoning" would be just another buzzword.
> Are these models capable of generalizable reasoning, or are they leveraging different forms of pattern matching?
Define reasoning, define generalizable, define pattern matching.
For additional credits after you have done so, show humans are capable of what you just defined as generalizable reasoning.
Flash answered correctly in ~2 seconds, at most. Pro answered very wrongly after thinking and elaborating for ~5 minutes.
Flash was also giving a wrong answer for the same string in the past, but it improved.
Prompt was the same: "Hey, can you decode $BASE64_string?"
I have no further comments.
the easy part is done but the hard part is so hard it takes years to progress
Besides, AI already passes the Turing test (or at least, is most likely to fail because it is too articulate and reasonable). There is a pretty good argument we've already achieved AGI and now we're working on achieving human- and superhuman-level intelligence in AGI.
It's better today. Hoping that LLMs can get us to AGI in one hop was naive. Depending on definition of AGI we might be already there. But for superhuman level in all possible tasks there are many steps to be done. The obvious way is to find a solution for each type of tasks. We have already for math calculations, it's using tools. Many other types can be solved the same way. After a while we'll gradually get to well rounded 'brain', or model(s) + support tools.
So, so far future looks bright, there is progress, problems, but not deadlocks.
PS: Turing test is a <beep> nobody seriously talks about today.
The first Boeing 747 was rolled out in 1968, only 65 years after the first successful heavier-than-air flight. If you told people back then that not much will fundamentally change in civil aviation over the next 57 years, no one would have believed you.
Time and again, for centuries - with the pace picking up dramatically in recent decades - we thought we were special and we were wrong. Sun does not rotate around the earth, which is a pretty typical planet, with the same chemical composition of any other planet. All of a sudden we're not the only ones who could calculate, then solve symbolic equations, then play chess, then compose music, then talk, then reason (up to a point, for some definition of "reason"). You get my point.
And when we were not only matched, but dramatically surpassed in these tasks (and not a day earlier), we concluded that they weren't _really_ what made us special.
At this point, it seems to me reasonable to assume we're _not_ special, and the onus should be on anybody claiming that we are to at least attempt to mention in passing what is the secret sauce that we have (even if we can't quite say what it is without handwaving or using concepts that by definition can not be defined - "qualia is the indescribable feeling of red - its redness (?)).
Oh, and sorry, I could never quite grasp what "sentient" is supposed to mean - would we be able to tell we're not sentient if we weren't?
Spooky stuff.
Very clever, I must say. Kudos to folks who made this particular choice.
> we identify three performance regimes: (1) low complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse.
This is fascinating! We need more "mapping" of regimes like this!
What I would love to see (not sure if someone on here has seen anything to this effect) is how these complexity regimes might map to economic value of the task.
For that, the eval needs to go beyond puzzles but the complexity of the tasks still need to be controllable.
In 2025 they got a 313% gain (4.13 output factor).
Fusion is actually here and working. It’s not cost effective yet but to pretend there has been no progress or achievements is fundamentally false.
Fusion News, May 28th, 2025 https://www.youtube.com/watch?v=1YHcI-SfKx8
And also; the frontier LLMs blow older LLMs out of the water. There is continual progress and this study would have been structured substantially the same 2 years ago with much smaller N on the graphs because the regimes were much tinier then.
I've never seen this question quantified in a really compelling way, and while interesting, I'm not sure this PDF succeeds, at least not well-enough to silence dissent. I think AI maximalists will continue to think that the models are in fact getting less dim-witted, while the AI skeptics will continue to think these apparent gains are in fact entirely a biproduct of "increasing" "omniscience." The razor will have to be a lot sharper before people start moving between these groups.
But, anyway, it's still an important question to ask, because omniscient-yet-dim-witted models terminate at "superhumanly assistive" rather than "Artificial Superintelligence", which in turn economically means "another bite at the SaaS apple" instead of "phase shift in the economy." So I hope the authors will eventually succeed.
> because omniscient-yet-dim-witted models terminate at "superhumanly assistive"
It might be that with dim wits + enough brute force (knowledge, parallelism, trial-and-error, specialisation, speed) models could still substitute for humans and transform the economy in short order.
I'm bullish (and scared) about AI progress precisely because I think they've only gotten a little less dim-witted in the last few years, but their practical capabilities have improved a lot thanks to better knowledge, taste, context, tooling etc.
What scares me is that I think there's a reasoning/agency capabilities overhang. ie. we're only one or two breakthroughs away from something which is both kinda omniscient (where we are today), and able to out-think you very quickly (if only through dint of applying parallelism to actually competent outcome-modelling and strategic decision making).
That combination is terrifying. I don't think enough people have really imagined what it would mean for an AI to be able to out-strategise humans (in terms of speed and quality) in the same way that they can now — say — out-poetry humans (by being both decent in terms of quality and super fast). It's like when you're speaking to someone way smarter than you and you realise that they're 6 steps ahead, and actively shaping your thought process to guide you where they want you to end up. At scale. For everything.
This exact thing (better reasoning + agency) is also the top priority for all of the frontier researchers right now (because it's super useful), so I think a breakthrough might not be far away.
Another way to phrase it: I think today's LLMs are about as good at snap judgements in most areas as the best humans (probably much better at everything that rhymes with inferring vibes from text), but they kinda suck at:
1. Reasoning/strategising step-by-step for very long periods
2. Snap judgements about reasoning or taking strategic actions (in the way that expert strategic humans don't actually need to think through their actions step-by-step very often - they've built intuition which gets them straight to the best answer 90% of the time)
Getting good at the long range thinking might require more substantial architectural changes (eg. some sort of separate 'system 2' reasoning architecture to complement the already pretty great 'system 1' transformer models we have). OTOH, it might just require better training data and algorithms so that the models develop good enough strategic taste and agentic intuitions to get to a near-optimal solution quickly before they fall off a long-range reasoning performance cliff.
Of course, maybe the problem is really hard and there's no easy breakthrough (or it requires 100,000x more computing power than we have access to right now). There's no certainty to be found, but a scary breakthrough definitely seems possible to me.
behnamoh•5h ago
It's so easy to criticize the works of others and not deliver anything. Apple—be Sam in Game of Thrones: "I'm tired of reading about the achievements of better men".
suddenlybananas•5h ago
>It's so easy to criticize the works of others and not deliver anything. Apple—be Sam in Game of Thrones: "I'm tired of reading about the achievements of better men".
This is a patently absurd thing to write about a research paper.
bwfan123•3h ago
this work balances the hype and shows fundamental limitations so the AI hypesters are checked.
why be salty ?