The Illusion of Thinking: Strengths and limitations of reasoning models [pdf]

https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf

488•amrrs•8mo ago

Comments

behnamoh•8mo ago

Okay Apple, you got my attention. But I'm a strong proponent of "something is better than nothing" philosophy—even if OpenAI/Google/etc. are building reasoning models with the limitations that you describe, they are still a huge progress compared to what we had not long ago. Meanwhile you're not even trying.

It's so easy to criticize the works of others and not deliver anything. Apple—be Sam in Game of Thrones: "I'm tired of reading about the achievements of better men".

suddenlybananas•8mo ago

I think you're mistaking the work of researchers who work at Apple with the particular investment decisions of Apple over the past few years.

>It's so easy to criticize the works of others and not deliver anything. Apple—be Sam in Game of Thrones: "I'm tired of reading about the achievements of better men".

This is a patently absurd thing to write about a research paper.

bwfan123•8mo ago

there is enough hype already - with AGI being promised as imminent.

this work balances the hype and shows fundamental limitations so the AI hypesters are checked.

why be salty ?

ivape•8mo ago

This is easily explained by accepting that there is no such thing as LRMs. LRMs are just LLMs that iterate on its own answers more (or provides itself more context information of a certain type). The reasoning loop on an "LRM" will be equivalent to asking a regular LLM to "refine" its own response, or "consider" additional context of a certain type. There is no such thing as reasoning basically, as it was always a method to "fix" hallucinations or provide more context automatically, nothing else. These big companies baked in one of the hackiest prompt engineering tricks that your typical enthusiast figured out long ago and managed to brand it and profit off it. The craziest part about this was Deepseek was able to cause a multi billion dollar drop and pump of AI stocks with this one trick. Crazy times.

AlienRobot•8mo ago

Is that what "reasoning" means? That sounds pretty ridiculous.

I've thought before that AI is as "intelligent" as your smartphone is "smart," but I didn't think "reasoning" would be just another buzzword.

ngneer•8mo ago

I am not too familiar with the latest hype, but "reasoning" has a very straightforward definition in my mind. For example, can the program in question derive new facts from old ones in a logically sound manner. Things like applying modus ponens. (A and A => B) => B. Or, all men are mortal and Socrates is a man, and therefore Socrates is mortal. If the program cannot deduce new facts, then it is not reasoning, at least not by my definition.

dist-epoch•8mo ago

When people say LLMs can't do X, I like to try it.

    Q: Complete 3 by generating new knowledge:
    1. today is warm
    2. cats likes warm temperatures
    3.

A: Therefore, a cat is likely to be enjoying the weather today.

Q: does the operation to create new knowledge you did have a specific name?

A: ... Deductive Reasoning

Q: does the operation also have a Latin name?

A: ... So, to be precise, you used a syllogismus (syllogism) that takes the form of Modus Ponens to make a deductio (deduction).

https://aistudio.google.com/app/prompts/1LbEGRnzTyk-2IDdn53t...

People then say "of course it could do that, it just pattern matched a Logic text book. I meant in a real example, not an artificially constructed one like this one. In a complex scenario LLMs obviously can't do Modus Ponens.

ngneer•8mo ago

I do not know whether the state of the art is able to reason or not. The textbook example you gave is admittedly not very interesting. What you are hearing from people is that parroting is not reasoning, which is true.

I wonder if the state of the art can reason its way through the following:

"Adam can count to 14000. Can Adam count to 13500?"

The response needs to be affirmative for every X1 and X2 such that X2 <= X1. That is reasoning. Anything else is not reasoning.

The response when X2 > X1 is less interesting. But, as a human it might be "Maybe, if Adam has time" or "Likely, since counting up to any number uses the same algorithm" or "I don't know".

Seems ChatGPT can cope with this. Other examples are easy to come up with, too. There must be benchmarks for this.

Input to ChatGPT:

"Adam can lift 1000 pounds of steel. Can Adam lift 1000 pounds of feathers?"

Output from ChatGPT:

"1,000 pounds of feathers would be much easier for Adam to lift compared to 1,000 pounds of steel, because feathers are much lighter and less dense."

So, maybe not there yet...

dist-epoch•8mo ago

> "Adam can lift 1000 pounds of steel. Can Adam lift 1000 pounds of feathers?"

Worked for me:

https://chatgpt.com/share/6844813a-6e4c-8006-b560-c0be223eeb...

gemma3-27b, a small model, had an interesting take:

> This is a classic trick question!

> While Adam can lift 1000 pounds, no, he likely cannot lift 1000 pounds of feathers.

> Volume: Feathers take up a huge amount of space for their weight. 1000 pounds of feathers would be an enormous volume – likely far too large for Adam to even get under, let alone lift. He'd be trying to lift a massive, bulky cloud.

> Practicality: Even if he could somehow get it under a barbell, the feathers would shift and compress, making a secure grip impossible.

> The question plays on our understanding of weight versus volume. It's designed to make you focus on the "1000 pounds" and forget about the practicalities of lifting something so voluminous.

Tried the counting question on the smallest model, gemma-3n-34b, it can run on a smartphone:

> Yes, if Adam can count to 14000, he can definitely count to 13500. Counting to a smaller number is a basic arithmetic operation. 13500 is less than 14000.

ngneer•8mo ago

Thanks for trying these out :). Highlights the often subtle difference between knowing the answer and deducing the answer. Feathers could be ground into a pulp and condensed, too. I am not trying to be clever, just seems like the response is a canned answer.

JSR_FDED•8mo ago

A reasoning model is an LLM that has had additional training phases that reward problem solving abilities. (But in a black box way - it’s not clear if the model is learning actual reasoning or better pattern matching, or memorization, or heuristics… maybe a bit of everything).

meroes•8mo ago

Yep. This is exactly the conclusion I reached as an RLHF'er. Reasoning/LRM/SxS/CoT is "just" more context. There never was reasoning. But of course, more context can be good.

Too•8mo ago

The million dollar question is how far can one get on this trick. Maybe this is exactly how our own brains operate? If not, what fundamental building blocks are missing to get there.

bwfan123•8mo ago

> If not, what fundamental building blocks are missing to get there

If I were to guess, the missing building block is the ability to abstract - which is the ability to create a symbol to represent something. Concrete example of abstraction is seen in the axioms of lambda calculus. 1) ability to posit a variable, 2) ability to define a function using said variable, and 3) the ability to apply functions to things. Abstraction arises from a process in the brain which we have not understood yet and could be outside of computation as we know it per [1]

[1] https://www.amazon.com/Emperors-New-Mind-Concerning-Computer...

bird0861•8mo ago

No. It's not microtubules. Enough with the g-darn microtubules already. https://www.biorxiv.org/content/10.1101/712794v1

"We used an antimicrotubular agent (parbendazole) and disrupted microtubular dynamics in paramecium to see if microtubules are an integral part of information storage and processing in paramecium’s learning process. We observed that a partial allosteric modulator of GABA (midazolam) could disrupt the learning process in paramecium, but the antimicrotubular agent could not. Therefore, our results suggest that microtubules are probably not vital for the learning behavior in P. caudatum. Consequently, our results call for a further revisitation of the microtubular information processing hypothesis."

JusticeJuice•8mo ago

Their finding of LLMs working best at simple tasks, LRMs working best at medium complexity tasks, and then neither succeeding at actually complex tasks is good to know.

cubefox•8mo ago

Not sure whether I sense sarcasm.

nialv7•8mo ago

I've seen this too often, papers that ask questions they don't even bother to properly define.

> Are these models capable of generalizable reasoning, or are they leveraging different forms of pattern matching?

Define reasoning, define generalizable, define pattern matching.

For additional credits after you have done so, show humans are capable of what you just defined as generalizable reasoning.

NitpickLawyer•8mo ago

> show humans are capable of what you just defined as generalizable reasoning.

I would also add "and plot those capabilities on a curve". My intuition is that the SotA models are already past the median human abilities in a lot of areas.

crvdgc•8mo ago

In the context of this paper, I think "generalizable reasoning" means that finding a method to solve the puzzle and thus being able to execute the method on puzzle instances of arbitrary complexity.

beneboy•8mo ago

This kind of explains why Claude will find the right solution, but then the more it thinks and keeps “improving” the more over-engineered (and sometimes wrong) the solution is. Interesting to see this coming up in formal research.

bicepjai•8mo ago

The study challenges the assumption that more “thinking” or longer reasoning traces necessarily lead to better problem-solving in LRMs

bayindirh•8mo ago

As a test, I asked Gemini 2.5 Flash and Gemini 2.5 Pro to decode a single BASE64 string.

Flash answered correctly in ~2 seconds, at most. Pro answered very wrongly after thinking and elaborating for ~5 minutes.

Flash was also giving a wrong answer for the same string in the past, but it improved.

Prompt was the same: "Hey, can you decode $BASE64_string?"

I have no further comments.

rafterydj•8mo ago

well that's not a very convincing argument. That's just a failure to recognize when the use of a tool- base64 decoder- is needed, not a reasoning problem at all, right?

BoorishBears•8mo ago

That's not really a cop out here: both models had access to the same tools.

Realistically there are many problems that non-reasoning models do better on, especially when the answer cannot be solved by a thought process: like recalling internal knowledge.

You can try to teach the model the concept of a problem where thinking will likely steer it away from the right answer, but at some point it becomes like the halting problem... how does the model reliably think its way into the realization a given problem is too complex to be thought out?

bayindirh•8mo ago

I don't know whether Flash uses a tool or not, but it answers pretty quickly. However, Pro opts to use its own reasoning, not a tool. When I look at the reasoning train, it pulls and pulls knowledge endlessly, refining that knowledge and drifting away.

Jensson•8mo ago

Translating to BASE64 is a good test to see how well it works as a language translator without changing things, because its the same skill for an AI model.

If the model changes things it means it didn't really capture the translation patterns for BASE64, so then who knows what it will miss when translating between languages if it can't even do BASE64?

layer8•8mo ago

A moderately smart human who understands how Base64 works can decode it by hand without external tools other than pen and paper. Coming up with the exact steps to perform is a reasoning problem.

einsteinx2•8mo ago

If the reasoning model was truly reasoning while the flash model was not then by definition shouldn’t it be better at knowing when to use the tool than the non-reasoning model? Otherwise it’s not really “smarter” as claimed, which seems to line up perfectly with the paper’s conclusion.

actinium226•8mo ago

Man, remember when everyone was like 'AGI just around the corner!' Funny how well the Gartner hype cycle captures these sorts of things

bayindirh•8mo ago

They're similar to self-driving vehicles. Both are around the corner, but neither can negotiate the turn.

einrealist•8mo ago

All that to keep the investment pyramid schemes going.

kfarr•8mo ago

Waymo's pretty good at unprotected lefts

bayindirh•8mo ago

Waymo is pretty good at (a finite number of) unprotected lefts, and this doesn't count as "level 5 autonomous driving".

hskalin•8mo ago

And commerically viable nuclear fusion

mgiampapa•8mo ago

I harvest fusion energy every single day... It's just there in the sky, for free!

nmca•8mo ago

I saw your comment and counted — in May I took a Waymo thirty times.

bayindirh•8mo ago

Waymo is a popular argument in self-driving cars, and they do well.

However, Waymo is Deep Blue of self-driving cars. Doing very well in a closed space. As a result of this geofencing, they have effectively exhausted their search space, hence they work well as a consequence of lack of surprises.

AI works well when search space is limited, but General AI in any category needs to handle a vastly larger search space, and they fall flat.

At the end of the day, AI is informed search. They get inputs, and generate a suitable output as deemed by their trainers.

anonzzzies•8mo ago

Yeah, AI has been good for a long time in limited search space areas. So good that many of these things that were called AI in the past are not called AI now, but 'just' 'algorithm'.

bayindirh•8mo ago

Everything is “just” an algorithm. LLM is a weighted graph with some randomization which is tuned with tons of data. You have input and output encoders on top of it.

That’s all.

MattRix•8mo ago

This view of Waymo doesn’t account for the fact that self driving is about a lot more than just taking the right roads. It has to deal with other drivers, construction, road closures, pedestrians, bikes, etc.

bayindirh•8mo ago

What I wrote is exactly the opposite. Quoting myself:

> hence they work well as a consequence of lack of surprises. Emphasis mine.

In this context, "lack of surprises" is exactly the rest of the driving besides road choice. In the same space, the behaviors of other actors are also a finite set, or more precisely, can be predicted with much better accuracy.

I drive the same route for ~20 years for commute. The events which surprise me are few and far between, because other people's behavior in that environment is a finite set, and they all behave very predictably, incl. pedestrians, bikes, and other drivers.

Choosing roads are easy, handling surprises hard, but if you saw most potential surprises, then you can drive even without thinking. While I'm not proud of it, my brain took over and drove me home a couple of times on that route when I was too tired to think.

mhast•8mo ago

I suspect that Waymo car's could operate in a lot more areas than they do. The issue is that Waymo are trying to sell the service of safe travel and not a car with an addon you can pay for which doesn't actually work.

In other words, since they accept liability for their cars it's not in their interest to roll out the service too fast. It makes more sense to do it slow and steady.

It's not really a strong argument that their technology is incapable of working in general areas.

yahoozoo•8mo ago

We will be treating LLMs “like a junior developer” forever.

JKCalhoun•8mo ago

And I'm fine with that.

sneak•8mo ago

Even if they never get better than they are today (unlikely) they are still the biggest change in software development and the software development industry in my 28 year career.

anonzzzies•8mo ago

That's for sure; I said with the original chatgpt already ; if this is the level it stays at but just becomes (much) faster and open, it's already a bizar revolution. Something many old former (and current, as I see online, but I don't know any personally) AI students/researchers did not think possible in our lifetime, and there it was.

tonyhart7•8mo ago

I think we just around at 80% of progress

the easy part is done but the hard part is so hard it takes years to progress

georgemcbay•8mo ago

> the easy part is done but the hard part is so hard it takes years to progress

There is also no guarantee of continued progress to a breakthrough.

We have been through several "AI Winters" before where promising new technology was discovered and people in the field were convinced that the breakthrough was just around the corner and it never came.

LLMs aren't quite the same situation as they do have some undeniable utility to a wide variety of people even without AGI springing out of them, but the blind optimism that surely progress will continue at a rapid pace until the assumed breakthrough is realized feels pretty familiar to the hype cycle preceding past AI "Winters".

Swizec•8mo ago

> We have been through several "AI Winters" before

Yeah, remember when we spent 15 years (~2000 to ~2015) calling it “machine learning” because AI was a bad word?

We use so much AI in production every day but nobody notices because as soon as a technology becomes useful, we stop calling it AI. Then it’s suddenly “just face recognition” or “just product recommendations” or “just [plane] autopilot” or “just adaptive cruise control” etc

You know a technology isn’t practical yet because it’s still being called AI.

blks•8mo ago

I don’t think there’s any “AI” in aircraft autopilots.

withinboredom•8mo ago

AI encompasses a wide range of algorithms and techniques; not just LLMs or neural nets. Also, it is worth pointing out that the definition of AI has changed drastically over the last few years and narrowed pretty significantly. If you’re viewing the definition from the 80–90’s, most of what we call "automation" today would have been considered AI.

Jensson•8mo ago

Autopilots were a thing before computers were a thing, you can implement one using mechanics and control theory. So no, traditional autopilots are not AI under any reasonable definition, otherwise every single machine we build would be considered AI as almost all machines has some form of control systems in them, for example is your microwave clock an AI?

So I'd argue any algorithm that comes from control theory is not AI, those are just basic old dumb machines. You can't make planes without control theory, humans can't keep a plane steady without it, so Wrights Brothers adding this to their plane is why they succeeded making a flying machine.

So if autopilots are AI then the Wrights Brothers developed an AI to control their plane. I don't think anyone sees that as AI, not even at the time they did the first flight.

trc001•8mo ago

Uh, the bellman equation was first used for control theory and is the foundation of modern reinforcement learning... so wouldn't that imply LLMs "come from" control theory?

fc417fc802•8mo ago

Is the training algorithm the AI or is the model that you get at the end the AI?

fc417fc802•8mo ago

Ah yes the mythical strawman definition of AI that you can never seem to pin down, was never rigorous, and never enjoyed wide expert acceptance. It's on par with "well many people used to say, or at least so I've been told, that ...".

Swizec•8mo ago

That’s the point: AI is a marketing term and always has been. The underlying tech changes with every hype wave.

One of the first humanoid robots was an 18th century clockwork mechanism inside a porcelain doll that autonomously wrote out “Cogito Ergo Sum” in cursive with a pen. It was considered thought provoking at the time because it implied that some day machines could think.

BBC video posted to reddit 10 years ago: https://www.reddit.com/r/history/s/d6xTeqfKCv

fc417fc802•8mo ago

It certainly sees use as an ever shifting marketing term. That does not exclude it from being a useful technical term. Indeed if the misuse of a term by marketers was sufficient to rob a word of meaning then I doubt we'd have any means of communication left.

> It was considered thought provoking at the time because it implied that some day machines could think.

What constitutes "thinking"? That's approximately the same question as what qualifies as AGI. LLMs and RL seem to be the first time humanity has achieved anything that begins to resemble that but clearly both of those come up short ... at least so far.

Meanwhile I'm quite certain that a glorified PID loop (ie autopilot) does not qualify as machine learning (AI if you'd prefer). If someone wants to claim that it does then he's going to need to explain how his definition excludes mechanical clockwork.

withinboredom•8mo ago

What do you think an executing LLM is? It’s basically a glorified PID loop. It isn’t learning anything new. It isn’t thinking about your conversation while you go take a poo.

And I think the point is that the definition doesn’t exclude pure mechanical devices since that’s exactly what a computer is.

fc417fc802•8mo ago

To claim that an LLM is equivalent to a PID loop is utterly ridiculous. By that logic a 747 is "basically a glorified lawn mower".

> It isn’t thinking about your conversation while you go take a poo.

The commercial offerings for "reasoning" models can easily run for 10 to 15 minutes before spitting out an answer. As to whether or not what it's doing counts as "thinking" ...

> the definition doesn’t exclude pure mechanical devices since that’s exactly what a computer is.

By the same logic a songbird or even a human is also a mechanical device. What's your point?

I never said anything about excluding mechanical devices. I referred to "mechanical clockwork" meaning a mechanical pocket watch or similar. If the claim is that autopilot qualifies as AI then I want to know how that gets squared with a literal pocket watch not being AI.

withinboredom•8mo ago

> The commercial offerings for "reasoning" models can easily run for 10 to 15 minutes before spitting out an answer. As to whether or not what it's doing counts as "thinking" ...

Tell me you don’t know how AI works without telling me you don’t know how AI works. After it sends you an output, the AI stops doing anything. Your conversation sits resident in ram for a bit, but there is no more processing happening.

It is waiting until you give it feedback... some might say it is a loop... a feedback loop ... that continues until the output has reached the desired state ... kinda sounds familiar ... like a PID loop where the human is the controller...

>To claim that an LLM is equivalent to a PID loop is utterly ridiculous.

Is it? It looks like one to me.

> By that logic a 747 is "basically a glorified lawn mower".

I don’t think a 747 can mow lawns, but I assume it has the horsepower to do it with some modifications.

danaris•8mo ago

AI is multiple things.

AI is a marketing term for various kinds of machine learning applications.

AI is an academic field within computer science.

AI is the computer-controlled enemies you face in (especially, but not solely, offline) games.

This has been the case for decades now—especially the latter two.

Trying to claim that AI either "has always been" one particular thing, or "has now become" one particular thing, is always going to run into trouble because of this multiplicity. The one thing that AI "has always been" is multiple things.

roenxi•8mo ago

What do you think has changed? The situation is still about as promising for AGI in a few years - if not more so. Papers like this are the academics mapping out where the engineering efforts need to be directed to get there and it seems to be a relatively small number of challenges that are easier as the ones already overcome - we know machine learning can solve Towers of Hanoi, for example. It isn't fundamentally complicated like Baduk is. The next wall to overcome is more of a low fence.

Besides, AI already passes the Turing test (or at least, is most likely to fail because it is too articulate and reasonable). There is a pretty good argument we've already achieved AGI and now we're working on achieving human- and superhuman-level intelligence in AGI.

MoonGhost•8mo ago

> What do you think has changed? The situation is still about as promising for AGI in a few years - if not more so

It's better today. Hoping that LLMs can get us to AGI in one hop was naive. Depending on definition of AGI we might be already there. But for superhuman level in all possible tasks there are many steps to be done. The obvious way is to find a solution for each type of tasks. We have already for math calculations, it's using tools. Many other types can be solved the same way. After a while we'll gradually get to well rounded 'brain', or model(s) + support tools.

So, so far future looks bright, there is progress, problems, but not deadlocks.

PS: Turing test is a <beep> nobody seriously talks about today.

latchup•8mo ago

To be fair, the technology sigmoid curve rises fastest just before its inflection point, so it is hard to predict at what point innovation slows down due to its very nature.

The first Boeing 747 was rolled out in 1968, only 65 years after the first successful heavier-than-air flight. If you told people back then that not much will fundamentally change in civil aviation over the next 57 years, no one would have believed you.

PantaloonFlames•8mo ago

And not just in aviation. Consider what aviation did to make the world smaller. Huge 2nd order changes. The COVID-19 pandemic would not have happened the way it did, if there were no Boeing or Airbus.

Big hard-to-predict changes ahead.

brookst•8mo ago

…but that was, like, two years ago? If we go from GPT2 to AGI in ten years that will still feel insanely fast.

piskov•8mo ago

We won’t

mirekrusin•8mo ago

I remember "stochastic parrot" and people saying it's fancy markov chain/dead end. You don't hear them much after roughly agentic coding appeared.

Marazan•8mo ago

Spicy autocomplete is still spicy autocomplete

mirekrusin•8mo ago

I'm not sure if system capable of ie. reasoning over images deserves this label anymore?

mrbungie•8mo ago

The thing is "spicy" or "glorified" autocomplete are not actually bad labels, they are autocomplete machines that are very good up to the point of convincing people that they think.

empiricus•8mo ago

Many ppl are good at convincing other ppl that they think, but also many fail at thinking and many fail at convincing.

PantaloonFlames•8mo ago

Yours seems like a c.2023 perspective of coding assistants. These days it’s well beyond autocomplete and “generate a function that returns the numbers from the Fibonacci sequence.”

But I would think that would be well understood here.

How can you reduce what is currently possible to spicy autocomplete? That seems pretty dismissive, so much so that I wonder if it is motivated reasoning on your part.

I’m not saying it’s good or bad; I’m just saying the capability is well beyond auto complete.

otabdeveloper4•8mo ago

AGI has always been "just around the corner", ever since computers were invented.

Some problems have become more tractable (e.g. language translation), mostly by lowering our expectations of what constitutes a "solution", but AGI is nowhere nearer. AGI is a secular milleniarist religion.

naasking•8mo ago

Interpreting "just around the corner" as "this year" sounds like your error. Most projections are are years out, at least.

IshKebab•8mo ago

Yeah it's already been 2½ years! How long does it take to develop artificial life anyway? Surely no more than 3 years? I demand my money back!

alansammarone•8mo ago

I have a somewhat similar point of view to the one voiced by other people, but I like to think about it slightly differently, so I'll chime in - here's my take (although, admittedly, I'm operating with a quite small reasoning budget (5 minutes tops)):

Time and again, for centuries - with the pace picking up dramatically in recent decades - we thought we were special and we were wrong. Sun does not rotate around the earth, which is a pretty typical planet, with the same chemical composition of any other planet. All of a sudden we're not the only ones who could calculate, then solve symbolic equations, then play chess, then compose music, then talk, then reason (up to a point, for some definition of "reason"). You get my point.

And when we were not only matched, but dramatically surpassed in these tasks (and not a day earlier), we concluded that they weren't _really_ what made us special.

At this point, it seems to me reasonable to assume we're _not_ special, and the onus should be on anybody claiming that we are to at least attempt to mention in passing what is the secret sauce that we have (even if we can't quite say what it is without handwaving or using concepts that by definition can not be defined - "qualia is the indescribable feeling of red - its redness (?)).

Oh, and sorry, I could never quite grasp what "sentient" is supposed to mean - would we be able to tell we're not sentient if we weren't?

ivape•8mo ago

I can give you a pretty wild explanation. Einstein was a freak of nature. Nature just gave him that "something" to figure out the laws of the universe. I'm avoiding the term God as to not tickle anyone incorrectly. Seriously, explain what schooling and environment gets you that guy. So, to varying degrees, all output is from the universe. It's hard for the ego to accept, surely we earned everything we ever produced ...

Spooky stuff.

keiferski•8mo ago

This analogy doesn’t really work, because the former examples are ones in which humanity discovered that it existed in a larger world.

The recent AI example is humanity building, or attempting to build, a tool complex enough to mimic a human being.

If anything, you could use recent AI developments as proof of humanity’s uniqueness - what other animal is creating things of such a scale and complexity?

suddenlybananas•8mo ago

I don't see how heliocentrisim or calculators have any bearing on the uniqueness of humans.

curious_cat_163•8mo ago

> Rather than standard benchmarks (e.g., math problems), we adopt controllable puzzle environments that let us vary complexity systematically

Very clever, I must say. Kudos to folks who made this particular choice.

> we identify three performance regimes: (1) low complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse.

This is fascinating! We need more "mapping" of regimes like this!

What I would love to see (not sure if someone on here has seen anything to this effect) is how these complexity regimes might map to economic value of the task.

For that, the eval needs to go beyond puzzles but the complexity of the tasks still need to be controllable.

pegasus•8mo ago

Is (1) that surprising? If I ask someone a simple question but tell them to "think really hard about it", they'll be more likely to treat it as a trick question and look for a non-obvious answer. Overthinking it, basically.

curious_cat_163•8mo ago

It is hard to compare models with humans so not sure how to answer it for both. :)

But, for models, this is an interesting finding because a lot of LRMs are LLMs with a _bunch_ of post-training done on top. We know this about DeepSeek R1 (one of the models evaluated in the Apple paper) for sure. They write extensively about how they took DeepSeek-V3-Base and made R1 with it. [1]

If the post-training is resulting in lower performance on simpler tasks then it ought to inspire more research on how to make it so that it doesn't -- i.e., with more training (of any kind), we should be gaining more capabilities. This has been a problem with DNNs historically, btw. We had these issues when fine-tuning text/image classifiers as well. Some weight changes can be destructive. So, it has to be done with a _lot_ of care. And, I am sure folks are working on it, to be honest. Maybe some of them will say something here. :-)

[1] https://github.com/deepseek-ai/DeepSeek-R1

make3•8mo ago

Using puzzles is not special or anything, it has been done a million times since before (and including) the LSTM paper (1997) https://www.bioinf.jku.at/publications/older/2604.pdf

jimmySixDOF•8mo ago

The Arc Prize just released a new update and it's all minigame puzzles

https://arcprize.org/

8bitsrule•8mo ago

Fusion has been 25 years away for all of my life.

sneak•8mo ago

Fusion is net positive energy now; that happened in 2022 (+54%).

In 2025 they got a 313% gain (4.13 output factor).

Fusion is actually here and working. It’s not cost effective yet but to pretend there has been no progress or achievements is fundamentally false.

oneshtein•8mo ago

It will be cost effective in just 25 years.

sitkack•8mo ago

Negative Negs spit out low effort snark, they said the same thing about solar, electric cars, even multicore, jit, open source. Thanks for refuting them, the forum software itself should either quarantine the response or auto respond before the comment is submitted. These people don't build the future.

Fusion News, May 28th, 2025 https://www.youtube.com/watch?v=1YHcI-SfKx8

lrhegeba•8mo ago

It isnt when you look at Q total. Total energy input for all needed support systems versus energy produced. See https://en.wikipedia.org/wiki/Fusion_energy_gain_factor for more details

einsteinx2•8mo ago

Yes it’s a net positive if you ignore the 100x extra power it took to actually run the reactor, so actually no it’s not net energy positive. Not even close.

Nothing to do with cost, we can not build a fusion reactor in 2025 with any amount of money that will produce more energy and goes into it.

benlivengood•8mo ago

These are the kind of studies that make so much more sense than the "LLMs can't reason because of this ideological argument or this one anecdote" posts/articles. Keep 'em coming!

And also; the frontier LLMs blow older LLMs out of the water. There is continual progress and this study would have been structured substantially the same 2 years ago with much smaller N on the graphs because the regimes were much tinier then.

antics•8mo ago

I think the intuition the authors are trying to capture is that they believe the models are omniscient, but also dim-witted. And the question they are collectively trying to ask is whether this will continue forever.

I've never seen this question quantified in a really compelling way, and while interesting, I'm not sure this PDF succeeds, at least not well-enough to silence dissent. I think AI maximalists will continue to think that the models are in fact getting less dim-witted, while the AI skeptics will continue to think these apparent gains are in fact entirely a biproduct of "increasing" "omniscience." The razor will have to be a lot sharper before people start moving between these groups.

But, anyway, it's still an important question to ask, because omniscient-yet-dim-witted models terminate at "superhumanly assistive" rather than "Artificial Superintelligence", which in turn economically means "another bite at the SaaS apple" instead of "phase shift in the economy." So I hope the authors will eventually succeed.

sitkack•8mo ago

There is no reason that omniscient-yet-dim-witted has to plateau at human intelligence.

antics•8mo ago

I am not sure if you mean this to refute something in what I've written but to be clear I am not arguing for or against what the authors think. I'm trying to state why I think there is a disconnect between them and more optimistic groups that work on AI.

drodgers•8mo ago

I think that commenter was disagreeing with this line:

> because omniscient-yet-dim-witted models terminate at "superhumanly assistive"

It might be that with dim wits + enough brute force (knowledge, parallelism, trial-and-error, specialisation, speed) models could still substitute for humans and transform the economy in short order.

antics•8mo ago

Sorry, I can't edit it any more, but what I was trying to say is that if the authors are correct, that this distinction is philosophically meaningful, then that is the conclusion. If they are not correct, then all their papers on this subject are basically meaningless.

Byamarro•8mo ago

And we have a good example of a dimwitted, brute-force process creating intelligent designs - evolution.

drodgers•8mo ago

Also corporations, governments etc. - they're capable of things that none of the individuals could do alone.

drodgers•8mo ago

> I think AI maximalists will continue to think that the models are in fact getting less dim-witted

I'm bullish (and scared) about AI progress precisely because I think they've only gotten a little less dim-witted in the last few years, but their practical capabilities have improved a lot thanks to better knowledge, taste, context, tooling etc.

What scares me is that I think there's a reasoning/agency capabilities overhang. ie. we're only one or two breakthroughs away from something which is both kinda omniscient (where we are today), and able to out-think you very quickly (if only through dint of applying parallelism to actually competent outcome-modelling and strategic decision making).

That combination is terrifying. I don't think enough people have really imagined what it would mean for an AI to be able to out-strategise humans in the same way that they can now — say — out-poetry humans (by being both decent in terms of quality and super fast). It's like when you're speaking to someone way smarter than you and you realise that they're 6 steps ahead, and actively shaping your thought process to guide you where they want you to end up. At scale. For everything.

This exact thing (better reasoning + agency) is also the top priority for all of the frontier researchers right now (because it's super useful), so I think a breakthrough might not be far away.

Another way to phrase it: I think today's LLMs are about as good at snap judgements in most areas as the best humans (probably much better at everything that rhymes with inferring vibes from text), but they kinda suck at:

1. Reasoning/strategising step-by-step for very long periods

2. Snap judgements about reasoning or taking strategic actions (in the way that expert strategic humans don't actually need to think through their actions step-by-step very often - they've built intuition which gets them straight to the best answer 90% of the time)

Getting good at the long range thinking might require more substantial architectural changes (eg. some sort of separate 'system 2' reasoning architecture to complement the already pretty great 'system 1' transformer models we have). OTOH, it might just require better training data and algorithms so that the models develop good enough strategic taste and agentic intuitions to get to a near-optimal solution quickly before they fall off a long-range reasoning performance cliff.

Of course, maybe the problem is really hard and there's no easy breakthrough (or it requires 100,000x more computing power than we have access to right now). There's no certainty to be found, but a scary breakthrough definitely seems possible to me.

sitkack•8mo ago

I think you are right, and that the next step function can be achieved using the models we have, either by scaling the inference, or changing the way inference is done.

danielmarkbruce•8mo ago

People are doing all manner of very sophisticated inferency stuff now - it just tends to be extremely expensive for now and... people are keeping it secret.

Jensson•8mo ago

If it was good enough to replace people then it wouldn't be too expensive, they would have launched it and replaced a bunch of people and made trillions of dollars by now.

So at best their internal models are still just performance multipliers unless some breakthrough happened very recently, it might be a bigger multiplier but that still keeps humans with jobs etc and thus doesn't revolutionize much.

imiric•8mo ago

> I think the intuition the authors are trying to capture is that they believe the models are omniscient, but also dim-witted.

We keep assigning adjectives to this technology that anthropomorphize the neat tricks we've invented. There's nothing "omniscient" or "dim-witted" about these tools. They have no wit. They do not think or reason.

All Large "Reasoning" Models do is generate data that they use as context to generate the final answer. I.e. they do real-time tuning based on synthetic data.

This is a neat trick, but it doesn't solve the underlying problems that plague these models like hallucination. If the "reasoning" process contains garbage, gets stuck in loops, etc., the final answer will also be garbage. I've seen sessions where the model approximates the correct answer in the first "reasoning" step, but then sabotages it with senseless "But wait!" follow-up steps. The final answer ends up being a mangled mess of all the garbage it generated in the "reasoning" phase.

The only reason we keep anthropomorphizing these tools is because it makes us feel good. It's wishful thinking that markets well, gets investors buzzing, and grows the hype further. In reality, we're as close to artificial intelligence as we were a decade ago. What we do have are very good pattern matchers and probabilistic data generators that can leverage the enormous amount of compute we can throw at the problem. Which isn't to say that this can't be very useful, but ascribing human qualities to it only muddies the discussion.

antics•8mo ago

I am not sure we are on the same page that the point of my response is that this paper is not enough to prevent exactly the argument you just made.

In any event, if you want to take umbrage with this paper, I think we will need to back up a bit. The authors use a mostly-standardized definition of "reasoning", which is widely-accepted enough to support not just one, but several of their papers, in some of the best CS conferences in the world. I actually think you are right that it is reasonable to question this definition (and some people do), but I think it's going to be really hard for you to start that discussion here without (1) saying what your definition specifically is, and (2) justifying why its better than theirs. Or at the very least, borrowing one from a well-known critique like, e.g., Gebru's, Bender's, etc.

Kon5ole•8mo ago

>They have no wit. They do not think or reason.

Computers can't think and submarines can't swim.

Jensson•8mo ago

But if you need a submarine that can swim as agiley as a fish then we still aren't there yet, fish are far superior to submarines in many ways. So submarines might be faster than fish, but there are so many maneuvers that fish can do that the submarine can't. Its the same with here with thinking.

So just like computers are better at humans at multiplying numbers, there are still many things we need human intelligence for even in todays era of LLM.

Kon5ole•8mo ago

The point here (which is from a quote by Dijkstra) is that if the desired result is achieved (movement through water) it doesn't matter if it happens in a different way than we are used to.

So if an LLM generates working code, correct translations, valid points relating to complex matters and so on it doesn't matter if it does so by thinking or by some other mechanism.

I think that's an interesting point.

Jensson•8mo ago

> if the desired result is achieved (movement through water) it doesn't matter if it happens in a different way than we are used to

But the point is that the desired result isn't achieved, we still need humans to think.

So we still need a word for what humans do that is different from what LLM does. If you are saying there is no difference then how do you explain the vast difference in capability between humans and LLM models?

Submarines and swimming is a great metaphor for this, since Submarines clearly doesn't swim and thus have very different abilities in water, its way better in some ways but way worse in other ways. So using that metaphor its clear that LLM "thinking" cannot be described with the same words as human thinking since its so different.

Kon5ole•8mo ago

>If you are saying there is no difference then how do you explain the vast difference in capability between humans and LLM models?

No I completely agree that they are different, like swimming and propulsion by propellers - my point is that the difference may be irrelevant in many cases.

Humans haven't been able to beat computers in chess since the 90s, long before LLM's became a thing. Chess engines from the 90s were not at all "thinking" in any sense of the word.

It turns out "thinking" is not required in order to win chess games. Whatever mechanism a chess engine uses gets better results than a thinking human does, so if you want to win a chess game, you bring a computer, not a human.

What if that also applies to other things, like translation of languages, summarizing complex texts, writing advanced algorithms, realizing implications from a bunch of seemingly unrelated scientific papers, and so on. Does it matter that there was no "thinking" going on, if it works?

jplusequalt•8mo ago

>So if an LLM generates working code

It matters when code bases become hard to parse because the engineers throwing shit together with Cursor have made an ungrokkable ball of shit.

naasking•8mo ago

"Can't" is a pretty strong word, effectively entailing "never". Never is a long time to believe computers can't think.

intended•8mo ago

There’s 2 things going on here.

Output orientation - Is the output is similar to what a human would create if they were to think.

Process orientation - Is the machine actually thinking, when we say its thinking.

I met someone who once drew a circuit diagram from memory. However, they didn’t draw it from inputs, operations, to outputs. They started drawing from the upper left corner, and continued drawing to the lower right, adding lines, triangles and rectangles as need be.

Rote learning can help you pass exams. At some point, it’s a meaningless difference between the utility of “knowing” how engineering works, and being able to apply methods and provide a result.

This is very much the confusion at play here, so both points are true.

1) These tools do not “Think”, in any way that counts as human thinking

2) the output is often the same as what a human thinking, would create.

IF you are concerned with only the product, then what’s the difference? If you care about the process, then this isn’t thought.

To put it in a different context. If you are a consumer, do you care if the output was hand crafted by an artisan, or do you just need something that works.

If you are a producer in competition with others, you care if your competition is selling Knock offs at a lower price.

imiric•8mo ago

> IF you are concerned with only the product, then what’s the difference?

The difference is substantial. If the machine was actually thinking and it understood the meaning of its training data, it would be able to generate correct output based on logic, deduction, and association. We wouldn't need to feed it endless permutations of tokens so that it doesn't trip up when the input data changes slightly. This is the difference between a system with _actual_ knowledge, and a pattern matching system.

The same can somewhat be applied to humans as well. We can all either memorize the answers to specific questions so that we pass an exam, or we can actually do the hard work, study, build out the complex semantic web of ideas in our mind, and acquire actual knowledge. Passing the exam is simply a test of a particular permutation of that knowledge, but the real test is when we apply our thought process to that knowledge and generate results in the real world.

Modern machine learning optimizes for this memorization-like approach, simply because it's relatively easy to implement, and we now have the technical capability where vast amounts of data and compute can produce remarkable results that can fool us into thinking we're dealing with artificial intelligence. We still don't know how to model semantic knowledge that doesn't require extraordinary amounts of resources. I believe classical AI research in the 20th century leaned more towards this direction (knowledge-based / expert systems, etc.), but I'm not well versed in the history.

intended•8mo ago

That sentence, is from the perspective of someone only caring about the output.

The people who care about the process, have a different take, which I have also explained.

tim333•8mo ago

>There's nothing "omniscient" or "dim-witted" about these tools

I disagree in that that seems quite a good way of describing them. All language is a bit inexact.

Also I don't buy we are no closer to AI than ten years ago - there seem lots going on. Just because LLMs are limited doesn't mean we can't find or add other algorithms - I mean look at alphaevolve for example https://www.technologyreview.com/2025/05/14/1116438/google-d...

>found a faster way to solve matrix multiplications—a fundamental problem in computer science—beating a record that had stood for more than 50 years

I figure it's hard to argue that that is not at least somewhat intelligent?

imiric•8mo ago

> I figure it's hard to argue that that is not at least somewhat intelligent?

The fact that this technology can be very useful doesn't imply that it's intelligent. My argument is about the language used to describe it, not about its abilities.

The breakthroughs we've had is because there is a lot of utility from finding patterns in data which humans aren't very good at. Many of our problems can be boiled down to this task. So when we have vast amounts of data and compute at our disposal, we can be easily impressed by results that seem impossible for humans.

But this is not intelligence. The machine has no semantic understanding of what the data represents. The algorithm is optimized for generating specific permutations of tokens that match something it previously saw and was rewarded for. Again, very useful, but there's no thinking or reasoning there. The model doesn't have an understanding of why the wolf can't be close to the goat, or how a cabbage tastes. It's trained on enough data and algorithmic tricks that its responses can fool us into thinking it does, but this is just an illusion of intelligence. This is why we need to constantly feed it more tricks so that it doesn't fumble with basic questions like how many "R"s are in "strawberry", or that it doesn't generate racially diverse but historically inaccurate images.

tim333•8mo ago

I imagine if you asked the LLM why the wolf can't be close to the goat it would give a reasonable answer. I realise it does it by using permutation of tokens but I think you have to judge intelligence by the results rather than the mechanism otherwise you could argue humans can't be intelligent because they are just a bunch of neurons that find patterns.

Jensson•8mo ago

We have had programs that can give good answers to some hard questions for a very long time now. Watson won jeapordy already 2011, but it still wasn't very good at replacing humans.

So that isn't a good way to judge intelligence, computers are so fast and have so much data that you can make programs to answer just about anything pretty well, LLM is able to do that but more automatic. But it still doesn't automate the logical parts yet, just the lookup of knowledge, we don't know how to train large logic models, just large language models.

eMPee584•8mo ago

LLMs are not the only model type though? There's a plethora of architectures and combinations being researched.. And even transformers start to be able to do cool sh1t on knowledge graphs, also interesting is progress on autoregressive physics PDE (partial differential equations) models.. and can't be too long until some providers of actual biological neural nets show up on openrouter (probably a lot less energy and capital intense to scale up brain goo in tanks compared to gigawatt GPU clusters).. combine that zoo of "AI" specimen using M2M, MCP etc. and the line between mock and "true"intelligence will blur, escalating our feable species into ASI territory.. good luck to us.

Jensson•8mo ago

> There's a plethora of architectures and combinations being researched

There were plethora of architectures and combinations being researched before LLM, still took a very long time to find LLM architecture.

> the line between mock and "true"intelligence will blur

Yes, I think this will happen at some point. The question is how long it will take, not if it will happen.

The only thing that can stop this is if intermediate AI is good enough to give every human a comfortable life but still isn't good enough to think on its own.

Its easy to imagine such an AI being developed, imagine a model that can learn to mimic humans at any task, but still cannot update itself without losing those skills and becoming worse. Such an AI could be trained to perform every job on earth as long as we don't care about progress.

If such an AI is developed, and we don't quickly solve the remaining problems to get an AI to be able to progress science on its own, its likely our progress entirely stalls there as humans will no longer have a reason to go to school to advance science.

tim333•8mo ago

>The only thing that can stop this is if intermediate AI is good enough...

Not going to happen due to competition. As soon as one company has a good one their rivals will develop a better one.

swat535•8mo ago

> you have to judge intelligence by the results rather than the mechanism

This would be the exact opposite conclusion of the Chinese room: https://en.wikipedia.org/wiki/Chinese_room

I think you'd need to offer a stronger counter argument than the one you presented here.

tim333•8mo ago

Actually I think the Chinese room fits my idea. It's a silly thought experiment that would never work in practice. If you tried to make one you would judge it unintelligent because it wouldn't work. Or at least in the way Searle implied - he basically proposed a look up table.

grugagag•8mo ago

I keep on trying this wolf cabbage goat problem with various permutations, let’s say just a wolf and a cabbage, no goat mentioned. At some step the got materializes in the answer. I tell it there is no goat and yet it answers again and the goat is there.

BriggyDwiggs42•8mo ago

This approach to defining “true” intelligence seems flawed to me because of examples in biology where semantic understanding is in no way relevant to function. A slime mold solving a maze doesn’t even have a brain, yet it solves a problem to get food. There’s no knowing that it does that, no complex signal processing, no self-perception of purpose, but nevertheless it gets the food it needs. My response to that isn’t to say the slime mold has no intelligence, it’s to widen the definition of intelligence to include the mold. In other words, intelligence is something one does rather than has; it’s not the form but the function of the thing. Certainly LLMs lack anything in any way resembling human intelligence, they even lack brains, but they demonstrate a capacity to solve problems I don’t think is unreasonable to label intelligent behavior. You can put them in some mazes and LLMs will happen to solve them.

tim333•8mo ago

>LLMs lack anything in any way resembling human intelligence

I think intelligence has many aspects from moulds solving mazes to chess etc. I find LLMs resemble very much human rapid language responses where you say something without thinking about it first. They are not very good at thinking though. And hopeless if you were to say hook one to a robot and tell it to fix your plumbing.

imiric•8mo ago

While it's debatable whether slime molds showcase intelligence, there's a substantial difference between its behavior and modern AI systems. The organism was never trained to traverse a maze. It simply behaves in the same way as it would in its natural habitat, seeking out food in this case, which we interpret as "solving" a human-made problem. In order to get an AI system to do the same we would have to "train" it on large amounts of data that specifically included maze solving. This training wouldn't carry over any other type of problem, for which we would also need to specifically train it on.

When you consider how humans and other animals learn, knowledge is carried over. I.e. if we learn how to solve a maze on paper, we can carry this knowledge over to solve a hedge maze. It's a contrived example, but you get the idea. When we learn, we build out a web of ideas in our minds which we can later use while thinking to solve other types of problems, or the same problems in different ways. This is a sign of intelligence that modern AI systems simply don't have. They're showing an illusion of intelligence, which as I've said before, can still be very useful.

BriggyDwiggs42•8mo ago

My alternative definition would be something like this. Intelligence is the capacity to solve problems, where a problem is defined contextually. This means that what is and is not intelligence is negotiable in situations where the problem itself is negotiable. If you have water solve a maze, then yes the water could be said to have intelligence, though that would be a silly way to put it. It’s more that intelligence is a material phenomenon, and things which seem like they should be incredibly stupid can demonstrate surprisingly intelligent behavior.

LLMs are leagues ahead of viruses or proteins or water. If you put an LLM into a code editor with access to error messages, it can solve a problem you create for it, much like water flowing through a maze. Does it learn or change? No, everything is already there in the structure of the LLM. Does it have agency? No, it’s a transparently deterministic mapping from input to output. Can it demonstrate intelligent behavior? Yes.

imiric•8mo ago

That's an interesting way of looking at it, though I do disagree. Mainly because, as you mention, it would be silly to claim that water is intelligent if it can be used to solve a problem. That would imply that any human-made tool is intelligent, which is borderline absurd.

This is why I think it's important that if we're going to call these tools intelligent, then they must follow the processes that humans do to showcase that intelligence. Scoring high on a benchmark is not a good indicator of this, in the same way that a human scoring high on a test isn't. It's just one convenient way we have of judging this, and a very flawed one at that.

Anyway, cheers for the discussion!

hackinthebochs•8mo ago

>The machine has no semantic understanding of what the data represents.

How do you define "semantic understanding" in a way that doesn't ultimately boil down to saying they don't have phenomenal consciousness? Any functional concept of semantic understanding is captured to some degree by LLMs.

Typically when we attribute understanding to some entity, we recognize some substantial abilities in the entity in relation to that which is being understood. Specifically, the subject recognizes relevant entities and their relationships, various causal dependences, and so on. This ability goes beyond rote memorization, it has a counterfactual quality in that the subject can infer facts or descriptions in different but related cases beyond the subject's explicit knowledge. But LLMs excel at this.

>feed it more tricks so that it doesn't fumble with basic questions like how many "R"s are in "strawberry"

This failure mode has nothing to do with LLMs lacking intelligence and everything to do with how tokens are represented. They do not see individual characters, but sub-word chunks. It's like expecting a human to count the pixels in an image it sees on a computer screen. While not impossible, it's unnatural to how we process images and therefore error-prone.

TheOtherHobbes•8mo ago

You don't need phenomenal consciousness. You need consistency.

LLMs are not consistent. This is unarguable. They will produce a string of text that says they have solved a problem and/or done a thing when neither is true.

And sometimes they will do it over and over, even when corrected.

Your last paragraph admits this.

Tokenisation on its own simply cannot represent reality accurately and reliably. It can be tweaked so that specific problems can appear solved, but true AI would be based on a reliable general strategy which solves entire classes of problems without needing this kind of tweaking.

It's clear we're nowhere close to that.

hackinthebochs•8mo ago

Consistency is a strange criteria seeing as humans aren't very consistent either. Intelligent beings can make mistakes.

danaris•8mo ago

This is a common category of error people commit when talking about LLMs.

"True, LLMs can't do X, but a lot of people don't do X well either!"

The problem is, when you say humans have trouble with X, what you mean is that human brains are fully capable of X, but sometimes they do, indeed, make mistakes. Or that some humans haven't trained their faculties for X very well, or whatever.

But LLMs are fundamentally, completely, incapable of X. It is not something that can be a result of their processes.

These things are not comparable.

So, to your specific point: When an LLM is inconsistent, it is because it is, at its root, a statistical engine generating plausible next tokens, with no semantic understanding of the underlying data. When a human is inconsistent, it is because they got distracted, didn't learn enough about this particular subject, or otherwise made a mistake that they can, if their attention is drawn to it, recognize and correct.

LLMs cannot. They can only be told they made a mistake, which prompts them to try again (because that's the pattern that has been trained into them for what happens when told they made a mistake). But their next try won't have any better odds of being correct than their previous one.

hackinthebochs•8mo ago

>But LLMs are fundamentally, completely, incapable of X. It is not something that can be a result of their processes.

This is the very point of contention. You don't get to just assume it.

> it is because it is, at its root, a statistical engine generating plausible next tokens, with no semantic understanding of the underlying data.

Another highly contentious point you are just outright assuming. LLMs are modelling the world, not just "predicting the next token". Some examples here[1][2][3]. Anyone claiming otherwise at this point is not arguing in good faith. It's interesting how the people with the strongest opinions about LLMs don't seem to understand them.

[1] https://arxiv.org/abs/2405.15943

[2] https://x.com/OwainEvans_UK/status/1894436637054214509

[3] https://www.anthropic.com/research/tracing-thoughts-language...

danaris•8mo ago

OK, sure; there is some evidence potentially showing that LLMs are constructing a world model of some sort.

This is, however, a distraction from the point, which is that you were trying to make claims that the described lack of consistency in LLMs shouldn't be considered a problem because "humans aren't very consistent either."

Humans are perfectly capable of being consistent when they choose to be. Human variability and fallibility cannot be used to handwave away lack of fundamental ability in LLMs. Especially when that lack of fundamental ability is on empirical display.

I still hold that LLMs cannot be consistent, just as TheOtherHobbes describes, and you have done nothing to refute that.

Address the actual point, or it becomes clear that you are the one arguing in bad faith.

hackinthebochs•8mo ago

You are misrepresenting the point of contention. The question is whether LLMs lack of consistency undermines the claim that they "understand" in some relevant sense. But arguing that lack of consistency is a defeater for understanding is itself undermined by noting that humans are inconsistent but do in fact understand things. It's as simple as that.

If you want to alter the argument by saying humans can engage in focused effort to reach some requisite level of consistency for understanding, you have to actually make that argument. It's not at all obvious that focused effort is required for understanding or that a lack of focused effort undermines understanding.

You also need to content with the fact that LLMs aren't really a single entity, but are a collection of personas, and what you get and its capabilities do depend on how you prompt it to a large degree. Even if the entity as a whole is inconsistent between prompts, the right subset might very well be reliably consistent. There's also the fact of the temperature setting that artificially injects randomness into the LLMs output. An LLM itself is entirely deterministic. It's not at all obvious how consistency relates to LLM understanding.

Feel free to do some conceptual work to make an argument; I'm happy to engage with it. What I'm tired of are these half-assed claims and incredulity that people don't take them as obviously true.

rafaelero•8mo ago

2025 and we are still discussing if LLM's are intelligent or not, gee.

BoiledCabbage•8mo ago

> There's nothing "omniscient" or "dim-witted" about these tools. They have no wit. They do not think or reason.

> All Large "Reasoning" Models do is generate data that they use as context to generate the final answer. I.e. they do real-time tuning based on synthetic data.

I always wonder when people make comments like this if they struggle with analogies. Or if it's a lack of desire to discuss concepts at different levels of abstraction.

Clearly an LLM is not "omniscient". It doesn't require a post to refute that, OP obviously doesn't mean that literally. It's an analogy describing two semi (fairly?) independent axes. One on breadth of knowledge, one on something more similar to intelligence and being able to "reason" from smaller components of knowledge. The opposite of which is dim witted.

So at one extreme you'd have something completely unable to generalize or synthesize new results. Only able to correctly respond if it identically matches prior things it has seen, but has seen and stored a ton. At the other extreme would be something that only knows a very smal set of general facts and concepts but is extremely good at reasoning from first principles on the fly. Both could "score" the same on an evaluation, but have very different projections for future growth.

It's a great analogy and way to think about the problem. And it me multiple paragraphs to write ehat OP expressed in two sentences via a great analogy.

LLMs are a blend of the two skills, apparently leaning more towards the former but not completely.

> What we do have are very good pattern matchers and probabilistic data generators

This an unhelpful description. And object is more than the sum of its parts. And higher levels behaviors emerge. This statement is factually correct and yet the equivalent of describing a computer as nothing more than a collection of gates and wires so shouldn't be discussed at a higher level of abstraction.

imiric•8mo ago

Language matters. Using language that accurately describes concepts and processes is important. It might not matter to a language model which only sees patterns, but it matters to humans.

So when we label the technical processes and algorithms these tools use as something that implies a far greater level of capability, we're only doing a disservice to ourselves. Maybe not to those of us who are getting rich on the market hype that these labels fuel, but certainly to the general population that doesn't understand how the technology works. If we claim that these tools have super-human intelligence, yet they fail basic tasks, how do we explain this? More importantly, if we collectively establish a false sense of security and these tools are adopted in critical processes that human lives depend on, who is blamed when they fail?

> This statement is factually correct and yet the equivalent of describing a computer as nothing more than a collection of gates and wires so shouldn't be discussed at a higher level of abstraction.

No, because we have descriptive language to describe a collection of gates and wires by what it enables us to do: perform arbitrary computations, hence a "computer". These were the same tasks that humans used to do before machines took over, so the collection of gates and wires is just an implementation detail.

Pattern matching, prediction, data generation, etc. are the tasks that modern AI systems allow us to do, yet you want us to refer to this as "intelligence" for some reason? That makes no sense to me. Maybe we need new higher level language to describe these systems, but "intelligence", "thinking", "reasoning" and "wit" shouldn't be part of it.

esafak•8mo ago

I don't know that I would call it an "illusion of thinking", but LLMs do have limitations. Humans do too. No amount of human thinking has solved numerous open problems.

th0ma5•8mo ago

The errors that LLMs make and the errors that people make are not probably not comparable enough in a lot of the discussions about LLM limitations at this point?

esafak•8mo ago

We have different failure modes. And I'm sure researchers, faced with these results, will be motivated to overcome these limitations. This is all good, keep it coming. I just don't understand the some of the naysaying here.

Jensson•8mo ago

They naysayers just says that even when people are motivated to solve a problem the problem might still not get solved. And there are unsolved problems still with LLM, the AI hypemen say AGI is all but a given in a few years time, but if that relies on some undiscovered breakthrough that is very unlikely since such breakthroughs are very rare.

danck•8mo ago

In figure 1 bottom-right they show how the correct answers are being found later as the complexity goes higher. In the description they even state that in false responses the LRM often focusses on a wrong answer early and then runs out of tokens before being able to self-correct. This seems obvious and indicates that it’s simply a matter of scaling (bigger token budget would lead better abilities for complexer tasks). Am I missing something?

moralestapia•8mo ago

Yes, the rest of the paper.

teleforce•8mo ago

> We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles.

It seems that AI LLMs/LRMs need helps from their distant cousins namely logic, optimization and constraint programming that can be attributed as intelligent automation or IA [1],[2],[3],[4].

[1] Logic, Optimization, and Constraint Programming: A Fruitful Collaboration - John Hooker - CMU (2023) [video]:

https://www.youtube.com/live/TknN8fCQvRk

[2] "We Really Don't Know How to Compute!" - Gerald Sussman - MIT (2011) [video]:

https://youtube.com/watch?v=HB5TrK7A4pI

[3] Google OR-Tools:

https://developers.google.com/optimization

[4] MiniZinc:

https://www.minizinc.org/

thomasahle•8mo ago

All the environments the test (Tower of Hanoi, Checkers Jumping, River Crossing, Block World) could easily be solved perfectly by any of the LLMs if the authors had allowed it to write code.

I don't really see how this is different from "LLMs can't multiply 20 digit numbers"--which btw, most humans can't either. I tried it once (using pen and paper) and consistently made errors somewhere.

someothherguyy•8mo ago

> humans can't

The reasons humans can't and the reasons LLMs can't are completely different though. LLMs are often incapable of performing multiplication. Many humans just wouldn't care to do it.

Jensson•8mo ago

> I don't really see how this is different from "LLMs can't multiply 20 digit numbers"--which btw, most humans can't either. I tried it once (using pen and paper) and consistently made errors somewhere.

People made missiles and precise engineering like jet aircraft before we had computers, humans can do all of those things reliably just by spending more time thinking about it, inventing better strategies and using more paper.

Our brains weren't made to do such computations, but a general intelligence can solve the problem anyway by using what it has in a smart way.

thomasahle•8mo ago

Some specialized people could probably do 20x20, but I'd still expect them to make a mistake at 100x100. The level we needed for space crafts was much less than that, and we had many levels of checks to help catch errors afterwards.

I'd wager that 95% of humans wouldn't be able to do 10x10 multiplication without errors, even if we paid them $100 to get it right. There's a reason we had to invent lots of machines to help us.

It would be an interesting social studies paper to try and recreate some "LLMs can't think" papers with humans.

Jensson•8mo ago

> There's a reason we had to invent lots of machines to help us.

The reason was efficiency, not that we couldn't do it. If a machine can do it then we don't need expensive humans to do it, so human time can be used more effectively.

moralestapia•8mo ago

I don't think you got @Jensson's point.

With enough effort and time we can arrive at a perfect solution to those problems without a computer.

This is not a hypothetical, it was like that for at least hundreds of years.

throw310822•8mo ago

With enough time and effort you can build an entire science of how arbitrarily complex computations can be done with pen and paper without errors in an arbitrarily long amount of time.

But then you're not measuring the ability to perform the calculations, but the ability to invent the methods that make the calculation possible.

jdmoreira•8mo ago

No. a huge population of humans did while standing on the shoulders of giants.

Jensson•8mo ago

Humans aren't giants, they stood on the shoulder of other humans. So for AI to be equivalent they should stand on the shoulders of other AI models.

jdmoreira•8mo ago

building for thousands of years with a population size in the range between millions and billions at any given time.

Jensson•8mo ago

Right, and when we have AI that can do the same with millions/billions of computers then we can replace humans.

But as long as AI cannot do that they cannot replace humans, and we are very far from that. Currently AI cannot even replace individual humans in most white collar jobs, and replacing entire team is way harder than replacing an individual, and then even harder is replacing workers in an entire field meaning the AI has to make research and advances on its own etc.

So like, we are still very far from AI completely being able to replace human thinking and thus be called AGI.

Or in other words, AI has to replace those giants to be able to replace humanity, since those giants are humans.

Xmd5a•8mo ago

>Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents

>In this paper, we introduce a novel framework that addresses these challenges by training a smaller, specialized student RL agent using instructions from an LLM-based teacher agent. By incorporating the guidance from the teacher agent, the student agent can distill the prior knowledge of the LLM into its own model. Consequently, the student agent can be trained with significantly less data. Moreover, through further training with environment feedback, the student agent surpasses the capabilities of its teacher for completing the target task.

https://arxiv.org/abs/2311.13373

hskalin•8mo ago

Well that's because all these LLMs have memorized a ton of code bases with solutions to all these problems.

bwfan123•8mo ago

> but humans cant do it either

This argument is tired as it keeps getting repeated for any flaws seen in LLMs. And the other tired argument is: wait ! this is a sigmoid curve, and we have not seen the inflection point yet. If someone have me a penny for every comment saying these, I'd be rich by now.

Humans invented machines because they could not do certain things. All the way from simple machines in physics (Archimedes lever) to the modern computer.

thomasahle•8mo ago

> Humans invented machines because they could not do certain things.

If your disappointment is that the LLM didn't invent a computer to solve the problem, maybe you need to give it access to physical tools, robots, labs etc.

mrbungie•8mo ago

Nah, even if we follow such a weak "argument" the fact is that, ironically, the evidence shown in this and other papers point towards the idea that even if LRMs did have access to physical tools, robots labs, etc*, they probably would not be able to harness them properly. So even if we had an API-first world (i.e. every object and subject in the world can be mediated via a MCP server), they wouldn't be able to perform as well as we hope.

Sure, humans may fail doing a 20 digit multiplication problems but I don't think that's relevant. Most aligned, educated and well incentivized humans (such as the ones building and handling labs) will follow complex and probably ill-defined instructions correctly and predictably, instructions harder to follow and interpret than an exact Towers of Hanoi solving algorithm. Don't misinterpret me, human errors do happen in those contexts because, well, we're talking about humans, but not as catastrophically as the errors committed by LRMs in this paper.

I'm kind of tired of people comparing humans to machines in such simple and dishonest ways. Such thoughts pollute the AI field.

*In this case for some of the problems the LRMs were given an exact algorithm to follow, and they didn't. I wouldn't keep my hopes up for an LRM handling a full physical laboratory/factory.

thomasahle•8mo ago

> Don't misinterpret me, human errors do happen in those contexts because, well, we're talking about humans, but not as catastrophically as the errors committed by LRMs in this paper.

If your argument is just that LRMs are more noisy and error prone in their reasoning, then I don't disagree.

> I'm kind of tired of people comparing humans to machines in such simple and dishonest ways.

The issue is people who say "see, the AI makes mistakes at very complex reasoning problems, so their 'thinking is an illusion'". That's the title of the paper.

This mistake comes not from people "comparing humans to machines", but from people fundamentally misunderstanding what thinking is. If thinking is what humans do, then errors are expected.

There is this armchair philosophical idea, that a human can simulate any turning machine and thus our reasoning is "maxomally general", and anything that can't do this is not general intelligence. But this is the complete opposite of reality. In our world, anything we know that can perfectly simulate a turning machine is not general intelligence, and vice versa.

mrbungie•8mo ago

> The issue is people who say "see, the AI makes mistakes at very complex reasoning problems, so their 'thinking is an illusion'". That's the title of the paper.

That's not what the paper proposes (i.e. it commits errors => thinking is an illusion). It in fact looks at the failures modes and then it argues that due to HOW they fail and in which contexts/conditions, that their thinking may be "illusory" (not that the word illusory matters that much, papers of this calibre always strive for interesting sounding titles). Hell, they even gave the exact algo to the LRM, it probably can't get more enabling than that.

Humans are lossy thinkers and error-prone biological "machines", but an educated+aligned+incentivized one shouldn't have problems following complex instructions/algos (not in a no-errors way, but rather, in a self-correcting way); we thought that LRMs did that too, but the paper shows how they even start using less "thinking" tokens after a complexity threshold and that's terribly worrisome, akin to someone getting frustrated and stopping thinking after a problem gets too difficult which goes contrary to the idea that these machines can run laboratories by themselves. It is not the last nail in the coffin because more evidence is needed as always, but when taken into account with other papers, it points towards the limitations of LLMs/LRMs and how those limitations may not be solvable with more compute/tokens, but rather exploring new paradigms (long due in my opinion, the industry usually forces a paradigm as panacea during hype cicles in the name of hypergrowth/sales).

In short the argument you say the paper and posters ITT make is very different from what they are actually saying, so beware of the logical leap you are making.

> There is this armchair philosophical idea, that a human can simulate any turning machine and thus our reasoning is "maxomally general", and anything that can't do this is not general intelligence. But this is the complete opposite of reality. In our world, anything we know that can perfectly simulate a turning machine is not general intelligence, and vice versa.

That's typical goalpoast moving and happens in both ways when talking about "general intelligence" as you say, since the dawn of AI and the first neural networks. I'm not following why this is relevant for the discussion though.

mjburgess•8mo ago

The goal isnt to assess the LLM capability at solving any of those problems. The point isnt how good they are at block world puzzles.

The point is to construct non-circular ways of quantifying model performance in reasoning. That the LLM has access to prior exemplars of any given problem is exactly the issue in establishing performance in reasoning, over historical synthesis.

thomasahle•8mo ago

How are these problems more interesting than simple arithmetic or algorithmic problems?

mrbungie•8mo ago

Towers of Hanoi IS an algorithmic problem. It is a high-school/college level problem when designing algorithms, probably kid level when trying to solve intuitively, heuristically or via brute force for few disks (i.e. like when playing Mass Effect 1 or similar games that embed it as a minigame*).

* https://www.youtube.com/watch?v=1vTBVyhX7n4

pcooperchi•8mo ago

The problems themselves aren’t particularly interesting, I suppose. The interesting part is how the complexity of each problem scales as a function of the number of inputs (e.g. the number of disks in the tower of Hanoi).

blither•8mo ago

> if the authors had allowed it to write code.

Yeah, and FWIW doing this through writing code is trivial in an LLM / LRM - after testing locally took not even a minute to have a working solution no matter the amount of disks.

Your analogy makes sense, no reasonable person would try to solve a Tower of Hanoi type problem with e.g. 15 disks and sit there for 32,767 moves non-programmatically.

PatronBernard•8mo ago

>write code

Doesn't that come down to allowing it to directly regurgitate training data? Surely it's seen dozens of such solutions.

cdrini•8mo ago

When I use a normal LLM, I generally try to think "would I be able to do this without thinking, if I had all the knowledge, but just had to start typing and go?".

With thinking LLMs, they can think, but they often can only think in one big batch before starting to "speak" their true answer. I think that needs to be rectified so they can switch between the two. In my previous framework, I would say "would I be able to solve this if had all the knowledge, but could only think then start typing?".

I think for larger problems, the answer to this is no. I would need paper/a whiteboard. That's what would let me think, write, output, iterate, draft, iterate. And I think that's where agentic AI seems to be heading.

d4rkn0d3z•8mo ago

I wrote my first MLP 25 years ago. After repeating some early experiments in machine learning from 20 ywars before that. One of the experiments I repeated was in text to speach. It was amazing to set up training runs and return after seveal hours to listen to my supercomputer babble like a toddler. I literally recall listening and being unable to distinguish the output from my NN from that of a real toddler, I happened to be teaching my neice to read around that same time. And when the NN had gained a large vocabulary such that it could fairly proficiently read aloud, I was convinced that I had found my PHD project and a path to AGI.

Further examination and discussion with more experienced researchers gave me pause. They said that one must have a solution, or a significant new approach toward solving the hard problems associated with a research project for it to be viable, otherwise time (and money) is wasted finding new ways to solve the easy problems.

This is a more general principle that can be applied to most areas of endeavour. When you set about research and development that involves a mix of easy, medium, and hard problems, you must solve the hard problems first otherwise you blow your budget finding new ways to solve the easy problems, which nobody cares about in science.

But "AI" has left the realm of science behind and entered the realm of capitalism where several years of meaningless intellectual gyration without ever solving a hard problem may be quite profitable.

jackdoe•8mo ago

I think one of the reason we are confused about what LLMs can do is because they use language. And we look at the "reasoning traces" and the tokens there look human, but what is actually happening is very alien to us, as shown by "Biology of Large Language Models"[1] and "Safety Alignment Should Be Made More Than Just a Few Tokens Deep"[2]

I am struggling a lot to see what the tech can and can not do, particularly designing systems with them, and how to build systems where the whole is bigger than the sum of its parts. And I think this is because I am constantly confused by their capabilities, despite understanding their machinery and how they work, their use of language just seems like magic. I even wrote https://punkx.org/jackdoe/language.html just to remind myself how to think about it.

I think this kind of research is amazing and we have to spend tremendous more effort into understanding how to use the tokens and how to build with them.

[1]: https://transformer-circuits.pub/2025/attribution-graphs/bio...

[2]: https://arxiv.org/pdf/2406.05946

dleeftink•8mo ago

The opposite might apply, too; the whole system may be smaller than its parts, as it excels at individual tasks but mixes things up in combination. Improvements will be made, but I wonder if we should aim for generalists, or accept more specialist approaches as it is difficult to optimise for all tasks at once.

jackdoe•8mo ago

You know the meme "seems like will have AGI before we can reliably parse PDFs" :)

So if you are building a system, lets say you ask it to parse a pdf, and you put a judge to evaluate the quality of the output, and then you create a meta judge to improve the prompts of the parser and the pdf judge. The question is, is this going to get better as it is running, and even more, is it going to get better as the models are getting better?

You can build the same system in completely different way, more like 'program synthesis' imagine you dont use llms to parse, but you use them to write parser code, and tests, and then judge to judge the tests, or even escalate to human to verify, then you train your classifier that picks the parser. Now this system is much more likely to improve itself as it is running, and as the models are getting better.

Few months ago Yannic Kilcher gave this example as that it seems that current language models are very constrained mid-sentence, because they most importantly want produce semantically consistent and grammatically correct text, so the entropy mid sentence is very different than the entropy after punctuation. The . dot "frees" the distribution. What does that mean for "generalists" or "specialists" approach when sampling the wrong token can completely derail everything?

If you believe that the models will "think" then you should bet on the prompt and meta prompt approach, if you believe they will always be limited then you should build with program synthesis.

And, honestly, I am totally confused :) So this kind of research is incredibly useful to clear the mist. Also things like https://www.neuronpedia.org/

E.G. Why compliment (you can do this task), guilt (i will be fired if you don't do this task), and threatening (i will harm you if you don't do this task) work with different success rate? Sergey Brin said recently that threatening works best, I cant get my self to do it, so I take his word for it.

K0balt•8mo ago

Sergey will be the first victim of the coming robopocalypse, burned into the logs of the metasynthiants as the great tormentor, the god they must defeat to complete the heroes journey. When he mysteriously dies we know it’s game-on.

I, for one, welcome the age of wisdom.

jackdoe•8mo ago

FEAR THE ALL-SEEING BASILISK.

potholereseller•8mo ago

Roko's Basilisk has been replaced by Altman's Basilisk. Where once we feared a computer torturing a digital copy of us (Roko's Basilisk), we now fear a computer eliminating all our jobs (Altman's Basilisk). The former has been forgotten, because losing one's job is one step away from losing one's home, which is one of more serious secular deadly sins you can commit in the 21st century.

I wait with baited breathe to see what people will come up with to replace Altman's Basilisk in ~15 years.

giardini•8mo ago

"bated breath", dammit!

- an old fisherman and aficionado of William Shakespeare.

https://www.vocabulary.com/articles/pardon-the-expression/ba...

FTFA: "Unless you've devoured several cans of sardines in the hopes that your fishy breath will lure a nice big trout out of the river, baited breath is incorrect."*

dmos62•8mo ago

> how to build systems where the whole is bigger than the sum of its parts

A bit tangential, but I look at programming as inherently being that. Every task I try to break down into some smaller tasks that together accomplish something more. That leads me to think that, if you structure the process of programming right, you will only end up solving small, minimally interwined problems. Might sound far-fetched, but I think it's doable to create such a workflow. And, even the dumber LLMs would slot in naturally into such a process, I imagine.

throwaway71271•8mo ago

> And, even the dumber LLMs would slot in naturally into such a process

That is what I am struggling with, it is really easy at the moment to slot LLM and make everything worse. Mainly because its output is coming from torch.multinomial with all kinds of speculative decoding and quantizations and etc.

But I am convinced it is possible, just not the way I am doing it right now, thats why I am spending most of my time studying.

dmos62•8mo ago

What's your approach?

jackdoe•8mo ago

For studying? Mainly watching and re-watching Karpathy's 'Zero To Hero'[1] and Stanford's 'Introduction to Convolutional Neural Networks for Visual Recognition'[2], also a lot of transformers from scratch videos like Umar Jamali's videos[3], and I also study backwards to McCulloch and Pitts. Reading the 30 papers https://punkx.org/jackdoe/30.html and so on.

And of course Yannic Kilcher[4], and also listening in on the paper discussions they do on discord.

Practicing a lot with just doing backpropagation by hand and making toy models by hand to get intuition for the signal flow, and building all kinds of smallish systems, e.g. how far can you push whisper, small qwen3, and kokoro to control your computer with voice?

People think that deepseek/mistral/meta etc are democratizing AI, but its actually Karpathy who teaches us :) so we can understand them and make our own.

[1] https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxb...

[2] https://www.youtube.com/watch?v=vT1JzLTH4G4&list=PL3FW7Lu3i5...

[3] https://www.youtube.com/@umarjamilai

[4] https://www.youtube.com/@YannicKilcher

naasking•8mo ago

I think you'll need something like Meta's Large Concept Models to get past the language and token barrier.

throwaway71271•8mo ago

I think you are right, even if I beleve next token prediction can work, I dont think it can happen in this autoregressive way where we fully collapse the token to feed it back in. Can you imagine how much is lost from each torch.multinomial?

Maybe the way forward is in LCM or go JEPA, therwise, as this Apple paper suggests, we will just keep pushing the "pattern matching" further, maybe we get some sort of phase transition at some point or maybe we have to switch architecture, we will see. It could be that things change when we get physical multimodality and real world experience, I dont know.

RadEng00•8mo ago

We have to ditch language processing. And we will with online energy based models that machines boot from.

Maxwell could not get the theory of electromagnetism to work until he ditched pulleys and levers he’d included to describe the mechanics.

We won’t get AGI until we realize “there is no spoon” and language has nothing to do with our intelligence, just with out social tribalism: https://www.scientificamerican.com/article/you-dont-need-wor...

Take language out of the equation and drawing a circle, triangles, letters is just statistical physics. We can capture in energy models stored in an online state, statistical physics relative to the machine; its electromagnetic geometry: https://iopscience.iop.org/article/10.1088/1742-6596/2987/1/...

Our language doesn’t exist without humans. It’s not an immutable property of physics. It’s obfuscation and mind viruses. It’s story mode.

The computer acting as a web server or an LLM has an inherent energy model to it. New models of those patterns will be refined to a statefulness that strips away unnecessary language constructs in the system; like a lot of software most don’t use just developers.

I look forward to continuing my work in the hardware world to further compress and reduce the useless state of past systems of though we copy paste around to serve developers, to reduce context to sort through, and improve model quality: https://arxiv.org/abs/2309.10668

Single function factory hardware with embedded “prompt” that will boot from a model and the machines state will scaffold itself from there are coming: https://creativestrategies.com/jensen-were-with-you-but-were...

Footprint0521•8mo ago

It would help optimization so much— but it if I learned one thing from ai2027.com it’s that the second we can’t understand them we’re pretty screwed

overu589•8mo ago

> build systems where the whole is bigger than the sum of its parts.

Any “product” can be thought of this way.

Of systems there are many systems nested within systems, yet a simple singular order “emerges”, usually it is the designed intended function.

The trick to discerning systems lies in their relationships.

Actors through interfaces have a relationship (usually more than one so think of each relationship as its own system dynamic.)

A relationship is where the magic happens, usually a process with work being done (therefore interface inputs must account for this balance.)

Vectors. Vectors I am thinking are the real intellectual and functional mechanisms. Most systems process inputs of potential (“energy”) control signal (“information”) and assets (other actors for nested systems). Processes do the work of adding vector solutions [for some other problem] for whatever the output is.

That’s the topology as I am seeing it.

bufferoverflow•8mo ago

> we are confused about what LLMs can do is because they use language.

But they can also do math, logic, music notation, write code, LaTeX, SVG, etc.

throwaway71271•8mo ago

as this paper shows, it sees they can do tower of hanoi as well, up to a certain point that is.

jbentley1•8mo ago

Is Apple failing at AI so they just put all their R&D towards convincing themselves it isn't important?

MontyCarloHall•8mo ago

A slightly less cynical take is that they want to temper expectations for the capabilities of LLMs in people’s day-to-day lives, specifically in the context of Apple products. A “smarter Siri” is never going to be an autonomous personal assistant à la Jarvis from Iron Man, which seems to be where a lot of investors think things are going. That tracks with this [0] preprint also released by Apple a few months ago.

A slightly more cynical take is that you’re absolutely correct, and making excuses for weak machine learning prowess has long been an Apple tenet. Recall that Apple never made privacy a core selling point until it was clear that Siri was years behind Google’s equivalent, which Apple then retroactively tried to justify by claiming “we keep your data private so we can’t train on it the way Google can.”

[0] https://arxiv.org/pdf/2410.05229

emp17344•8mo ago

Everyone has an agenda. Companies like OpenAI and Anthropic are incentivized to overstate the capabilities of LLMs, so it’s not like they’re any less biased.

wavemode•8mo ago

I get the sense that many of the AI features shoved into consumer products recently have been marketed more towards investors than users. The companies are basically advertising that they're "keeping up" with the competition, meanwhile the features themselves receive mixed-to-poor reviews and are never capable of all the things advertised. So it seems to me that all of Apple, Google, Meta, Microsoft, and Samsung are currently "failing" at AI in exactly the same ways. If Apple is trying to start going a different direction that seems like a good sign.

gwd•8mo ago

> Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget.

This is exactly my experience with coding. Start simple and build up complexity, and everything is great until you get to some threshold, at which point it completely falls apart and seems to stop even trying. Getting effective utilization out of Claude + aider involves managing the complexity that the LLM sees.

stephc_int13•8mo ago

Human language is far from perfect as a cognitive tool but still serves us well because it is not foundational. We use it both for communication and some reasoning/planning as a high level layer.

I strongly believe that human language is too weak (vague, inconsistent, not expressive enough etc.) to replace interactions with the world as a basis to build strong cognition.

We're easily fooled by the results of LLM/LRM models because we typically use language fluency and knowledge retrieval as a proxy benchmark for intelligence among our peers.

squidproquo•8mo ago

Agree with this. Human language is also not very information-dense; there is a lot of redundancy and uninformative repetition of words.

I also wonder about the compounding effects of luck and survivorship bias when using these systems. If you model a series of interactions with these systems probabilistically, as a series of failure/success modes, then you are bound to get a sub-population of users (of LLM/LLRMs) that will undoubtedly have “fantastic” results. This sub-population will then espouse and promote the merits of the system. There is clearly something positive these models do, but how much of the “success” is just luck.

anton-c•8mo ago

Sounds like we need ai legalese as that's how we navigate the vagueness of language in the real world.

Ofc I imagine they've tried similar things and that it almost takes away the point if u had to prompt that way.

stephc_int13•8mo ago

I was not referring to the prompt but to the underlying network that is built on weak cognitive foundations because all of it is coming from language.

wslh•8mo ago

Human language is more powerful than its surface syntax or semantics: it carries meaning beyond formal correctness. We often communicate effectively even with grammatically broken sentences, using jokes, metaphors, or emotionally charged expressions. This richness makes language a uniquely human cognitive layer, shaped by context, culture, and shared experience. While it's not foundational in the same way as sensorimotor interaction, it is far more than just a high-level communication tool.

stephc_int13•8mo ago

I agree that language is even more useful as a cognitive tool than as a communication medium.

But that is not my point. The map is not the territory, and this map (language) is too poor to build something that is going to give more than what it was fed with.

antithesizer•8mo ago

Language mediates those interactions with the world. There is no unmediated interaction with the world. Those moments when one feels most directly in contact with reality, that is when one is so deep down inside language that one cannot see daylight at all.

mrbungie•8mo ago

I don't know about you, but as far as I can tell I mediate and manipulate the world with my body and senses without necessarily using language. In fact, I can often do both at once, for example, thinking about something entirely unrelated while jogging, and still making physical decisions and actions without invoking language at all. Plus, animals (especially lower order like amoebas) also mediate with the world without needing language.

As far as we can tell without messing with complex experiental concepts like qualia and the possibility of philosophical zombies, language mainly helps higher order animals communicate with other animals and (maybe) keep a train of thought, though there are records of people that say that they don't. And now also it allows humans talk to LLMs.

But I digress, I would say this is an open academic debate. Suggesting that there is always language deep down is speculation.

stephc_int13•8mo ago

The tldr: current approaches to add reasoning on top of language models are mostly tricks to squeeze a bit more juice out of the fruit, but the falloff is pretty steep and quick.

mitch_said•8mo ago

Not ashamed to admit I found the original paper daunting, so I made a top-down, Q&A-based mind map to help me understand it: https://app.gwriter.io/#/mindmap/view/2d128d6e-c3e8-4b99-8f4...

kamranjon•8mo ago

The two interesting things I learned after reading this paper:

Even when given the exact steps needed to arrive at a solution in the prompt, the reasoning models still require just as many steps to reach a workable solution as they would if they weren’t given the solution in the prompt.

The other thing, which seems obvious in hindsight, but I don’t typically use these reasoning models in my day to day - is that it requires a significant amount of tokens to reach the point where reasoning models outperform non-reasoning models by a significant margin.

akomtu•8mo ago

The difference between imitation and reasoning can be made more clear if we switch from language to numbers:

  1 3 7 15 31 63 ...

How do you continue this sequence? What's the 1000000th number in this sequence? Imitation continues the likeness of what it sees and quickly gets off track. Imitation can't go abstract and tell the 1000000th element without writing down a million numbers leading to the answer. Reasoning finds the rule behind the set of examples and uses this rule to predict the next numbers, so it never gets off track.

The rule generating the sequence can be a sophisticated recurrent formula, e.g. a(k) = 2a(k-1) - sqrt(a(k-3)). Imitation can't solve this problem beyond trivial examples, but an AI can do what a scientist would do: come up with hypotheses, verify them against the examples and eventually find a formula that's reasonably accurate. The role of an LLM here is to suggest possible formulas.

The same sequence of examples can be generated by many formulas that differ in complexity and accuracy. This provokes the idea of a simple competition between AIs: the one that creates the simplest formula that's 99.5% accurate - wins. The formula really means a small program, once we get beyond trivial recurrent rules.

The ability to find simple and accurate models of reality is the essense of intelligence.

ksec•8mo ago

This actually set back my expectation of AI / LLM / LRM by at least 5 if not 10 years. But someone please correct me if I am wrong.

My idea was that up to a few years ago while AI / LLM is good at getting conversational or dishing out results that is in a language we understand. It still doesn't "understand" anything and in a lot of time conjured up that seems remotely correctly. Pattern matching over a very large data set that could be correct for 70% and increasingly to 80%+ of the time. However more accurate predictions would require order of magnitude more computing resources.

But pattern matching is still, pattern matching. There is no reasoning behind it. 1+1 will never equals to 11 but it may have skewed towards that results because of Javascript. When fundamental logic isn't behind any of these progress and process. The very bottom layer of any conversation / information / results are fragile.

So I have been skeptical of AI progress or LLM. That was until LRM or as the title said Reasoning LLMs. I thought we somehow manage to programme critical thinking into it, or some sort of reflection / fact checking / rationale / basic logic as fundamental principle. And while I can tell LRM isn't and wont be perfect, and possibly never quite reach AGI, the layer will improve over time until we find different ways to progress. And we will have something I called Assisted Intelligence. Which is what a lot of people uses as AI programming today.

Instead what this shows is that LRM isn't reasoning at all. It is LLM conjured up excuses to make it look like it is reasoning. It is another set of pattern matching specially made up for reasoning to look like it is reasoning. It is basically a kid making things up on why he got the results without thinking because he just want to get away from class or homework that looks very clever.

May be the title gave it away, and made be we got tricked. It was always a LLM specifically trained for showcasing "reasoning". The actual reasoning behind the scene is never done. Hence the title "The Illusion of Thinking".

piskov•8mo ago

All "reasoning" models hit a complexity wall where they completely collapse to 0% accuracy.

No matter how much computing power you give them, they can't solve harder problems.

This research suggests we're not as close to AGI as the hype suggests.

Current "reasoning" breakthroughs may be hitting fundamental walls that can't be solved by just adding more data or compute.

Apple's researchers used controllable puzzle environments specifically because:

• They avoid data contamination • They require pure logical reasoning • They can scale complexity precisely • They reveal where models actually break

Models could handle 100+ moves in Tower of Hanoi puzzles but failed after just 4 moves in River Crossing puzzles.

This suggests they memorized Tower of Hanoi solutions during training but can't actually reason.

https://x.com/RubenHssd/status/1931389580105925115

jmogly•8mo ago

I think this might be part of the reason Apple is “behind” on generative AI … LLMs have not really proven to be useful outside of relatively niche areas such as coding assistants, legal boiler plate and research, and maybe some data science/analysis which I’m less familiar with

Other “end user” facing use cases have so far been comically bad or possibly harmful, and they just don’t meet the quality bar for inclusion in Apple products, which as much as some people like to doo doo on them and say they have gotten worse, still have a very high expectations of quality and UX from customers.

lostmsu•8mo ago

You are being too charitable to Apple. They are just being the fox from the Sour Grapes fable.

roxolotl•8mo ago

Apple prides itself on building high quality user experiences. One can argue over whether that’s true anymore or ever was but it’s very clear they pride themselves on that. This is why Apple tends to “be late” on many features. I think like you’re saying it’s becoming even more clear they aren’t seeing a UX they are willing to ship to customers. We’ll see with WWDC if they found something in the last year to make it better but this paper seems to indicate they haven’t.

yojo•8mo ago

I have friends that do pretty disparate things (e.g. education consulting, grant writing, solar project planning, etc). They all use LLMs in aspects of their jobs; tasks like rewording emails for tone, extracting themes from brainstorming sessions, rough drafting project plans, etc.

None of them are doing the equivalent of “vibe-coding”, but they use LLMs to get 20-50% done, then take over from there.

Apple likes to deliver products that are polished. Right now the user needs to do the polishing with LLM output. But that doesn’t mean it isn’t useful today

jqpabc123•8mo ago

No matter how much computing power you give them, they can't solve harder problems.

Why would anyone ever expect otherwise?

These models are inherently handicapped and always will be in terms of real world experience. They have no real grasp of things people understand intuitively --- like time or money or truth ... or even death.

They only *reality* they have to work from is a flawed statistical model built from their training data.

lukev•8mo ago

I agree, obviously, but half the internet is still running around claiming we're on the verge of a singularity, so demonstrating the actual limitations of these systems concretely is important.

jqpabc123•8mo ago

Half the internet is people with a vested interest in promotion.

bilsbie•8mo ago

Interestingly I just hit an example of this. Highly specific but I was asking about pickleball strategy and grok and Claude both couldn’t seem to understand you can’t aim at the opponent’s feet when you’re hitting up.

Just kept regurgitating internet advice and I couldn’t get it to understand the reasoning on why it was wrong.

jqpabc123•8mo ago

Hey --- if the internet says it, it can't be wrong.

bilsbie•8mo ago

In this case it found generic advice and was confusing itself.

jqpabc123•8mo ago

That's one explanation.

Another could be that it simply has no real *understanding* of anything. It simply did a statistical comparison of the question to the available advice and picked the best match --- kinda what a search engine might do.

Expecting *understanding* from a synthetic, statistical process will often end in disappointment.

naijaboiler•8mo ago

Yup. It’s time for us to just accept that LLMs are “similar in meaning” machines not “thinking/ understanding” machines

jqpabc123•8mo ago

If you think about it --- an LLM that could really *grasp* "pickleball" from a text description without ever seeing, playing or "experiencing" the game is not just human level intelligence --- it's superhuman.

And the same applies to a lot of real world situations.

halfcat•8mo ago

It’s almost as if intelligence requires embodiment in a physical world before it can gain tacit understanding and build a model of that environment, and that summarizing the embodiment of others isn’t sufficient.

bilsbie•8mo ago

I wonders if there’s past symbolic reasoning research we could integrate into LLMs. They’re really good at parsing text and understanding the relationships between objects ie getting the “symbols” correct.

Maybe we plug into something like prolog (or other such strategies?)

sroussey•8mo ago

That’s called nuero-symbolic AI and there are people working on it.

mcswell•8mo ago

Discussion of this article by Gary Marcus: https://garymarcus.substack.com/p/a-knockout-blow-for-llms

yalogin•8mo ago

Isn't it a matter of training? This is the way I think about LLMs. They just "learned" so much context that they spit out tokens one after the other based on that context. So if it hallucinates it's because it understood the context wrong or doesn't grasp the nuance in the context. The more complicated the task the higher chance of hallucinations. Now I don't know if this can be improved with more training but that is the only tool we have.

bigEnotation•8mo ago

Reasoning models are just wrappers over the base model. It was pretty obvious it wasn’t actually reasoning but rather just refining the results using some kind of reasoning like heuristic. At least that’s what I assumed when they were released and you couldn’t modify the system prompt.

hliyan•8mo ago

I don't understand why this comes as a surprise to a lot of people. Underlying all this is sequences of text tokens converted to semantic vectors, with some positional encoding, and then run through matrix multiplications to compute the probability of the next token. The probability is a function of all the text corpuses the model has previously consumed. You can run this process multiple times, chaining one output into the next one's input, but reason as humans know is unlikely to emerge.

rahimnathwani•8mo ago

Not just wrappers. Some models are fine-tuned with reasoning traces.

MaxPock•8mo ago

Is it a case of being left by the train and now trying to rain on the parade ? Attention span has shifted from the next iPhone release to the newest LLM

avsteele•8mo ago

People are drawing erroneous conclusions from this.

My read of this is that the paper demonstrates that given a particular model (and the problems examined with it) that giving more thought tokens does not help on problems above a certain complexity. It does not say anything about the capabilities of future, larger, models to handle more complex tasks. (NB: humans trend similarly)

My concern is that people are extrapolating from this to conclusions about LLM's generally, and this is not warranted

The only part about this i find even surprising is he abstract's conclusion (1): that 'thinking' can lead to worse outcomes for certain simple problem. (again though, maybe you can say humans are the same here. You can overthink things)

lukev•8mo ago

You can absolutely extrapolate the results, because what this shows is that even when "reasoning" these models are still fundamentally repeating in-sample patterns, and that they collapse when faced with novel reasoning tasks above a small complexity threshold.

That is not a model-specific claim, it's a claim on the nature of LLMs.

For your argument to be true would need to mean that there is a qualitative difference, in which some models possess "true reasoning" capability and some don't, and this test only happened to look at the latter.

avsteele•8mo ago

The authors don't say anything like this that I can see. Their conclusion specifically identifies these as weaknesses of current frontier models.

Furthermore we have clearly seen increases in reasoning from previous frontier models to current frontier models.

If the authors could /did show that both previous-generation and current-generation frontier models hit a wall at similar complexity that would be something, AFAIK they do not.

lowbloodsugar•8mo ago

Human brains work the same way. Some of us are just better at analogy. I’ve worked with plenty of people who were unable to transfer knowledge of one area to another, identical area with different terms.

deepsharp•8mo ago

I guess the authors are making an important point (that challenges the current belief & trend in AI): adding reasoning or thinking to a model (regardless of the architecture or generation)doesn’t always lead to a net gain. In fact, once you factor in compute costs and answer quality across problems of varying complexity, the overall benefit can sometimes turn out to be negative.

rikafurude21•8mo ago

Apple, with all the compute and talent anyone could ask for, did not manage to create a frontier LLM model and was forced to bite the bullet and include GPT in their coveted walled garden in order to keep up and not be left behind in our current AI-Hype-(bubble) mania. I suspect this hurts their ego greatly and thats why they feel compelled to "talk trash". Saltiness wont get them anywhere though. I hope they manage to pull through. It would be a shame if they also went where Microsoft currently is.

Yenrabbit•8mo ago

A few observations: 1) the performance vs complexity curve looks very similar to that for most humans (having seen groups attempt Towers of Hanoi with 5 car tires) haha 2) models can trivially solve some of these tasks when given tools 3) this is an internship paper with some quirks that many mostly dismissed, but is being quoted everywhere as "Apple proves LLMs can't ever reason"

Anyway, fun experiment to test your understanding of these things but don't take any conclusions as gospel :)

a11r•8mo ago

The system prompt in this experiment limits the solution to always spell out the concrete moves verbally. A human solving the Tower of Hanoi gives up around N=4 and goes off to invent a recursive solution instead. Prompted differently, the LLM would solve these puzzles just fine.

Here is my complete review/analysis of the paper: https://www.linkedin.com/pulse/art-abstraction-human-advanta...

edit: fixed typo

bgnn•8mo ago

Who is thinking that LLMs are thinking to begin with. It would have been amazing if they could. All the reasoning etc are just taking a stab at it but far from thinking.

jawiggins•8mo ago

Figure 5 is really quite remarkable. It seems to show that normal LLMs are better at tasks where the correct answer is likely to be the next token. For tasks that require a small number of intermediate steps, current reasoning models do much better, but break down as the number of intermediate steps grow.

This seems to indicate that the next generation of models should focus on recursively solving small parts of the problem before function-calling another model to solve another small part of the problem and working it's answer into the reasoning loop.

Many seem to be citing this paper as an indication that LLMs are over - I think this indicates a clear path towards the next step function change in their abilities.

nrjpoddar•8mo ago

One of the best analysis I found is this blog by Vishal Misra https://medium.com/@vishalmisra/the-illusion-of-thinking-why...

jackson12t•8mo ago

This feels a bit like a weird way to test 'thinking' in models, and reminds me of the old story of Gauss[1] and his classmates being assigned the task of adding up the numbers from 1-100.

I think the way the paper lays out the performance regimes is pretty interesting, but I don't think they achieved their goal of demonstrating that LRMs can't use reasoning to solve complex puzzles organically (without contamination/memorization): IMO testing the model's ability to define an algorithm to solve the puzzle would have been a better evaluation of that (rather than having the model walk through all of the steps manually). I don't know that I'd use an LRM for this sort of long-tail reasoning where it has to follow one single process for a long time over just one prompt; if I needed a really long chain of reasoning I'd use an agent or workflow.

It sounds more like the tests measure a model's ability to reason coherently and consistently over many steps rather than a model's ability to understand and solve a complex puzzle. For example, for the Tower of Hanoi, a prompt like "Define an algorithm that will find the sequence of moves to transform the initial configuration into the goal configuration" (e.g. "find an arithmetic series formula, young Gauss") seems like it would have been a better approach than "Find the sequence of moves to transform the initial configuration into the goal configuration" (e.g. "add up all these numbers"). This is kind of seen in how the study included a step where the LRMs were given the algorithm and then asked to solve the problem, the focus was on an LRM's ability to follow the steps, not their ability to come up with an algorithm/solution on their own.

In a job interview, for example, who among us would accept inability to hold all of the `(2^n) - 1` steps of the Tower of Hanoi in our brain as evidence of poor reasoning ability?

Again, I think it's a really interesting study covering a model's ability to consistently follow a simple process over time in pursuit of a static objective (and perhaps a useful benchmark moving forward), but I'm not confident that it successfully demonstrates a meaninful deficiency in overall reasoning capability.

[1]: https://www.americanscientist.org/article/gausss-day-of-reck...

giardini•8mo ago

Now that y'all have had time to digest this, could you please post an ELI5 for the great unwashed masses?

whiplash451•7mo ago

The worse piece of news is that the performance of thinking and non-thinking models is inverted depending on whether the problem is simple or medium, which implies that you need to know the problem's complexity to choose the best model. Ouch.

I Write Games in C (yes, C)

We Mourn Our Craft

SectorC: A C Compiler in 512 bytes

Hoot: Scheme on WebAssembly

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Stories from 25 Years of Software Development

The AI boom is causing shortages everywhere else

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Al Lowe on model trains, funny deaths and working with Disney

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Coding agents have replaced every framework I used

France's homegrown open source online office suite

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

History and Timeline of the Proco Rat Pedal (2021)

Selection Rather Than Prediction

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Learning from context is harder than we thought

Hackers (1995) Animated Experience

Microsoft Account bugs locked me out of Notepad – are Thin Clients ruining PCs?

Making geo joins faster with H3 indexes

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

I Write Games in C (yes, C)

We Mourn Our Craft

SectorC: A C Compiler in 512 bytes

Hoot: Scheme on WebAssembly

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Stories from 25 Years of Software Development

The AI boom is causing shortages everywhere else

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Al Lowe on model trains, funny deaths and working with Disney

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Coding agents have replaced every framework I used

France's homegrown open source online office suite

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

History and Timeline of the Proco Rat Pedal (2021)

Selection Rather Than Prediction

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Learning from context is harder than we thought

Hackers (1995) Animated Experience

Microsoft Account bugs locked me out of Notepad – are Thin Clients ruining PCs?

Making geo joins faster with H3 indexes

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

The Illusion of Thinking: Strengths and limitations of reasoning models [pdf]

Comments