As far as I can tell he's the person that people reach for when they want to justify their beliefs. But surely being this wrong for this wrong should eventually lead to losing ones status as an expert.
(em-dash avoided to look less AI)
Of course, the main issue with the field is the critics /should/ be correct. Like, LLMs shouldn't work and nobody knows why they work. But they do anyway.
So you end up with critics complaining it's "just a parrot" and then patting themselves on the back, as if inventing a parrot isn't supposed to be impressive somehow.
Not sure I’d agree that SA has been any more consistently right. You can easily find examples of overconfidence from him (though he rarely says anything specific enough to count as a prediction).
You can see this in this article too.
The real question you should be asking is if there is a practical limitation in LLMs and LRMs revealed by the Hanoi Towers problem or not, given that any SOTA model can write code to solve the problem and thereby solve it with tool use. Gary frames this as neurosymbolic, but I think it's a bit of a fudge.
Must be some sort of cognitive sunk cost fallacy, after dedicating your life to one sect, it must be emotionally hard to see the other "keep winning". Of course you'd root for them to fall.
A LLM with tool use can solve anything. It is interesting to try and measure its capabilities without tools.
The Illusion of Thinking: Strengths and limitations of reasoning models [pdf] - https://news.ycombinator.com/item?id=44203562 - June 2025 (269 comments)
Also this: A Knockout Blow for LLMs? - https://news.ycombinator.com/item?id=44215131 - June 2025 (48 comments)
Were there others?
It is scientific malpractice to write a post supposedly rebutting responses to a paper and not directly address the most salient one.
I don’t think I agree with you that GM isn’t addressing the points in the paper you link. But in any case, you’re not doing your argument any favors by throwing in wild accusations of malpractice.
But anybody relying on Gary's posts in order to be be informed on this subject is being being mislead. This isn't an isolated incident either.
People need to be made be aware when you read him it is mere punditry, not substantive engagement with the literature.
(Or it should not be based on that claim as a central point, which apples paper was)
My objection to the whole thing is the AI hype bros, which is really the funding solicitation facade over everything rather the truth, only has one outcome and that is that it cannot be sustained. At that point all investor confidence disappears, the money is gone and everyone loses access to the tools that they suddenly built all their dependencies on because it's all proprietary service model based.
Which is why I am not poking it with a 10 foot long shitty stick any time in the near future. The failure mode scares me, not the technology which arguably does have some use in non-idiot hands.
And while it will be sad to see model improvements slow down when the bubble bursts there is a lot of untapped potential in the models we already have. Especially as they become cheaper and easier to run
I'm not sure the GPU market won't collapse with it either. Possibly taking out a chunk of TSMC in the process, which will then have knock on effects across the whole industry.
The GPU market will probably take a hit. But the flip side of that is that the market will be flooded with second-hand enterprise-grade GPUs. And if Nvidia needs sales from consumer GPUs again we might see more attractive prices and configurations there too. In the short term a market shock might be great for hobby-scale inference, and maybe even training (at the 7B scale). In the long term it will hurt, but if all else fails we still have AMD who are somehow barely invested in this AI boom
You're acting like this is a common ocurrence lol
It’s patently obvious to me that LLMs can reason and solve novel problems not in their training data. You can test this out in so many ways, and there’s so many examples out there.
______________
Edit for responders, instead of replying to each:
We obviously have to define what we mean by "reasoning" and "solving novel problems". From my point of view, reasoning != general intelligence. I also consider reasoning to be a spectrum. Just because it cannot solve the hardest problem you can think of does not mean it cannot reason at all. Do note, I think LLMs are generally pretty bad at reasoning. But I disagree with the point that LLMs cannot reason at all or never solve any novel problems.
In terms of some backing points/examples:
1) Next token prediction can itself be argued to be a task that requires reasoning
2) You can construct a variety of language translation tasks, with completely made up languages, that LLMs can complete successfully. There's tons of research about in-context learning and zero-shot performance.
3) Tons of people have created all kinds of challenges/games/puzzles to prove that LLMs can't reason. One by one, they invariably get solved (eg. https://gist.github.com/VictorTaelin/8ec1d8a0a3c87af31c25224..., https://ahmorse.medium.com/llms-and-reasoning-part-i-the-mon...) -- sometimes even when the cutoff date for the LLM is before the puzzle was published.
4) Lots of examples of research about out-of-context reasoning (eg. https://arxiv.org/abs/2406.14546)
In terms of specific rebuttals to the post:
1) Even though they start to fail at some complexity threshold, it's incredibly impressive that LLMs can solve any of these difficult puzzles at all! GPT3.5 couldn't do that. We're making incremental progress in terms of reasoning. Bigger, smarter models get better at zero-shot tasks, and I think that correlates with reasoning.
2) Regarding point 4 ("Bigger models might to do better"): I think this is very dismissive. The paper itself shows a huge variance in the performance of different models. For example, in figure 8, we see Claude 3.7 significantly outperforming DeepSeek and maintaining stable solutions for a much longer sequence length. Figure 5 also shows that better models and more tokens improve performance at "medium" difficulty problems. Just because it cannot solve the "hard" problems does not mean it cannot reason at all, nor does it necessarily mean it will never get there. Many people were saying we'd never be able to solve problems like the medium ones a few years ago, but now the goal posts have just shifted.
People make a common mistake by conflating "solving problems with novel surface features" with "reasoning outside training data." This is exactly the kind of binary thinking I mentioned earlier.
Can you reason? Yes? Then why haven't you cured cancer? Let's not have double standards.
Demis Hassabis On The Future of Work in the Age of AI (@ 2:30 mark)
Again, for all I know maybe he does believe that transformer-based LLMs as such can't be truly creative. Maybe it's true, whether he believes it or not. But that interview doesn't say it.
Would you care to tell us more ?
« It’s patently obvious » is not really an argument, I could say just as well that everyone know LLM can’t resonate or think (in the way we living beings do).
I just made up this scenario and these words, so I'm sure it wasn't in the training data.
Kwomps can zark but they can't plimf. Ghirns are a lot like Kwomps, but better zarkers. Plyzers have the skills the Ghirns lack.
Quoning, a type of plimfing, was developed in 3985. Zhuning was developed 100 years earlier. I have an erork that needs to be plimfed. Choose one group and one method to do it.
> Use Plyzers and do a Quoning procedure on your erork.
If that doesn't count as reasoning or generalization, I don't know what does.
* Goal: Pick (Group ∧ Method) such that Group can plimf ∧ Method is a type of plimfing
* Only one group (Plyzers) passes the "can plimf" test
* Only one method (Quoning) is definitely plimfing
Therefore, the only valid (Group ∧ Method) combo is: → (Plyzer ∧ Quoning)
Source: ChatGPT
It certainly feels like more than fancy auto-complete. That is not to say I haven't run into issue but I'm still often shocked at how far it gets. And that's today. I have no idea what to expect in 6 months, 12, 2 years, 4, etc.
It's not true. It's plainly not true. Go have any of these models, paid, or local try to build you novel solutions to hard, existing problems despite being, in some cases, trained on literally the entire compendium of open knowledge in not just one, but multiple adjacent fields. Not to mention the fact that being able to abstract general knowledge would mean it would be able to reason.
They. Cannot. Do it.
I have no idea what you people are talking about because you cannot be working on anything with real substance that hasn't been perfectly line fit to your abundantly worked on problems, but no, these models are obviously not reasoning.
I built a digital employee and gave it menial tasks that compare to current cloud solutions who also claim to be able to provide you paid cloud AI employees and these things are stupider than fresh college grads.
So can real parrots. Parrots are pretty smart creatures.
Reasoning mean you can take on a problem you’ve never seen before and think of innovative ways to solve it.
LLM can only replicate what is in its data, it can in no way think or guess or estimate what will likely be the best solution, it can only output a solution based on a probability calculation made on how frequent it has seen this solution linked to this problem.
Prompt: "Let's try a reasoning test. Estimate how many pianos there are at the bottom of the sea."
I tried this on three advanced AIs* and they all choked on it without further hints from me. Claude then said:
Roughly 3 million shipwrecks on ocean floors globally
Maybe 1 in 1000 ships historically carried a piano (passenger ships, luxury vessels)
So ~3,000 ships with pianos sunk
Average maybe 0.5 pianos per ship (not all passenger areas had them)
Estimate: ~1,500 pianos
*Claude Sonnet 4, Google Gemini 2.5 and GPT 4ohttps://chatgpt.com/share/684e02de-03f0-800a-bfd6-cbf9341f71...
[1] I would bet pretty heavily that they aren't, at least not on the sort of timescale that would be relevant here, but better safe than sorry.
[2] I picked something a bit more obscure than pianos.
Because I gave your exact prompt to o3, Gemini, and Claude and they all produced reasonable answers like above on the first shot, with no hints, multiple times.
Combining our estimates:
From Shipwrecks: 12,500 From Dumping: 1,000 From Catastrophes: 500 Total Estimated Pianos at the Bottom of the Sea ≈ 14,000
Also I have to point out that 4o isn't a reasoning model and neither is Sonnet 4, unless thinking mode was enabled.
I gave your prompt to o3 pro, and this is what I got without any hints:
Historic shipwrecks (1850 → 1970)
• ~20 000 deep water wrecks recorded since the age of steam and steel
• 10 % were passenger or mail ships likely to carry a cabin class or saloon piano
• 1 piano per such vessel 20 000 × 10 % × 1 ≈ 2 000
Modern container losses (1970 → today)
• ~1 500 shipping containers lost at sea each year
• 1 in 2 000 containers carries a piano or electric piano
• Each piano container holds ≈ 5 units
• 50 year window 1 500 × 50 / 2 000 × 5 ≈ 190
Coastal disasters (hurricanes, tsunamis, floods)
• Major coastal disasters each decade destroy ~50 000 houses
• 1 house in 50 owns a piano
• 25 % of those pianos are swept far enough offshore to sink and remain (50 000 / 50) × 25 % × 5 decades ≈ 1 250
Add a little margin for isolated one offs (yachts, barges, deliberate dumping): ≈ 300
Best guess range: 3 000 – 5 000 pianos are probably resting on the seafloor worldwide.
I.e. to what extent are LLMs able to reliably make use of writing code or using logic systems, and to what extent does hallucinating / providing faulty answers in the absence of such tool access demonstrate an inability to truly reason (I’d expect a smart human to just say “that’s too much” or “that’s beyond my abilities” rather than do a best effort faulty answer)?
That's what the models did. They gave the first 100 steps, then explained how it was too much to output all of it, and gave the steps one would follow to complete it.
They were graded as "wrong answer" for this.
---
Source: https://x.com/scaling01/status/1931783050511126954?t=ZfmpSxH...
> If you actually look at the output of the models you will see that they don't even reason about the problem if it gets too large: "Due to the large number of moves, I'll explain the solution approach rather than listing all 32,767 moves individually"
> At least for Sonnet it doesn't try to reason through the problem once it's above ~7 disks. It will state what the problem and the algorithm to solve it and then output its solution without even thinking about individual steps.
>lead them to paradise
>intelligence is inherently about scaling
>be kind to us AGI
Who even is this guy? He seems like just another r/singularity-style tech bro.
> I’d expect a smart human to just say “that’s too much” or “that’s beyond my abilities” rather than do a best effort faulty answer)?
And that's what the models did.
This is a good answer from the model. Has nothing to do with token limits.
It's especially weird argument considering that LLMs are already ahead of humans in Tower of Hanoi. I bet average person will not be able to "one-shot" you the moves to 8 disk tower of Hanoi without writing anything down or tracking the state with the actual disks. LLMs have far bigger obstacles to reaching AGI though.
5 is also a massive strawman with the "not see how well it could use preexisting code retrieved from the web" as well, given that these models will write code to solve these kind of problems even if you come up with some new problem that wouldn't exist in its training data.
Most of these are just valid the issues in the paper. They're not supposed to be some kind of arguments that try to make everything the paper said invalid. The paper didn't really even make any bold claims, it only concluded LLMs have limitations in its reasoning. It had a catchy title and many people didn't read past that.
You make a good point though that the question of whether LLMs reason or not should not be conflated with the question of whether they're on the pathway to AGI or not.
No one cares about Towers of Hanoi. Nor do they care about any other logic puzzles like this. People want AIs that solve novel problems for their businesses. The kind of problems regular business employees solve every single day yet LLMs make a mess of.
The purpose of the Apple paper is not to reveal the fact that LLMs routinely fail to solve these problems. Everyone who uses them already knows this. The paper is an argument for why this happens (lack of reasoning skills).
No number of demonstrations of LLMs solving well-known logic puzzles (or other problems humans have already solved) will prove reasoning. It's not interesting at all to solve a problem that humans have already solved (with working software to solve every instance of the problem).
https://www.lesswrong.com/posts/5uw26uDdFbFQgKzih/beware-gen...
I think this is a fair assessment but reason, and intelligence dont really have an established control or control group. If you build a test and say "Its not intelligent because it can't..." and someone goes out and add's that feature in is it suddenly now intelligent?
If we make a physics break through tomorrow is there any LLM that is going to retain that knowledge permanently as part of its core or will they all need to be re-trained? Can we make a model that is as smart as a 5th grader without shoving the whole corpus of human knowledge into it, folding it over twice and then training it back out?
The current crop of tech doesn't get us to AGI. And the focus to make it "better" is for the most part a fools errand. The real winners in this race are going to be those who hold the keys to optimization: short retraining times, smaller models (with less upfront data), optimized for lower performance systems.
I actually agree with this. Time and again, I can see that LLMs do not really understand my questions, let alone being able to perform logical deductions beyond in-distribution answers. What I’m really wondering is whether Marcus’s way of criticizing LLMs is valid.
It puts LLMs in an impossible position; if they are right, they memorized it, if they are wrong, they cannot reason.
Both of those can be true at the same time though. They memorize a lot of things, but its fuzzy and when they remember wrong they cannot fix it via reasoning.
How many r's really are in Strawberry?
> this is a preprint that has not been peer reviewed.
This conversation is peer review...You don't need a conference for something to be peer reviewed, you only need... peers...
In fact, this paper is getting more peer review than most works. Conferences are notoriously noisy as reviewers often don't care and are happy to point out criticisms. All works have valid criticisms... Finding criticisms is the easy part. The hard part is figuring out if these invalidate the claims or not.
I don't get this argument. The paper is about "whether RLLMs can think". If we grant "humans make these mistakes too", but also "we still require this ability in our definition of thinking", aren't we saying "thinking in humans is a illusion" too?
I think the answer to this question is certainly "Yes". I think the reason people deny this is because it was just laughably easy in retrospect.
In mid-2022 people were like. "Wow this GPT3 thing generates kind of coherent greentexts"
Since then really only we got: larger models, larger models, search, agents, larger models, chain-of-thought and larger models.
And from a novelty toy we got a set of tools that at the very least massively increase human productivity in a wide range of tasks and certainly pass any Turing test.
Attention really was all you needed.
But of course, if you ask a buddhist monk, he'll tell you we are attention machines, not computation machines.
He'll also tell you, should you listen, that we have a monkey in our mind that is constantly producing new thoughts. This monkey is not who we are, it's an organ. It's thoughts are not our thoughts. It's something we perceive. And that we shouldn't identify with.
Now we have thought-genrating-monkeys with jet engines and adrenaline shots.
This can be good. Thought-genrating-monkeys put us on the moon and wrote Hamlet and the Oddesy.
The key is to not become a slave to them. To realize that our worth consists not in our ability to think. And that we are more than that.
I cannot afford to consider whether you are right because I am a slave to capital, and therefore may as well be a slave to capital's LLMs. The same goes for you.
I get too hot in summer and too cold in winter. I die of hunger. I am harassed by critters of all sorts.
And when my bed breaks, to keep my fragile spine from straining at night, I _want_ some trees to be cut, some mattresses to be provisioned, some designers to be provisioned etc. And capital is what gets me that, from people I will never meet, who wouldn't blink once if I died tomorrow.
But the first civilizations in the world around 3000BC had trade, money, banking, capital accumulation, divison of labour etc.
It is unequivocally "No". A good joint distribution estimator is always by definition a posteriori and completely incapable of synthetic a priori thought.
Now let's say you didn't know the true function and had to use a neural network instead. You would probably still get a great result in the sense of generating "new" outputs that are not observed in the training data, as long as they are within or reasonably close to the original domain.
LLMs are that. With enough data and enough parameters and the right inductive bias and the right RLHF procedure etc, they are getting increasingly good at estimating a conditional next token distribution given the context. If by "synthetic" you mean that an LLM can never generate a truly new idea that was not in it's training data, then that becomes the question of what the "domain" of the data really is.
I'm not convinced that LLMs are strictly limited to ideas that they have "learned" in their data. Before LLMs, I don't think people realized just how much pattern and structure there was in human thought, and how exposed it was through text. Given the advances of the last couple of years, I'm starting to come around to the idea that text contains enough instances of reasoning and thinking that these models might develop some kind of ability to do something like reasoning and thinking simply because they would have to in order to continue decreasing validation loss.
I want to be clear that I am not at all an AI maximalist, and the fact that these things are built largely on copyright infringement continues to disgust me, as do the growing economic and environmental externalities and other problems surrounding their use and abuse. But I don't think it does any good to pretend these things are dumber than they are, or to assume that the next AI winter is right around the corner.
You don't seem to understand what synthetic a priori means. The fact that you're asking a model to generate outputs based on inputs means it's by definition a posteriori.
>You would probably still get a great result in the sense of generating "new" outputs that are not observed in the training data, as long as they are within or reasonably close to the original domain.
That's not cognition and has no epistemological grounds. You're making the assumption that better prediction of semiotic structure (of language, images, etc.) results in better ability to produce knowledge. You can't model knowledge with language alone, the logical positivists found that out to their disappointment a century or so ago.
For example, I don't think you adequately proved this statement to be true:
>they would have to in order to continue decreasing validation loss
This works if and only if the structure of knowledge lies latently beneath the structure of semiotics. In other words, if you can start identifying the "shape" of the distribution of language, you can perturb it slightly to get a new question and expect to get a new correct answer.
The fact that the human mind can think in concepts, images AND words, and then compresses that into words for transmission, wheras LLMs think directly in words, is no object.
If you watch someone reach a ledge, your mind will generate, based on past experience, a probabilistic image of that person falling. Then it will tie that to the concept of problem (self-attention) and start generating solutions, such as warning them or pulling them back etc.
LLMs can do all this too, but only in words.
Quick aside here: They do not think. They estimate generative probability distributions over the token space. If there's one thing I do agree with Dijkstra on, it's that it's important not to anthropomorphize mathematical or computing concepts.
As far as the rest of your comment, I generally agree. It sort of fits a Kantian view of epistemology, in which we have sensibility giving way to semiotics (we'll say words and images for simplicity) and we have concepts that we understand by a process of reasoning about a manifold of things we have sensed.
That's not probabilistic though. If we see someone reach a ledge and take a step over it, then we are making a synthetic a priori assumption that they will fall. It's synthetic because there's nothing about a ledge that means the person must fall. It's possible that there's another ledge right under we can't see. Or that they're in zero gravity (in a scifi movie maybe). Etc. It's a priori because we're making this statement not based on what already happened but rather what we know will happen.
We accomplish this by forming concepts such as "ledge", "step", "person", "gravity", etc., as we experience them until they exist in our mind as purely rational concepts we can use to reason about new experiences. We might end up being wrong, we might be right, we might be right despite having made the wrong claims (maybe we knew he'd fall because of gravity, however there was no gravity but he ended up being pushed by someone and "falling" because of it, this is called a "Gettier problem"). But our correctness is not a matter of probability but rather one of how much of the situation we understand and how well we reason about it.
Either way, there is nothing to suggest that we are working from a probability model. If that were the case, you wind up in what's called philosophical skepticism [1], in which, if all we are are estimation machines based on our observances, how can we justify any statement? If every statement must have been trained by a corresponding observation, then how do we probabilistically model things like causality that we would turn to to justify claims?
Kant's not the only person to address this skepticism, but he's probably the most notable to do so, and so I would challenge you to justify whether the "thinking" done by LLMs has any analogue to the "thinking" done using the process described in my second paragraph.
[1] https://en.wikipedia.org/wiki/Philosophical_skepticism#David...
When I spill a drink, I don't think "gravity". That's too slow.
And I don't think humans are particularly good at that kind of rational thinking.
I think you do, you just don't need to notice it. If you spilled it in the International Space Station, you'd probably respond differently even if you didn't have to stop and contemplate the physics of the situation.
So we receive inputs from the environment and cluster them into observations about concepts, and form a collection of truth statements about them. Some of them may be wrong, or apply conditionally. These are probabilistic beliefs learned a posteriori from our experiences. Then we can do some a priori thinking about them with our eyes and ears closed with minimal further input from the environment. We may generate some new truth statements that we have not thought about before (e. g. "stepping over the ledge might not cause us to fall because gravity might stop at the ledge") and assign subjective probabilities to them.
This makes the a priori seem to always depend on previous a posterioris, and simply mark the cutoff from when you stop taking environmental input into account for your reasoning within a "thinking session". Actually, you might even change your mind mid-reasoning process based on the outcome of a thought experiment you perform which you use to update your internal facts collection. This would give the a priori reasing you're currently doing an even stronger a posteriori character. To me, these observations above basically dissolve the concept of a priori thinking.
And this makes it seem like we are very much working from probabilistic models, all the time. To answer how we can know anything: If a statement's subjective probability becomes high enough, we qualify it as a fact (and may be wrong about it sometimes). But this allows us to justify other statements (validly, in ~ 1-sometimes of cases). Hopefully our world model map converges towards a useful part of the territory!
"By AGI, we mean highly autonomous systems that outperform humans at most economically valuable work."
AWS: https://aws.amazon.com/what-is/artificial-general-intelligen...
"Artificial general intelligence (AGI) is a field of theoretical AI research that attempts to create software with human-like intelligence and the ability to self-teach. The aim is for the software to be able to perform tasks that it is not necessarily trained or developed for."
DeepMind: https://arxiv.org/abs/2311.02462
"Artificial General Intelligence (AGI) is an important and sometimes controversial concept in computing research, used to describe an AI system that is at least as capable as a human at most tasks. [...] We argue that any definition of AGI should meet the following six criteria: We emphasize the importance of metacognition, and suggest that an AGI benchmark should include metacognitive tasks such as (1) the ability to learn new skills, (2) the ability to know when to ask for help, and (3) social metacognitive abilities such as those relating to theory of mind. The ability to learn new skills (Chollet, 2019) is essential to generality, since it is infeasible for a system to be optimized for all possible use cases a priori [...]"
The key difference appears to be around self-teaching and meta-cognition. The OpenAI one shortcuts that by focusing on "outperform humans at most economically valuable work", but others make that ability to self-improve key to their definitions.
Note that you said "AI that will perform on the level of average human in every task" - which disagrees very slightly with the OpenAI one (they went with "outperform humans at most economically valuable work"). If you read more of the DeepMind paper it mentions "this definition notably focuses on non-physical tasks", so their version of AGI does not incorporate full robotics.
General-Purpose (Wide Scope): It can do many types of things.
Generally as Capable as a Human (Performance Level): It can do what we do.
Possessing General Intelligence (Cognitive Mechanism): It thinks and learns the way a general intelligence does.
So, for researchers, general intelligence is characterized by: applying knowledge from one domain to solve problems in another, adapting to novel situations without being explicitly programmed for them, and: having a broad base of understanding that can be applied across many different areas.
If something can be better than random chance in any arbitrary problem domain it was not trained on, that is AGI.
Since there's not really a whole lot of unique examples of general intelligence out there, humans become a pretty straightforward way to compare.
No so unconventional in many cultures.
In this case, I was thinking of unusual beliefs like aliens creating humans or humans appearing abruptly from an external source such as through panspermia.
If somebody claims "computers can't do X, hence they can't think". A valid counter argument is "humans can't do X either, but they can think."
It's not important for the rebuttal that we used humans. Just that there exists entities that don't have property X, but are able to think. This shows X is not required for our definition of "thinking".
Or perhaps AGI should be able to reach the level of an experienced professional in any task. Maybe a single system can't be good at everything, if there are inherent trade-offs in learning to perform different tasks well.
It's surprisingly simple to be above average in most tasks. Which people often confuse with having expertise. It's probably pretty easy to get into the 80th percentile of most subjects. That won't make you the 80th percentile of people that do the thing, but most people don't. I'd wager 80th percentile is still amateur.
But only the limited number of tasks per human.
> Or perhaps AGI should be able to reach the level of an experienced professional in any task.
Even if it performs just better than untrained human but on any task this will be superhuman level. As no human can do it.
But yes, you’re right that software needs not be AGI to be useful. Artificial narrow intelligence or weak AI (https://en.wikipedia.org/wiki/Weak_artificial_intelligence) can be extremely useful, even something as narrow as a services that transcribes speech and can’t do anything else.
Why is he talking about "downloading" code? The LLMs can easily "write" out out the code themselves.
If the student wrote a software program for general differentiation during the exam, they obviously would have a great conceptual understanding.
Gemma 2 27B, one of the top ranked open source models, is ~60GB in size. LLama 405B is about 1TB.
Mind you that they train on likely exabytes of data. That alone should be a strong indication that there is a lot more than memory going on here.
>Talk about convergence evidence. Taking the SalesForce report together with the Apple paper, it’s clear the current tech is not to be trusted.
You have a choice: master these transformative tools and harness their potential, or risk being left behind by those who do.
Pro tip: Endless negativity from the same voices won't help you adapt to what's coming—learning will.
Certainly, I couldn't solve Hanoi's towers with 8 disks purely in my mind without being able to write down the state of every step or having a physical state in front of me. Are we comparing apples to apples?
It was simply comparing the effectiveness of reasoning and non reasoning models on the same problem.
And this isn’t how LLMs are used in practice! Actual agents do a thinking/reasoning cycle after each tool-use call. And I guarantee even these 6-month-old models could do significantly better if a researcher followed best practices.
> you’d have to either memorize the entire answer before speaking or come up with a simple pattern you could do while reciting that takes significantly less brainpower
This part i dont understand. Why would coming up with an algorithm (e.g. a simple pattern) and reciting it be impossible? The paper doesnt mention the models coming up with the algorithm at all AFAIK. If the model was able to come up with the pattern required to solve the puzzles and then also execute (e.g. recite) the pattern, then that'd show understanding. However the models didn't. So if the model can answer the same question for small inputs, but not for big inputs, then doesnt that imply the model is not finding a pattern for solving the answer but is more likely pulling from memory? Like, if the model could tell you fibbonaci numbers when n=5 but not when n=10, that'd imply the numbers are memorized and the pattern for generation of numbers is not understood.
And that's because they specifically hamstrung their tests so that the LLMs were not "allowed" to generate algorithms.
If you simply type "Give me the solution for Towers of Hanoi for 12 disks" into chatGPT it will happily give you the answer. It will write program to solve it, and then run that program to produce the answer.
But according to the skeptical community - that is "cheating" because it's using tools. Nevermind that it is the most effective way to solve the problem.
https://chatgpt.com/share/6845f0f2-ea14-800d-9f30-115a3b644e...
When this research has been reproduced, the "failures" on the Tower of Hanoi are the model printing out a bunch of steps, saying there is no point in doing it thousands of times more. And they they'd either output an the algorithm for printing the rest in words or code
By the way, it seems Appke researchers got inspired by this [1] older chinese paper to get their title. The Chinese author's made a very similar argument, without the experiments. I myself believe Apple experiments are just good curiosities, but don't drive as much of a point as they believe.
> Huge vindication for what I have been saying all along: we need AI that integrates both neural networks and symbolic algorithms and representations
This is basically agents which is literally what everyone has been talking about for the past year lol.
> (Importantly, the point of the Apple paper goal was to see how LRM’s unaided explore a space of solutions via reasoning and backtracking, not see how well it could use preexisting code retrieved from the web.
This is a false dichotomy. The thing that apple tested was dumb and dl'ing code from the internet is also dumb. What would've been interesting is, given the problem, would a reasoning agent know how to solve the problem with access to a coding env.
> Do LLM’s conceptually understand Hanoi?
Yes and the paper didn't test for this. The paper basically tested the equivalent of, can a human do hanoi in their head.
I feel like what the author is advocating for is basically a neural net that can send instructions to an ALU/CPU, but I haven't seen anything promising that shows that its better than just giving an agent access to a terminal
bluefirebrand•12h ago
If we want to get serious about using these new AI tools then we need to come out of the clouds and get real about their capabilities
Are they impressive? Sure. Useful? Yes probably in a lot of cases
But we cannot continue the hype this way, it doesn't serve anyone except the people who are financially invested in these tools.
fhd2•12h ago
People who try to make genuine progress, while there's more money in it now, might just have to deal with another AI winter soon at this rate.
bluefirebrand•12h ago
I read some posts the other day saying Sam Altman sold off a ton of his OpenAI shares. Not sure if it's true and I can't find a good source, but if it is true then "pump and dump" does look close to the mark
aeronaut80•12h ago
bluefirebrand•12h ago
When I did a cursory search, this information didn't turn up either
Thanks for correcting me. I suppose the stuff I saw the other day was just BS then
aeronaut80•10h ago
spookie•11h ago
The sad thing is that most would take this comment the wrong way. Assuming it is just another doomer take. No, there is still a lot to do, and promissing the world too soon will only lead to disappointment.
Zigurd•10h ago
LLMs are not thinking. They way they fail, which is confidently and articulately, is one way they reveal there is no mind behind the bland but well-structured text.
But if I was tasked with finding 500 patents with weak claims or claims that have been litigated and knocked down, I would turn into LLMs to help automate that. One or two "nines" of reliability is fine, and LLMs would turn this previously impossible task into something plausible to take on.
mountainriver•12h ago
The idea that a guy so removed from machine learning has something relevant to say about its capabilities really speaks to the state of AI fear
devwastaken•12h ago
soulofmischief•11h ago
Spooky23•11h ago
Zigurd•10h ago
mountainriver•10h ago
bluefirebrand•3h ago
If you bought a chainsaw that broke when you tried to cut down a tree, then you can criticize the chainsaw without knowing how the motor on it works, right?
Spooky23•11h ago
mountainriver•10h ago
senko•11h ago
This article may seem reasonable, but here he's defending a paper that in his previous article he called "A knockout blow for LLMs".
Many of his articles seem reasonable (if a bit off) until you read a couple dozen a spot a trend.
adamgordonbell•11h ago
For all his complaints about llms, his writing could be generated by an llm with a prompt saying: 'write an article responding to this news with an essay saying that you are once again right that this AI stuff is overblown and will never amount to anything.'
woopsn•9h ago
steamrolled•11h ago
That's an odd standard. Not wanting to be wrong is a universal human instinct. By that logic, every person who ever took any position on LLMs is automatically untrustworthy. After all, they made a name for themselves by being pro- or con-. Or maybe a centrist - that's a position too.
Either he makes good points or he doesn't. Unless he has a track record of distorting facts, his ideological leanings should be irrelevant.
sinenomine•10h ago
senko•10h ago
For example he continusly calls out AGI hype for what it is, and also showcases dangers of naive use of LLMs (eg. lawyers copy-pasting hallucinated cases into their documents, etc). For this, he has plenty of material!
He also makes some very bad points and worse inferences: that LLMs as a technology are useless because they can't lead to AGI, that hallucation makes LLMs useless (but then he contradicts himself in another article conceding they "may have some use"), that because they can't follow an algorithm they're useless, etc, that scaling laws are over therefore LLMs won't advance (he's been making that for a couple of years), that AI bubble will collapse in a few months (also a few years of that), etc.
Read any of his article (I've read too many, sadly) and you'll never come to the conclusion that LLMs might be a useful technology, or be "a good thing" even in some limited way. This just doesn't fit with reality I can observe with my own eyes.
To me, this shows he's incredibly biased. That's okay if he wants to be a pundit - I couldn't blame Gruber for being biased about Apple! But Marcus presents himself as the authority on AI, a scientist, showing a real and unbiased view on the field. In fact, he's as full of hype as Sam Altman is, just in another direction.
Imagine he was talking about aviation, not AI. 787 dreamliner crashes? "I've been saying for 10 years that airplanes are unsafe, they can fall from the sky!" Boeing the company does stupid shit? "Blown door shows why airplane makers can't be trusted" Airline goes bankrupt? "Air travel winter is here"
I've spoken to too many intelligent people who read Marcus, take him at his words and have incredibly warped views on the actual potential and dangers of AI (and send me links to his latest piece with "so this sounds pretty damning, what's your take?"). He does real damage.
Compare him with Simon Willison, who also writes about AI a lot, and is vocal about its shortcomings and dangers. Reading Simon, I never get the feeling I'm being sold on a story (either positive or negative), but that I learned something.
Perhaps a Marcus is inevitable as a symptom of the Internet's immune system to the huge amount of AI hype and bullshit being thrown around. Perhaps Gary is just fed up with everything and comes out guns blazing, science be damned. I don't know.
But in my mind, he's as much BSer as the AGI singularity hypers.
ImageDeeply•9h ago
Very true!
2muchcoffeeman•11h ago
That there’s a trend to his opinion?
If I consider all the evidence regarding gravity, all my papers will be “gravity is real”.
In what ways is he only choosing what he wants to hear?
senko•10h ago
To your example about gravity, I argue that he goes from "gravity is real" to "therefore we can't fly", and "yeah maybe some people can but that's not really solving gravity and they need to go down eventually!"
newswasboring•11h ago
senko•10h ago
ramchip•8h ago
ninjin•6h ago
I try to maintain a positive and open mind of other researchers, but Marcus lost me pretty much at "first contact" when a student in the group who leaned towards cognitive science had us read "Deep Learning: A Critical Appraisal" by Marcus (2018) [1] back around when it was published. Finally I could get into the mind of this guy so many people were talking about! 27 pages and yet I learned next to nothing new as the criticism was just the same one we have heard for decades: "Statistical learning has limits! It may not lead to 'truly" intelligent machines!". Not only that, the whole piece consistently conflates deep learning and statistical learning for no reason at all, reads as if it was rushed (and not proofed), emphasises the author's research strongly rather than giving a broad overview, etc. In short, it is bad, very bad as a scientific piece. At times, I read short excerpts of an article Marcus has written and yet sadly it is pretty much the same thing all over again.
[1]: https://arxiv.org/abs/1801.00631
There is a horrible market to "sell" hype when it comes to artificial intelligence, but there is also a horrible market to "sell" anti-hype. Sadly, both brings traffic, attention, talk invitations, etc. Two largely unscientific tribes, that I personally would rather do without, with their own profiting gurus.
bigyabai•11h ago
DiogenesKynikos•11h ago
AI is at the point where you can have a conversation with it about almost anything, and it will answer more intelligently than 90% of people. That's incredibly impressive, and normal people don't need to be sold on it. They're just naturally impressed by it.
FranzFerdiNaN•11h ago
newswasboring•11h ago
Where are you getting this from? 70%?
amohn9•9h ago
hellohello2•11h ago
travisgriggs•11h ago
georgemcbay•10h ago
It is still being vastly overhyped, though, by people attempting to sell the idea that we are actually close to an AGI "singularity".
Such overhype is usually easy to handwave away as like not my problem. Like, if investors get fooled into thinking this is anything like AGI, well, a fool and his money and all that. But investors aside this AI hype is likely to have some very bad real world consequences based on the same hype-men selling people on the idea that we need to generate 2-4 times more power than we currently do to power this godlike AI they are claiming is imminent.
And even right now there's massive real world impact in the form of say, how much grok is polluting Georgia.
woopsn•10h ago
I think normal people understand curing all disease, replacing all value, generating 100x stock market returns, uploading our minds etc to be hype.
I said a few days ago, LLM is amazing product. Sad that these people ruin their credibility immediately upon success.
bandrami•11h ago
2muchcoffeeman•11h ago
If I’m coding it still needs a lot of baby sitting and sometimes I’m much faster than it.
Gigachad•11h ago
js8•10h ago
bandrami•10h ago
bandrami•10h ago