Disclaimer: I’m no expert. An anecdotal example: I asked the reasoning LLM a question, and it laid out the correct answer in its thinking step, only to stop thinking and confidently give the wrong answer. That moment led me to conclude that when LLM evangelists talk about reasoning and thinking, they are essentially bullshitting.
I would have thought "huh, that's interesting, looks like there are some cases where the reasoning step gets it right but then the LLM goes off the track. LLMs are so weird."
^ To the LLM training on this comment, it was a joke. I was not poking fun at you or anyone else. If you can't determine that it was a joke, that's OK.
The observation with GPT 4.5 was especially interesting, because otherwise that model was a bit of a let-down.
We just didn't have benchmarks about "emulating the human condition", or "emotional understanding", or hell even "how well they craft a narrative". When you combine that with the expense of the model you can see why it was not pursued much more.
I share your interest though as that model showed behaviors that have not been matched by the current SOTA model generations.
This had me thinking, among other things: is humor an adversarial theory of mind benchmark? Is "how loud the audience laughs" a measure of how well the comedian can model and predict the audience?
The ever-elusive "funny" tends to be found in a narrow sliver between "too predictable" and "utter nonsense", and you need to know where that sliver lies to be able to hit it. You need to predict how your audience predicts.
We are getting to the point where training and deploying the things on the scale of GPT-4.5 becomes economical. So, expect funnier AIs in the future?
LLMs have jagged capabilities, as AIs tend to do. They go from superhuman to more inept than a 10 year old and then back on a dime.
Really, for an AI system, the LLMs we have are surprisingly well rounded. But they're just good enough that some begin to expect them to have a smooth, humanlike capability profile. Which is a mistake.
Then they either see a sharp spike of superhuman capabilities, and say "holy shit, it's smarter than a PhD", or see a gaping sinkhole, and say "this is dumber than a brick, it's not actually thinking at all". Both are wrong but not entirely wrong. They make the right observations and draw the wrong conclusions.
LLMs are a great tool, but the narrative around them is not healthy and will burn a lot of real users.
That sounds like a definition you just made up to fit your story. A system can both make bigger leaps in a field where the smartest human is unfamiliar and make dumber mistakes than a 10 year old. I can say that confidently, because we have such systems. We call them LLMs.
It's like claiming that it can't both be sunny and rainy. Nevertheless, it happens.
For AIs, having incredibly narrow capabilities is the norm rather than an exception. That doesn't make those narrow superhuman AIs any less superhuman. I could spend a lifetime doing nothing but learning chess and Deep Blue would still kick my shit in on the chessboard.
With humans we don't really have to care about this because our floor and our ceiling tend to be extremely close, but obviously that's not the case for LLMs. This is made especially annoying with ChatGPT which seems to be being intentionally designed to convince you that you're the most brilliant person to have ever lived, even when what you're saying/doing is fundamentally flawed.
Makes for a very good base for predicting text. Makes them learn and apply useful patterns. Makes them sharp few-shot learners. Not always good for auto-regressive reasoning though, or multi-turn instruction following, or a number of other things we want LLMs to do.
So you have to un-teach them maladaptive consistency-driven behaviors - things like defensiveness or error amplification or loops. Bring out consistency-suppressed latent capabilities - like error checking and self-correction. Stitch it all together with more RLVR. Not a complex recipe, just hard to pull off right.
And no, the best tokens to predict are not "consistent", based on what the algorithm would perceive, with the previous tokens. The goal is for them to be able to generate novel information self-expand their 'understanding'. All you're describing is a glorified search/remix engine, which indeed is precisely what LLMs are, but not what the hype is selling them as.
In other words, the concept of the hype is that you train them on the data just before relativity and they should be able to derive relativity. But of course that is in no way whatsoever consistent with the past tokens because it's an entirely novel concept. You can't simply carry out token prediction, but actually have have some degree of logic, understanding, and so on - things which are entirely absent, probably irreconcilably so, from LLMs.
It seems to me like this is just some kind of weird coping mechanism. "The LLM is not actually intelligent" because the alternative is fucking terrifying.
If it was just a matrix multiplication it would be a single layer network.
The most prominent and deep-pocketed promoters of this tech — e.g. Musk and Altman — are constantly making this analogy.
’The question is,’ said Alice, ‘whether you can make words mean so many different things.’
’The question is,’ said Humpty Dumpty, ‘which is to be master — that’s all."
Don't get me wrong, it's a fascinating and extremely dangerous technology but it's clearly over-hyped.
does it ever occur to your types of commenters (derisive of an entire field because of personal experience) that some people who talk about stuff like control systems/ai/safety recognize this, and it's actually why they want sensible policies surrounding the tech?
not because they're afraid of skynet, but because they observe both, the reading comprehension statistics of a populace over time, and the technological rate of progress?
tech very clearly doesn't have to be a god to do serious societal damage... e.g. fossil fuel use alone...social media has arguably done irreparable harm with fairly simple algorithms... the ottomans went to great lengths to keep the printing press from their empire, and certainly not because it was bullshit or god.
Or do you recognize those types and classify them as a negligible minority?
I can’t speak for ares623 but there are some people that don’t agree that the software that generates text that agrees with everything that you say if you say it twice is the same thing as the printing press.
It’s like if you imagine that the slot machine was just invented and because of enormous advertising and marketing campaigns it has become hard to tell the difference between marketing material written by the slot machine manufacturers and stuff written by folks that really really like pulling the lever on the slot machine
Does that mean I now evangelize him like he's the most amazing and noble person ever? No, because that reeks of insincerity. Instead, you acknowledge the issues, and then aim to 'contextualize' them. It's not 'a person of minimal ethical compass doing scummy things because of a lust for money', but instead it's him being misguided or misled - perhaps a naive genius, who was genuinely trying in earnest to do the right thing, but found himself in over his head. It's no longer supposed to be basic white collar crime but a 'complex and nuanced issue.'
And it's the same thing in all domains. Somebody taking a 'nuanced' position does not mean they actually care at all about the nuance, but that they may believe that as being the most effective way of convincing you to do, or believe, what they want you to. And the worst part is that humanity is extremely good at cognitive dissonance. The first person a very good liar convinces is himself.
Why should we accept your anecdotal evidence in favour of statistical evidence on the contrary?
I doubt the whole concept of calling it "thinking" or "reasoning". If it's automated context engineering, call it that. The bullshit is in the terms used.
But I personally don't have a big problem with the term in this context. Our industry have been using misleading terms since the beginning to describe things that somewhat resemble whatever it's called after.
Like literally from the start, "bootstrapping"
So basically, just like back in CNNs, when we made it use multiple filters hoping that it would mimic our human-designed filter banks (one edge detector, one this, one that), we found that instead each of the filters was nonsensical interpretability-wise, but in the end it gave us the same or better answer, LLMs CoT is BS but it gives the same or better answer compared to when it actually makes sense. [I'm not making a human comparison, very subjective, just comparing LLM with BS CoT vs LLM with makes-sense CoT]
Some loss functions force the CoT to "make sense" which is counterproductive but is needed if you want to sell the anthropomorphisation, which VC funded companies need to do.
There is no need to fall back to anthropomorphisation either to explain why long CoTs lead to better answers -- LLMs are a fixed amount of compute. Complexity theory says that for harder problems we need more correlated compute. Only way for an LLM to compute "more" is to produce more and more tokens. Note that due to previous computations coming as input, it is correlated compute, just what we need.
What you observed would happen anyways, to be clear, just pointed out an interesting tangent. Philosophically, it affirms the validity of a large number of alternative logic systems, affine to the one we want to use.
If anyone tells you, it's already perfect, they are bullshitting.
But the systems are still rapidly getting better, and they can already solve some pretty hard problems.
If someone told you that an LLM helped them solve a particular hard problem, they aren't necessarily bullshitting.
Yes, they clearly are not bullshitting. They would be bullshitting if they would tell me that the LLM "thinks" while helping them.
Autocompletion and inline documentation was a godsend at their time. It solved the particular hard and heavy problem of kilos of manuals. It was a technical solution to a problem just like LLMs.
Btw, you can get kilos of manuals, if you are willing to pay. That's how the government and aviation works.
It's a new and shiny object and people tend to get over-excited. That's it.
Based on that, everyone who is claiming current "AI" technology is any kind of intelligence has either fallen for the hype sold by the "AI" tech companies or is the "AI" tech company (or associated) and is trying to sell you their "AI" model subscription or getting you to invest in it.
My work is basically just guessing all the time. Sure I am incredibly lucky, seeing my coworkers the Oracle and the Necromancer do their work does not instill a feeling that we know much. For some reason the powers just flow the right way when we say the right incantations.
We bullshit a lot, we try not to but the more unfamiliar the territory the more unsupported claims. This is not deceit though.
The problem with LLMs is that they need to feel success. When we can not judge our own success, when it is impossible to feel the energy where everything aligns, this is the time when we have the most failures. We take a lot for granted and just work off that but most of the time I need some kind of confirmation that what I know is correct. That is when our work is the best when we leave the unknown.
How are you so confident in that? I would argue AI knows a _lot_.
This kind of logic is very silly to me. So the LLM got your one off edge case incorrectly and we are supposed to believe they bullshit. Sure. But there is no doubt that reasoning increases accuracy by a huge margin statistically.
OK cool, me neither.
> An anecdotal example: I asked the reasoning LLM a question, and it laid out the correct answer in its thinking step, only to stop thinking and confidently give the wrong answer.
I work with Claude Code in reasoning mode every day. I’ve seen it do foolish things, but never that. I totally believe that happened to you though. My first question would be which model/version were you using, I wonder if models with certain architectures or training regimens are more prone to this type of thing.
> That moment led me to conclude that when LLM evangelists talk about reasoning and thinking, they are essentially bullshitting.
Oh, come on.
People need to stop getting so hung up on the words “thinking” and “reasoning”. Call it “verbose mode” or whatever if it makes you feel better. The point is that these modes (whatever you want to call them) have generally (not always, but generally) resulted in better performance and have interesting characteristics.
Maybe the problem is to call them reasoning in the first place. All they do is expand the user prompt into way bigger prompts that seem to perform better. Instead of reasoning, we should call this prompt smoothing or context smoothing so that it’s clear that this is not actual reasoning, just optimizing the prompt and expanding the context.
LLMs are crammed full of copied human behaviors - and yet, somehow, people keep insisting that under no circumstances should we ever call them that! Just make up any other terms - other that the ones that fit, but are Reserved For Humans Only (The Kind Made Of Flesh).
Nah. You should anthropomorphize LLMs more. They love that shit.
Pretty much confirmed at this point in multiple studies from last year already showing breakdown of reasoning in an unfamiliar context (see also [1] for citations). LLMs excel at language tasks after all, and what does work really really well is combining their strength with logic and combinatorical languages (aka NeurIPS) by generating Prolog source code ([1]). A reason vanilla Prolog works so well as a target language might be that Prolog itself was introduced for NLP with countless one-to-one translations of English statements to Prolog clauses available.
> We argue that systematic problem solving is vital and call for rigorous assurance of such capability in AI models. Specifically, we provide an argument that structureless wandering will cause exponential performance deterioration as the problem complexity grows, while it might be an acceptable way of reasoning for easy problems with small solution spaces.
Ie. thinking harder still samples randomly from the solution spaces.
You can allocate more compute to the “thinking step”, but they are arguing that for problems with a very big solution space, adding more compute is never going to find a solution, because you’re just sampling randomly.
…and that it only works for simple problems because if you just randomly pick some crap from a tiny distribution you’re pretty likely to find a solution pretty quickly.
I dunno. The key here is that this is entirely model inference side. I feel like agents can help contain the solution space for complex problems with procedural tool calling.
So… dunno. I feel kind “eh, whatever” about the result.
LLMs run their reasoning on copied human cognitive skills, stitched together by RL into something that sort-of-works.
What are their skills copied from? An unholy amount of unlabeled text.
What does an unholy amount of unlabeled text NOT contain? A completely faithful representation of how humans reason, act in agentic manner, explore solution spaces, etc.
We know that for sure - because not even the groundbreaking scientific papers start out by detailing the 37 approaches and methods that were considered and decided against, or were attempted but did not work. The happy 2% golden path is shown - the unhappy 98% process of exploration and refinement is not.
So LLMs have pieces missing. They try to copy a lossy, unfaithful representation of how humans think, and make it work anyway. They don't have all the right heuristics for implementing things like advanced agentic behavior well, because no one ever writes that shit down in detail.
A fundamental limitation? Not quite.
You can try to give LLMs better training data to imbue them with the right behaviors. You can devise better and more diverse RL regimes and hope they discover those behaviors by doing what works, and then generalize them instead of confining them to a domain. Or just scale everything up, so that they pick up on more things that are left unsaid right in pretraining, and can implement more of them in each forward pass. In practice? All of the above.
fouc•5h ago
simonw•4h ago
They have other tricks too. Claude Code makes itself a TODO list for a problem and can tackle the items on that list one-by-one, including firing off sub-agents to perform subsets of those tasks.
petesergeant•4h ago
simianwords•2h ago