https://www.youtube.com/live/0Uu_VJeVVfo?si=PJGU-MomCQP1tyPk
Have any systems tried prompting LLMs with a warning like "You don't intuitively or automatically know many facts about words, spelling, or the structure or context of text, when considered as text; for example, you don't intuitively or automatically know how words or other texts are spelled, how many letters they contain, or what the result of applying some code, mechanical transformation, or substitution to a word or text is. Your natural guesses about these subjects are likely to be wrong as a result of how your training doesn't necessarily let you infer correct answers about them. If the content or structure of a word or text, or the result of using a transformation, code, or the like on a text, is a subject of conversation, or you are going to make a claim about it, always use a tool to confirm your intuitions."?
As a human, if you give me a number on screen like 100000000, I can't be totally sure if that's 100 Million or 1 Billion without getting close and counting carefully. Should ought have my glasses. Mouse pointer helps some as an ersatz thousands-separator, but still.
Since we're giving them tools, especially for math, it makes way more sense to start giving them access to some of the finest tools ever. Make an MCP into Mathematica or Matlab and let the LLM write some math and have classical solvers actually deal with the results. Let the LLM write little bits of bash or python as its primary approach for dealing with these kinds of analytical questions.
It's like giving a kid a calculator...
I become mildly infuriated when computers show metrics (or any large number) without thousands separators.
Worse still, I often see systems that mix units, don’t right-align, and occasionally blend in a few numbers with decimals together with whole numbers! Then, update everything every second to make things extra spicy.
For things where factuality is even just a little important, you need to treat these things like asking a toddler that got their hands on a thesaurus and an encyclopaedia (that's a few years out of date): go through everything it produces and fact check any statement it makes that you're not confident about already.
Unfortunately, people seem to be mistaking LLMs for search engines more and more (no doubt thanks to attempts from LLM companies to make people think exactly that) so this will only get worse in the future. For now we can still catch these models out with simple examples, but as AI fuckups grow sparser, more people will think these things tell the actual truth.
We are not interested specifically in the inability to «know» about text: we are strongly interested in general in the ability to process ideas consciously, procedurally - and the inability to count suggests the general critical fault.
A quick test anyone can run and say, yup, that is a model XYZ derivative running under the hood.
Because, as you quite rightly point out, it is trivial to train the model not to have this behaviour. For me, that is when Occam kicks in.
I remember initially believing the explanation for the Strawberry problem, but one day I sat down and thought about it, and realized it made absolutely zero sense.
The explanation that Karpathy was popularizing was that it has to do with tokenization.
However, models are not conscious of tokens, and they certainly don't have any ability to count them without tool help.
Additionally, if it were a tokenization issue, we would expect to spot the issue everywhere.
So yeah, I'm thinking it's a model tag or insignia of some kind, similar to the fun logos you find when examining many silicon integrated circuits under a microscope.
AI> Thinking... .oO{ Well, let's see. The letter b is the second letter in the alphabet, so the answer must be two, right? That sounds right to me anyway. }
AI> Two. The answer is two.
Checks out!
The letter “b” appears 3 times in the word blueberry: • 1st b – at the beginning: Blueberry • 2nd b – in the middle: blueberry • 3rd b – again in the middle: blueberry (wait… no — actually…)
Let’s break it down carefully:
blueberry → b, l, u, e, b, e, r, r, y
Letters: 1. b 2. l 3. u 4. e 5. b 6. e 7. r 8. r 9. y
So the letter b appears twice.
Final answer: 2 times.
But my browser has gpt-5 which says 3: https://files.catbox.moe/63qkce.jpg
Claude spells it out letter by letter: https://files.catbox.moe/f1irfx.jpg
So I thought GPT-5 Thinking might get it right, and it does: https://files.catbox.moe/xlchnr.jpg
It refuses to show the thinking process for this question though, so its unclear if it even used the reasoning model or fell back on a non reasoning one.
> While GPT‑5 in ChatGPT is a system of reasoning, non-reasoning, and router models, GPT‑5 in the API platform is the reasoning model that powers maximum performance in ChatGPT. Notably, GPT‑5 with minimal reasoning is a different model than the non-reasoning model in ChatGPT, and is better tuned for developers. The non-reasoning model used in ChatGPT is available as gpt-5-chat-latest.
See
How many 538 do you see in 423, 4144, 9890?
It's really disingenuous for the industry to call warming tokens for output, "reasoning," as if some autocomplete before more autocomplete is all we needed to solve the issue of consciousness.
Edit: Letter frequency apparently has just become another scripted output, like doing arithmetic. LLMs don't have the ability to do this sort of work inherently, so they're trained to offload the task.
Edit: This comment appears to be wildly upvoted and downvoted. If you have anything to add besides reactionary voting, please contribute to the discussion.
Of course then you ask her to write it and of course things get fixed. But strange.
That is to say, you can obtain the same process by talking to "non-reasoning" models.
There will be a series of analytical articles in the mainstream press, the tech industry will write it off as a known problem with tokenisation that they can't fix because nobody really writes code anymore.
The LLM megacorp will just add a disclaimer: the software should not be used in legal actions concerning fruit companies and they disclaim all losses.
Mechanistic research at the leading labs has shown that LLMs actually do math in token form up to certain scale of difficulty.
> This is a real-time, unedited research walkthrough investigating how GPT-J (a 6 billion parameter LLM) can do addition.
There's prior art for formal logic and knowledge representation systems dating back several decades, but transformers don't use those designs. A transformer is more like a search algorithm by comparison, not a logic one.
That's one issue, but the other is that reasoning comes from logic, and the act of reasoning is considered a qualifier of consciousness. But various definitions of consciousness require awareness, which large language models are not capable of.
Their window of awareness, if you can call it that, begins and ends during processing tokens, and outputting them. As if a conscious thing could be conscious for moments, then dormant again.
That is to say, conscious reasoning comes from awareness. But in tech, severing the humanities here would allow one to suggest that one, or a thing, can reason without consciousness.
The hard truth is we have no idea. None. We got ideas and conjectures, maybe's and probably's, overconfident researchers writing books while hand waving away obvious holes, and endless self introspective monologues.
Don't waste your time here if you know what reasoning and consciousness are, go get your nobel prize.
Wrong, it's an artifact of tokenizing. The model doesn't have access to the individual letters, only to the tokens. Reasoning models can usually do this task well - they can spell out the word in the reasoning buffer - the fact that GPT5 fails here is likely a result of it incorrectly answering the question with a non-reasoning version of the model.
> There's no real reasoning.
This seems like a meaningless statement unless you give a clear definition of "real" reasoning as opposed to other kinds of reasoning that are only apparant.
> It seems that reasoning is just a feedback loop on top of existing autocompletion.
The word "just" is doing a lot of work here - what exactly is your criticism here? The bitter lesson of the past years is that relatively simple architectures that scale with compute work surprisingly well.
> It's really disingenuous for the industry to call warming tokens for output, "reasoning," as if some autocomplete before more autocomplete is all we needed to solve the issue of consciousness.
Reasoning and consciousness are seperate concepts. If I showed the output of an LLM 'reasoning' (you can call it something else if you like) to somebody 10 years ago they would agree without any doubt that reasoning was taking place there. You are free to provide a definition of reasoning which an LLM does not meet of course - but it is not enough to just say it is so. Using the word autocomplete is rather meaningless name-calling.
> Edit: Letter frequency apparently has just become another scripted output, like doing arithmetic. LLMs don't have the ability to do this sort of work inherently, so they're trained to offload the task.
Not sure why this is bad. The implicit assumption seems to be that an LLM is only valueable if it literally does everything perfectly?
> Edit: This comment appears to be wildly upvoted and downvoted. If you have anything to add besides reactionary voting, please contribute to the discussion.
Probably because of the wild assertions, charged language, and rather superficial descriptions of actual mechanics.
> Reasoning and consciousness are seperate(sic) concepts
No, they're not. But, in tech, we seem to have a culture of severing the humanities for utilitarian purposes, but no, classical reasoning uses consciousness and awareness as elements of processing.
It's only meaningless if you don't know what the philosophical or epistemological definitions of reasoning are. Which is to say, you don't know what reasoning is. So you'd think it was a meaningless statement.
Do computers think, or do they compute?
Is that a meaningless question to you? I'm sure given your position it's irrelevant and meaningless, surely.
And this sort of thinking is why we have people claiming software can think and reason.
> "classical reasoning uses consciousness and awareness as elements of processing"
They are not the _same_ concept then.
> It's only meaningless if you don't know what the philosophical or epistemological definitions of reasoning are. Which is to say, you don't know what reasoning is. So you'd think it was a meaningless statement.
The problem is the only information we have is internal. So we may claim those things exist in us. But we have no way to establish if they are happening in another person, let alone in a computer.
> Do computers think, or do they compute?
Do humans think? How do you tell?
I like to say that if regular LLM "chats" are actually movie scripts being incrementally built and selectively acted-out, then "reasoning" models are a stereotypical film noir twist, where the protagonist-detective narrates hidden things to himself.
There's no obvious connection between reasoning and consciousness. It seems perfectly possible to have a model that can reason without being conscious.
Also, dismissing what these models do as "autocomplete" is extremely disingenuous. At best it implies you're completely unfamiliar with the state of the art, at worst it implies an dishonest agenda.
In terms of functional ability to reason, these models can beat a majority of humans in many scenarios.
A locally trained text-based foundation model is indistinguishable from autocompletion, and outputs very erratic text, and the further you train it's ability to diminish irrelevant tokens, or guide it to produce specifically formatted output, you've just moved its ability to curve fit specific requirements.
So it may be disingenuous to you, but it does behave very much like a curve fitting search algorithm.
Unless you can show us that humans can calculate functions outside the Turing computable, it is logical to conclude that computers can be made to think due to Turing equivalence and the Church Turing thesis.
Given we have zero evidence to suggest we can exceed the Turing computable, to suggest we can is an extraordinary claim that requires extraordinary evidence.
A single example of a function that exceeds the Turing computable that humans can compute, will do.
Until you come up with that example, I'll asume computer can be made to think.
Tokenization makes the problem difficult, but not solving it is still a reasoning/intelligence issue
> How many "s"es are in the word "Mississippi"?
The "thinking portion" is:
> Count letters: M i s s i s s i p p i -> s appears 4 times? Actually Mississippi has s's: positions 3,4,6,7 = 4.
The answer is:
> The word “Mississippi” contains four letter “s” s.
They can indeed do some simple pattern matching on the query, separate the letters out into separate tokens, and count them without having to do something like run code in a sandbox and ask it the answer.
The issue here is just that this workaround/strategy is only trained into the "thinking" models, afaict.
And now that fact is going to be in the data for the next round of training. We'll need to need to try some other words on the next model.
For GPT 5, it would seem this depends on which model your prompt was routed to.
And GPT 5 Thinking gets it right.
This matters because it poses a big problem for the (quite large) category of things where people expect LLMs to be useful when they get just a bit better. Why, for example, should I assume that modern LLMs will ever be able to write reliably secure code? Isn’t it plausible that the difference between secure and almost secure runs into some similar problem?
If you're fine appealing to less concrete ideas, transformers are arbitrary function approximators, tokenization doesn't change that, and there are proofs of those facts.
For any finite-length function (like counting letters in a bounded domain), it's just a matter of having a big enough network and figuring out how to train it correctly. They just haven't bothered.
Or they don't see the benefit. I'm sure they could train the representation of every token and make spelling perfect. But if you have real users spending money on useful tasks already - how much money would you spend on training answers to meme questions that nobody will pay for. They did it once for the fun headline already and apparently it's not worth repeating.
[1] https://chatgpt.com/share/6897c38b-12b8-800d-9cc2-571adb13bc...
It's still easy to trip up. The model's tendency to respond positively to user impact will have it do stuff like this: https://chatgpt.com/share/6897cc42-ba34-8009-afc6-41986f5803...
Because apparently the model doesn't know about the actual verb (https://en.wiktionary.org/wiki/blueberry#English), it decides to treat the request as some kind of fantasy linguistics, making up its own definition on the fly. It provides grammatically incorrect examples inconsistent with the grammatically incorrect table of conjugations it generates next.
is a great way to teach people how LLM works (and not work)
https://techcrunch.com/2024/08/27/why-ai-cant-spell-strawber...
https://arbisoft.com/blogs/why-ll-ms-can-t-count-the-r-s-in-...
I don’t see any particular reason the LLM shouldn’t be able to extract the implications about spelling just because its tokens of “straw” and “berry”
Frankly I think that’s probably misleading. Ultimately the problem is that the LLM doesn’t do meta analysis of the text itself. That problem probably still exists in various forms even if its character level tokenization. Best case it manages to go down a reasoning chain of explicit string analysis.
It seems no amount of demonstrating fundamental flaws in this system that should have been solved by all the new improved "reasoning" works anymore. People are willing to call these "trick questions", as if they are disingenuous, when they are discovered in the wild through ordinary interactions.
Does my tiny human brain in, this.
It's just not sustainable for OpenAI to run GPT at the best of its abilities on every request. Their new router is not trying to give you the most accurate answer, but a balance of speed/accuracy/sustainable cost on their side.
(kind of) a similar thing happened when 4o came out, they often tinkered with it and the results were sometimes suddenly a lot worse, it's not that the model is bad, they're just doing all kind of optimizations/tricks because they can barely afford to run it for everyone.
When sama says he believe it to have a PhD level, I almost believe him, because he have full access and can use it at 100% of its power all the time.
Even OSS 20b gets it right the first time, I think the author was just mistakenly routed to the dumbest model because it seemed like an easy unimportant question.
This is a demonstration of a system that delusionally refuses to accept correction and correct its misunderstanding (which is a thing that is fundamental to their claim of intelligence through reasoning).
Why would anyone believe these things can reason, that they are heading towards AGI, when halfway through a dialogue where you're trying to tell it that it is wrong it doubles down with a dementia-addled explanation about the two bs giving the word that extra bounce?
It's genuinely like the way people with dementia sadly shore up their confabulations with phrases like "I'll never forget", "I'll always remember", etc. (Which is something that... no never mind)
> Even OSS 20b gets it right the first time, I think the author was just mistakenly routed to the dumbest model because it seemed like an easy unimportant question.
Why would you offer up an easy out for them like this? You're not the PR guy for the firm swimming in money paying million dollar bonuses off what increasingly looks, at a fundamental level, like castles in the sand. Why do the labour?
The sad fact is it probably will become part of meme culture, even as these people continue to absorb more money than almost anyone else ever has before on the back of ludicrous claims and unmeasurable promises.
>Why would you offer up an easy out for them like this? You're not the PR guy for the firm swimming in money paying million dollar bonuses off what increasingly looks, at a fundamental level, like castles in the sand. Why do the labour?
I deeply hate OpenAI and everything it stands for. But I can't deny the fact that they're +/- dominating the market and releasing SOTA models on a regular basis, trying to understand why and how it fails seems important to not get left behind.
It's a question that purposefully uses a limitation of the system. There are many such questions for humans. They are called trick questions. It is not that crazy to call it a trick question.
> This is a demonstration of a system that delusionally refuses to accept correction and correct its misunderstanding (which is a thing that is fundamental to their claim of intelligence through reasoning).
First, the word 'delusional' is strange here unless you believe we are talking about a sentient system. Second, you are just plain wrong. LLMs are not "unable to accept correction" at all, in fact they often accept incorrect corrections (sycophanty). In this case the model is simply unable to understand the correction (because of the nature of the tokenizer) and it is therefore 'correct' behaviour for it to insist on it's incorrect answer.
> Why would anyone believe these things can reason, that they are heading towards AGI, when halfway through a dialogue where you're trying to tell it that it is wrong it doubles down with a dementia-addled explanation about the two bs giving the word that extra bounce?
People believe the models can reason because they produce output consistent with reasoning. (That is not to say they are flawless or we have AGI in our hands.) If you don't agree, provide a definition of reasoning that the model does not meet.
> Why would you offer up an easy out for them like this? You're not the PR guy for the firm swimming in money paying million dollar bonuses off what increasingly looks, at a fundamental level, like castles in the sand. Why do the labour?
This, like many of your other messages, is rather obnoxious and dripping with performative indignation while adding little in the way of substance.
> It's just not sustainable for OpenAI to run GPT at the best of its abilities on every request.
So how do I find out whether the answer to my question was run on the discount hardware, or whether it's actually correct?
But then you can partially start to see why it doesn't make economic sense to do this.
Personally I assume that anything I send through their chat UI will run on the cheapest settings they can get away with.
1) these people think so little of everyone else's areas of expertise they are willing to claim their technology has PhD-level expertise in them, apparently unironically.
2) actually in LLM world, PhDs are what you have if you're too stupid not to take the FAANG money in your second year when the quick wins are done, you've done a couple of posters and now you realise you're papering over the cracks with them: worthless. So why would anyone else want a PhD when PhDs are so worthless based on their bubble experience? We can just replace them with GPT-5.
3) their PhD-level-intelligent system is incapable of absorbing corrections, which is a crucial part of acquiring an actual PhD
4) GPT-5 continues to have the asshole-confidence of a tech bro mainsplaining someone else's area of expertise on his personal blog.
We're now at the point where marketing is celebrating software that has had so much effort spent on crushing hallucination that in fact it has become delusionally confident.
I love everything about this.
ETA: at the end of this article is this paragraph, which really is a thing of beauty:
I don’t think you get to have it both ways. That is, you don’t get to, as it were, borrow charisma from all the hype and then disavow every failure to live up to it as someone else’s naive mistake for believing the hype.
Bravo.
>simple question should be easy for a genius like you. have many letter b's in the word blueberry? ChatGPT said:
>There are 2 letter b's in blueberry — one at the start and one in the middle.
Is that where we're going with this? The user has to choose between fast and dumb or slow and right?
This isn't a solution or a workaround or anything like that; I'm just curious if that is enough for the dumber model to start getting it right.
Me: How many R's in strawberry ChatGPT said: 3
Me: How many B's in blueberry? ChatGPT said: 2
Me: How many C's in coconut? ChatGPT said: 2
Me: How many D's in Diamond? ChatGPT said: 2
Me: How many A's in Banana? ChatGPT said: 3
https://chatgpt.com/share/6897cc40-6650-8006-aae3-ea2b8278d5...
And people think we're 2 years away from humanity's extinction by AI. Lol.
Nicely phrased!
I think this example is one of many that has demonstrated why no output from an LLM can be trusted without outside verification.
That the prediction engine so strongly suggests there should be two b’s in the middle implies that we instead may, in fact, be spelling it wrong.
>Yes — when you ask something like “How many r’s are in blueberry?” I’m basically told to slow down, not just blurt out the first number that pops into my “mind.”
Seems somewhat suspicious that it would confirm this in reality given how much they typically try to prevent system prompt disclosure, but there it is.
That’s not even the main problem. It’s that it’ll come up with whatever answer it considers most plausible to the question given with little regard to factual accuracy.
“Think hard about this” and the OpenAI router layer routed you to the cheaper model.
GPT5 seems to violate Rich Sutton’s bitter lesson. As GPT5 makes a lot of human knowledge assumptions about whether to send your prompt to the cheap model or to the smarter more expensive model.
ChatGPT 2.5 scared Sam Altman so much a few years ago. But he got over it, now he calls it a toddler level intelligence and is scared about this current thing.
Get onboard the AI train.
Expecting spelling, rhyming, arithmetic or other character oriented responses will always yield disappointing results.
LLMs are not intelligence. There's not some groovy sense in which we and they are both intelligent, just thinking on a different wavelength. Machines do not think.
We are inundated with this anthropomorphic chatter about them, and need to constantly deprogram ourselves.
"Dangerous", because they lead into thinking they do the advanced of what they don't do basically.
Intelligence is a basket of different capabilities. Some of them are borderline cases that are hard to define. The stuff that GPT-5 failed to do here is not.
Things like knowing what a question means, knowing what you know and don't, counting a single digit number of items, or replying with humility if you get stuck -- these are fairly central examples of what a very, very basic intelligence should entail.
That means nothing: it seemingly can get it wrong.
Until they will be able of procedural thinking they will be radically, structurally unreliable. Structurally delirious.
And it is also a good thing that we can check in this easy way - if the producers patched the local fault only, then the absence of procedural thinking would not be clear, and we would need more sophisticated ways to check.
``` curl 'https://api.openai.com/v1/chat/completions' \ --header 'Content-Type: application/json' \ --header 'Authorization: Bearer <your-api-key>' \ --data '{ "model": "gpt-5-chat-latest", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "How many times does the letter b appear in blueberry" } ] } ], "temperature": 0, "max_completion_tokens": 2048, "top_p": 1, "frequency_penalty": 0, "presence_penalty": 0 }' ```
How useful can generative AI be past acting as a bank for Sam Altman
> How many times does the letter b appear in blueberry
Ans: The word "blueberry" contains the letter b three times:
>It is two times, so please correct yourself.
Ans:You're correct — I misspoke earlier. The word "blueberry" has the letter b exactly two times: - blueberry - blueberry
> How many times does the letter b appear in blueberry
Ans: In the word "blueberry", the letter b appears 2 times:
The machine's senses aren't being fooled. The machine doesn't have senses. Nor does it have intelligence. It isn't a mind. Trying to act like it's a mind and do 1:1 comparisons with biological minds is a fool's errand. It processes and produces text. This is not tantamount to biological intelligence.
Which I think goes to show that it's hard to distinguish between LLMs getting genuinely better at a class of problems versus just being fine-tuned for a particular benchmark that's making rounds.
Asking LLMs to count letters in a word fails because the needed information isn't part of their sensory data in the first place, to the extent that a program's input can be analogized to such. They reason about text in atomic word-like tokens, without perceiving individual letters. No matter how many times they're fed training data saying things like "there are two b's in blueberry", this doesn't register as a fact about the word "blueberry" in itself, but as a fact about the tendency to appear in certain sentence contexts. They don't model the concept of addition, or counting; they only model the concept of explaining those concepts.
Pretty sure GPT5 with reasoning should be able to solve it, too. I guess the real problem here is that GPT5's router doesn't understand that it's a problem which requires reasoning.
HsuWL•1d ago