GPT-5: "How many times does the letter b appear in blueberry?"

https://bsky.app/profile/kjhealy.co/post/3lvtxbtexg226

204•minimaxir•1d ago

https://kieranhealy.org/blog/archives/2025/08/07/blueberry-h...

Comments

HsuWL•1d ago

I love this test. Demonstrates the "understanding" process of the language model.

axdsk•1d ago

“It’s like talking to a PhD level expert” -Sam Altman

https://www.youtube.com/live/0Uu_VJeVVfo?si=PJGU-MomCQP1tyPk

jcgrillo•1h ago

There must be smart people at openai who believe in what they're doing and absolutely cringe whenever this clown opens his mouth... like, I hope?

smsm42•1h ago

A lot of people confuse access to information with being smart. Because for humans it correlates well - usually the smart people are those that know a lot of facts and can easily manipulate them on demand, and the dumb people are those that can not. LLMs have unique capability of being both very knowledgeable (as in, able to easily access vast quantities of information, way beyond the capabilities of any human, PhD or not) and very dumb, they way a kindergarten kid wouldn't be. It totally confuses all our heuristics.

schoen•1d ago

These are always amazing when juxtaposed with apparently impressive LLM reasoning, knowledge, and creativity. You can trivially get them to make the most basic mistakes about words and numbers, and double down on those mistakes, repeatedly explaining that they're totally correct.

Have any systems tried prompting LLMs with a warning like "You don't intuitively or automatically know many facts about words, spelling, or the structure or context of text, when considered as text; for example, you don't intuitively or automatically know how words or other texts are spelled, how many letters they contain, or what the result of applying some code, mechanical transformation, or substitution to a word or text is. Your natural guesses about these subjects are likely to be wrong as a result of how your training doesn't necessarily let you infer correct answers about them. If the content or structure of a word or text, or the result of using a transformation, code, or the like on a text, is a subject of conversation, or you are going to make a claim about it, always use a tool to confirm your intuitions."?

mikestorrent•1d ago

This is a great idea. Like, if someone asked me to count the number of B's in your paragraph, I'd yeet it through `grep -o 'B' file.txt | wc -l` or similar, why would I sit there counting it by hand?

As a human, if you give me a number on screen like 100000000, I can't be totally sure if that's 100 Million or 1 Billion without getting close and counting carefully. Should ought have my glasses. Mouse pointer helps some as an ersatz thousands-separator, but still.

Since we're giving them tools, especially for math, it makes way more sense to start giving them access to some of the finest tools ever. Make an MCP into Mathematica or Matlab and let the LLM write some math and have classical solvers actually deal with the results. Let the LLM write little bits of bash or python as its primary approach for dealing with these kinds of analytical questions.

It's like giving a kid a calculator...

philipwhiuk•1d ago

If you have to build an MCP for every system you aren’t building intelligence in the first place.

viraptor•1d ago

You don't need specialised MCPs for this. In the past you could add "use python" to there chatgpt prompt and it would do the right thing. This is exactly the intelligent "use the right tool for the right thing" idea. Chatgpt just want trained to apply it in the right circumstances automatically.

dsadfjasdf•1d ago

Why? just cause? analogize it to the human brain.

jcgrillo•1h ago

Why does it matter? I don't care whether it's intelligent, I just need it to be useful. In order to be useful it needs to start fucking up less, stat. In current form it's borderline useless.

tschwimmer•1h ago

Fair criticism, but also this arguably would be preferable. For many use cases it would be strictly better, as you've built some sort of automated drone that can do lots of work but without preferences and personality.

Someone•1h ago

I think a piece of software that can correctly decide what oracle to consult to get answers to questions you give it can be called intelligent, even if it itself doesn’t know any facts.

yahoozoo•41m ago

What if MCP servers were really the neurons we were looking for all along? /s

jiggawatts•1h ago

> As a human, if you give me a number on screen like 100000000, I can't be totally sure if that's 100 Million or 1 Billion without getting close and counting carefully.

I become mildly infuriated when computers show metrics (or any large number) without thousands separators.

Worse still, I often see systems that mix units, don’t right-align, and occasionally blend in a few numbers with decimals together with whole numbers! Then, update everything every second to make things extra spicy.

ehnto•1d ago

I often tell LLMs to ask questions if required, and that it is a skilled developer who is working along side me. That seems to help them be more collaborative rather than prescriptive.

philipwhiuk•1d ago

You can’t just prompt your way out of a systemic flaw

skeledrew•1d ago

What's the systematic flaw?

lottin•2h ago

The fact that it can't count.

minimaxir•1h ago

If a LLM can get IMO Gold but can’t count, that’s an issue.

utopcell•1h ago

This particular LLM did not get an IMO Gold.

jeroenhd•1h ago

You don't need to as long as you don't use LLMs like these in cases where incorrect output isn't of any consequence. If you're using LLMs to generate some placeholder bullshit to fill out a proof of concept website, you don't care if it claims strawberries have tails, you just need it to generate some vaguely coherent crap.

For things where factuality is even just a little important, you need to treat these things like asking a toddler that got their hands on a thesaurus and an encyclopaedia (that's a few years out of date): go through everything it produces and fact check any statement it makes that you're not confident about already.

Unfortunately, people seem to be mistaking LLMs for search engines more and more (no doubt thanks to attempts from LLM companies to make people think exactly that) so this will only get worse in the future. For now we can still catch these models out with simple examples, but as AI fuckups grow sparser, more people will think these things tell the actual truth.

mdp2021•6m ago

> prompting LLMs with a warning like "You don't intuitively or automatically know many facts about...

We are not interested specifically in the inability to «know» about text: we are strongly interested in general in the ability to process ideas consciously, procedurally - and the inability to count suggests the general critical fault.

Erem•1d ago

With data starvation driving ai companies towards synthetic data I’m surprised that an easily synthesized problem like this hasn’t been trained out of relevance. Yet here we are with proof that it hasn’t

quatonion•1d ago

Are we a hundred percent sure it isn't a watermark that is by design?

A quick test anyone can run and say, yup, that is a model XYZ derivative running under the hood.

Because, as you quite rightly point out, it is trivial to train the model not to have this behaviour. For me, that is when Occam kicks in.

I remember initially believing the explanation for the Strawberry problem, but one day I sat down and thought about it, and realized it made absolutely zero sense.

The explanation that Karpathy was popularizing was that it has to do with tokenization.

However, models are not conscious of tokens, and they certainly don't have any ability to count them without tool help.

Additionally, if it were a tokenization issue, we would expect to spot the issue everywhere.

So yeah, I'm thinking it's a model tag or insignia of some kind, similar to the fun logos you find when examining many silicon integrated circuits under a microscope.

simianwords•1d ago

If you choose the thinking model it doesn’t make this mistake. It means the auto router should be tuned to call the thinking model on edge cases like these.

fathermarz•1d ago

If you are going to release a new set of models and eliminate the old ones, your new smallest model should have equal capabilities equal to or greater than your old models of similar size. This is not a routing problem. This is a driving the price down of running and scaling the business problem.

simianwords•1d ago

do you not think 4o has the same capabilities as GPT-5 without thinking?

lucas_membrane•1d ago

A couple of weeks ago, I asked google, ordinary google search, how many times the letter r is found in preferred, and it told me 2. This century has taken quite a bitter turn against those of us who think that the 'enough' in 'good enough' ought to exclude products indistinguishable from the most grievously disgraceful products of sloth. But I have also lately realized that human beings, brains, society, culture, education, technology, computers, etc, are all extremely complicated emergent properties of a universe that is far beyond our understanding. And we ought not to complain too seriously, because this, too, shall pass.

jcgrillo•1h ago

The generation leading this world has all the weapons the previous generation built at their disposal and none of the discipline or education to wield them responsibly. This too shall pass, but how it passes will be interesting to see.

zozbot234•1d ago

> How many times does the letter b appear in blueberry?

AI> Thinking... .oO{ Well, let's see. The letter b is the second letter in the alphabet, so the answer must be two, right? That sounds right to me anyway. }

AI> Two. The answer is two.

Checks out!

coldtea•1d ago

Q: How many times does the letter b appear in blueberry?

The letter “b” appears 3 times in the word blueberry: • 1st b – at the beginning: Blueberry • 2nd b – in the middle: blueberry • 3rd b – again in the middle: blueberry (wait… no — actually…)

Let’s break it down carefully:

blueberry → b, l, u, e, b, e, r, r, y

Letters: 1. b 2. l 3. u 4. e 5. b 6. e 7. r 8. r 9. y

So the letter b appears twice.

Final answer: 2 times.

skybrian•1h ago

How did you generate this?

jokoon•1d ago

Maybe it's joking

alkyon•50m ago

Just feigning stupidity

andai•1d ago

My phone still has gpt-4o which gets it right: https://files.catbox.moe/0yg6cu.jpg

But my browser has gpt-5 which says 3: https://files.catbox.moe/63qkce.jpg

Claude spells it out letter by letter: https://files.catbox.moe/f1irfx.jpg

So I thought GPT-5 Thinking might get it right, and it does: https://files.catbox.moe/xlchnr.jpg

It refuses to show the thinking process for this question though, so its unclear if it even used the reasoning model or fell back on a non reasoning one.

> While GPT‑5 in ChatGPT is a system of reasoning, non-reasoning, and router models, GPT‑5 in the API platform is the reasoning model that powers maximum performance in ChatGPT. Notably, GPT‑5 with minimal reasoning is a different model than the non-reasoning model in ChatGPT, and is better tuned for developers. The non-reasoning model used in ChatGPT is available as gpt-5-chat-latest.

https://openai.com/index/introducing-gpt-5-for-developers/

amai•1d ago

Isn't that just an artifact caused by the tokenization of the training and input data?

See

https://platform.openai.com/tokenizer

https://github.com/openai/tiktoken

falcor84•2h ago

Where in the tokenization does the 3rd b come from?

IanCal•1h ago

The tokenisation means they don’t see the letters at all. They see something like this - to convert just some tokens to words

How many 538 do you see in 423, 4144, 9890?

minimaxir•34m ago

LLMs don’t see token ids, they see token embeddings that map to those ids, and those embeddings are correlated. The hypothetical embeddings of 538, 423, 4144, and 9890 are likely strongly correlated in the process of training the LLM and the downstream LLM should be able to leverage those patterns to solve the question correctly. Even more so since the training process likely has many examples of similar highly correlated embeddings to identify the next similar token.

andrewmcwatters•2h ago

No, it's the entire architecture of the model. There's no real reasoning. It seems that reasoning is just a feedback loop on top of existing autocompletion.

It's really disingenuous for the industry to call warming tokens for output, "reasoning," as if some autocomplete before more autocomplete is all we needed to solve the issue of consciousness.

Edit: Letter frequency apparently has just become another scripted output, like doing arithmetic. LLMs don't have the ability to do this sort of work inherently, so they're trained to offload the task.

Edit: This comment appears to be wildly upvoted and downvoted. If you have anything to add besides reactionary voting, please contribute to the discussion.

spwa4•2h ago

I had a fun experience recently. I asked one of my daughters how many r's there are in strawberry. Her answer? Two ...

Of course then you ask her to write it and of course things get fixed. But strange.

andrewmcwatters•2h ago

I think that's supposed to be the idea of reasoning functionality, but in practice, it just seems to allow responses to continue longer than that would have otherwise by bisecting the output into warming an output and then using maybe what we would consider cached tokens to assist with further contextual lookups.

That is to say, you can obtain the same process by talking to "non-reasoning" models.

exasperaited•1h ago

In ten years time an LLM lawyer will lose a legal case for someone who can no longer afford a real lawyer because there are so few left. And it'll be because the layers of bodges in the model caused it to go crazy, insult the judge and threaten to burn down the courthouse.

There will be a series of analytical articles in the mainstream press, the tech industry will write it off as a known problem with tokenisation that they can't fix because nobody really writes code anymore.

The LLM megacorp will just add a disclaimer: the software should not be used in legal actions concerning fruit companies and they disclaim all losses.

Terr_•53m ago

I glumly predict LLMs will end up a bit like asbestos: Powerful in some circumstances, but over/mis-used...

kiratp•1h ago

> Edit: Letter frequency apparently has just become another scripted output, like doing arithmetic. LLMs don't have the ability to do this sort of work inherently, so they're trained to offload the task.

Mechanistic research at the leading labs has shown that LLMs actually do math in token form up to certain scale of difficulty.

> This is a real-time, unedited research walkthrough investigating how GPT-J (a 6 billion parameter LLM) can do addition.

https://youtu.be/OI1we2bUseI

dwaltrip•1h ago

Please define “real reasoning”? Where is the distinction coming from?

andrewmcwatters•1h ago

Can we not downvote this, please? It's a good question.

There's prior art for formal logic and knowledge representation systems dating back several decades, but transformers don't use those designs. A transformer is more like a search algorithm by comparison, not a logic one.

That's one issue, but the other is that reasoning comes from logic, and the act of reasoning is considered a qualifier of consciousness. But various definitions of consciousness require awareness, which large language models are not capable of.

Their window of awareness, if you can call it that, begins and ends during processing tokens, and outputting them. As if a conscious thing could be conscious for moments, then dormant again.

That is to say, conscious reasoning comes from awareness. But in tech, severing the humanities here would allow one to suggest that one, or a thing, can reason without consciousness.

Workaccount2•38m ago

There no model of conscience or reasoning.

The hard truth is we have no idea. None. We got ideas and conjectures, maybe's and probably's, overconfident researchers writing books while hand waving away obvious holes, and endless self introspective monologues.

Don't waste your time here if you know what reasoning and consciousness are, go get your nobel prize.

nurettin•48m ago

Athenian wisdom suggests that fallacious thought is "unreasonable". So reason is the opposite of that.

tmnvdb•1h ago

> No, it's the entire architecture of the model.

Wrong, it's an artifact of tokenizing. The model doesn't have access to the individual letters, only to the tokens. Reasoning models can usually do this task well - they can spell out the word in the reasoning buffer - the fact that GPT5 fails here is likely a result of it incorrectly answering the question with a non-reasoning version of the model.

> There's no real reasoning.

This seems like a meaningless statement unless you give a clear definition of "real" reasoning as opposed to other kinds of reasoning that are only apparant.

> It seems that reasoning is just a feedback loop on top of existing autocompletion.

The word "just" is doing a lot of work here - what exactly is your criticism here? The bitter lesson of the past years is that relatively simple architectures that scale with compute work surprisingly well.

> It's really disingenuous for the industry to call warming tokens for output, "reasoning," as if some autocomplete before more autocomplete is all we needed to solve the issue of consciousness.

Reasoning and consciousness are seperate concepts. If I showed the output of an LLM 'reasoning' (you can call it something else if you like) to somebody 10 years ago they would agree without any doubt that reasoning was taking place there. You are free to provide a definition of reasoning which an LLM does not meet of course - but it is not enough to just say it is so. Using the word autocomplete is rather meaningless name-calling.

Not sure why this is bad. The implicit assumption seems to be that an LLM is only valueable if it literally does everything perfectly?

> Edit: This comment appears to be wildly upvoted and downvoted. If you have anything to add besides reactionary voting, please contribute to the discussion.

Probably because of the wild assertions, charged language, and rather superficial descriptions of actual mechanics.

andrewmcwatters•1h ago

These aren't wild assertions. I'm not using charged language.

> Reasoning and consciousness are seperate(sic) concepts

No, they're not. But, in tech, we seem to have a culture of severing the humanities for utilitarian purposes, but no, classical reasoning uses consciousness and awareness as elements of processing.

It's only meaningless if you don't know what the philosophical or epistemological definitions of reasoning are. Which is to say, you don't know what reasoning is. So you'd think it was a meaningless statement.

Do computers think, or do they compute?

Is that a meaningless question to you? I'm sure given your position it's irrelevant and meaningless, surely.

And this sort of thinking is why we have people claiming software can think and reason.

tmnvdb•32m ago

You have again answered with your customary condescension. Is that really necessary? Everything you write is just dripping with patronizing superiority and combatative sarcasm.

> "classical reasoning uses consciousness and awareness as elements of processing"

They are not the _same_ concept then.

> It's only meaningless if you don't know what the philosophical or epistemological definitions of reasoning are. Which is to say, you don't know what reasoning is. So you'd think it was a meaningless statement.

The problem is the only information we have is internal. So we may claim those things exist in us. But we have no way to establish if they are happening in another person, let alone in a computer.

> Do computers think, or do they compute?

Do humans think? How do you tell?

Terr_•57m ago

> There's no real reasoning. It seems that reasoning is just a feedback loop on top of existing autocompletion.

I like to say that if regular LLM "chats" are actually movie scripts being incrementally built and selectively acted-out, then "reasoning" models are a stereotypical film noir twist, where the protagonist-detective narrates hidden things to himself.

antonvs•56m ago

> It's really disingenuous for the industry to call warming tokens for output, "reasoning," as if some autocomplete before more autocomplete is all we needed to solve the issue of consciousness.

There's no obvious connection between reasoning and consciousness. It seems perfectly possible to have a model that can reason without being conscious.

Also, dismissing what these models do as "autocomplete" is extremely disingenuous. At best it implies you're completely unfamiliar with the state of the art, at worst it implies an dishonest agenda.

In terms of functional ability to reason, these models can beat a majority of humans in many scenarios.

andrewmcwatters•52m ago

It would require you to change the definition of reasoning, or it would require you to believe computers can think.

A locally trained text-based foundation model is indistinguishable from autocompletion, and outputs very erratic text, and the further you train it's ability to diminish irrelevant tokens, or guide it to produce specifically formatted output, you've just moved its ability to curve fit specific requirements.

So it may be disingenuous to you, but it does behave very much like a curve fitting search algorithm.

vidarh•14m ago

> or it would require you to believe computers can think.

Unless you can show us that humans can calculate functions outside the Turing computable, it is logical to conclude that computers can be made to think due to Turing equivalence and the Church Turing thesis.

Given we have zero evidence to suggest we can exceed the Turing computable, to suggest we can is an extraordinary claim that requires extraordinary evidence.

A single example of a function that exceeds the Turing computable that humans can compute, will do.

Until you come up with that example, I'll asume computer can be made to think.

visarga•49m ago

Understanding is always functional, we don't study medicine before going to the doctor, we trust the expert. Like that we do with almost every topic or system. How do you "understand" a company or a complex technological or biological system? Probably nobody does end to end. We can only approximate it with abstractions and reasoning. Not even a piece of code can be understood - without execution we can't tell if it will halt or not.

wongarsu•1h ago

It can spell the word (writing each letter in uppercase followed by a whitespace, which should turn each letter with its whitespace into a separate token). It also has reasoning tokens to use as scratch space, and previous models have demonstrated knowledge of the fact that spelling words is a useful step to counting letters.

Tokenization makes the problem difficult, but not solving it is still a reasoning/intelligence issue

pxc•27m ago

Here's an example of what gpt-oss-20b (at the default mxfp4 precision) does with this question:

> How many "s"es are in the word "Mississippi"?

The "thinking portion" is:

> Count letters: M i s s i s s i p p i -> s appears 4 times? Actually Mississippi has s's: positions 3,4,6,7 = 4.

The answer is:

> The word “Mississippi” contains four letter “s” s.

They can indeed do some simple pattern matching on the query, separate the letters out into separate tokens, and count them without having to do something like run code in a sandbox and ask it the answer.

The issue here is just that this workaround/strategy is only trained into the "thinking" models, afaict.

johnecheck•8m ago

That proves nothing. The fact that Mississippi has 4 "s" is far more likely to be in the training data than the fact that blueberry has 2 "b"s.

And now that fact is going to be in the data for the next round of training. We'll need to need to try some other words on the next model.

vidarh•23m ago

> It also has reasoning tokens to use as scratch space

For GPT 5, it would seem this depends on which model your prompt was routed to.

And GPT 5 Thinking gets it right.

SpicyLemonZest•1h ago

It clearly is an artifact of tokenization, but I don’t think it’s a “just”. The point is precisely that the GPT system architecture cannot reliably close the gap here; it’s almost able to count the number of Bs in a string, there’s no fundamental reason you could not build a correct number-of-Bs mapping for tokens, and indeed it often gets the right answer. But when it doesn’t you can’t always correct it with things like chain of thought reasoning.

This matters because it poses a big problem for the (quite large) category of things where people expect LLMs to be useful when they get just a bit better. Why, for example, should I assume that modern LLMs will ever be able to write reliably secure code? Isn’t it plausible that the difference between secure and almost secure runs into some similar problem?

hansvm•47m ago

Common misconception. That just means the algorithm for counting letters can't be as simple as adding 1 for every token. The number of distinct tokens is tiny compared to the parameter space, and it's not infeasible to store a mapping from token type to character count in those weights.

If you're fine appealing to less concrete ideas, transformers are arbitrary function approximators, tokenization doesn't change that, and there are proofs of those facts.

For any finite-length function (like counting letters in a bounded domain), it's just a matter of having a big enough network and figuring out how to train it correctly. They just haven't bothered.

viraptor•4m ago

> They just haven't bothered.

Or they don't see the benefit. I'm sure they could train the representation of every token and make spelling perfect. But if you have real users spending money on useful tasks already - how much money would you spend on training answers to meme questions that nobody will pay for. They did it once for the fun headline already and apparently it's not worth repeating.

dang•2h ago

Url changed from https://bsky.app/profile/kjhealy.co/post/3lvtxbtexg226, which points to this.

minimaxir•2h ago

The reason I submitted the Bluesky post is because the discussion there is more informative (and also multiple instances of confirmation that it’s not a fluke), but the link to both the post and blog is a good compromise.

dang•26m ago

Ok, I'll swap the two - thanks!

wslh•2h ago

Seems like they just fixed it: [1]. A "thinking longer for a better answer" message appeared before giving the answer.

[1] https://chatgpt.com/share/6897c38b-12b8-800d-9cc2-571adb13bc...

jeroenhd•1h ago

Having to activate their more complex "thinking" model every time they need to count letters is pretty silly, but I suppose it does hide the symptoms.

It's still easy to trip up. The model's tendency to respond positively to user impact will have it do stuff like this: https://chatgpt.com/share/6897cc42-ba34-8009-afc6-41986f5803...

Because apparently the model doesn't know about the actual verb (https://en.wiktionary.org/wiki/blueberry#English), it decides to treat the request as some kind of fantasy linguistics, making up its own definition on the fly. It provides grammatically incorrect examples inconsistent with the grammatically incorrect table of conjugations it generates next.

ck2•2h ago

The technical explanations to why this happens with strawberry, blueberry and similar

is a great way to teach people how LLM works (and not work)

https://techcrunch.com/2024/08/27/why-ai-cant-spell-strawber...

https://arbisoft.com/blogs/why-ll-ms-can-t-count-the-r-s-in-...

https://www.runpod.io/blog/llm-tokenization-limitations

minimaxir•1h ago

In this case, tokenization is less effective of a counterargument. If it was one-shot, maybe, but the OP asked GPT-5 several times, with different formatting of blueberry (and therefore different tokens, including single-character tokens), and it still asserted there are 3 b’s.

jncfhnb•1h ago

I don’t find the explanation about tokenization to be very compelling.

I don’t see any particular reason the LLM shouldn’t be able to extract the implications about spelling just because its tokens of “straw” and “berry”

Frankly I think that’s probably misleading. Ultimately the problem is that the LLM doesn’t do meta analysis of the text itself. That problem probably still exists in various forms even if its character level tokenization. Best case it manages to go down a reasoning chain of explicit string analysis.

exasperaited•1h ago

When Minsky and Papert showed that the perceptron couldn't learn XOR, it contributed to wiping the neural network off the map for decades.

It seems no amount of demonstrating fundamental flaws in this system that should have been solved by all the new improved "reasoning" works anymore. People are willing to call these "trick questions", as if they are disingenuous, when they are discovered in the wild through ordinary interactions.

Does my tiny human brain in, this.

vidarh•9m ago

It doesn't work this time because there are plenty of models, including GPT5 Thinking that can handle this correctly, and so it is clear this isn't a systemic issue that can't be trained out of them.

gok•2h ago

This is like asking a human how many pixels appears in the word "blueberry".

likeclockwork•1h ago

Except a human would say "I don't know" instead up making up some nonsense.

djeastm•30m ago

It's ironic that saying "I don't know" could be the ultimate sign of superior intelligence just like Socrates told us millennia ago.

gnabgib•22m ago

Two and a half millennia ago (he died 2424 years ago)

dcre•1h ago

I think the concrete issue this points to is the thing that dynamically decides when to use reasoning failed to choose it in this instance. Sam Altman said it was broken on release day.

minimaxir•1h ago

Even if it’s pointing to a weaker GPT-5 like gpt-5-nano, it should still be able to answer this question correctly.

CharlesW•1h ago

If you know how GPT architectures work, why would you think this?

minimaxir•1h ago

https://news.ycombinator.com/item?id=44850753

CharlesW•1h ago

Now I'm even more confused why you believe GPTs should be able to math. Even in a contrived example where each "b" gets its own token, there are several reasons why GPTs might not be able to correctly count the number of occurrences of a letter (without invoking a tool, obv).

shadowjones•1h ago

I think a lot of those trick questions outputting stupid stuff can be explained by simple economics.

It's just not sustainable for OpenAI to run GPT at the best of its abilities on every request. Their new router is not trying to give you the most accurate answer, but a balance of speed/accuracy/sustainable cost on their side.

(kind of) a similar thing happened when 4o came out, they often tinkered with it and the results were sometimes suddenly a lot worse, it's not that the model is bad, they're just doing all kind of optimizations/tricks because they can barely afford to run it for everyone.

When sama says he believe it to have a PhD level, I almost believe him, because he have full access and can use it at 100% of its power all the time.

Even OSS 20b gets it right the first time, I think the author was just mistakenly routed to the dumbest model because it seemed like an easy unimportant question.

exasperaited•1h ago

This is not a demonstration of a trick question.

This is a demonstration of a system that delusionally refuses to accept correction and correct its misunderstanding (which is a thing that is fundamental to their claim of intelligence through reasoning).

Why would anyone believe these things can reason, that they are heading towards AGI, when halfway through a dialogue where you're trying to tell it that it is wrong it doubles down with a dementia-addled explanation about the two bs giving the word that extra bounce?

It's genuinely like the way people with dementia sadly shore up their confabulations with phrases like "I'll never forget", "I'll always remember", etc. (Which is something that... no never mind)

> Even OSS 20b gets it right the first time, I think the author was just mistakenly routed to the dumbest model because it seemed like an easy unimportant question.

Why would you offer up an easy out for them like this? You're not the PR guy for the firm swimming in money paying million dollar bonuses off what increasingly looks, at a fundamental level, like castles in the sand. Why do the labour?

jonnycomputer•1h ago

the extra bounce was my favorite part!

exasperaited•1h ago

I mean if it was a Black Mirror satire moment it would rapidly become part of meme culture.

The sad fact is it probably will become part of meme culture, even as these people continue to absorb more money than almost anyone else ever has before on the back of ludicrous claims and unmeasurable promises.

shadowjones•1h ago

It's a trick question for an artificial intelligence that tokenize words. Humans have plenty of different weaknesses.

>Why would you offer up an easy out for them like this? You're not the PR guy for the firm swimming in money paying million dollar bonuses off what increasingly looks, at a fundamental level, like castles in the sand. Why do the labour?

I deeply hate OpenAI and everything it stands for. But I can't deny the fact that they're +/- dominating the market and releasing SOTA models on a regular basis, trying to understand why and how it fails seems important to not get left behind.

minimaxir•1h ago

It’s a more difficult question for LLMs due to tokenization, but far from a trick one. There’s no word play or ambiguity involved.

tmnvdb•1h ago

> This is not a demonstration of a trick question.

It's a question that purposefully uses a limitation of the system. There are many such questions for humans. They are called trick questions. It is not that crazy to call it a trick question.

> This is a demonstration of a system that delusionally refuses to accept correction and correct its misunderstanding (which is a thing that is fundamental to their claim of intelligence through reasoning).

First, the word 'delusional' is strange here unless you believe we are talking about a sentient system. Second, you are just plain wrong. LLMs are not "unable to accept correction" at all, in fact they often accept incorrect corrections (sycophanty). In this case the model is simply unable to understand the correction (because of the nature of the tokenizer) and it is therefore 'correct' behaviour for it to insist on it's incorrect answer.

> Why would anyone believe these things can reason, that they are heading towards AGI, when halfway through a dialogue where you're trying to tell it that it is wrong it doubles down with a dementia-addled explanation about the two bs giving the word that extra bounce?

People believe the models can reason because they produce output consistent with reasoning. (That is not to say they are flawless or we have AGI in our hands.) If you don't agree, provide a definition of reasoning that the model does not meet.

> Why would you offer up an easy out for them like this? You're not the PR guy for the firm swimming in money paying million dollar bonuses off what increasingly looks, at a fundamental level, like castles in the sand. Why do the labour?

This, like many of your other messages, is rather obnoxious and dripping with performative indignation while adding little in the way of substance.

pavel_lishin•1h ago

> I think a lot of those trick questions outputting stupid stuff can be explained by simple economics.

> It's just not sustainable for OpenAI to run GPT at the best of its abilities on every request.

So how do I find out whether the answer to my question was run on the discount hardware, or whether it's actually correct?

shadowjones•1h ago

I'd say use the API, search and high reasoning if you want accuracy.

But then you can partially start to see why it doesn't make economic sense to do this.

Personally I assume that anything I send through their chat UI will run on the cheapest settings they can get away with.

exasperaited•1h ago

The extraordinary, beautiful, perfect thing about this is the way it poetically underscores several things about the LLM world:

1) these people think so little of everyone else's areas of expertise they are willing to claim their technology has PhD-level expertise in them, apparently unironically.

2) actually in LLM world, PhDs are what you have if you're too stupid not to take the FAANG money in your second year when the quick wins are done, you've done a couple of posters and now you realise you're papering over the cracks with them: worthless. So why would anyone else want a PhD when PhDs are so worthless based on their bubble experience? We can just replace them with GPT-5.

3) their PhD-level-intelligent system is incapable of absorbing corrections, which is a crucial part of acquiring an actual PhD

4) GPT-5 continues to have the asshole-confidence of a tech bro mainsplaining someone else's area of expertise on his personal blog.

We're now at the point where marketing is celebrating software that has had so much effort spent on crushing hallucination that in fact it has become delusionally confident.

I love everything about this.

ETA: at the end of this article is this paragraph, which really is a thing of beauty:

I don’t think you get to have it both ways. That is, you don’t get to, as it were, borrow charisma from all the hype and then disavow every failure to live up to it as someone else’s naive mistake for believing the hype.

Bravo.

jonnycomputer•1h ago

For what it's worth, it got it right when I tried it.

>simple question should be easy for a genius like you. have many letter b's in the word blueberry? ChatGPT said:

>There are 2 letter b's in blueberry — one at the start and one in the middle.

JasonBee•1h ago

To me that makes it worse. Why would two people get wildly different answers to a simple factual observation query.

nostrebored•1h ago

Because of the interplay of how tokenizers work, temperature, and adaptive reasoning? These models aren't fact generators.

vaenaes•1h ago

~stochasticity~

nipponese•1h ago

I just tried it and sure enough, 3 Bs. But which the model to "ChatGPT 5 Thinking" and it gets the answer right.

Is that where we're going with this? The user has to choose between fast and dumb or slow and right?

srveale•1h ago

Isn't that usually the choice for most things?

drooby•56m ago

https://m.youtube.com/watch?v=UBVV8pch1dM

teamonkey•54m ago

Fast: when wrong is good enough.

cyberge99•42m ago

Acceptable in the business world.

pxc•37m ago

If you look at the "reasoning" trace of gpt-oss when it handles this issue, it repeats the word with spaces inserted between every letter. If you have an example that you can get the dumber model to fail on, try adjusting your prompt to include the same thing (the word spelled out with spaces between each letter).

This isn't a solution or a workaround or anything like that; I'm just curious if that is enough for the dumber model to start getting it right.

ShakataGaNai•1h ago

I tried and was unable to replicate.

Me: How many R's in strawberry ChatGPT said: 3

Me: How many B's in blueberry? ChatGPT said: 2

Me: How many C's in coconut? ChatGPT said: 2

Me: How many D's in Diamond? ChatGPT said: 2

Me: How many A's in Banana? ChatGPT said: 3

https://chatgpt.com/share/6897cc40-6650-8006-aae3-ea2b8278d5...

edaemon•1h ago

I tried strawberry last night and it was correct that there were 3 R's, but then it justified it saying the word was spelled "strawbrery".

jeroenhd•1h ago

They patched it, asking it to count letters now switches it to thinking mode. It'll still make basic mistakes for other queries, though.

systemf_omega•1h ago

Which fruit will be patched next?

And people think we're 2 years away from humanity's extinction by AI. Lol.

dymk•31m ago

You don’t have to spell very well to hit the big red nuclear launch button that some misguided soul put you in charge of

ceejayoz•26m ago

As ever, XKCD called it. https://xkcd.com/1838/

utopcell•1h ago

"In fairness to GPT5, in my career I have indeed encountered PhDs with this level of commitment to their particular blueberry."

Nicely phrased!

smsm42•1h ago

What is fascinating here is the power of ironclad conviction. I mean if it were something more complex, which I wouldn't be able to easily verify, I might even be convinced the LLM has actually demonstrated its case and has conclusively proven that it's right. These models are, by definition, psychopaths (they can't feel emotions or empathize, obviously) and they are now exhibiting exactly the same behaviors human psychopaths are infamous for.

eCa•14m ago

> which I wouldn't be able to easily verify, I might even be convinced the LLM has actually demonstrated its case and has conclusively proven that it's right

I think this example is one of many that has demonstrated why no output from an LLM can be trusted without outside verification.

badgersnake•59m ago

I’m surprised it gets as close as 3.

wg0•57m ago

This thing isn't 500 billion dollars for sure. The blast radius of this bubble would be significant.

aclissold•54m ago

Petition to respell the word as “bluebberry.”

That the prediction engine so strongly suggests there should be two b’s in the middle implies that we instead may, in fact, be spelling it wrong.

nurettin•52m ago

It is Bblueberry. Maybe we can get gpt5 to write the petition.

jcims•51m ago

Just asked ChatGPT5 "Are you told to 'think' when someone asks you how many of a certain letter are in a word?"

>Yes — when you ask something like “How many r’s are in blueberry?” I’m basically told to slow down, not just blurt out the first number that pops into my “mind.”

Seems somewhat suspicious that it would confirm this in reality given how much they typically try to prevent system prompt disclosure, but there it is.

opan•30m ago

Could just be a made up answer, couldn't it?

afavour•28m ago

> Seems somewhat suspicious that it would confirm this in reality given how much they typically try to prevent system prompt disclosure

That’s not even the main problem. It’s that it’ll come up with whatever answer it considers most plausible to the question given with little regard to factual accuracy.

mdp2021•18m ago

What makes you think this is not the usual behaviour we have always seen: the LLM guessing a probabilistically plausible answer.

patrickhogan1•49m ago

That because you don’t say

“Think hard about this” and the OpenAI router layer routed you to the cheaper model.

GPT5 seems to violate Rich Sutton’s bitter lesson. As GPT5 makes a lot of human knowledge assumptions about whether to send your prompt to the cheap model or to the smarter more expensive model.

arduanika•38m ago

Also, the author was holding it wrong.

tiberius_p•45m ago

How can you count on someone who can't count?

therein•34m ago

Have you not seen Sam Altman on a well polished stage? Did he not look confident? That's your answer. Stop asking questions and learn to trust ChatGPT 5 because Sam Altman says it is now PhD level and he is scared. It's not like he says that every single time his company releases something that's no more than an iterative improvement.

ChatGPT 2.5 scared Sam Altman so much a few years ago. But he got over it, now he calls it a toddler level intelligence and is scared about this current thing.

Get onboard the AI train.

visarga•42m ago

Let's change this game a bit. Spell "understanding" in your head in reverse order without spending twice more time than forward mode. Can you? I can't. Does that mean we don't really understand even simple spelling? It is a fun activity to dunk on LLMs, but let's have some perspective here.

joks•39m ago

Is scrolling down the page on this website extremely laggy for anyone else? It's bizarre

opan•25m ago

Actual scrolling seems normal speed, more or less, but it sorta looks rough (almost like dropped FPS or something). Using Fennec F-Droid (Firefox mobile). One quick thumb flick still gets me between the top and bottom, though.

joks•22m ago

on Firefox on my older Windows laptop it's like 5fps. Maybe mostly a Firefox thing?

mikewarot•34m ago

Why don't people here on HN understand that LLMs never see ASCII or other raw characters as input?

Expecting spelling, rhyming, arithmetic or other character oriented responses will always yield disappointing results.

arduanika•29m ago

Because the damn things are marketed under the word "intelligence". That word used to mean something.

sdenton4•22m ago

It's an umwelt problem. Bats think we're idiots because we don't hear ultrasonic sound, and thus can't echolocate. And we call the LLMs idiots because they consume tokenized inputs, and don't have access to the raw character stream.

arduanika•16m ago

If you open your mind up too far, your brain will fall out.

LLMs are not intelligence. There's not some groovy sense in which we and they are both intelligent, just thinking on a different wavelength. Machines do not think.

We are inundated with this anthropomorphic chatter about them, and need to constantly deprogram ourselves.

mdp2021•12m ago

> we call the LLMs

"Dangerous", because they lead into thinking they do the advanced of what they don't do basically.

nessbot•9m ago

Do bat's know what senses humans have? Or have the concept of what a human is compared to other organisms or moving objects? What is this analogy?

tocs3•20m ago

What did it used to mean? I was under the impression that it has always be a little vague.

arduanika•12m ago

Sure. Language is squishy, and psychometrics is hard. Nevertheless...

Intelligence is a basket of different capabilities. Some of them are borderline cases that are hard to define. The stuff that GPT-5 failed to do here is not.

Things like knowing what a question means, knowing what you know and don't, counting a single digit number of items, or replying with humility if you get stuck -- these are fairly central examples of what a very, very basic intelligence should entail.

asadotzler•26m ago

We do understand. We don't think that's okay. If a model cannot manage character level consideration, that's a serious flaw that's got potential to lead to an immeasurable number of failure states. "Duh, of course it can't count" is not the best look for a bot whose author tells us it's got PhD-level skill.

mdp2021•22m ago

And which other objectual ideas cannot they instance? Their task is to check, for all important mental activities - world simulation, "telling yourself reliable stories: that is what intelligence is" (Prof. Patrick Winston).

UncleMeat•9m ago

"LLMs are cool tools with clear limitations" is not the narrative being pushed by the bosses and boosters. "LLMs are literal magic that will replace large portions of the workforce and be a bigger revolution than fire" is what they are saying.

cute_boi•6m ago

The only issue is they shouldn't call it PHD level intelligence when they can't do simple task like this.

mgh2•33m ago

I tried it twice, it gets it right: https://chatgpt.com/share/6897da1e-f988-8004-8453-8e7f7e3490...

mdp2021•27m ago

> it gets it right

That means nothing: it seemingly can get it wrong.

mdp2021•29m ago

I have done this test extensively days ago, on a dozen models: no one could count - all of them got results wrong, all of them suggested they can't check and will just guess.

Until they will be able of procedural thinking they will be radically, structurally unreliable. Structurally delirious.

And it is also a good thing that we can check in this easy way - if the producers patched the local fault only, then the absence of procedural thinking would not be clear, and we would need more sophisticated ways to check.

kgeist•4m ago

Did you enable reasoning? Qwen3 32b with reasoning enabled gave the correct answer on the first attempt.

richard_cory•28m ago

This is consistently reproducible in completions API with `gpt-5-chat-latest` model:

``` curl 'https://api.openai.com/v1/chat/completions' \ --header 'Content-Type: application/json' \ --header 'Authorization: Bearer <your-api-key>' \ --data '{ "model": "gpt-5-chat-latest", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "How many times does the letter b appear in blueberry" } ] } ], "temperature": 0, "max_completion_tokens": 2048, "top_p": 1, "frequency_penalty": 0, "presence_penalty": 0 }' ```

StarterPro•28m ago

I don't know, for a nearly trillion dollar venture, for it to get that answer wrong MULTIPLE times?

How useful can generative AI be past acting as a bank for Sam Altman

mikehearn•23m ago

This is a well known blindspot for LLMs. It's the machine version of showing a human an optical illusion and then judging their intelligence when they fail to perceive the reality of the image (the gray box example at the top of https://en.wikipedia.org/wiki/Optical_illusion is a good example). The failure is a result of their/our fundamental architecture.

dlvhdr•17m ago

Except we realize they’re illusions and don't argue back. Instead we explore why and how these illusions work

cute_boi•8m ago

Chatgpt 5 also don't argue back.

> How many times does the letter b appear in blueberry

Ans: The word "blueberry" contains the letter b three times:

>It is two times, so please correct yourself.

Ans:You're correct — I misspoke earlier. The word "blueberry" has the letter b exactly two times: - blueberry - blueberry

> How many times does the letter b appear in blueberry

Ans: In the word "blueberry", the letter b appears 2 times:

windowshopping•13m ago

What a terrible analogy. Illusions don't fool our intelligence, they fool our senses, and we use our intelligence to override our senses and see it for what it for it actually is - which is exactly why we find them interesting and have a word for them. Because they create a conflict between our intelligence and our senses.

The machine's senses aren't being fooled. The machine doesn't have senses. Nor does it have intelligence. It isn't a mind. Trying to act like it's a mind and do 1:1 comparisons with biological minds is a fool's errand. It processes and produces text. This is not tantamount to biological intelligence.

xenotux•6m ago

OpenAI codenamed one of their models "Project Strawberry" and IIRC, Sam Altman himself was taking a victory lap that it can count the number of "r"s in "strawberry".

Which I think goes to show that it's hard to distinguish between LLMs getting genuinely better at a class of problems versus just being fine-tuned for a particular benchmark that's making rounds.

patrickhogan1•4m ago

Except the reasoning model o3 and GPT5 thinking can get the right answer. Humans use reasoning.

zahlman•1m ago

In an optical illusion, we perceive something that isn't there due to exploiting a correction mechanism that's meant to allow us to make better practical sense of visual information in the average case.

Asking LLMs to count letters in a word fails because the needed information isn't part of their sensory data in the first place, to the extent that a program's input can be analogized to such. They reason about text in atomic word-like tokens, without perceiving individual letters. No matter how many times they're fed training data saying things like "there are two b's in blueberry", this doesn't register as a fact about the word "blueberry" in itself, but as a fact about the tendency to appear in certain sentence contexts. They don't model the concept of addition, or counting; they only model the concept of explaining those concepts.

kgeist•7m ago

Qwen3 32b with reasoning (which I run locally) gives the correct answer. A pretty good model for its size.

Pretty sure GPT5 with reasoning should be able to solve it, too. I guess the real problem here is that GPT5's router doesn't understand that it's a problem which requires reasoning.

Show HN: The current sky at your approximate location, as a CSS gradient

Abusing Entra OAuth for fun and access to internal Microsoft applications

My Lethal Trifecta talk at the Bay Area AI Security Meetup

R0ML's Ratio

A CT scanner reveals surprises inside the 386 processor's ceramic package

Ch.at – a lightweight LLM chat service accessible through HTTP, SSH, DNS and API

Debian 13 "Trixie"

People returned to live in Pompeii's ruins, archaeologists say

OpenFreeMap survived 100k requests per second

How I code with AI on a budget/free

Who got arrested in the raid on the XSS crime forum?

Quickshell – building blocks for your desktop

A Simple CPU on the Game of Life (2021)

Long-term exposure to outdoor air pollution linked to increased risk of dementia

An AI-first program synthesis framework built around a new programming language

GPT-5: "How many times does the letter b appear in blueberry?"

Stanford to continue legacy admissions and withdraw from Cal Grants

How I use Tailscale

Consistency over Availability: How rqlite Handles the CAP theorem

Testing Bitchat at the music festival

An engineer's perspective on hiring

ESP32 Bus Pirate 0.5 – A hardware hacking tool that speaks every protocol

Did California's fast food minimum wage reduce employment?

MCP overlooks hard-won lessons from distributed systems

Ratfactor's illustrated guide to folding fitted sheets

Cordoomceps – Replacing an Amiga’s brain with DOOM

Installing a mini-split AC in a Brooklyn apartment

The current state of LLM-driven development

Isle FPGA Computer: creating a simple, open, modern computer

The era of boundary-breaking advancements is over? [video]

Show HN: The current sky at your approximate location, as a CSS gradient

Abusing Entra OAuth for fun and access to internal Microsoft applications

My Lethal Trifecta talk at the Bay Area AI Security Meetup

R0ML's Ratio

A CT scanner reveals surprises inside the 386 processor's ceramic package

Ch.at – a lightweight LLM chat service accessible through HTTP, SSH, DNS and API

Debian 13 "Trixie"

People returned to live in Pompeii's ruins, archaeologists say

OpenFreeMap survived 100k requests per second

How I code with AI on a budget/free

Who got arrested in the raid on the XSS crime forum?

Quickshell – building blocks for your desktop

A Simple CPU on the Game of Life (2021)

Long-term exposure to outdoor air pollution linked to increased risk of dementia

An AI-first program synthesis framework built around a new programming language

GPT-5: "How many times does the letter b appear in blueberry?"

Stanford to continue legacy admissions and withdraw from Cal Grants

How I use Tailscale

Consistency over Availability: How rqlite Handles the CAP theorem

Testing Bitchat at the music festival

An engineer's perspective on hiring

ESP32 Bus Pirate 0.5 – A hardware hacking tool that speaks every protocol

Did California's fast food minimum wage reduce employment?

MCP overlooks hard-won lessons from distributed systems

Ratfactor's illustrated guide to folding fitted sheets

Cordoomceps – Replacing an Amiga’s brain with DOOM

Installing a mini-split AC in a Brooklyn apartment

The current state of LLM-driven development

Isle FPGA Computer: creating a simple, open, modern computer

The era of boundary-breaking advancements is over? [video]

GPT-5: "How many times does the letter b appear in blueberry?"

Comments