LLMs aren't world models

https://yosefk.com/blog/llms-arent-world-models.html

371•ingve•6mo ago

Comments

t0md4n•6mo ago

yosefk•6mo ago

This is interesting. The "professional level" rating of <1800 isn't, but still.

However:

"A significant Elo rating jump occurs when the model’s Legal Move accuracy reaches 99.8%. This increase is due to the reduction in errors after the model learns to generate legal moves, reinforcing that continuous error correction and learning the correct moves significantly improve ELO"

You should be able to reach the move legality of around 100% with few resources spent on it. Failing to do so means that it has not learned a model of what chess is, at some basic level. There is virtually no challenge in making legal moves.

lostmsu•5mo ago

> r4rk1 pp6 8 4p2Q 3n4 4N3 qP5P 2KRB3 w — — 3 27

Can you say 100% you can generate a good next move (example from the paper) without using tools, and will never accidentally make a mistake and give an illegal move?

rpdillon•5mo ago

> Failing to do so means that it has not learned a model of what chess is, at some basic level.

I'm not sure about this. Among a standard amateur set of chess players, how often when they lack any kind of guidance from a computer do they attempt to make a move that is illegal? I played chess for years throughout elementary, middle and high school, and I would easily say that even after hundreds of hours of playing, I might make two mistakes out of a thousand moves where the move was actually illegal, often because I had missed that moving that piece would continue to leave me in check due to a discovered check that I had missed.

It's hard to conclude from that experience that players that are amateurs lack even a basic model of chess.

libraryofbabel•6mo ago

This essay could probably benefit from some engagement with the literature on “interpretability” in LLMs, including the empirical results about how knowledge (like addition) is represented inside the neural network. To be blunt, I’m not sure being smart and reasoning from first principles after asking the LLM a lot of questions and cherry picking what it gets wrong gets to any novel insights at this point. And it already feels a little out date, with LLMs getting gold on the mathematical Olympiad they clearly have a pretty good world model of mathematics. I don’t think cherry-picking a failure to prove 2 + 2 = 4 in the particular specific way the writer wanted to see disproves that at all.

LLMs have imperfect world models, sure. (So do humans.) That’s because they are trained to be generalists and because their internal representations of things are massively compressed single they don’t have enough weights to encode everything. I don’t think this means there are some natural limits to what they can do.

armchairhacker•6mo ago

Any suggestions from this literature?

libraryofbabel•6mo ago

The papers from Anthropic on interpretability are pretty good. They look at how certain concepts are encoded within the LLM.

yosefk•6mo ago

Your being blunt is actually very kind, if you're describing what I'm doing as "being smart and reasoning from first principles"; and I agree that I am not saying something very novel, at most it's slightly contrarian given the current sentiment.

My goal is not to cherry-pick failures for its own sake as much as to try to explain why I get pretty bad output from LLMs much of the time, which I do. They are also very useful to me at times.

Let's see how my predictions hold up; I have made enough to look very wrong if they don't.

Regarding "failure disproving success": it can't, but it can disprove a theory of how this success is achieved. And, I have much better examples than the 2+2=4, which I am citing as something that sorta works these says

libraryofbabel•6mo ago

I mean yeah, it’s a good essay in that it made me think and try to articulate the gaps, and I’m always looking to read things that push back on AI hype. I usually just skip over the hype blogging.

I think my biggest complaint is that the essay points out flaws in LLM’s world models (totally valid, they do confidently get things wrong and hallucinate in ways that are different, and often more frustrating, from how humans get things wrong) but then it jumps to claiming that there is some fundamental limitation about LLMs that prevents them from forming workable world models. In particular, it strays a bit towards the “they’re just stochastic parrots” critique, e.g. “that just shows the LLM knows to put the words explaining it after the words asking the question.” That just doesn’t seem to hold up in the face of e.g. LLMs getting gold on the Mathematical Olympiad, which features novel questions. If that isn’t a world model of mathematics - being able to apply learned techniques to challenging new questions - then I don’t know what is.

A lot of that success is from reinforcement learning techniques where the LLM is made to solve tons of math problems after the pre-training “read everything” step, which then gives it a chance to update its weights. LLMs aren’t just trained from reading a lot of text anymore. It’s very similar to how the alpha zero chess engine was trained, in fact.

I do think there’s a lot that the essay gets right. If I was to recast it, I’d put it something like this:

* LLMs have imperfect models of the world which is conditioned by how they’re trained on next token prediction.

* We’ve shown we can drastically improve those world models for particular tasks by reinforcement learning. you kind of allude to this already by talking about how they’ve been “flogged” to be good at math.

* I would claim that there’s no particular reason these RL techniques aren’t extensible in principle to beat all sorts of benchmarks that might look unrealistic now. (Two years ago it would have been an extreme optimist position to say an LLM could get gold on the mathematical Olympiad, and most LLM skeptics would probably have said it could never happen.)

* Of course it’s very expensive, so most world models LLMs have won’t get the RL treatment and so will be full of gaps, especially for things that aren’t amenable to RL. It’s good to beware of this.

I think the biggest limitation LLMs actually have, the one that is the biggest barrier to AGI, is that they can’t learn on the job, during inference. This means that with a novel codebase they are never able to build a good model of it, because they can never update their weights. (If an LLM was given tons of RL training on that codebase, it could build a better world model, but that’s expensive and very challenging to set up.) This problem is hinted at in your essay, but the lack of on-the-job learning isn’t centered. But it’s the real elephant in the room with LLMs and the one the boosters don’t really have an answer to.

Anyway thanks for writing this and responding!

yosefk•6mo ago

I'm not saying that LLMs can't learn about the world - I even mention how they obviously do it, even at the learned embeddings level. I'm saying that they're not compelled by their training objective to learn about the world and in many cases they clearly don't, and I don't see how to characterize the opposite cases in a more useful way than "happy accidents."

I don't really know how they are made "good at math," and I'm not that good at math myself. With code I have a better gut feeling of the limitations. I do think that you could throw them off terribly with unusual math quastions to show that what they learned isn't math, but I'm not the guy to do it; my examples are about chess and programming where I am more qualified to do it. (You could say that my question about the associativity of blending and how caching works sort of shows that it can't use the concept of associativity in novel situations; not sure if this can be called an illustration of its weakness at math)

calf•5mo ago

But this is parallel to saying LLMs are not "compelled" by the training algorithms to learn symbolic logic.

Which says to me there are two camps on this and the verdict is still out on this and all related questions.

teleforce•5mo ago

>LLMs are not "compelled" by the training algorithms to learn symbolic logic.

I think "compell" is such a unique human trait that machine will never replicate to the T.

The article did mention specifically about this very issue:

"And of course people can be like that, too - eg much better at the big O notation and complexity analysis in interviews than on the job. But I guarantee you that if you put a gun to their head or offer them a million dollar bonus for getting it right, they will do well enough on the job, too. And with 200 billion thrown at LLM hardware last year, the thing can't complain that it wasn't incentivized to perform."

If it's not already evident that in itself LLM is a limited stochastic AI tool by definition and its distant cousins are the deterministic logic, optimization and constraint programming [1],[2],[3]. Perhaps one of the two breakthroughs that the author was predicting will be in this deterministic domain in order to assist LLM, and it will be the hybrid approach rather than purely LLM.

[1] Logic, Optimization, and Constraint Programming: A Fruitful Collaboration - John Hooker - CMU (2023) [video]:

https://www.youtube.com/live/TknN8fCQvRk

[2] "We Really Don't Know How to Compute!" - Gerald Sussman - MIT (2011) [video]:

https://youtube.com/watch?v=HB5TrK7A4pI

[3] Google OR-Tools:

https://developers.google.com/optimization

[4] MiniZinc:

https://www.minizinc.org/

calf•5mo ago

And yet there are two camps on the matter. Experts like Hinton disagree, others agree.

2muchcoffeeman•5mo ago

It’s not just on the job learning though. I’m no AI expert, but the fact that you have “prompt engineers” and AI doesn’t know what it doesn’t know, gives me pause.

If you ask an expert, they know the bounds of their knowledge and can understand questions asked to them in multiple ways. If they don’t know the answer, they could point to someone who does or just say “we don’t know”.

LLMs just lie to you and we call it “hallucinating“ as though they will eventually get it right when the drugs wear off.

eru•5mo ago

> I’m no AI expert, but the fact that you have “prompt engineers” [...] gives me pause.

Why? A bunch of human workers can get a lot more done with a capable leader who helps prompt them in the right direction and corrects oversights etc.

And overall, prompt engineering seems like exactly the kind of skill AI will be able to develop by itself. You already have a bit like this happening: when you ask Gemini to create a picture for you, then the language part of Gemini will take your request and engineer a prompt for the picture part of Gemini.

intended•5mo ago

This is the goalpost flip which happens in AI conversations. If goalpost is even the right term, conversation switch?

Theres 2 AI conversations on HN occurring simultaneously.

Convo A: Is it actually reasoning? does it have a world model? etc..

Convo B: Is it good enough right now? (for X, Y, or Z workflow)

eru•5mo ago

Maybe, yes. It's good to acknowledge that both of these conversations are worthwhile to have.

Mikhail_Edoshin•5mo ago

LLM comprehends, but does not understand. It is interesting to see these two qualities separated; so far they were synonyms.

eru•5mo ago

> A lot of that success is from reinforcement learning techniques where the LLM is made to solve tons of math problems after the pre-training “read everything” step, which then gives it a chance to update its weights. LLMs aren’t just trained from reading a lot of text anymore. It’s very similar to how the alpha zero chess engine was trained, in fact.

It's closer to AlphaGo, which first trained on expert human games and then 'fine tuned' with self-play.

AlphaZero specifically did not use human training data at all.

I am waiting for an AlphaZero style general AI. ('General' not in the GAI sense but in the ChatGPT sense of something you can throw general problems at and it will give it a good go, but not necessarily at human level, yet.) I just don't want to call it an LLM, because it wouldn't necessarily be trained on language.

What I have in mind is something that first solves lots and lots of problems, eg logic problems, formally posed programming problems, computer games, predicting of next frames in a web cam video, economic time series, whatever, as a sort-of pre-training step and then later perhaps you feed it a relatively small amount of human readable text and speech so you can talk to it.

Just to be clear: this is not meant as a suggestion for how to successfully train an AI. I'm just curious whether it would work at all and how well / how badly.

Presumably there's a reason why all SOTA models go 'predict human produced text first, then learn problem solving afterwards'.

> I think the biggest limitation LLMs actually have, the one that is the biggest barrier to AGI, is that they can’t learn on the job, during inference. This means that with a novel codebase they are never able to build a good model of it, because they can never update their weights. [...]

Yes, I agree. But 'on-the-job' training is also such an obvious idea that plenty of people are working on making it work.

WillPostForFood•5mo ago

Your LLM output seems abnormally bad, like you are using old models, bad models, or intentionally poor prompting. I just copied and pasted your Krita example into ChatGPT, and reasonable answer, nothing like what you paraphrased in your post.

https://imgur.com/a/O9CjiJY

marcellus23•5mo ago

I think it's hard to take any LLM criticism seriously if they don't even specify which model they used. Saying "an LLM model" is totally useless for deriving any kind of conclusion.

p1esk•5mo ago

Yes, I’d be curious about his experience with GPT-5 Thinking model. So far I haven’t seen any blunders from it.

eru•5mo ago

I've seen plenty of blunders, but in general it's better than their previous models.

Well, it depends a bit on what you mean by blunders. But eg I've seen it confidently assert mathematically wrong statements with nonsense proofs, instead of admitting that it doesn't know.

grey-area•5mo ago

In a very real sense it doesn’t even know that it doesn’t know.

eru•5mo ago

Maybe. But in math you can either produce the proof (with each step checkable) or you can't.

ehnto•5mo ago

When talking about the capabilities of a class of tools long term, it makes sense to be general. I think deriving conclusions at all is pretty difficult given how fast everything is moving, but there is some realities we do actually know about how LLMs work and we can talk about that.

Knowing that ChatGPT output good tokens last tuesday but Sonnet didn't does not help us know much about the future of the tools on general.

dpoloncsak•5mo ago

> Knowing that ChatGPT output good tokens last tuesday but Sonnet didn't does not help us know much about the future of the tools on general.

Isnt that exactly what is going to help us understand the value these tools bring to end-users, and how to optimize these tools for better future use? None of these models are copy+pastes, they tend to be doing things slightly differently under the hood. How those differences affect results seems like the exact data we would want here

ehnto•5mo ago

I guess I disagree that the main concern is the differences per each model, rather than the overall technology of LLMs in general. Given how fast it's all changing, I would rather focus on the broader conversation personally. I don't really care if GPT5 is better at benchmarks, I care that LLMs are actually capable of the type of reasoning and productive output that the world currently thinks they are.

marcellus23•5mo ago

Sure, but if you're making a point about LLMs in general, you need to use examples from best-in-class models. Otherwise your examples of how these models fail are meaningless. It would be like complaining about how smartphone cameras are inherently terrible, but all your examples of bad photos aren't labeled with what phone was used to capture. How can anyone infer anything meaningful from that?

typpilol•5mo ago

This seems like a common theme with these types of articles

eru•5mo ago

Perhaps the people who get decent answers don't write articles about them?

ehnto•5mo ago

I imagine people give up silently more often than they write a well syndicated article about it. The actual adoption and efficiencies we see in enterprises will be the most verifiable data on if LLMs are generally useful in practice. Everything so far is just academic pontificating or anecdata from strangers online.

libraryofbabel•5mo ago

This. I think we’ve about reached the limit of the usefulness of anecdata “hey I asked an LLM this this and this” blog posts. We really need more systematic large scale data and studies on the latest models and tools - the recent one on cursor (which had mixed results) was a good start but it was carried out before Claude Code was even released, i.e. prehistoric times in terms of AI coding progress.

For my part I don’t really have a lot of doubts that coding agents can be a useful productivity boost on real-world tasks. Setting aside personal experience, I’ve talked to enough developers at my company using them for a range of tickets on a large codebase to know that they are. The question is more, how much: are we talking a 20% boost, or something larger, and also, what are the specific tasks they’re most useful on. I do hope in the next few years we can get some systematic answers to that as an industry, that go beyond people asking LLMs random things and trying to reason about AI capabilities from first principles.

eru•5mo ago

I am inclined to agree.

However, I'm not completely sure. Eg object oriented programming was basically a useless fad full of empty, never-delivered-on promises, but software companies still lapped it up. (If you happen to like OOP, you can probably substitute your own favourite software or wider management fad.)

Another objection: even an LLM with limited capabilities and glaring flaws can still be useful for some commercial use-cases. Eg the job of first line call centre agents that aren't allowed to deviate from a fixed script can be reasonable automated with even a fairly bad LLM.

Will it suck occasionally? Of course! But so does interacting with the humans placed into these positions without authority to get anything done for you. So if the bad LLM is cheaper, it might be worthwhile.

yosefk•5mo ago

The examples are from the latest versions of ChatGPT, Claude, Grok, and Google AI Overview. I did not bother to list the full conversations because (A) LLMs are very verbose and (B) nothing ever reproduces, so in any case any failure is "abnormally bad." I guess dismissing failures and focusing on successes is a natural continuation of our industry's trend to ship software with bugs which allegedly don't matter because they're rare, except with "AI" the MTBF is orders of magnitude shorter

AyyEye•6mo ago

With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever. That addition (something which only takes a few gates in digital logic) happens to be overfit into a few nodes on multi-billion node networks is hardly a surprise to anyone except the most religious of AI believers.

yosefk•6mo ago

Actually I forgive them those issues that stem from tokenization. I used to make fun at them for listing datum as a noun whose plural form ends with an i, but once I learned about how tokenization works, I no longer do it - it feels like mocking a person's intelligence because of a speech impediment or something... I am very kind to these things, I think

astrange•5mo ago

Tokenization makes things harder, but it doesn't make them impossible. Just takes a bit more memorization.

Other writing systems come with "tokenization" built in making it still a live issue. Think of answering:

1. How many n's are in 日本?

2. How many ん's are in 日本?

(Answers are 2 and 1.)

andyjohnson0•6mo ago

> With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever.

Is this a real defect, or some historical thing?

I just asked GPT-5:

    How many "B"s in "blueberry"?

and it replied:

    There are 2 — the letter b appears twice in "blueberry".

I also asked it how many Rs in Carrot, and how many Ps in Pineapple, amd it answered both questions correctly too.

libraryofbabel•6mo ago

It’s a historical thing that people still falsely claim is true, bizarrely without trying it on the latest models. As you found, leading LLMs don’t have a problem with it anymore.

pydry•6mo ago

Depends how you define historical. If by historical you mean more than two days ago then, yeah, it's ancient history.

jijijijij•5mo ago

The question is, did these LLMs figured it out by themselves or has someone programmed a specific coroutine to address this „issue“, to make it look smarter than it is?

On a trillion dollar budget, you could just crawl the web for AI tests people came up with and solve them manually. We know it‘s a massively curated game. With that kind of money you can do a lot of things. You could feed every human on earth countless blueberries for starters.

Calling an algorithm to count letters in a word isn’t exactly worth the hype tho is it?

The point is, we tend to find new ways these LLMs can’t figure out the most basic shit about the world. Horses can count. Counting is in everything. If you read every text ever written and still can’t grasp counting you simply are not that smart.

pxc•5mo ago

Some LLMs do better than others, but this still sometimes trips up even "frontier" non-reasoning models. People were showing this on this very forum with GPT-5 in the past couple days.

ThrowawayR2•6mo ago

It was discussed and reproduced on GPT-5 on HN couple of days ago: https://news.ycombinator.com/item?id=44832908

Sibling poster is probably mistakenly thinking of the strawberry issue from 2024 on older LLM models.

bgwalter•6mo ago

It is not historical:

https://kieranhealy.org/blog/archives/2025/08/07/blueberry-h...

Perhaps they have a hot fix that special cases HN complaints?

AyyEye•6mo ago

They clearly RLHF out the embarrassing cases and make cheating on benchmarks into a sport.

Terr_•5mo ago

I wouldn't be surprised if some models get set up to identify that type of question and run the word through string processing function.

jijijijij•5mo ago

Of course they do stuff like that, otherwise it would look like they are stagnating. Fake it till you make it. Tho, at this point, the world is in deep shit, if they don’t make it…

pmg101•5mo ago

What deep shit do you foresee?

My prediction is that this will be like the 2000 dot com bubble. Both dot com and AI are real and really useful technologies but hype and share price has got way ahead of it so will need to re adjust.

jijijijij•5mo ago

A major economic crisis, yes. I think the web is already kinda broken because of AI, gonna get a lot worse. I also question its usefulness… Is it useful solving any real problems, and if so how long before we run out of these problems? Because we conflated a lot of bullshit with innovation right before AI. Right now people may be getting a slight edge, but it’s like getting a dishwasher, once expectations adjusted things will feel like a grind again, and I really don’t think people will like that new reality in regard to experience of self-efficacy (which is important for mental health). I presume the struggle to get information, figuring it out yourself, may be a really important part of putting pressure towards process optimization and for learning, cognitive development. We may collectively regress there. With so many major crisis, a potential economic crisis on top, I am not sure we can afford losing problem solving capabilities to any extent. And I really, really don’t think AI is worth the fantastical energy expenditure, waste of resources and human exploitation, so far.

nosioptar•6mo ago

Shouldn't the correct answer be that there is not a "B" in "blueberry"?

eru•5mo ago

No, why?

It depend on context. English is often not very precise and relies on implied context clues. And that's good. It makes communication more efficient in general.

To spell it out: in this case I suspect you are talking about English letter case? Most people don't care about case when they ask these questions, especially in an informal question.

BobbyJo•6mo ago

The core issue there isn't that the LLM isn't building internal models to represent its world, it's that its world is limited to tokens. Anything not represented in tokens, or token relationships, can't be modeled by the LLM, by definition.

It's like asking a blind person to count the number of colors on a car. They can give it a go and assume glass, tires, and metal are different colors as there is likely a correlation they can draw from feeling them or discussing them. That's the best they can do though as they can't actually perceive color.

In this case, the LLM can't see letters, so asking it to count them causes it to try and draw from some proxy of that information. If it doesn't have an accurate one, then bam, strawberry has two r's.

I think a good example of LLMs building models internally is this: https://rohinmanvi.github.io/GeoLLM/

LLMs are able to encode geospatial relationships because they can be represented by token relationships well. Teo countries that are close together will be talked about together much more often than two countries far from each other.

vrighter•6mo ago

That is just not a solid argument. There are countless examples of LLMs splitting "blueberry" into "b l u e b e r r y", which would contain one token per letter. And then they still manage to get it wrong.

Your argument is based on a flawed assumption, that they can't see letters. If they didn't they wouldn't be able to spell the word out. But they do. And when they do get one token per letter, they still miscount.

xigoi•5mo ago

> It's like asking a blind person to count the number of colors on a car.

I presume if I asked a blind person to count the colors on a car, they would reply “sorry, I am blind, so I can’t answer this question”.

libraryofbabel•6mo ago

> they clearly don't have any world model whatsoever

Then how did an LLM get gold on the mathematical Olympiad, where it certainly hadn’t seen the questions before? How on earth is that possible without a decent working model of mathematics? Sure, LLMs might make weird errors sometimes (nobody is denying that), but clearly the story is rather more complicated than you suggest.

simiones•5mo ago

> where it certainly hadn’t seen the questions before?

What are you basing this certainty on?

And even if you're right that the specific questions had not come up, it may still be that the questions from the math olympiad were rehashes of similar questions in other texts, or happened to correspond well to a composition of some other problems that were part of the training set, such that the LLM could 'pick up' on the similarity.

It's also possible that the LLM was specifically trained on similar problems, or may even have a dedicated sub-net or tool for it. Still impressive, but possibly not in a way that generalizes even to math like one might think based on the press releases.

eru•5mo ago

> What are you basing this certainty on?

People make up new questions for each IMO.

fxtentacle•5mo ago

Didn’t OpenAI get caught bribing their way to pre-tournament access of the questions?

eru•5mo ago

This is the first time I hear about this. (It's certainly possible, but I'd need to see some evidence or at least a write-up.)

OpenAI got flamed over announcing their results before the embargo was up:

IMO had asked companies to wait at least a week or so after the human winners were announced to announce the AI results. OpenAI did not wait.

libraryofbabel•5mo ago

Like the other reply said, each exam has entirely new questions which are of course secret until the test is taken.

Sure, the questions were probably in a similar genre as existing questions or required similar techniques that could be found in solutions that are out there. So what? You still need some kind of world model of mathematics in which to understand the new problem and apply the different techniques to solve it.

Are you really claiming that SOTA LLMs don’t have any world model of mathematics at all? If so, can you tell us what sort of example would convince you otherwise? (Note that the ability to do novel mathematics research is setting the bar too high, because many capable mathematics majors never get to that point, and they clearly have a reasonable model of mathematics in their heads.)

williamcotton•5mo ago

I don’t solve math problems with my poetry writing skills:

https://chatgpt.com/share/689ba837-8ae0-8013-96d2-7484088f27...

eru•5mo ago

> With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever.

Train your model on characters instead of on tokens, and this problem goes away. But I don't think this teaches us anything about world models more generally.

derdi•5mo ago

Ask a kid that doesn't know how to read and write how many Bs there are in blueberry.

Ygg2•5mo ago

For a kid that doesn't know to read or write, Chat GPT writes way too much.

Nevermark•5mo ago

That was always a specious test.

LLMs don't ingest text a character at a time. The difficulty with analyzing individual letterings just reflected that they don't directly "see" letters in their tokenized input.

A direct comparison would be asking someone how many convex Bézier curves are in the spoken word "monopoly".

Or how many red pixels are in a visible icon.

We could work out answers to both. But they won't come to us one-shot or accurately, without specific practice.

lossolo•6mo ago

https://arxiv.org/abs/2508.01191

joe_the_user•5mo ago

I think both the literature on interpretability and explorations on internal representations actually reinforce the author's conclusion. I think internal representation research tends to nets that deal with a single "model" don't necessary have the same representation and don't necessarily have a single representation.

And doing well on XYZ isn't evidence of a world model in particular. The point that these things aren't always using a world is reinforced by systems being easily confused by extraneous information, even systems as sophisticated as thus that can solve Math Olympiad questions. The literature has said "ad-hoc predictors" for a long time and I don't think much has changed - except things do better on benchmarks.

And, humans too can act without a consistent world model.

rishi_devan•6mo ago

Haha. I enjoyed that Soviet-era joke at the end.

svantana•6mo ago

Yes, I hadn't heard that before. It's similar in spirit to this norwegian folk tale about a deaf man guessing what someone is saying to him:

https://en.wikipedia.org/wiki/%22Good_day,_fellow!%22_%22Axe...

kgwgk•6mo ago

Another similar story:

King Frederick, the great of Prussia had a very fine army, and none of the soldiers in it were finer than Giant Guards, who were all extremely tall men. It was difficult to find enough soldiers for these Guards, as there were not many men who were tall enough.

Frederick had made it a rule that no soldiers who did not speak German could be admitted to the Giant Guards, and this made the work of the officers who had to find men for them even more difficult. When they had to choose between accepting or refusing a really tall man who knew no German, the officers used to accept him, and then teach him enough. German to be able to answer if the King questioned him.

Frederick, sometimes, used to visit the men who were on guard around his castle at night to see that they were doing their job properly, and it was his habit to ask each new one that he saw three questions: “How old are you?” “How long have you been in my army?” and “Are you satisfied with your food and your conditions?”

The offices of the Giant Guards therefore used to teach new soldiers who did not know German the answers to these three questions.

One day, however, the King asked a new soldier the questions in a different order, he began with, “How long have you been in my army?” The young soldier immediately answered, “Twenty – two years, Your Majesty”. Frederick was very surprised. “How old are you then?”, he asked the soldier. “Six months, Your Majesty”, came the answer. At this Frederick became angry, “Am I a fool, or are you one?” he asked. “Both, Your Majesty”, the soldier answered politely.

https://archive.org/details/advancedstoriesf0000hill

deadbabe•6mo ago

Don’t: use LLMs to play chess against you

Do: use LLMs to talk shit to you while a real chess AI plays chess against you.

The above applies to a lot of things besides chess, and illustrates a proper application of LLMs.

Seb-C•5mo ago

Are you suggesting that we use an LLM as an interface between the AI and the player?

Why would anyone choose to awkwardly play using natural language rather than a reliable, fast and intuitive UI?

crabmusket•5mo ago

No, I think they're suggesting the LLM should literally be "talking shit", e.g. in a chat window alongside the game UI, as if you're in a live chat with another player. As in, use the LLM for processing language, and the chess engine for playing chess.

I think this is quite an amusing idea, as the LLM would see the moves the chess engine made and comment along the lines of "wow, I didn't see that one coming!" very Roger Sperry.

imenani•6mo ago

As far as I can tell they don’t say which LLM they used which is kind of a shame as there is a huge range of capabilities even in newly released LLMs (e.g. reasoning vs not).

yosefk•6mo ago

ChatGPT, Claude, Grok and Google AI Overviews, whatever powers the latter, were all used in one or more of these examples, in various configurations. I think they can perform differently, and I often try more than one when the 1st try doesn't work great. I don't think there's any fundamental difference in the principle of their operation, and I think there never will be - there will be another major breakthrough

red75prime•6mo ago

My hypothesis is that a model fails to switch into a deep thinking mode (if it has it) and blurts whatever it got from all the internet data during autoregressive training. I tested it with alpha-blending example. Gemini 2.5 flash - fails, Gemini 2.5 pro - succeeds.

How presence/absence of a world model, er, blends into all this? I guess "having a consistent world model at all times" is an incorrect description of humans, too. We seem to have it because we have mechanisms to notice errors, correct errors, remember the results, and use the results when similar situations arise, while slowly updating intuitions about the world to incorporate changes.

The current models lack "remember/use/update" parts.

imenani•6mo ago

Each of these models has a thinking/reasoning variant and a default non-thinking variant. I would expect the reasoning variants (o3 or “GPT5 Thinking”, Gemini DeepThink, Claude with Extended Thinking, etc) to do better at this. I think there is also some chance that in their reasoning traces they may display something you might see as closer to world modelling. In particular, you might find them explicitly tracking positions of pieces and checking validity.

red75prime•6mo ago

> I don't think there's any fundamental difference in the principle of their operation

Yeah, they seem to be a subject to the universal approximation theorem (it needs to be checked more thoroughly, but I think we can build a transformer that is equivalent to any given fully-connected multilayered network).

That is at a certain size they can do anything a human can do at a certain point in their life (that is with no additional training) regardless of whether humans have world models and what those model are on the neuronal level.

But there are additional nuances that are related to their architectures and training regimes. And practical questions of the required size.

lowsong•6mo ago

It doesn't matter. These limitations are fundamental to LLMs, so all of them that will ever be made suffer from these problems.

famouswaffles•6mo ago

Yes LLMs can play chess and yes they can model it fine

https://arxiv.org/pdf/2403.15498v2

GaggiX•6mo ago

https://www.youtube.com/watch?v=LtG0ACIbmHw

Sota LLMs do play legal moves in chess, I don't why the article seem to say otherwise.

tickettotranai•6mo ago

Technically yes, but... it's moderately tricky to get an LLM to play good chess even though it can.

https://dynomight.net/more-chess/

This is significant in general because I personally would love to get these things to code-switch into "hackernews poster" or "writer for the Economist" or "academic philosopher", but I think the "chat" format makes it impossible. The inaccessibility of this makes me want to host my own LLM...

lordnacho•6mo ago

Here's what LLMs remind me of.

When I went to uni, we had tutorials several times a week. Two students, one professor, going over whatever was being studied that week. The professor would ask insightful questions, and the students would try to answer.

Sometimes, I would answer a question correctly without actually understanding what I was saying. I would be spewing out something that I had read somewhere in the huge pile of books, and it would be a sentence, with certain special words in it, that the professor would accept as an answer.

But I would sometimes have this weird feeling of "hmm I actually don't get it" regardless. This is kinda what the tutorial is for, though. With a bit more prodding, the prof will ask something that you genuinely cannot produce a suitable word salad for, and you would be found out.

In math-type tutorials it would be things like realizing some equation was useful for finding an answer without having a clue about what the equation actually represented.

In economics tutorials it would be spewing out words about inflation or growth or some particular author but then having nothing to back up the intuition.

This is what I suspect LLMs do. They can often be very useful to someone who actually has the models in their minds, but not the data to hand. You may have forgotten the supporting evidence for some position, or you might have missed some piece of the argument due to imperfect memory. In these cases, LLM is fantastic as it just glues together plausible related words for you to examine.

The wheels come off when you're not an expert. Everything it says will sound plausible. When you challenge it, it just apologizes and pretends to correct itself.

roywiggins•5mo ago

> When you challenge it, it just apologizes and pretends to correct itself.

Even when it was right the first time!

nhaehnle•5mo ago

Good on you for having the meta-cognition to recognize it.

I've graded many exams in my university days (and set some myself), and it's exceedingly obvious that that's what many students are doing. I do wonder though how often they manage to fly under the radar. I'm sure it happens, as you described.

(This is also the reason why I strongly believe that in exams where students write free-form answers, points should be subtracted for incorrect statements even if a correct solution is somewhere in the word salad.)

ej88•6mo ago

This article is interesting but pretty shallow.

0(?): there’s no provided definition of what a ‘world model’ is. Is it playing chess? Is it remembering facts like how computers use math to blend Colors? If so, then ChatGPT: https://chatgpt.com/s/t_6898fe6178b88191a138fba8824c1a2c has a world model right?

1. The author seems to conflate context windows with failing to model the world in the chess example. I challenge them to ask a SOTA model with an image of a chess board or notation and ask it about the position. It might not give you GM level analysis but it definitely has a model of what’s going on.

2. Without explaining which LLM they used or sharing the chats these examples are just not valuable. The larger and better the model, the better its internal representation of the world.

You can try it yourself. Come up with some question involving interacting with the world and / or physics and ask GPT-5 Thinking. It’s got a pretty good understanding of how things work!

https://chatgpt.com/s/t_689903b03e6c8191b7ce1b85b1698358

yosefk•6mo ago

A "world model" depends on the context which defines which world the problem is in. For chess, which moves are legal and needing to know where the pieces are to make legal moves are parts of the world model. For alpha blending, it being a mathematical operation and the visibility of a background given the transparency of the foreground are parts of the world model.

The examples are from all the major commercial American LLMs as listed in a sister comment.

You seem to conflate context windows with tracking chess pieces. The context windows are more than large enough to remember 10 moves. The model should either track the pieces, or mention that it would be playing blindfold chess absent a board to look at and it isn't good at this, so could you please list the position after every move to make it fair, or it doesn't know what it's doing; it's demonstrably the latter.

tmnvdb•5mo ago

If you train an LLM on chess, it will learn that too. You don't need to explain the rules, just feed it chess games, and it will stop making illegal moves at some point. It is a clear example of an inferred world model from training.

https://arxiv.org/abs/2501.17186

PS "Major commercial American LLM" is not very meaningful, you could be using GPT4o with that description.

codebastard•5mo ago

I my opinion the author refers to a LLMs inability to create a inner world, a world model.

That means it does not build a mirror of a system based on its interactions.

It just outputs fragments of world models it was build one and tries to give you a string of fragments that should match to the fragment of your world model that you provided through some input method.

It can not abstract the code base fragments you share it can not extend them with details using the model of the whole project.

jonplackett•6mo ago

I just tried a few things that are simple and a world model would probably get right. Eg

Question to GPT5: I am looking straight on to some objects. Looking parallel to the ground.

In front of me I have a milk bottle, to the right of that is a Coca-Cola bottle. To the right of that is a glass of water. And to the right of that there’s a cherry. Behind the cherry there’s a cactus and to the left of that there’s a peanut. Everything is spaced evenly. Can I see the peanut?

Answer (after choosing thinking mode)

No. The cactus is directly behind the cherry (front row order: milk, Coke, water, cherry). “To the left of that” puts the peanut behind the glass of water. Since you’re looking straight on, the glass sits in front and occludes the peanut.

It doesn’t consider transparency until you mention it, then apologises and says it didn’t think of transparency

RugnirViking•6mo ago

this seems like a strange riddle. In my mind I was thinking that regardless of the glass, all of the objects can be seen (due to perspective, and also the fact you mentioned the locations, meaning you're aware of them).

It seems to me it would only actually work in an orthographic perspective, which is not how our reality works

jonplackett•6mo ago

You can tell from the response it does understand the riddle just fine, it just gets it wrong.

rpdillon•5mo ago

Have you asked five adults this riddle? I suspect at least two of them would get it wrong or have some uncertainty about whether or not the peanut was visible.

xg15•5mo ago

This. Was also thinking "yes" first because of the glass of water, transparency, etc, but then got unsure: The objects might be spaced so widely that the milk or coke bottle would obscure the view due to perspective - or the peanut would simply end up outside the viewer's field of vision.

Shows that even if you have a world model, it might not be the right one.

rpdillon•5mo ago

I'm not sure it does. I did ask 5 adults this question, with zero context about what we're discussing with AI, just posing it as a riddle. They were split, with lots of uncertainty about the optics of the glass straight on, and where the viewer is vertically with respect the glass' height. The best response, from my wife, brought up the trig aspect to the problem, pointing out that nothing in the question talks about distance to or between the objects. Her assumption was that the peanut could easily be offset behind the glass from her perspective, resulting in the peanut not being visible.

We tried the experiment, and she was right that there are definitely setups of distance and spacing that cause the peanut to not be visible. Try it!

optimalsolver•5mo ago

Gemini 2.5 Pro gets this correct on the first attempt, and specifically points out the transparency of the glass of water.

https://g.co/gemini/share/362506056ddb

Time to get the ol' goalpost-moving gloves out.

wilg•5mo ago

Worked for me: https://chatgpt.com/share/689bc3ef-fa1c-800f-9275-93c2dbc11b...

Razengan•6mo ago

A slight tangent: I think/wonder if the one place where AIs could be really useful, might be in translating alien languages :)

As in, an alien could teach one of our AIs their language faster than an alien could teach an human, and vice versa..

..though the potential for catastrophic disasters is also great there lol

keeda•6mo ago

That whole bit about color blending and transparency and LLMs "not knowing colors" is hard to believe. I am literally using LLMs every day to write image-processing and computer vision code using OpenCV. It seamlessly reasons across a range of concepts like color spaces, resolution, compression artifacts, filtering, segmentation and human perception. I mean, removing the alpha from a PNG image was a preprocessing step it wrote by itself as part of a larger task I had given it, so it certainly understands transparency.

I even often describe the results e.g. "this fails when in X manner when the image has grainy regions" and it figures out what is going on, and adapts the code accordingly. (It works with uploading actual images too, but those consume a lot of tokens!)

And all this in a rather niche domain that seems relatively less explored. The images I'm working with are rather small and low-resolution, which most literature does not seem to contemplate much. It uses standard techniques well known in the art, but it adapts and combines them well to suit my particular requirements. So they seem to handle "novel" pretty well too.

If it can reason about images and vision and write working code for niche problems I throw at it, whether it "knows" colors in the human sense is a purely philosophical question.

geraneum•5mo ago

> it wrote by itself as part of a larger task I had given it, so it certainly understands transparency

Or it’s a common step or a known pattern or combination of steps that is prevalent in its training data for certain input. I’m guessing you don’t know what’s exactly in the training sets. I don’t know either. They don’t tell ;)

> but it adapts and combines them well to suit my particular requirements. So they seem to handle "novel" pretty well too.

We tend to overestimate the novelty of our own work and our methods and at the same time, underestimate the vastness of the data and information available online for machines to train on. LLMs are very sophisticated pattern recognizers. It doesn’t mean what you are doing specifically is done in this exact way before, rather the patterns adapted and the approach may not be one of their kind.

> is a purely philosophical question

It is indeed. A question we need to ask ourselves.

Uehreka•5mo ago

> We tend to overestimate the novelty of our own work and our methods and at the same time, underestimate the vastness of the data and information available online for machines to train on. LLMs are very sophisticated pattern recognizers.

If LLMs are stochastic parrots, but also we’re just stochastic parrots, then what does it matter? That would mean that LLMs are in fact useful for many things (which is what I care about far more than any abstract discussion of free will).

samrus•5mo ago

We're not just stochastic parrots though, we can parrot things stochastically when that has utility, but we can also be original. The first time that work was done, it was sone by a person, autonomously. Current LLMs couldnt have done it the first time

Nevermark•5mo ago

They are much more than stochastic parrots.

I have never understood the stochastic parrot interpretation. LLMs (and general deep learning models) are not statistical/stochastic based models. Statistics trivially apply, as they apply to all measurements of judge-able behavior. But the models do not perform statistical operations, nor do their architectures form tunable statistically driven systems.

They learn topological representations of relationships. Entirely different from statistics/stochastics.

Within their "style" of cognition, LLMs are very creative. They readily propose solutions to problems involving uncommon or unique combinations of disparate topics.

Coming up with artificial examples is easy (and they come up naturally for me all the time).

I think the best characterization of LLM knowledge, reasoning and creativity is: extremely wide (in ability to weave topics and communication constraints - one shot), but somewhat shallow (not being able to reason too deep.)

Within those bounds, they far far exceed human capabilities.

geraneum•5mo ago

> LLMs (and general deep learning models) are not statistical/stochastic based models. Statistics trivially apply, as they apply to all measurements of judge-able behavior. But the models do not perform statistical operations, nor do their architectures form tunable statistically driven systems.

And just like a LLM, confidently wrong.

skeledrew•6mo ago

Agree in general with most of the points, except

> but because I know you and I get by with less.

Actually we got far more data and training than any LLM. We've been gathering and processing sensory data every second at least since birth (more processing than gathering when asleep), and are only really considered fully intelligent in our late teens to mid-20s.

helloplanets•5mo ago

Don't forget the millions of years of pre-training! ;)

o_nate•6mo ago

What with this and your previous post about why sometimes incompetent management leads to better outcomes, you are quickly becoming one of my favorite tech bloggers. Perhaps I enjoyed the piece so much because your conclusions basically track mine. (I'm a software developer who has dabbled with LLMs, and has some hand-wavey background on how they work, but otherwise can claim no special knowledge.) Also your writing style really pops. No one would accuse your post of having been generated by an LLM.

yosefk•6mo ago

thank you for your kind words!

neuroelectron•5mo ago

Not yet

ameliaquining•5mo ago

One thing I appreciated about this post, unlike a lot of AI-skeptic posts, is that it actually makes a concrete falsifiable prediction; specifically, "LLMs will never manage to deal with large code bases 'autonomously'". So in the future we can look back and see whether it was right.

For my part, I'd give 80% confidence that LLMs will be able to do this within two years, without fundamental architectural changes.

moduspol•5mo ago

"Deal with" and "autonomously" are doing a lot of heavy lifting there. Cursor already does a pretty good job indexing all the files in a code base in a way that lets it ask questions and get answers pretty quickly. It's just a matter of where you set the goalposts.

ameliaquining•5mo ago

True, there'd be a need to operationalize these things a bit more than is done in the post to have a good advance prediction.

jononor•5mo ago

"LLM" as well, because coding agents are already more than just an LLM. There is very useful context management around it, and tool calling, and ability to run tests/programs, etc. Though they are LLM-based systems, they are not LLMs.

smnrchrds•5mo ago

Indeed. If the LLM calls a chess engine tool behind the scenes, it would be able to play excellent chess as well.

cavisne•5mo ago

The author would still be wrong in the tool-calling scenario. There is already perfect (or at least superhuman) chess engines. There is no perfect "coding engine". LLM's + tools being able to reliably work on large codebases would be a new thing.

yosefk•5mo ago

Correct - as long as the tools the LLM uses are non-ML-based algorithms existing today, and it operates on a large code base with no programmers in the loop, I would be wrong. If the LLM uses a chess engine, then it does nothing on top of the engine; similarly if an LLM will use another system adding no value on top, I would not be wrong. If the LLM uses something based on a novel ML approach, I would not be wrong - it would be my "ML breakthrough" scenario. If the LLM uses classical algorithms or an ML algo known today and adds value on top of them and operates autonomously on a large code base - no programmer needed on the team - then I am wrong

interstice•5mo ago

This rapidly gets philosophical. If I use tools am I not handling the codebase? Are we classing LLM as tool or user in this scenario?

yosefk•5mo ago

Cursor fails miserably for me even just trying to replace function calls with method calls consistently, like I said in the post. This I would hope is fixable. By dealing autonomously I mean "you don't need a programmer - a PM talks to an LLM and that's how the code base is maintained, and this happens a lot (rather than on one or two famous cases where it's pretty well known how they are special and different from most work)"

By "large" I mean 300K lines (strong prediction), or 10 times the context window (weaker prediction)

I don't shy away from looking stupid in the future, you've got to give me this much

adastra22•5mo ago

I'm pretty sure you can do that right now in Claude Code with the right subagent definitions.

(For what it's worth, I respect and greatly appreciate your willingness to put out a prediction based on real evidence and your own reasoning. But I think you must be lacking experience with the latest tools & best practices.)

yosefk•5mo ago

If you're right, there will soon be a flood of software teams with no programmers on them - either across all domains, or in some domains where this works well. We shall see.

Indeed I have no experience with Claude Code, but I use Claude via chat, and it fails all the time on things not remotely as hard as orientation in a large code base. Claude Code is the same thing with the ability to run tools. Of course tools help to ground its iterations in reality, but I don't think it's a panacea absent a consistent ability to model the reality you observe thru your use of tools. Let's see...

adastra22•5mo ago

> Indeed I have no experience with Claude Code, but I use Claude via chat...

These are not even remotely similar, despite the name. Things are moving very fast, and the sort of chat-based interface that you describe in your article is already obsolete.

Claude is the LLM model. Claude Code is a combination of internal tools for the agent to track its goals, current state, priorities, etc., and a looped mechanism for keeping it on track, focused, and debugging its own actions. With the proper subagents it can keep its context from being poisoned from false starts, and its built-in todo system keeps it on task.

Really, try it out and see for yourself. It doesn't work magic out of the box, and absolutely needs some hand-holding to get it to work well, but that's only because it is so new. The next generation of tooling will have these subagent definitions auto selected and included in context so you can hit the ground running.

We are already starting to see a flood of software coming out with very few active coders on the team, as you can see on the HN front page. I say "very few active coders" not "no programmers" because using Claude Code effectively still requires domain expertise as we work out the bugs in agent orchestration. But once that is done, there aren't any obvious remaining stumbling blocks to a PM running a no-coder, all-AI product team.

TheOtherHobbes•5mo ago

Claude Code isn't an LLM. It's a hybrid architecture where an LLM provides the interface and some of the reasoning, embedded inside a broader set of more or less deterministic tools.

It's obvious LLMs can't do the job without these external tools, so the claim above - that LLMs can't do this job - is on firm ground.

But it's also obvious these hybrid systems will become more and more complex and capable over time, and there's a possibility they will be able to replace humans at every level of the stack, from junior to CEO.

If that happens, it's inevitable these domain-specific systems will be networked into a kind of interhybrid AGI, where you can ask for specific outputs, and if the domain has been automated you'll be guided to what you want.

It's still a hybrid architecture though. LLMs on their own aren't going to make this work.

It's also short of AGI, never mind ASI, because AGI requires a system that would create high quality domain-specific systems from scratch given a domain to automate.

adastra22•5mo ago

If you want to be pedantic about word definitions, it absolutely is AGI: artificial general intelligence.

Whether you draw the system boundary of an LLM to include the tools it calls or not is a rather arbitrary distinction, and not very interesting.

nomel•5mo ago

Nearly every definition I’ve seen that involves AGI (there are many) includes the ability to self learn and create “novel ideas”. The LLM behind it isn’t capable of this, and I don’t think the addition of the current set of tools enables this either.

adastra22•5mo ago

Artificial general intelligence was a phrase invented to draw distinction from “narrow intelligence” which are algorithms that can only be applied to specific problem domains. E.g. Deep Blue was amazing at playing chess, but couldn’t play Go much less prioritize a grocery list. Any artificial program that could be applied to arbitrary tasks not pre-trained on is AGI. ChatGPT and especially more recent agentic models are absolutely and unquestionably AGI in the original definition of the term.

Goalposts are moving though. Through the efforts of various people in the rationalist-connected space, the word has since morphed to be implicitly synonymous with the notion of superintellgence and self-improvement, hence the vague and conflicting definitions people now ascribe to it.

Also, fwiw the training process behind the generation of an LLM is absolutely able to discover new and novel ideas, in the same sense that Kepler’s laws of planetary motion were new and novel if all you had were Tycho Brache’s astronomical observations. Inference can tease out these novel discoveries, if nothing else. But I suspect also that your definition of creative and novel would also exclude human creativity if it were rigorously applied—our brains after all are merely remixing our own experiences too.

Vegenoid•5mo ago

> If you want to be pedantic about word definitions, it absolutely is AGI: artificial general intelligence.

This isn't being pedantic, it's deliberately misinterpreting a commonly used term by taking every word literally for effect. Terms, like words, can take on a meaning that is distinct from looking at each constituent part and coming up with your interpretation of a literal definition based on those parts.

adastra22•5mo ago

I didn't invent this interpretation. It's how the word was originally defined, and used for many, many decades, by the founders of the field. See for example:

https://www-formal.stanford.edu/jmc/generality.pdf

Or look at the old / early AGI conference series:

https://agi-conference.org

Or read any old, pre-2009 (ImageNet) AI textbook. It will talk about "narrow intelligence" vs "general intelligence," a dichotomy that exists more in GOFAI than the deep learning approaches.

Maybe I'm a curmudgeon and this is entering get-off-my-lawn territory, but I find it immensely annoying when existing clear terminology (AGI vs ASI, strong vs weak, narrow vs. general) is superseded by a confused mix of popular meanings that lack any clear definition.

Vegenoid•5mo ago

The McCarthy paper doesn't use the term "artificial general intelligence" anywhere. It does use the word "general" a lot in relation to artificial intelligence.

I looked at the AGI conference page for 2009: https://agi-conference.org/2009/

When it uses the term "artificial general intelligence", it hyperlinks to this page: http://www.agiri.org/wiki/index.php?title=Artificial_General...

Which seems unavailable, so here is an archive from 2007: https://web.archive.org/web/20070106033535/http://www.agiri....

And that page says "In Nov. 1997, the term Artificial General Intelligence was first coined by Mark Avrum Gubrud in the abstract for his paper Nanotechnology and International Security". And here is that paper: https://web.archive.org/web/20070205153112/http://www.foresi...

That paper says: "By advanced artificial general intelligence, I mean AI systems that rival or surpass the human brain in complexity and speed, that can acquire, manipulate and reason with general knowledge, and that are usable in essentially any phase of industrial or military operations where a human intelligence would otherwise be needed."

I think that your insisting that AGI means something different than what everyone else means when they say it is not useful, and will only lead to people getting confused and disagreeing with you. I agree that it's not a great term.

scoopdewoop•5mo ago

I'm a week late, but I do appreciate you pointing out this real phenomenon of moving the goalpost. Language is really general, multimodal models even more-so. The idea that AGI should be way more anthropomorphic and omnipotent is really recent. New definitions almost disregard the possibility of stupid general intelligence, despite proof-by-existence living all around us.

boxed•5mo ago

I was very skeptical of Claude Code but was finally convinced to try it and it does feel very different to use. I made three hobby projects in a weekend that I had pushed up for years due to "it's too much hassle to get started" inertia. Two of the projects it did very well with, the third I had to fight with it and it still is subtly wrong (swiftUI animations and claude code seemingly is not a good mix!)

That being said, I think your analysis is 100% correct. LLMs are fundamentally stupid beyond belief :P

Terretta•5mo ago

> SwiftUI animations and claude code seemingly is not a good mix

Where is the corpus of SwiftUI animations to train Claude what probable soup you probably want regurgitated?

Hypothesis: iOS devs don't share their work openly for reasons associated with how the App Store ecosystem (mis)behaves.

Relatedly, the models don't know about Swift 6 except from maybe mid-2024 WWDC announcements. It's worth feeding them your own context. If you are 5.10, great. If you want to ship iOS 26 changes, wait till 2026 or again, roll your own context.

boxed•5mo ago

In my case the big issue seems to be that if you hide a component in SwiftUI, it's by default animated with a fade. This not shown in the API surface area at all.

Vegenoid•5mo ago

I am more skeptical of the rate of AI progress than many here, but Claude Code is a huge step. There were a few "holy shit" moments when I started using it. Since then, after much more experimentation, I see its limits and faults, and use it less now. But I think it's worth giving it a try if you want to be informed about the current state of LLM-assisted programming.

bootsmann•5mo ago

I feel like refutations like this (you aren't using the tool right | you should try this other tool) pop up often but are fundamentally worthless because as long as you're not showing code you might as well be making it up. The blog post gives examples of clear failures that can be reproduced by anyone by themselves, I think its time vibe code defenders are held to the same standard.

adastra22•5mo ago

The very first example is that LLMs lose their mental model of chess when playing a game. Ok, so instead ask Claude Code to design an MCP for tracking chess moves, and vibe code it. That’s the very first thing that comes to mind, and I expect it would work well enough.

alfalfasprout•5mo ago

FWIW I do work with the latest tools/practices and completely agree with OP. It's also important to contextualize what "large" and "complex" codebases really mean.

Monorepos are large but the projects inside may, individually, not be that complex. So there are ways of making LLMs work with monorepos well (eg; providing a top level index of what's inside, how to find projects, and explaining how the repo is set up). Complexity within an individual project is something current-gen SOTA LLMs (I'm counting Sonnet 4, Opus 4.1, Gemini 2.5 Pro, and GPT-5 here) really suck at handling.

Sure, you can assign discrete little tasks here and there. But bigger efforts that require not only understanding how the codebase is designed but also why it's designed that way fall short. Even more so if you need them to make good architectural decisions on something that's not "cookie cutter".

Fundamentally, I've noticed the chasm between those that are hyper-confident LLMs will "get there soon" and those that are experienced but doubtful depends on the type of development you do. "ticket pulling" type work generally has the work scoped well enough that an LLM might seem near-autonomous. More abstract/complex backend/infra/research work not so much. Still value there, sure. But hardly autonomous.

adastra22•5mo ago

Could, e.g., a custom-made 100ktoken summary of the architecture and relevant parts of the giant repo and base index of where to find more info be sufficient to allow Opus to take a large task and split it into small enough subprojects that are farmed out to Sonnet instances with sufficient context?

This seems quite doable with even a small amount of tooling around Claude Code, even though I agree it doesn't have this capability out of the box. I think a large part of this gulf is "it doesn't work out of the box" vs "it can be made to work with a little customization."

exe34•5mo ago

How large? What does "deal" mean here? Autonomously - is that on its own whim, or at the behest of a user?

shinycode•5mo ago

« autonomously » what happens when subtle updates that are not bugs but change the meaning of some features that might break the workflow on some other external parts of a client’s system ? It happens all the time and, because it’s really hard to have the whole meaning and business rules written and maintained up to date, an LLM might never be able to grasp some meaning. Maybe if instead of developing code and infrastructures, the whole industry shifts toward only writing impossibly precise spec sheets that make meaning and intent crystal clear then, maybe « autonomously » might be possible to pull off

wizzwizz4•5mo ago

Those spec sheets exist: they're called software.

shinycode•5mo ago

Not exactly. It depends how software is written and if there is ADRs in the project. I had to work on projects where there was bugs because someone coded business rules in a very bad and unclear way. You move an if somewhere and something breaks somewhere else. You ask « is this condition the way it’s supposed to work or is it a bug » when software is not clear enough - and often it isn’t because we have to go fast - we ask people to confirm the rule. My point is this, amazingly written software surely works best with LLMs. That’s not the most software written for now because businesses value speed over engineering sometimes (or it’s lack of skills)

wizzwizz4•5mo ago

Right: software is not necessarily a sufficiently-clear specification, but a sufficiently-clear specification would be software – and you've correctly identified that a good part of your job is ensuring the software provides a sufficiently-clear specification.

Amazingly-written software is necessary for LLMs to work well, but it isn't sufficient: LLMs tend to make nonsensical changes that, while technically implementing what they're asked to do (much of the time), reduce the quality of the software. As this repeats, the LLMs become less and less able to modify the program. This is because they can't program: they can translate, plagiarise, and interpolate, but they're missing several key programming skills, and probably cannot learn them.

slt2021•5mo ago

>LLMs will never manage to deal

time to prove hypothesis: infinity years

eru•5mo ago

Eh, if the hypothesis remains unfalsified for longer and longer, we can have increased confidence.

Similar, Newton's laws say that bodies always stay at rest unless acted upon by a force. Strictly speaking, if a billiard ball jumps up without cause tomorrow that would disprove Newton. So we'd have to wait an infinite amount of time to prove Newton right.

However no one has to wait so long, and we found ways to express how Newton's ideas are _better_ than those of Aristotle without waiting an eternity.

bootsmann•5mo ago

The whole of modern science is based on the idea that we can never prove a theory about the world to be true, but that we can provide experiments which allow us to show that some theories are closer to the truth than others.

whoknowsidont•5mo ago

I don't think that statement is falsifiable until you define "deal with" and "large code bases."

p1necone•5mo ago

That feels like a statement that's far too loosely defined to be meaningful to me.

I work on codebases that you could describe as 'large', and you could describe some of the LLM driven work being done on them as 'autonomous' today.

Mars008•5mo ago

In two years there will be probably no new 'autonomous' LLMs. They will be most likely integrated into 'products', trained and designed for this. We see the beginning of it today as agents and tools.

otabdeveloper4•5mo ago

> LLMs will never manage to deal with large code bases 'autonomously'

Absolutely nothing about that statement is concrete or falsifiable.

Hell, you can already deal with large code bases 'autonomously' without LLMs - grep and find and sed goes a long way!

drdeca•5mo ago

Seems falsifiable to me? If an LLM (+harness) is fully maintaining a project, updating things when dependencies update, handling bug reports, etc., in a way that is considered decent quality by consumers of the project, then that seems like it would falsify it.

Now, that’s a very high bar, and I don’t anticipate it being cleared any time soon.

But I do think if it happened, it would pretty clearly falsify the hypothesis .

bithive123•5mo ago

Language models aren't world models for the same reason languages aren't world models.

Symbols, by definition, only represent a thing. They are not the same as the thing. The map is not the territory, the description is not the described, you can't get wet in the word "water".

They only have meaning to sentient beings, and that meaning is heavily subjective and contextual.

But there appear to be some who think that we can grasp truth through mechanical symbol manipulation. Perhaps we just need to add a few million more symbols, they think.

If we accept the incompleteness theorem, then there are true propositions that even a super-intelligent AGI would not be able to express, because all it can do is output a series of placeholders. Not to mention the obvious fallacy of knowing super-intelligence when we see it. Can you write a test suite for it?

habitue•5mo ago

> Symbols, by definition, only represent a thing.

This is missing the lesson of the Yoneda Lemma: symbols are uniquely identified by their relationships with other symbols. If those relationships are represented in text, then in principle they can be inferred and navigated by an LLM.

Some relationships are not represented well in text: tacit knowledge like how hard to twist a bottle cap to get it to come off, etc. We aren't capturing those relationships between all your individual muscles and your brain well in language, so an LLM will miss them or have very approximate versions of them, but... that's always been the problem with tacit knowledge: it's the exact kind of knowledge that's hard to communicate!

nomel•5mo ago

I don’t think it’s a communication problem as much as there is no possible relation between a word and a (literal) physical experiences. They’re, quite literally, on different planes of existence.

drdeca•5mo ago

When I have a physical experience, sometimes it results in me saying a word.

Now, maybe there are other possible experiences that would result in me behaving identically, such that from my behavior (including what words I say) it is impossible to distinguish between different potential experiences I could have had.

But, “caused me to say” is a relation, is it not?

Unless you want to say that it wasn’t the experience that caused me to do something, but some physical thing that went along with the experience, either causing or co-occurring with the experience, and also causing me to say the word I said. But, that would still be a relation, I think.

nomel•5mo ago

Yes, but it's a unidirectional relation: it was the result of the experience. The word cannot represent the context (the experience), in a meaningful way.

It's like trying to describe a color to a blind person: poetic subjective nonsense.

drdeca•5mo ago

I don’t know what you mean by “unidirectional relation”. I get that you gave an explanation after the colon, but I still don’t quite get what you mean. Do you just mean that what words I use doesn’t pick out a unique possible experience? That’s true of course, but I don’t know why you call that “unidirectional”

I don’t think describing colors to a blind person is nonsense. One can speak of how the different colors relate to one-another. A blind person can understand that a stop sign is typically “red”, and that something can be “borderline between red and orange”, but that things will not be “borderline between green and purple”. A person who has never had any color perception won’t know the experience of seeing something red or blue, but they can still have a mental model of the world that includes facts about the colors of things, and what effects these are likely to have, even though they themselves cannot imagine what it is like to see the colors.

akomtu•5mo ago

IMO, the GP's idea is that you can't explain sounds to a deaf man, or emotions to someone who doesn't feel them. All that needs direct experience and words only point to our shared experience.

drdeca•5mo ago

Ok, but you can explain properties of sounds to deaf men, and properties of colors to blind men. You can’t give them a full understanding of what it is like to experience these things, but that doesn’t preclude deaf or blind men from having mental models of the world that take into account those senses. A blind man can still reason about what things a sighted person would be able to conclude based on what they see, likewise a deaf man can reason about what a person who can hear could conclude based on what they could hear.

semiquaver•5mo ago

Well shit, I better stop reading books then.

nomel•5mo ago

I think you've missed the concept here.

You exist in the full experience. That lossy projection to words is still meaningful to you, in your reading, because you know the experience it's referencing. What do I mean by "lossy projection"? It's the experience of seeing the color blue to the word "blue". The word "blue" is meaningless without already having experienced it, because the word is not a description of the experience, it's a label. The experience itself can't be sufficiently described, as you'll find if you try to explain a "blue" to a blind person, because it exists outside of words.

The concept here is that something like an LLM, trained on human text, can't having meaningful comprehension of some concepts, because some words are labels of things that exist entirely outside of text.

You might say "but multimodal models use tokens for color!", or even extending that to "you could replace the tokens used in multimodal models with color names!" and I would agree. But, the understanding wouldn't come from the relation of words in human text, it would come from the positional relation of colors across a space, which is not much different than our experience of the color, on our retina

tldr: to get AI to meaningful understand something, you have to give it a meaningful relation. Meaningful relations sometimes aren't present, in human writing.

exe34•5mo ago

> Language models aren't world models for the same reason languages aren't world models. > Symbols, by definition, only represent a thing. They are not the same as the thing. The map is not the territory, the description is not the described, you can't get wet in the word "water".

There is a lot of negatives in there, but I feel like it boils down to a model of a thing is not the thing. Well duh. It's a model. A map is a model.

bithive123•5mo ago

Right. It's a dead thing that has no independent meaning. It doesn't even exist as a thing except conceputally. The referent is not even another dead thing, but a reality that appears nowhere in the map itself. It may have certain limited usefulness in the practical realm, but expecting it to lead to new insights ignores the fact that it's fundamentally an abstraction of the real, not in relationship to it.

exe34•5mo ago

> but expecting it to lead to new insights ignores the fact that it's fundamentally an abstraction of the real, not in relationship to it.

Where do humans get new insights from?

bithive123•5mo ago

Generally the experience of insight is prior to any discursive expression. We put our insights in terms of words, they do not arise as such.

exe34•5mo ago

Like VLMs then.

auggierose•5mo ago

First: true propositions (that are not provable) can definitely be expressed, if they couldn't, the incompleteness theorem would not be true ;-)

It would be interesting to know what the percentage of people is, who invoke the incompleteness theorem, and have no clue what it actually says.

Most people don't even know what a proof is, so that cannot be a hindrance on the path to AGI ...

Second: ANY world model that can be digitally represented would be subject to the same argument (if stated correctly), not only LLMs.

bithive123•5mo ago

I knew someone would call me out on that. I used the wrong word; what I meant was "expressed in a way that would satisfy" which implies proof within the symbolic order being used. I don't claim to be a mathematician or philosopher.

auggierose•5mo ago

Well, you don't get it. The LLM definitely can state propositions "that satisfy", let's just call them true propositions, and that this is not the same as having a proof for it is what the incompleteness theorem says.

Why would you require an LLM to have proof for the things it says? I mean, that would be nice, and I am actually working on that, but it is not anything we would require of humans and/or HN commenters, would we?

bithive123•5mo ago

I clearly do not meet the requirements to use the analogy.

I am hearing the term super intelligence a lot and it seems to me the only form that would take is the machine spitting out a bunch of symbols which either delight or dismay the humans. Which implies they already know what it looks like.

If this technology will advance science or even be useful for everyday life, then surely the propositions it generates will need to hold up to reality, either via axiomatic rigor or empirically. I look forward to finding out if that will happen.

But it's still just a movement from the known to the known, a very limited affair no matter how many new symbols you add in whatever permutation.

chamomeal•5mo ago

I’m not a math guy but the incompleteness theorem applies to formal systems, right? I’ve never thought about LLMs as formal systems, but I guess they are?

bithive123•5mo ago

Nor am I. I'm not claiming an LLM is a formal system, but it is mechanical and operates on symbols. It can't deal in anything else. That should temper some of the enthusiasm going around.

pron•5mo ago

Anything that runs on a computer is a formal system. "Formal" (the manipulation of forms) is an old term for what, after Turing, we call "mechanical".

scarmig•5mo ago

> If we accept the incompleteness theorem

And, by various universality theorems, a sufficiently large AGI could approximate any sequence of human neuron firings to an arbitrary precision. So if the incompleteness theorem means that neural nets can never find truth, it also means that the human brain can never find truth.

Human neuron firing patterns, after all, only represent a thing; they are not the same as the thing. Your experience of seeing something isn't recreating the physical universe in your head.

bevr1337•5mo ago

> And, by various universality theorems, a sufficiently large AGI could approximate any sequence of human neuron firings to an arbitrary precision.

Wouldn't it become harder to simulate a human brain the larger a machine is? I don't know nothing, but I think that peaky speed of light thing might pose a challenge.

drdeca•5mo ago

simulate ≠ simulate-in-real-time

zeroonetwothree•5mo ago

All simulation is realtime to the brain being simulated.

drdeca•5mo ago

Sure, but that’s not the clock that’s relevant to the question of the light speed communication limits in a large computer?

overgard•5mo ago

I don't think you can apply the incompleteness theorem like that, LLMs aren't constrained to formal systems

pron•5mo ago

> Symbols, by definition, only represent a thing. They are not the same as the thing

First of all, the point isn't about the map becoming the territory, but about whether LLMs can form a map that's similar to the map in our brains.

But to your philosophical point, assuming there are only a finite number of things and places in the universe - or at least the part of which we care about - why wouldn't they be representable with a finite set of symbols?

What you're rejecting is the Church-Turing thesis [1] (essentially, that all mechanical processes, including that of nature, can be simulated with symbolic computation, although there are weaker and stronger variants). It's okay to reject it, but you should know that not many people do (even some non-orthodox thoughts by Penrose about the brain not being simulatable by an ordinary digital computer still accept that some physical machine - the brain - is able to represent what we're interested in).

> If we accept the incompleteness theorem

There is no if there. It's a theorem. But it's completely irrelevant. It means that there are mathematical propositions that can't be proven or disproven by some system of logic, i.e. by some mechanical means. But if something is in the universe, then it's already been proven by some mechanical process: the mechanics of nature. That means that if some finite set of symbols could represent the laws of nature, then anything in nature can be proven in that logical system. Which brings us back to the first point: the only way the mechanics of nature cannot be represented by symbols is if they are somehow infinite, i.e. they don't follow some finite set of laws. In other words - there is no physics. Now, that may be true, but if that's the case, then AI is the least of our worries.

Of course, if physics does exist - i.e. the universe is governed by a finite set of laws - that doesn't mean that we can predict the future, as that would entail both measuring things precisely and simulating them faster than their operation in nature, and both of these things are... difficult.

[1]: https://plato.stanford.edu/entries/church-turing/

astrange•5mo ago

> First of all, the point isn't about the map becoming the territory, but about whether LLMs can form a map that's similar to the map in our brains.

It should be capable of something similar (fsvo similar), but the largest difference is that humans have to be power-efficient and LLMs do not.

That is, people don't actually have world models, because modeling something is a waste of time and energy insofar as it's not needed for anything. People are capable of taking out the trash without knowing what's in the garbage bag.

Terr_•5mo ago

> Of course, if physics does exist - i.e. the universe is governed by a finite set of laws

Wouldn't physics still "exist" even if there were an infinite set of laws?

pron•5mo ago

Well, the physical universe will still exist, but I don't think that physics - the scientific study of said universe - will become sort of meaningless, I would think?

Terr_•5mo ago

Why meaningless? Imperfect knowledge can still be useful, and ultimately that's the only kind we can ever have about anything.

"We could learn to sail the oceans and discover new lands and transport cargo cheaply... But in a few centuries we'll discover we were wrong and the Earth isn't really a sphere and tides are extra-complex so I guess there's no point."

pron•5mo ago

Because if there's an infinite number of laws, are they laws at all? You can't predict anything because you don't even know if some of the laws you don't know yet (which is pretty much all of them) makes an exception to the 0% of laws you do know. I'm not saying it's not interesting, but it's more history - today the apple fell down rather than up or sideways - than physics.

pixl97•5mo ago

In the infinite set of all laws is there an infinite set of laws that do not conflict with each other?

.000000000000001% of infinity is still infinite.

goatlover•5mo ago

> course, if physics does exist - i.e. the universe is governed by a finite set of laws

That statement is problematic. It implies a metaphysical set of laws that make physical stuff relate a certain way.

The Humean way of looking at physics is that we notice relationships and model those with various symbols. They symbols form incomplete models because we can't get to the bottom of why the relationships exist.

> that doesn't mean that we can predict the future, as that would entail both measuring things precisely and simulating them faster than their operation in nature, and both of these things are... difficult.

The indeterminism of Quantum Mechanics limits how how precise measure can be and how predictable the future is.

pron•5mo ago

> That statement is problematic. It implies a metaphysical set of laws that make physical stuff relate a certain way.

What I meant was that since physics is the scientific search for the laws of nature, then if there's an infinite number of them, then the pursuit becomes somewhat meaningless, as an infinite number of laws aren't really laws at all.

> They symbols form incomplete models because we can't get to the bottom of why the relationships exist.

Why would a model be incomplete if we don't know why the laws are what they are? A model pretty much is a set of laws; it doesn't require an explanation (we may want such an explanation, but it doesn't improve the model).

drdeca•5mo ago

Gödel’s incompleteness theorems aren’t particularly relevant here. Given how often people attempt to apply them to situations where they don’t say anything of note, I think the default should generally be to not publicly appeal to them unless one either has worked out semi-carefully how to derive the thing one wants to show from them, or at least have a sketch that one is confident, from prior experience working with it, that one could make into a rigorous argument. Absent these, the most one should say, I think, is “Perhaps one can use Gödel’s incompleteness theorems to show [thing one wants to show].” .

Now, given a program that is supposed to output text that encodes true statements (in some language), one can probably define some sort of inference system that corresponds to the program such that the inference system is considered to “prove” any sentence that the program outputs (and maybe also some others based on some logical principles, to ensure that the inference system satisfies some good properties), and upon defining this, one could (assuming the language allows making the right kinds of statements about arithmetic) show that this inference system is, by Gödel’s theorems, either inconsistent or incomplete.

This wouldn’t mean that the language was unable to express those statements. It would mean that the program either wouldn’t output those statements, or that the system constructed from the program was inconsistent (and, depending on how the inference system is obtained from the program, the inference system being inconsistent would likely imply that the program sometimes outputs false or contradictory statements).

But, this has basically nothing to do with the “placeholders” thing you said. Gödel’s theorem doesn’t say that some propositions are inexpressible in a given language, but that some propositions can’t be proven in certain axiom+inference systems.

Rather than the incompleteness theorems, the “undefinability of truth” result seems more relevant to the kind of point I think you are trying to make.

Still, I don’t think it will show what you want it to, even if the thing you are trying to show is true. Like, perhaps it is impossible to capture qualia with language, sure, makes sense. But logic cannot show that there are things which language cannot in any way (even collectively) refer to, because to show that there is a thing it has to refer to it.

————

“Can you write a test suite for it?”

Hm, might depend on what you count as a “suite”, but a test protocol, sure. The one I have in mind would probably be a bit expensive to run if it fails the test though (because it involves offering prize money).

energy123•5mo ago

Everything is just a low resolution representation of a thing. The so-called reality we supposedly have access to is at best a small number of sound waves and photons hitting our face. So I don't buy this argument that symbols are categorically different. It's a gradient and symbols are more sparse and less rich of a data source, yes. But who are we to say where that hypothetical line exists, beyond which further compression of concepts into smaller numbers of buckets becomes a non-starter for intelligence and world modelling. And then there's multi modal LLMs which have access to data of a similar richness that humans have access to.

bithive123•5mo ago

There are no "things" in the universe. You say this wave and that photon exist and represent this or that, but all of that is conceptual overlay. Objects are parts of speech, reality is undifferentiated quanta. Can you point to a particular place where the ocean becomes a particular wave? Your comment already implies an understanding that our mind is behind all the hypothetical lines; we impose them, they aren't actually there.

cognitif•5mo ago

> Language models aren't world models for the same reason languages aren't world models. Symbols, by definition, only represent a thing. They are not the same as the thing. The map is not the territory, the description is not the described, you can't get wet in the word "water".

Symbols, maps, descriptions, and words are useful precisely because they are NOT what they represent. Representation is not identity. What else could a “world model” be other than a representation? Aren’t all models representations, by definition? What exactly do you think a world model is, if not something expressible in language?

mrbungie•5mo ago

> Aren’t all models representations, by definition? What exactly do you think a world model is, if not something expressible in language?

I was following the string of questions, but I think there is a logical leap between those two questions.

Another question: is Language the only way to define models? An imagined sound or an imagined picture of an apple in my minds-eye are models to me, but they don't use language.

jandrewrogers•5mo ago

There is an important implication of learning and indexing being equivalent problems. A number of important data models and data domains exist for which we do not know how to build scalable indexing algorithms and data structures.

It has been noted for several years in US national labs and elsewhere that there is an almost perfect overlap between data models LLMs are poor at learning and data models that we struggle to index at scale. If LLMs were actually good at these things then there would be a straightforward path to addressing these longstanding non-AI computer science problems.

The incompleteness is that the LLM tech literally can't represent elementary things that are important enough that we spend a lot of money trying to represent them on computers for non-AI purposes. A super-intelligent AGI being right around the corner implies that we've solved these problems that we clearly haven't solved.

Perhaps more interesting, it also implies that AGI tech may look significantly different than the current LLM tech stack.

copypaper•5mo ago

Reminds me of this [1] article. If us humans, after all these years we've been around, can't relay our thoughts exactly as we perceive them in our heads, what makes us think that we can make a model that does it better than us?

[1]: https://www.experimental-history.com/p/you-cant-reach-the-br...

frankfrank13•5mo ago

Great quote at the end that I think I resonate a lot with:

> Feeding these algorithms gobs of data is another example of how an approach that must be fundamentally incorrect at least in some sense, as evidenced by how data-hungry it is, can be taken very far by engineering efforts — as long as something is useful enough to fund such efforts and isn’t outcompeted by a new idea, it can persist.

1970-01-01•5mo ago

I'm surprised the models haven't been enshittified by capitalism. I think in a few years we're going to see lightning-fast LLMs generating better output compared to what we're seeing today. But it won't be 1000x better, it will be 10x better, 10x faster, and completely enshittified with ads and clickbait links. Enjoy ChatGPT while it lasts.

UltraSane•5mo ago

I wonder how the nature of the language used to train an LLM affects its model of the world. Would a language designed for the maximum possible information content and clarity like Ithkuil make an LLMs world model more accurate?

DennisP•5mo ago

Maybe pure language models aren't world models, but Genie 3 for example seems to be a pretty good world model:

https://deepmind.google/discover/blog/genie-3-a-new-frontier...

We also have multimodal AIs that can do both language and video. Genie 3 made multimodal with language might be pretty impressive.

Focusing only on what pure language models can do is a bit of a straw man at this point.

ordu•5mo ago

> LLMs are not by themselves sufficient as a path to general machine intelligence; in some sense they are a distraction because of how far you can take them despite the approach being fundamentally incorrect.

I don't believe that it is a fundamentally incorrect approach. I believe, that human mind does something like that all the time, the difference is our minds have some additional processes that can, for example, filter hallucinations.

Kids at a specific age range are afraid of their imagination. Their imagination can place a monster into any dark place where nothing can be seen. Adult mind can do the same easily, but the difference is kids have difficulties distinguishing imagination and perception, while adult generally manage.

I believe, the ability of human mind to see difference between imagination/hallucinations from one hand and perception and memory from the other is not a fundamental thing stemming from the architecture of brains but a learned skill. Moreover people can be tricked to acquire false memory[1]. If LLM fell to tricks of Elizabet Loftus, we'd say LLM hallucinated.

What LLMs need is to learn some tricks to detect hallucinations. Probably they will not get 100% reliable detector, but to get to the level of humans they don't need 100% reliability.

shkkmo•5mo ago

> If LLM fell to tricks of Elizabet Loftus, we'd say LLM hallucinated.

She's strongly oversold how and when false memories can be created. She testified in defense of Ghislaine Maxwell at her 2021 trial that financial incentives can create false memories and only later admitted that there were no studies to back this up when directly questioned.

She's spent a career over-generalizing data about implanting false minor memories to make money discrediting victims' traumatic memories and defend abusers.

You conflate "hallucination" with "imagination" but the former has much more in common with lieing than it does with imagining.

taneq•5mo ago

> She testified in defense of Ghislaine Maxwell at her 2021 trial that financial incentives can create false memories and only later admitted that there were no studies to back this up when directly questioned.

Did she have financial incentives? Was this a live demonstration? :P

mac-mc•5mo ago

Do you have many memories of that time, around 3 to 5, and remember what your cognitive processes were?

When the child is afraid of the monster in the dark, they are not literally visually hallucinating a beast in the dark; they are worried that there could be a beast in the dark, and they are not sure that there is due to a lack of sensory information confirming a lack of the monster. They are not being hyper precise because they are 3, so they say "there is a monster under my bed"! Children have instincts to be afraid of the dark.

Similarly with imaginary friends and play, it's an instinct to practice through smaller stakes simulations. When they are emotionally attached to their imaginary friends, it's much like they are emotionally attached to their security blanket. They know that the "friend" is not perceptible.

It's much like the projected anxieties of adults or teenagers, who are worried that everyone thinks they are super lame and thus act like people do, because on the balance of no information, they choose the "safer path".

That is pretty different than the hallucinations of LLMs IMO.

TazeTSchnitzel•5mo ago

I have recently lived through something called a psychotic break, which was an unimaginably horrible thing, but it did let me see from the inside what insanity does to your thinking. And what's fascinating, coming out the other side of this, is how similar LLMs are to someone in psychosis. Someone in psychosis can have all the ability LLMs have to recognise patterns and sound like they know what they're talking about, but their brain is not working well enough to have proper self-insight, to be able to check their thoughts actually fully make sense. (And “making sense” turns out to be a sliding scale — it is not as if you just wake up one day suddenly fully rational again, there's a sliding scale of irrational thinking and you have to gradually re-process your older thoughts into more and more coherent shapes as your brain starts to work more correctly again.) I believe this isn't actually a novel insight either, many have worried about this for years! Psychosis might be an interesting topic to read about if you want to get another angle to understand the AI models from. I won't claim that it's exactly the same thing, but I will say that most people probably have a very undeveloped idea of what mental illness actually is or how it works, and that leaves them badly prepared for interacting with a machine that has a strong resemblance to a mentally ill person who's learned to pretend to be normal.

johnisgood•5mo ago

If we just take simply a panic attack, many people have no clue what or how it feels like, which is unfortunate, because they lack empathy for those who do experience it. My psychiatrists definitely need to experience it to understand.

rauljara•5mo ago

Thank you for sharing, and sorry you had to go through that. I had a good friend go through a psychotic break and I spent a long time trying to understand what was going on in his brain. The only solid conclusion I could come to was that I could not relate to what he was going through, but that didn’t change that he was obviously suffering and needed whatever support I could offer. Thanks for giving me a little bit of insight into his brain. Hope you were/are able to find support out there.

otabdeveloper4•5mo ago

> I believe, that human mind does something like that all the time

Absolutely not. Human brains have online one-shot training. LLMs weights are fixed and fine-tuning them is a huge multi-year enterprise.

Fundamentally it's two completely different architectures.

ordu•5mo ago

I really don't like how you rejecting the idea completely. People have online one-shot training, but have you tried to learn how to play on piano? To learn it you need a lot of repetitions. Really a lot. You need a lot of repetitions to learn how to walk, or how to do arithmetic, or how to read English. This is very similar to LLMs, isn't it? So they are not completely different architectures, aren't they? It is more like human brains have something on top of "LLM" that allows it to do tricks that LLMs couldn't do.

otabdeveloper4•5mo ago

> This is very similar to LLMs, isn't it?

No, it isn't at all. The effort humans spend on rote learning is to optimize mechanical precision in performance, not to internalize the concepts.

The concepts of playing the piano you can learn in a couple days. All the rest of the effort is about getting synchronization and timing right.

bayindirh•5mo ago

From my perspective, the fundamental problem arises from the assumption that brain's all functions are self contained, however there are feedback loops in the body which supports the functions of the brain.

The simplest one is fight/flight/freeze. Brain starts the process by being afraid, and hormones gets released, but next step is triggered by the nerve feedback coming from the body. If you are using beta-blockers and can't get panicked, the initial trigger fizzles and you return to your pre-panic state.

an LLM doesn't model a complete body. It just models the language. It's just a very small part of what brain handles, so assuming that modelling the language, even the whole brain gonna answer all the questions we have is a flawed approach.

Latest research shows body is a much more complicated and interconnected system than we learnt in school 30 years ago.

mft_•5mo ago

Sure, your points about the body aren’t wrong, but (as you say) LLMs are only modelling a small subset of a brain’s functions at the moment: applied knowledge, language/communication, and recently interpretation of visual data. There’s no need or opportunity for an LLM (as they currently exist) to do anything further. Further, just because additional inputs exist in the human body (gut-brain axis, for example) it doesn’t mean that they are especially (or at all) relevant for knowledge/language work.

TheOtherHobbes•5mo ago

The point is that knowledge/language work can't work reliably unless it's grounded in something outside of itself. Without it you don't get an oracle, you get a superficially convincing but fundamentally unreliable idiot savant who lacks a stable sense of self, other, or real world.

The fundamental foundation of science and engineering is reliability.

If you start saying reliability doesn't matter, you're not doing science and engineering any more.

mft_•5mo ago

I'm really struggling to understand what you're trying to communicate here; I'm even wondering if you're an LLM set up to troll, due to the weird language and confusing non-sequiturs.

> The point is that knowledge/language can't work reliably unless it's grounded in something outside of itself.

Just, what? Knowledge is facts, somehow held within a system allowing recall and usage of those facts. Knowledge doesn't have a 'self', and I'm totally not understanding how pure knowledge as a concept or medium needs "grounding"?

Being charitable, it sounds more like you're trying to describe "wisdom" - which might be considered as a combination of knowledge, lived experience, and good judgement? Yes, this is valuable in applying knowledge more usefully, but has nothing to do with the other bodily systems which interact with the brain, which is where you started?

> The fundamental foundation of science and engineering is reliability.

> If you start saying reliability doesn't matter, you're not doing science and engineering any more.

No-one mentioned reliability - not you in your original post, or me in my reply. We were discussing whether the various (unconscious) systems which link to the brain in the human body (like the gut:brain axis) might influence its knowledge/language/interpretation abilities.

Mikhail_Edoshin•5mo ago

You probably know the Law of Archimedes. Many people do. But do you know it in the same way Archimedes did? No. You were told the law, then taught how to apply it. But Archimedes discovered it without any of that.

Can we repeat the feat of Archimedes? Yes, we can, but first we'd have to forget what we were told and taught.

The way we actually discover things is very different from amassing lots of hearsay. Indeed, we do have an internal part that behaves the same way LLM does. But to get to the real understanding we actually shut down that part, forget what we "know", start from a clean slate. That part does not help us think; it helps us to avoid thinking. The reason it exists is that it is useful: thinking is hard and slow, but recalling is easy and fast. But it not thinking; it is the opposite.

ordu•5mo ago

> But to get to the real understanding we actually shut down that part, forget what we "know", start from a clean slate.

Close, but not exactly. To start from a clean slate is not very difficult, the trick is to reject some chosen parts of existing knowledge, or more specifically the difficulty is to choose what to reject. Starting from a clean slate you'll end up spending millennia to get the knowledge you've just rejected.

So the overall process of generating knowledge is to look under the streetlight till finding something new becomes impossible or too hard, and then you start experimenting with rejecting some bits of your knowledge to rethink them. I was taught to read works of Great Masters of the past critically, trying to reproduce their path while looking for forks where you can try to go the other way. It is a little bit like starting from a clean slate, but not exactly.

globular-toast•5mo ago

I agree with the article. I will be very surprised if LLMs end up being "it". I say this as a language geek who has always been amazed how language drives our thinking. However, I think language exists between brains, not inside them. There's something else in us and LLMs aren't it.

revskill•5mo ago

LLM is not AI, it's dumbass, too stupid to NOT assume and hallucinate.

mft_•5mo ago

Lots of people assume, confabulate, misremember, and lie every day.

revskill•5mo ago

They are not intelligent.

mft_•5mo ago

LOL, you really think that intelligence (however you want to define or measure the concept) is a guarantee that people won't make mistakes, misremember, make stuff up, or lie?

antirez•5mo ago

The post is based on a misconception. If you read the blog post linked at the end of this message, you'll see how a very small GPT-2 alike transformer (Karpathy nano-gpt trained to a very small size) after seeing just PGN games and nothing more develops an 8x8 internal representation with which chess piece is where. This representation can be extracted by linear probing (and can be even altered by using the probe in reverse). LLMs are decent but not very good chess players for other reasons, not because they don't have a world model of the chess board.

https://www.lesswrong.com/posts/yzGDwpRBx6TEcdeA5/a-chess-gp...

yosefk•5mo ago

The post or rather the part you refer to is based on a simple experiment which I encourage you to repeat. (It is way likelier to reproduce in the short to medium run than the others.)

From your link: "...The first was gpt-3.5-turbo-instruct's ability to play chess at 1800 Elo"

These things don't play at 1800 ELO, though maybe someone measured this ELO without cheating but rather relying on some artifacts of how an engine told to play at a low rating does against an LLM (engines are weird when you ask them to play badly, as a rule); a good start to a decent measurement would be to try it on chess 960. These things do lose track of the pieces in 10 moves. (As do I absent a board to look at, but I understand enough to say "I can't play blindfold chess, let's set things up so I can look at the current position somehow")

famouswaffles•5mo ago

>These things don't play at 1800 ELO

Why are you saying 'these things'?. That statement is about a specific model which did play at that level and did not lose track of the pieces. There's no cheating or weirdness.

https://github.com/adamkarvonen/chess_gpt_eval

thecupisblue•5mo ago

Ironically, that lesswrong article is more wrong than right.

First, chess is perfect for such modeling. The game is basically a tree of legal moves. The "world model" representation is already encoded in the dataset itself and at a certain scale the chance of making an illegal move is minimal, as the dataset itself includes an insane amount of legal moves compared to illegal moves, let alone when you are training it on a chess dataset like PGN one

Second, the probing is quite... a subjective thing.

We are cherry-picking activations across an arbitrary amount of dimensions, on a model specifically trained for chess, taking these arbitrary representations and displaying it on 2D graph.

Well yeah, with enough dimensions and cherry-picking, we can also show how "all zebras are elephants, because all elephants are horses and look their weights overlap in so many dimensions - large four-legged animals you see on safari!" - especially if we cherry-pick it. Especially if we tune a dataset on it.

This shows nothing other than "training LLMs on a constrained move dataset makes LLM great at predicting next move in that dataset".

flender•5mo ago

And if it knew every possible board configuration and optimal move, it could potentially do as well as it could, but instead if it were to just recognize “this looks like a chess game” and use an optimized tool to determine the next move, that would be a better use of training, it would seem.

thecupisblue•5mo ago

Way better use, at this point that engine is more like a world's most expensive monte carlo search.

FailMore•5mo ago

It seems clear to me that LLMs are a useful sort of dumb smart activity. They can take some pretty useful stabs in the dark, and therefore do much better in an environment which can give them feedback (coding) or where there is no objective correct answer (write a poem). It opens the door for some novel type of computational tasks, and the more feedback you can provide within the architecture of your application, the more useful the LLM will probably be. I think the hype of their genuine intelligence is overblown, but doesn’t mean they are not useful.

Peteragain•5mo ago

This article appeared on HN a while ago. https://dynomight.net/more-chess/ It basically is in agreement with this article and provides a few more trials and explanations.

devxpy•5mo ago

I don’t think you even need such complex tests as chess to see it doesn’t have a world model - just ask it to do any 5+ digit multiplication

machiaweliczny•5mo ago

I think they know this but don’t have causality built-in. In the sense they aren’t incentivised to understand holistically. Kids around 4 years old are spamming with “why? why? why?” questions and I think this is some process we need yet to reproduce. (BTW I suspect they ask this as manifestation of what is going in their brain and not real curiosity as they ask same question multiple times)

machiaweliczny•5mo ago

I think it’s mostly because they are incentivised to answer verbatim as medicine students and not with their own understanding. RL methods change that.

glenstein•5mo ago

This is the best and clearest explanation I have yet seen that describe a tricky thing, namely that LLMs, which are synonymous with "AI" for so many people, are just one variation of many possible types of machine intelligence.

Which I find important because, well, hallucinating facts is what you would expect from an LLM, but isn't necessarily inherent issue with machine intelligence writ large if it's trained from the ground up on different principles, or modelling something else. We use LLMs as a stand in for tutors because being really good at language incidentally makes them able to explain math or history as a side effect.

Importantly it doesn't show that hallucinating is a baked in problem for AI writ large. Presumably different models will have different kinds of systemic errors based on their respective designs.

m4nu3l•5mo ago

Developing a model of the real world, or even just learning only a subset of self-consistent information, could be detrimental to the task of predicting the next token in the average text, given that most of the written information on many subjects could be contradictory and somehow wrong. I don't know how they are doing RL on top of that, how they are using synthetic data or filtering them. But it's clear that even with GPT-5 they haven't solved the problem, as the presentation demonstrated with the very first prompt (I'm talking about the wrong explanation for lift produced by a wing).

maebert•5mo ago

It might be worth noting that humans also struggle with keeping up a coherent world model over time.

Luckily, we don’t have to; we externalize a lot of our representations. When shopping together with a friend we might put our stuff on one side of the shopping cart and our friends’ on the other. There’s a reason we don’t just play chess in our heads but use a chess board. We use notebooks to write things down, etc.

Some reasoning model can do similar things (keep a persistent notebook that gets fed back into the context window on every pass), but I expect that we need a few more dirty representational ist tricks to get there.

In other words, I don’t think it’s an LLMs job to have a world model, but an LLM is just one part of an AI system.

3willows•5mo ago

I've been looking for a way to play chess/go with a generalist LLM. It wouldn't matter if the moves are bad (I like winning), but being able to chat on unrelated topics while playing the game would take the experience to the next level.

quantumcotton•5mo ago

I suppose teaching mine gsis and point clouds and connecting it to the world gsis map is a bad idea?

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

The F Word

Speed up responses with fast mode

GitBlack: Tracing America's Foundation

Software factories and the agentic moment

FDA intends to take action against non-FDA-approved GLP-1 drugs

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

First Proof

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

I write games in C (yes, C) (2016)

Vocal Guide – belt sing without killing yourself

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Start all of your commands with a comma (2009)

LLMs as the new high level language

Reinforcement Learning from Human Feedback

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection rather than prediction

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

72M Points of Interest

Coding agents have replaced every framework I used

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

France's homegrown open source online office suite

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Where did all the starships go?

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

The F Word

Speed up responses with fast mode

GitBlack: Tracing America's Foundation

Software factories and the agentic moment

FDA intends to take action against non-FDA-approved GLP-1 drugs

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

First Proof

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

I write games in C (yes, C) (2016)

Vocal Guide – belt sing without killing yourself

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Start all of your commands with a comma (2009)

LLMs as the new high level language

Reinforcement Learning from Human Feedback

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection rather than prediction

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

72M Points of Interest

Coding agents have replaced every framework I used

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

France's homegrown open source online office suite

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Where did all the starships go?

LLMs aren't world models

Comments