frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Claude Sonnet 4 now supports 1M tokens of context

https://www.anthropic.com/news/1m-context
886•adocomplete•9h ago•492 comments

Search all text in New York City

https://www.alltext.nyc/
58•Kortaggio•1h ago•12 comments

Ashet Home Computer

https://ashet.computer/
185•todsacerdoti•6h ago•41 comments

Scapegoating the Algorithm

https://asteriskmag.com/issues/11/scapegoating-the-algorithm
31•fmblwntr•2h ago•14 comments

Show HN: Building a web search engine from scratch with 3B neural embeddings

https://blog.wilsonl.in/search-engine/
326•wilsonzlin•9h ago•57 comments

Journaling using Nix, Vim and coreutils

https://tangled.sh/@oppi.li/journal
76•icy•11h ago•23 comments

A gentle introduction to anchor positioning

https://webkit.org/blog/17240/a-gentle-introduction-to-anchor-positioning/
39•feross•3h ago•10 comments

Training language models to be warm and empathetic makes them less reliable

https://arxiv.org/abs/2507.21919
206•Cynddl•12h ago•210 comments

Show HN: Omnara – Run Claude Code from anywhere

https://github.com/omnara-ai/omnara
207•kmansm27•9h ago•100 comments

Multimodal WFH setup: flight SIM, EE lab, and music studio in 60sqft/5.5M²

https://www.sdo.group/study
180•brunohaid•3d ago•78 comments

Blender is Native on Windows 11 on Arm

https://www.thurrott.com/music-videos/324346/blender-is-native-on-windows-11-on-arm
115•thunderbong•3d ago•42 comments

AI Eroded Doctors' Ability to Spot Cancer Within Months in Study

https://www.bloomberg.com/news/articles/2025-08-12/ai-eroded-doctors-ability-to-spot-cancer-within-months-in-study
29•zzzeek•53m ago•15 comments

The Missing Protocol: Let Me Know

https://deanebarker.net/tech/blog/let-me-know/
75•deanebarker•5h ago•51 comments

WHY2025: How to become your own ISP [video]

https://media.ccc.de/v/why2025-9-how-to-become-your-own-isp
92•exiguus•8h ago•13 comments

Launch HN: Design Arena (YC S25) – Head-to-head AI benchmark for aesthetics

61•grace77•9h ago•23 comments

LLMs aren't world models

https://yosefk.com/blog/llms-arent-world-models.html
223•ingve•2d ago•113 comments

Go 1.25 Release Notes

https://go.dev/doc/go1.25
109•bitbasher•4h ago•10 comments

Why are there so many rationalist cults?

https://asteriskmag.com/issues/11/why-are-there-so-many-rationalist-cults
383•glenstein•10h ago•583 comments

The Equality Delete Problem in Apache Iceberg

https://blog.dataengineerthings.org/the-equality-delete-problem-in-apache-iceberg-143dd451a974
42•dkgs•7h ago•21 comments

RISC-V single-board computer for less than 40 euros

https://www.heise.de/en/news/RISC-V-single-board-computer-for-less-than-40-euros-10515044.html
126•doener•4d ago•72 comments

Debian GNU/Hurd 2025 released

https://lists.debian.org/debian-hurd/2025/08/msg00038.html
180•jrepinc•3d ago•93 comments

Visualizing quaternions, an explorable video series

https://eater.net/quaternions
3•uncircle•3d ago•0 comments

Weave (YC W25) is hiring a founding AI engineer

https://www.ycombinator.com/companies/weave-3/jobs/SqFnIFE-founding-ai-engineer
1•adchurch•8h ago

Dumb to managed switch conversion (2010)

https://spritesmods.com/?art=rtl8366sb&page=1
33•userbinator•3d ago•15 comments

Fixing a loud PSU fan without dying

https://chameth.com/fixing-a-loud-psu-fan-without-dying/
13•sprawl_•3d ago•15 comments

Galileo’s telescopes: Seeing is believing (2010)

https://www.historytoday.com/archive/history-matters/galileos-telescopes-seeing-believing
14•hhs•3d ago•4 comments

Nexus: An Open-Source AI Router for Governance, Control and Observability

https://nexusrouter.com/blog/introducing-nexus-the-open-source-ai-router
81•mitchwainer•11h ago•20 comments

Australian court finds Apple, Google guilty of being anticompetitive

https://www.ghacks.net/2025/08/12/australian-court-finds-apple-google-guilty-of-being-anticompetitive/
322•warrenm•12h ago•119 comments

How to safely escape JSON inside HTML SCRIPT elements

https://sirre.al/2025/08/06/safe-json-in-script-tags-how-not-to-break-a-site/
69•dmsnell•4d ago•40 comments

Comparing baseball greats across eras, who comes out on top?

https://phys.org/news/2025-07-baseball-greats-eras.html
6•PaulHoule•2d ago•12 comments
Open in hackernews

LLMs aren't world models

https://yosefk.com/blog/llms-arent-world-models.html
223•ingve•2d ago

Comments

t0md4n•2d ago
https://arxiv.org/abs/2501.17186
yosefk•2d ago
This is interesting. The "professional level" rating of <1800 isn't, but still.

However:

"A significant Elo rating jump occurs when the model’s Legal Move accuracy reaches 99.8%. This increase is due to the reduction in errors after the model learns to generate legal moves, reinforcing that continuous error correction and learning the correct moves significantly improve ELO"

You should be able to reach the move legality of around 100% with few resources spent on it. Failing to do so means that it has not learned a model of what chess is, at some basic level. There is virtually no challenge in making legal moves.

lostmsu•20h ago
> r4rk1 pp6 8 4p2Q 3n4 4N3 qP5P 2KRB3 w — — 3 27

Can you say 100% you can generate a good next move (example from the paper) without using tools, and will never accidentally make a mistake and give an illegal move?

rpdillon•5h ago
> Failing to do so means that it has not learned a model of what chess is, at some basic level.

I'm not sure about this. Among a standard amateur set of chess players, how often when they lack any kind of guidance from a computer do they attempt to make a move that is illegal? I played chess for years throughout elementary, middle and high school, and I would easily say that even after hundreds of hours of playing, I might make two mistakes out of a thousand moves where the move was actually illegal, often because I had missed that moving that piece would continue to leave me in check due to a discovered check that I had missed.

It's hard to conclude from that experience that players that are amateurs lack even a basic model of chess.

libraryofbabel•2d ago
This essay could probably benefit from some engagement with the literature on “interpretability” in LLMs, including the empirical results about how knowledge (like addition) is represented inside the neural network. To be blunt, I’m not sure being smart and reasoning from first principles after asking the LLM a lot of questions and cherry picking what it gets wrong gets to any novel insights at this point. And it already feels a little out date, with LLMs getting gold on the mathematical Olympiad they clearly have a pretty good world model of mathematics. I don’t think cherry-picking a failure to prove 2 + 2 = 4 in the particular specific way the writer wanted to see disproves that at all.

LLMs have imperfect world models, sure. (So do humans.) That’s because they are trained to be generalists and because their internal representations of things are massively compressed single they don’t have enough weights to encode everything. I don’t think this means there are some natural limits to what they can do.

armchairhacker•2d ago
Any suggestions from this literature?
libraryofbabel•2d ago
The papers from Anthropic on interpretability are pretty good. They look at how certain concepts are encoded within the LLM.
yosefk•2d ago
Your being blunt is actually very kind, if you're describing what I'm doing as "being smart and reasoning from first principles"; and I agree that I am not saying something very novel, at most it's slightly contrarian given the current sentiment.

My goal is not to cherry-pick failures for its own sake as much as to try to explain why I get pretty bad output from LLMs much of the time, which I do. They are also very useful to me at times.

Let's see how my predictions hold up; I have made enough to look very wrong if they don't.

Regarding "failure disproving success": it can't, but it can disprove a theory of how this success is achieved. And, I have much better examples than the 2+2=4, which I am citing as something that sorta works these says

libraryofbabel•2d ago
I mean yeah, it’s a good essay in that it made me think and try to articulate the gaps, and I’m always looking to read things that push back on AI hype. I usually just skip over the hype blogging.

I think my biggest complaint is that the essay points out flaws in LLM’s world models (totally valid, they do confidently get things wrong and hallucinate in ways that are different, and often more frustrating, from how humans get things wrong) but then it jumps to claiming that there is some fundamental limitation about LLMs that prevents them from forming workable world models. In particular, it strays a bit towards the “they’re just stochastic parrots” critique, e.g. “that just shows the LLM knows to put the words explaining it after the words asking the question.” That just doesn’t seem to hold up in the face of e.g. LLMs getting gold on the Mathematical Olympiad, which features novel questions. If that isn’t a world model of mathematics - being able to apply learned techniques to challenging new questions - then I don’t know what is.

A lot of that success is from reinforcement learning techniques where the LLM is made to solve tons of math problems after the pre-training “read everything” step, which then gives it a chance to update its weights. LLMs aren’t just trained from reading a lot of text anymore. It’s very similar to how the alpha zero chess engine was trained, in fact.

I do think there’s a lot that the essay gets right. If I was to recast it, I’d put it something like this:

* LLMs have imperfect models of the world which is conditioned by how they’re trained on next token prediction.

* We’ve shown we can drastically improve those world models for particular tasks by reinforcement learning. you kind of allude to this already by talking about how they’ve been “flogged” to be good at math.

* I would claim that there’s no particular reason these RL techniques aren’t extensible in principle to beat all sorts of benchmarks that might look unrealistic now. (Two years ago it would have been an extreme optimist position to say an LLM could get gold on the mathematical Olympiad, and most LLM skeptics would probably have said it could never happen.)

* Of course it’s very expensive, so most world models LLMs have won’t get the RL treatment and so will be full of gaps, especially for things that aren’t amenable to RL. It’s good to beware of this.

I think the biggest limitation LLMs actually have, the one that is the biggest barrier to AGI, is that they can’t learn on the job, during inference. This means that with a novel codebase they are never able to build a good model of it, because they can never update their weights. (If an LLM was given tons of RL training on that codebase, it could build a better world model, but that’s expensive and very challenging to set up.) This problem is hinted at in your essay, but the lack of on-the-job learning isn’t centered. But it’s the real elephant in the room with LLMs and the one the boosters don’t really have an answer to.

Anyway thanks for writing this and responding!

yosefk•2d ago
I'm not saying that LLMs can't learn about the world - I even mention how they obviously do it, even at the learned embeddings level. I'm saying that they're not compelled by their training objective to learn about the world and in many cases they clearly don't, and I don't see how to characterize the opposite cases in a more useful way than "happy accidents."

I don't really know how they are made "good at math," and I'm not that good at math myself. With code I have a better gut feeling of the limitations. I do think that you could throw them off terribly with unusual math quastions to show that what they learned isn't math, but I'm not the guy to do it; my examples are about chess and programming where I am more qualified to do it. (You could say that my question about the associativity of blending and how caching works sort of shows that it can't use the concept of associativity in novel situations; not sure if this can be called an illustration of its weakness at math)

calf•3h ago
But this is parallel to saying LLMs are not "compelled" by the training algorithms to learn symbolic logic.

Which says to me there are two camps on this and the verdict is still out on this and all related questions.

WillPostForFood•5h ago
Your LLM output seems abnormally bad, like you are using old models, bad models, or intentionally poor prompting. I just copied and pasted your Krita example into ChatGPT, and reasonable answer, nothing like what you paraphrased in your post.

https://imgur.com/a/O9CjiJY

marcellus23•4h ago
I think it's hard to take any LLM criticism seriously if they don't even specify which model they used. Saying "an LLM model" is totally useless for deriving any kind of conclusion.
p1esk•4h ago
Yes, I’d be curious about his experience with GPT-5 Thinking model. So far I haven’t seen any blunders from it.
typpilol•4h ago
This seems like a common theme with these types of articles
AyyEye•2d ago
With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever. That addition (something which only takes a few gates in digital logic) happens to be overfit into a few nodes on multi-billion node networks is hardly a surprise to anyone except the most religious of AI believers.
yosefk•2d ago
Actually I forgive them those issues that stem from tokenization. I used to make fun at them for listing datum as a noun whose plural form ends with an i, but once I learned about how tokenization works, I no longer do it - it feels like mocking a person's intelligence because of a speech impediment or something... I am very kind to these things, I think
astrange•3h ago
Tokenization makes things harder, but it doesn't make them impossible. Just takes a bit more memorization.

Other writing systems come with "tokenization" built in making it still a live issue. Think of answering:

1. How many n's are in 日本?

2. How many ん's are in 日本?

(Answers are 2 and 1.)

andyjohnson0•2d ago
> With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever.

Is this a real defect, or some historical thing?

I just asked GPT-5:

    How many "B"s in "blueberry"?
and it replied:

    There are 2 — the letter b appears twice in "blueberry".
I also asked it how many Rs in Carrot, and how many Ps in Pineapple, amd it answered both questions correctly too.
libraryofbabel•2d ago
It’s a historical thing that people still falsely claim is true, bizarrely without trying it on the latest models. As you found, leading LLMs don’t have a problem with it anymore.
pydry•2d ago
Depends how you define historical. If by historical you mean more than two days ago then, yeah, it's ancient history.
ThrowawayR2•2d ago
It was discussed and reproduced on GPT-5 on HN couple of days ago: https://news.ycombinator.com/item?id=44832908

Sibling poster is probably mistakenly thinking of the strawberry issue from 2024 on older LLM models.

bgwalter•2d ago
It is not historical:

https://kieranhealy.org/blog/archives/2025/08/07/blueberry-h...

Perhaps they have a hot fix that special cases HN complaints?

AyyEye•2d ago
They clearly RLHF out the embarrassing cases and make cheating on benchmarks into a sport.
Terr_•2h ago
I wouldn't be surprised if some models get set up to identify that type of question and run the word through string processing function.
nosioptar•2d ago
Shouldn't the correct answer be that there is not a "B" in "blueberry"?
BobbyJo•2d ago
The core issue there isn't that the LLM isn't building internal models to represent its world, it's that its world is limited to tokens. Anything not represented in tokens, or token relationships, can't be modeled by the LLM, by definition.

It's like asking a blind person to count the number of colors on a car. They can give it a go and assume glass, tires, and metal are different colors as there is likely a correlation they can draw from feeling them or discussing them. That's the best they can do though as they can't actually perceive color.

In this case, the LLM can't see letters, so asking it to count them causes it to try and draw from some proxy of that information. If it doesn't have an accurate one, then bam, strawberry has two r's.

I think a good example of LLMs building models internally is this: https://rohinmanvi.github.io/GeoLLM/

LLMs are able to encode geospatial relationships because they can be represented by token relationships well. Teo countries that are close together will be talked about together much more often than two countries far from each other.

vrighter•1d ago
That is just not a solid argument. There are countless examples of LLMs splitting "blueberry" into "b l u e b e r r y", which would contain one token per letter. And then they still manage to get it wrong.

Your argument is based on a flawed assumption, that they can't see letters. If they didn't they wouldn't be able to spell the word out. But they do. And when they do get one token per letter, they still miscount.

xigoi•12h ago
> It's like asking a blind person to count the number of colors on a car.

I presume if I asked a blind person to count the colors on a car, they would reply “sorry, I am blind, so I can’t answer this question”.

libraryofbabel•2d ago
> they clearly don't have any world model whatsoever

Then how did an LLM get gold on the mathematical Olympiad, where it certainly hadn’t seen the questions before? How on earth is that possible without a decent working model of mathematics? Sure, LLMs might make weird errors sometimes (nobody is denying that), but clearly the story is rather more complicated than you suggest.

simiones•15h ago
> where it certainly hadn’t seen the questions before?

What are you basing this certainty on?

And even if you're right that the specific questions had not come up, it may still be that the questions from the math olympiad were rehashes of similar questions in other texts, or happened to correspond well to a composition of some other problems that were part of the training set, such that the LLM could 'pick up' on the similarity.

It's also possible that the LLM was specifically trained on similar problems, or may even have a dedicated sub-net or tool for it. Still impressive, but possibly not in a way that generalizes even to math like one might think based on the press releases.

williamcotton•4h ago
I don’t solve math problems with my poetry writing skills:

https://chatgpt.com/share/689ba837-8ae0-8013-96d2-7484088f27...

lossolo•2d ago
https://arxiv.org/abs/2508.01191
rishi_devan•2d ago
Haha. I enjoyed that Soviet-era joke at the end.
svantana•2d ago
Yes, I hadn't heard that before. It's similar in spirit to this norwegian folk tale about a deaf man guessing what someone is saying to him:

https://en.wikipedia.org/wiki/%22Good_day,_fellow!%22_%22Axe...

kgwgk•2d ago
Another similar story:

King Frederick, the great of Prussia had a very fine army, and none of the soldiers in it were finer than Giant Guards, who were all extremely tall men. It was difficult to find enough soldiers for these Guards, as there were not many men who were tall enough.

Frederick had made it a rule that no soldiers who did not speak German could be admitted to the Giant Guards, and this made the work of the officers who had to find men for them even more difficult. When they had to choose between accepting or refusing a really tall man who knew no German, the officers used to accept him, and then teach him enough. German to be able to answer if the King questioned him.

Frederick, sometimes, used to visit the men who were on guard around his castle at night to see that they were doing their job properly, and it was his habit to ask each new one that he saw three questions: “How old are you?” “How long have you been in my army?” and “Are you satisfied with your food and your conditions?”

The offices of the Giant Guards therefore used to teach new soldiers who did not know German the answers to these three questions.

One day, however, the King asked a new soldier the questions in a different order, he began with, “How long have you been in my army?” The young soldier immediately answered, “Twenty – two years, Your Majesty”. Frederick was very surprised. “How old are you then?”, he asked the soldier. “Six months, Your Majesty”, came the answer. At this Frederick became angry, “Am I a fool, or are you one?” he asked. “Both, Your Majesty”, the soldier answered politely.

https://archive.org/details/advancedstoriesf0000hill

deadbabe•2d ago
Don’t: use LLMs to play chess against you

Do: use LLMs to talk shit to you while a real chess AI plays chess against you.

The above applies to a lot of things besides chess, and illustrates a proper application of LLMs.

Seb-C•7m ago
Are you suggesting that we use an LLM as an interface between the AI and the player?

Why would anyone choose to awkwardly play using natural language rather than a reliable, fast and intuitive UI?

imenani•2d ago
As far as I can tell they don’t say which LLM they used which is kind of a shame as there is a huge range of capabilities even in newly released LLMs (e.g. reasoning vs not).
yosefk•2d ago
ChatGPT, Claude, Grok and Google AI Overviews, whatever powers the latter, were all used in one or more of these examples, in various configurations. I think they can perform differently, and I often try more than one when the 1st try doesn't work great. I don't think there's any fundamental difference in the principle of their operation, and I think there never will be - there will be another major breakthrough
red75prime•2d ago
My hypothesis is that a model fails to switch into a deep thinking mode (if it has it) and blurts whatever it got from all the internet data during autoregressive training. I tested it with alpha-blending example. Gemini 2.5 flash - fails, Gemini 2.5 pro - succeeds.

How presence/absence of a world model, er, blends into all this? I guess "having a consistent world model at all times" is an incorrect description of humans, too. We seem to have it because we have mechanisms to notice errors, correct errors, remember the results, and use the results when similar situations arise, while slowly updating intuitions about the world to incorporate changes.

The current models lack "remember/use/update" parts.

imenani•2d ago
Each of these models has a thinking/reasoning variant and a default non-thinking variant. I would expect the reasoning variants (o3 or “GPT5 Thinking”, Gemini DeepThink, Claude with Extended Thinking, etc) to do better at this. I think there is also some chance that in their reasoning traces they may display something you might see as closer to world modelling. In particular, you might find them explicitly tracking positions of pieces and checking validity.
red75prime•2d ago
> I don't think there's any fundamental difference in the principle of their operation

Yeah, they seem to be a subject to the universal approximation theorem (it needs to be checked more thoroughly, but I think we can build a transformer that is equivalent to any given fully-connected multilayered network).

That is at a certain size they can do anything a human can do at a certain point in their life (that is with no additional training) regardless of whether humans have world models and what those model are on the neuronal level.

But there are additional nuances that are related to their architectures and training regimes. And practical questions of the required size.

lowsong•2d ago
It doesn't matter. These limitations are fundamental to LLMs, so all of them that will ever be made suffer from these problems.
og_kalu•2d ago
Yes LLMs can play chess and yes they can model it fine

https://arxiv.org/pdf/2403.15498v2

GaggiX•2d ago
https://www.youtube.com/watch?v=LtG0ACIbmHw

Sota LLMs do play legal moves in chess, I don't why the article seem to say otherwise.

tickettotranai•2d ago
Technically yes, but... it's moderately tricky to get an LLM to play good chess even though it can.

https://dynomight.net/more-chess/

This is significant in general because I personally would love to get these things to code-switch into "hackernews poster" or "writer for the Economist" or "academic philosopher", but I think the "chat" format makes it impossible. The inaccessibility of this makes me want to host my own LLM...

lordnacho•2d ago
Here's what LLMs remind me of.

When I went to uni, we had tutorials several times a week. Two students, one professor, going over whatever was being studied that week. The professor would ask insightful questions, and the students would try to answer.

Sometimes, I would answer a question correctly without actually understanding what I was saying. I would be spewing out something that I had read somewhere in the huge pile of books, and it would be a sentence, with certain special words in it, that the professor would accept as an answer.

But I would sometimes have this weird feeling of "hmm I actually don't get it" regardless. This is kinda what the tutorial is for, though. With a bit more prodding, the prof will ask something that you genuinely cannot produce a suitable word salad for, and you would be found out.

In math-type tutorials it would be things like realizing some equation was useful for finding an answer without having a clue about what the equation actually represented.

In economics tutorials it would be spewing out words about inflation or growth or some particular author but then having nothing to back up the intuition.

This is what I suspect LLMs do. They can often be very useful to someone who actually has the models in their minds, but not the data to hand. You may have forgotten the supporting evidence for some position, or you might have missed some piece of the argument due to imperfect memory. In these cases, LLM is fantastic as it just glues together plausible related words for you to examine.

The wheels come off when you're not an expert. Everything it says will sound plausible. When you challenge it, it just apologizes and pretends to correct itself.

roywiggins•2h ago
> When you challenge it, it just apologizes and pretends to correct itself.

Even when it was right the first time!

ej88•2d ago
This article is interesting but pretty shallow.

0(?): there’s no provided definition of what a ‘world model’ is. Is it playing chess? Is it remembering facts like how computers use math to blend Colors? If so, then ChatGPT: https://chatgpt.com/s/t_6898fe6178b88191a138fba8824c1a2c has a world model right?

1. The author seems to conflate context windows with failing to model the world in the chess example. I challenge them to ask a SOTA model with an image of a chess board or notation and ask it about the position. It might not give you GM level analysis but it definitely has a model of what’s going on.

2. Without explaining which LLM they used or sharing the chats these examples are just not valuable. The larger and better the model, the better its internal representation of the world.

You can try it yourself. Come up with some question involving interacting with the world and / or physics and ask GPT-5 Thinking. It’s got a pretty good understanding of how things work!

https://chatgpt.com/s/t_689903b03e6c8191b7ce1b85b1698358

yosefk•2d ago
A "world model" depends on the context which defines which world the problem is in. For chess, which moves are legal and needing to know where the pieces are to make legal moves are parts of the world model. For alpha blending, it being a mathematical operation and the visibility of a background given the transparency of the foreground are parts of the world model.

The examples are from all the major commercial American LLMs as listed in a sister comment.

You seem to conflate context windows with tracking chess pieces. The context windows are more than large enough to remember 10 moves. The model should either track the pieces, or mention that it would be playing blindfold chess absent a board to look at and it isn't good at this, so could you please list the position after every move to make it fair, or it doesn't know what it's doing; it's demonstrably the latter.

jonplackett•2d ago
I just tried a few things that are simple and a world model would probably get right. Eg

Question to GPT5: I am looking straight on to some objects. Looking parallel to the ground.

In front of me I have a milk bottle, to the right of that is a Coca-Cola bottle. To the right of that is a glass of water. And to the right of that there’s a cherry. Behind the cherry there’s a cactus and to the left of that there’s a peanut. Everything is spaced evenly. Can I see the peanut?

Answer (after choosing thinking mode)

No. The cactus is directly behind the cherry (front row order: milk, Coke, water, cherry). “To the left of that” puts the peanut behind the glass of water. Since you’re looking straight on, the glass sits in front and occludes the peanut.

It doesn’t consider transparency until you mention it, then apologises and says it didn’t think of transparency

RugnirViking•2d ago
this seems like a strange riddle. In my mind I was thinking that regardless of the glass, all of the objects can be seen (due to perspective, and also the fact you mentioned the locations, meaning you're aware of them).

It seems to me it would only actually work in an orthographic perspective, which is not how our reality works

jonplackett•1d ago
You can tell from the response it does understand the riddle just fine, it just gets it wrong.
rpdillon•5h ago
Have you asked five adults this riddle? I suspect at least two of them would get it wrong or have some uncertainty about whether or not the peanut was visible.
xg15•4h ago
This. Was also thinking "yes" first because of the glass of water, transparency, etc, but then got unsure: The objects might be spaced so widely that the milk or coke bottle would obscure the view due to perspective - or the peanut would simply end up outside the viewer's field of vision.

Shows that even if you have a world model, it might not be the right one.

optimalsolver•11h ago
Gemini 2.5 Pro gets this correct on the first attempt, and specifically points out the transparency of the glass of water.

https://g.co/gemini/share/362506056ddb

Time to get the ol' goalpost-moving gloves out.

wilg•2h ago
Worked for me: https://chatgpt.com/share/689bc3ef-fa1c-800f-9275-93c2dbc11b...
Razengan•2d ago
A slight tangent: I think/wonder if the one place where AIs could be really useful, might be in translating alien languages :)

As in, an alien could teach one of our AIs their language faster than an alien could teach an human, and vice versa..

..though the potential for catastrophic disasters is also great there lol

keeda•2d ago
That whole bit about color blending and transparency and LLMs "not knowing colors" is hard to believe. I am literally using LLMs every day to write image-processing and computer vision code using OpenCV. It seamlessly reasons across a range of concepts like color spaces, resolution, compression artifacts, filtering, segmentation and human perception. I mean, removing the alpha from a PNG image was a preprocessing step it wrote by itself as part of a larger task I had given it, so it certainly understands transparency.

I even often describe the results e.g. "this fails when in X manner when the image has grainy regions" and it figures out what is going on, and adapts the code accordingly. (It works with uploading actual images too, but those consume a lot of tokens!)

And all this in a rather niche domain that seems relatively less explored. The images I'm working with are rather small and low-resolution, which most literature does not seem to contemplate much. It uses standard techniques well known in the art, but it adapts and combines them well to suit my particular requirements. So they seem to handle "novel" pretty well too.

If it can reason about images and vision and write working code for niche problems I throw at it, whether it "knows" colors in the human sense is a purely philosophical question.

geraneum•4h ago
> it wrote by itself as part of a larger task I had given it, so it certainly understands transparency

Or it’s a common step or a known pattern or combination of steps that is prevalent in its training data for certain input. I’m guessing you don’t know what’s exactly in the training sets. I don’t know either. They don’t tell ;)

> but it adapts and combines them well to suit my particular requirements. So they seem to handle "novel" pretty well too.

We tend to overestimate the novelty of our own work and our methods and at the same time, underestimate the vastness of the data and information available online for machines to train on. LLMs are very sophisticated pattern recognizers. It doesn’t mean what you are doing specifically is done in this exact way before, rather the patterns adapted and the approach may not be one of their kind.

> is a purely philosophical question

It is indeed. A question we need to ask ourselves.

Uehreka•47m ago
> We tend to overestimate the novelty of our own work and our methods and at the same time, underestimate the vastness of the data and information available online for machines to train on. LLMs are very sophisticated pattern recognizers.

If LLMs are stochastic parrots, but also we’re just stochastic parrots, then what does it matter? That would mean that LLMs are in fact useful for many things (which is what I care about far more than any abstract discussion of free will).

skeledrew•2d ago
Agree in general with most of the points, except

> but because I know you and I get by with less.

Actually we got far more data and training than any LLM. We've been gathering and processing sensory data every second at least since birth (more processing than gathering when asleep), and are only really considered fully intelligent in our late teens to mid-20s.

helloplanets•3h ago
Don't forget the millions of years of pre-training! ;)
o_nate•1d ago
What with this and your previous post about why sometimes incompetent management leads to better outcomes, you are quickly becoming one of my favorite tech bloggers. Perhaps I enjoyed the piece so much because your conclusions basically track mine. (I'm a software developer who has dabbled with LLMs, and has some hand-wavey background on how they work, but otherwise can claim no special knowledge.) Also your writing style really pops. No one would accuse your post of having been generated by an LLM.
yosefk•1d ago
thank you for your kind words!
neuroelectron•4h ago
Not yet
ameliaquining•4h ago
One thing I appreciated about this post, unlike a lot of AI-skeptic posts, is that it actually makes a concrete falsifiable prediction; specifically, "LLMs will never manage to deal with large code bases 'autonomously'". So in the future we can look back and see whether it was right.

For my part, I'd give 80% confidence that LLMs will be able to do this within two years, without fundamental architectural changes.

moduspol•4h ago
"Deal with" and "autonomously" are doing a lot of heavy lifting there. Cursor already does a pretty good job indexing all the files in a code base in a way that lets it ask questions and get answers pretty quickly. It's just a matter of where you set the goalposts.
ameliaquining•4h ago
True, there'd be a need to operationalize these things a bit more than is done in the post to have a good advance prediction.
jononor•2h ago
"LLM" as well, because coding agents are already more than just an LLM. There is very useful context management around it, and tool calling, and ability to run tests/programs, etc. Though they are LLM-based systems, they are not LLMs.
smnrchrds•2h ago
Indeed. If the LLM calls a chess engine tool behind the scenes, it would be able to play excellent chess as well.
exe34•4h ago
How large? What does "deal" mean here? Autonomously - is that on its own whim, or at the behest of a user?
shinycode•3h ago
« autonomously » what happens when subtle updates that are not bugs but change the meaning of some features that might break the workflow on some other external parts of a client’s system ? It happens all the time and, because it’s really hard to have the whole meaning and business rules written and maintained up to date, an LLM might never be able to grasp some meaning. Maybe if instead of developing code and infrastructures, the whole industry shifts toward only writing impossibly precise spec sheets that make meaning and intent crystal clear then, maybe « autonomously » might be possible to pull off
wizzwizz4•3h ago
Those spec sheets exist: they're called software.
shinycode•2h ago
Not exactly. It depends how software is written and if there is ADRs in the project. I had to work on projects where there was bugs because someone coded business rules in a very bad and unclear way. You move an if somewhere and something breaks somewhere else. You ask « is this condition the way it’s supposed to work or is it a bug » when software is not clear enough - and often it isn’t because we have to go fast - we ask people to confirm the rule. My point is this, amazingly written software surely works best with LLMs. That’s not the most software written for now because businesses value speed over engineering sometimes (or it’s lack of skills)
slt2021•3h ago
>LLMs will never manage to deal

time to prove hypothesis: infinity years

bithive123•4h ago
Language models aren't world models for the same reason languages aren't world models.

Symbols, by definition, only represent a thing. They are not the same as the thing. The map is not the territory, the description is not the described, you can't get wet in the word "water".

They only have meaning to sentient beings, and that meaning is heavily subjective and contextual.

But there appear to be some who think that we can grasp truth through mechanical symbol manipulation. Perhaps we just need to add a few million more symbols, they think.

If we accept the incompleteness theorem, then there are true propositions that even a super-intelligent AGI would not be able to express, because all it can do is output a series of placeholders. Not to mention the obvious fallacy of knowing super-intelligence when we see it. Can you write a test suite for it?

habitue•4h ago
> Symbols, by definition, only represent a thing.

This is missing the lesson of the Yoneda Lemma: symbols are uniquely identified by their relationships with other symbols. If those relationships are represented in text, then in principle they can be inferred and navigated by an LLM.

Some relationships are not represented well in text: tacit knowledge like how hard to twist a bottle cap to get it to come off, etc. We aren't capturing those relationships between all your individual muscles and your brain well in language, so an LLM will miss them or have very approximate versions of them, but... that's always been the problem with tacit knowledge: it's the exact kind of knowledge that's hard to communicate!

nomel•2h ago
I don’t think it’s a communication problem as much as there is no possible relation between a word and a (literal) physical experiences. They’re, quite literally, on different planes of existence.
drdeca•2h ago
When I have a physical experience, sometimes it results in me saying a word.

Now, maybe there are other possible experiences that would result in me behaving identically, such that from my behavior (including what words I say) it is impossible to distinguish between different potential experiences I could have had.

But, “caused me to say” is a relation, is it not?

Unless you want to say that it wasn’t the experience that caused me to do something, but some physical thing that went along with the experience, either causing or co-occurring with the experience, and also causing me to say the word I said. But, that would still be a relation, I think.

nomel•1h ago
Yes, but it's a unidirectional relation: it was the result of the experience. The word cannot represent the context (the experience), in a meaningful way.

It's like trying to describe a color to a blind person: poetic subjective nonsense.

semiquaver•53m ago
Well shit, I better stop reading books then.
exe34•4h ago
> Language models aren't world models for the same reason languages aren't world models. > Symbols, by definition, only represent a thing. They are not the same as the thing. The map is not the territory, the description is not the described, you can't get wet in the word "water".

There is a lot of negatives in there, but I feel like it boils down to a model of a thing is not the thing. Well duh. It's a model. A map is a model.

bithive123•3h ago
Right. It's a dead thing that has no independent meaning. It doesn't even exist as a thing except conceputally. The referent is not even another dead thing, but a reality that appears nowhere in the map itself. It may have certain limited usefulness in the practical realm, but expecting it to lead to new insights ignores the fact that it's fundamentally an abstraction of the real, not in relationship to it.
auggierose•3h ago
First: true propositions (that are not provable) can definitely be expressed, if they couldn't, the incompleteness theorem would not be true ;-)

It would be interesting to know what the percentage of people is, who invoke the incompleteness theorem, and have no clue what it actually says.

Most people don't even know what a proof is, so that cannot be a hindrance on the path to AGI ...

Second: ANY world model that can be digitally represented would be subject to the same argument (if stated correctly), not only LLMs.

bithive123•3h ago
I knew someone would call me out on that. I used the wrong word; what I meant was "expressed in a way that would satisfy" which implies proof within the symbolic order being used. I don't claim to be a mathematician or philosopher.
auggierose•3h ago
Well, you don't get it. The LLM definitely can state propositions "that satisfy", let's just call them true propositions, and that this is not the same as having a proof for it is what the incompleteness theorem says.

Why would you require an LLM to have proof for the things it says? I mean, that would be nice, and I am actually working on that, but it is not anything we would require of humans and/or HN commenters, would we?

bithive123•3h ago
I clearly do not meet the requirements to use the analogy.

I am hearing the term super intelligence a lot and it seems to me the only form that would take is the machine spitting out a bunch of symbols which either delight or dismay the humans. Which implies they already know what it looks like.

If this technology will advance science or even be useful for everyday life, then surely the propositions it generates will need to hold up to reality, either via axiomatic rigor or empirically. I look forward to finding out if that will happen.

But it's still just a movement from the known to the known, a very limited affair no matter how many new symbols you add in whatever permutation.

chamomeal•3h ago
I’m not a math guy but the incompleteness theorem applies to formal systems, right? I’ve never thought about LLMs as formal systems, but I guess they are?
bithive123•3h ago
Nor am I. I'm not claiming an LLM is a formal system, but it is mechanical and operates on symbols. It can't deal in anything else. That should temper some of the enthusiasm going around.
pron•3h ago
Anything that runs on a computer is a formal system. "Formal" (the manipulation of forms) is an old term for what, after Turing, we call "mechanical".
scarmig•3h ago
> If we accept the incompleteness theorem

And, by various universality theorems, a sufficiently large AGI could approximate any sequence of human neuron firings to an arbitrary precision. So if the incompleteness theorem means that neural nets can never find truth, it also means that the human brain can never find truth.

Human neuron firing patterns, after all, only represent a thing; they are not the same as the thing. Your experience of seeing something isn't recreating the physical universe in your head.

bevr1337•3h ago
> And, by various universality theorems, a sufficiently large AGI could approximate any sequence of human neuron firings to an arbitrary precision.

Wouldn't it become harder to simulate a human brain the larger a machine is? I don't know nothing, but I think that peaky speed of light thing might pose a challenge.

drdeca•3h ago
simulate ≠ simulate-in-real-time
zeroonetwothree•46m ago
All simulation is realtime to the brain being simulated.
overgard•3h ago
I don't think you can apply the incompleteness theorem like that, LLMs aren't constrained to formal systems
pron•3h ago
> Symbols, by definition, only represent a thing. They are not the same as the thing

First of all, the point isn't about the map becoming the territory, but about whether LLMs can form a map that's similar to the map in our brains.

But to your philosophical point, assuming there are only a finite number of things and places in the universe - or at least the part of which we care about - why wouldn't they be representable with a finite set of symbols?

What you're rejecting is the Church-Turing thesis [1] (essentially, that all mechanical processes, including that of nature, can be simulated with symbolic computation, although there are weaker and stronger variants). It's okay to reject it, but you should know that not many people do (even some non-orthodox thoughts by Penrose about the brain not being simulatable by an ordinary digital computer still accept that some physical machine - the brain - is able to represent what we're interested in).

> If we accept the incompleteness theorem

There is no if there. It's a theorem. But it's completely irrelevant. It means that there are mathematical propositions that can't be proven or disproven by some system of logic, i.e. by some mechanical means. But if something is in the universe, then it's already been proven by some mechanical process: the mechanics of nature. That means that if some finite set of symbols could represent the laws of nature, then anything in nature can be proven in that logical system. Which brings us back to the first point: the only way the mechanics of nature cannot be represented by symbols is if they are somehow infinite, i.e. they don't follow some finite set of laws. In other words - there is no physics. Now, that may be true, but if that's the case, then AI is the least of our worries.

Of course, if physics does exist - i.e. the universe is governed by a finite set of laws - that doesn't mean that we can predict the future, as that would entail both measuring things precisely and simulating them faster than their operation in nature, and both of these things are... difficult.

[1]: https://plato.stanford.edu/entries/church-turing/

astrange•3h ago
> First of all, the point isn't about the map becoming the territory, but about whether LLMs can form a map that's similar to the map in our brains.

It should be capable of something similar (fsvo similar), but the largest difference is that humans have to be power-efficient and LLMs do not.

That is, people don't actually have world models, because modeling something is a waste of time and energy insofar as it's not needed for anything. People are capable of taking out the trash without knowing what's in the garbage bag.

Terr_•2h ago
> Of course, if physics does exist - i.e. the universe is governed by a finite set of laws

Wouldn't physics still "exist" even if there were an infinite set of laws?

pron•56m ago
Well, the physical universe will still exist, but I don't think that physics - the scientific study of said universe - will become sort of meaningless, I would think?
Terr_•45m ago
Why meaningless? Imperfect knowledge can still be useful, and ultimately that's the only kind we can ever have about anything.

"We could learn to sail the oceans and discover new lands and transport cargo cheaply... But in a few centuries we'll discover we were wrong and the Earth isn't really a sphere and tides are extra-complex so I guess there's no point."

pron•35m ago
Because if there's an infinite number of laws, are they laws at all? You can't predict anything because you don't even know if some of the laws you don't know yet (which is pretty much all of them) makes an exception to the 0% of laws you do know. I'm not saying it's not interesting, but it's more history - today the apple fell down rather than up or sideways - than physics.
goatlover•1h ago
> course, if physics does exist - i.e. the universe is governed by a finite set of laws

That statement is problematic. It implies a metaphysical set of laws that make physical stuff relate a certain way.

The Humean way of looking at physics is that we notice relationships and model those with various symbols. They symbols form incomplete models because we can't get to the bottom of why the relationships exist.

> that doesn't mean that we can predict the future, as that would entail both measuring things precisely and simulating them faster than their operation in nature, and both of these things are... difficult.

The indeterminism of Quantum Mechanics limits how how precise measure can be and how predictable the future is.

pron•48m ago
> That statement is problematic. It implies a metaphysical set of laws that make physical stuff relate a certain way.

What I meant was that since physics is the scientific search for the laws of nature, then if there's an infinite number of them, then the pursuit becomes somewhat meaningless, as an infinite number of laws aren't really laws at all.

> They symbols form incomplete models because we can't get to the bottom of why the relationships exist.

Why would a model be incomplete if we don't know why the laws are what they are? A model pretty much is a set of laws; it doesn't require an explanation (we may want such an explanation, but it doesn't improve the model).

drdeca•2h ago
Gödel’s incompleteness theorems aren’t particularly relevant here. Given how often people attempt to apply them to situations where they don’t say anything of note, I think the default should generally be to not publicly appeal to them unless one either has worked out semi-carefully how to derive the thing one wants to show from them, or at least have a sketch that one is confident, from prior experience working with it, that one could make into a rigorous argument. Absent these, the most one should say, I think, is “Perhaps one can use Gödel’s incompleteness theorems to show [thing one wants to show].” .

Now, given a program that is supposed to output text that encodes true statements (in some language), one can probably define some sort of inference system that corresponds to the program such that the inference system is considered to “prove” any sentence that the program outputs (and maybe also some others based on some logical principles, to ensure that the inference system satisfies some good properties), and upon defining this, one could (assuming the language allows making the right kinds of statements about arithmetic) show that this inference system is, by Gödel’s theorems, either inconsistent or incomplete.

This wouldn’t mean that the language was unable to express those statements. It would mean that the program either wouldn’t output those statements, or that the system constructed from the program was inconsistent (and, depending on how the inference system is obtained from the program, the inference system being inconsistent would likely imply that the program sometimes outputs false or contradictory statements).

But, this has basically nothing to do with the “placeholders” thing you said. Gödel’s theorem doesn’t say that some propositions are inexpressible in a given language, but that some propositions can’t be proven in certain axiom+inference systems.

Rather than the incompleteness theorems, the “undefinability of truth” result seems more relevant to the kind of point I think you are trying to make.

Still, I don’t think it will show what you want it to, even if the thing you are trying to show is true. Like, perhaps it is impossible to capture qualia with language, sure, makes sense. But logic cannot show that there are things which language cannot in any way (even collectively) refer to, because to show that there is a thing it has to refer to it.

————

“Can you write a test suite for it?”

Hm, might depend on what you count as a “suite”, but a test protocol, sure. The one I have in mind would probably be a bit expensive to run if it fails the test though (because it involves offering prize money).

energy123•42m ago
Everything is just a low resolution representation of a thing. The so-called reality we supposedly have access to is at best a small number of sound waves and photons hitting our face. So I don't buy this argument that symbols are categorically different. It's a gradient and symbols are more sparse and less rich of a data source, yes. But who are we to say where that hypothetical line exists, beyond which further compression of concepts into smaller numbers of buckets becomes a non-starter for intelligence and world modelling. And then there's multi modal LLMs which have access to data of a similar richness that humans have access to.
cognitif•20m ago
> Language models aren't world models for the same reason languages aren't world models. Symbols, by definition, only represent a thing. They are not the same as the thing. The map is not the territory, the description is not the described, you can't get wet in the word "water".

Symbols, maps, descriptions, and words are useful precisely because they are NOT what they represent. Representation is not identity. What else could a “world model” be other than a representation? Aren’t all models representations, by definition? What exactly do you think a world model is, if not something expressible in language?

mrbungie•14m ago
> Aren’t all models representations, by definition? What exactly do you think a world model is, if not something expressible in language?

I was following the string of questions, but I think there is a logical leap between those two questions.

Another questions: is language the only way to define models? An imagined sound or an imagined picture of an apple in my minds-eye are models to me, but they don't use language.

frankfrank13•4h ago
Great quote at the end that I think I resonate a lot with:

> Feeding these algorithms gobs of data is another example of how an approach that must be fundamentally incorrect at least in some sense, as evidenced by how data-hungry it is, can be taken very far by engineering efforts — as long as something is useful enough to fund such efforts and isn’t outcompeted by a new idea, it can persist.

1970-01-01•4h ago
I'm surprised the models haven't been enshittified by capitalism. I think in a few years we're going to see lightning-fast LLMs generating better output compared to what we're seeing today. But it won't be 1000x better, it will be 10x better, 10x faster, and completely enshittified with ads and clickbait links. Enjoy ChatGPT while it lasts.
UltraSane•1h ago
I wonder how the nature of the language used to train an LLM affects its model of the world. Would a language designed for the maximum possible information content and clarity like Ithkuil make an LLMs world model more accurate?
DennisP•1h ago
Maybe pure language models aren't world models, but Genie 3 for example seems to be a pretty good world model:

https://deepmind.google/discover/blog/genie-3-a-new-frontier...

We also have multimodal AIs that can do both language and video. Genie 3 made multimodal with language might be pretty impressive.

Focusing only on what pure language models can do is a bit of a straw man at this point.