frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Generative AI's crippling failure to induce robust models of the world

https://garymarcus.substack.com/p/generative-ais-crippling-and-widespread
40•pmcjones•4h ago

Comments

energy123•2h ago
Why was Anthropic's interpretability work not discussed? Inconvenient for the conclusion?

https://www.anthropic.com/news/tracing-thoughts-language-mod...

sdenton4•2h ago
"A wandering ant, for example, tracks where it is through the process of dead reckoning. An ant uses variables (in the algebraic/computer science sense) to maintain a readout of its location, even as as it wanders, constantly updated, so that it can directly return to its home."

Hm.

Dead reckoning is a terrible way to navigate, and famously led to lots of ships crashed on the shore of France before good clocks allowed tracking longitude accurately.

Ants lay down pheromone trails and use smell to find their way home... There's likely some additional tracking going on, but I would be surprised if it looked anything like symbolic GOFAI.

deadbabe•2h ago
Even if you find a pheromone trail, it doesn’t tell you what direction is home, or what path to take at branching paths. You need dead reckoning. The trail just helps you reduce the complexity of what you have to remember.
cma•2h ago
The trail also leads the other ants to food, hard for them to use your own dead reckoning.
viraptor•15m ago
The lack of information in ant trails (beyond "it exists here") leads to death spirals https://en.m.wikipedia.org/wiki/Ant_mill
vunderba•2h ago
Speaking of chess, a fun experiment is building a few positions such as on Lichess, taking a screenshot, and asking a state-of-the-art VLM to count the number of pieces on the board. In my experience, it had a much higher error ratio in less likely or impossible board situations (three kings on the board, etc).
extr•1h ago
I find Gary's arguments increasingly semantic and unconvincing. He lists several examples of how LLMs "fail to build a world model", but his definition of "world model" is an informal hand-wave ("a computational framework that a system (a machine, or a person or other animal) uses to track what is happening in the world"). His examples are lifted from a variety of unclear or obsolete models - what is his opinion of O3? Why doesn't he create or propose a benchmark that researchers could use to measure progress of "world model creation"?

What's more, his actual point is unclear. Even if you simply grant, "okay, even SOTA LLMs don't have world models", why do I as a user of these models care? Because the models could be wrong? Yes, I'm aware. Nevertheless, I'm still deriving subtantial personal and professional value from the models as they stand today.

voidhorse•1h ago
I think the point is that category errors or misinterpreting what a tool does can be dangerous.

Both statistical data generators and actual reasoning are useful in many circumstances, but there are also circumstances in which thinking that you are doing the latter when you are only doing the former can have severe consequences (example: building a bridge).

If nothing else, his perspective is a counterbalance to what is clearly an extreme hype machine that is doing its utmost to force adoption through overpromising, false advertising, etc. These are bad things even if the tech does actually have some useful applications.

As for benchmarks, if you fundamentally don't believe that stochastic data generation leads to reason as an emergent property, developing a benchmark is pointless. Also, not everyone has to be on the same side. It's clear that Marcus is not a fan of the current wave. Asking him to produce a substantive contribution that would help them continue to achieve their goals is preposterous. This game is highly political too. If you think the people pushing this stuff are less than estimable or morally sound, you wouldn't really want to empower them or give them more ideas.

NitpickLawyer•15m ago
> If nothing else, his perspective is a counterbalance to what is clearly an extreme hype machine that is doing its utmost to force adoption through overpromising, false advertising, etc. These are bad things even if the tech does actually have some useful applications.

In other words, overhyped in the short term, underhyped in the long term. Where short and long term are extremely volatile.

Take programming as an example. 2.5 years ago, gpt3.5 was seen as "cute" in the programming world. Oh, look, it does poems and e-mails, and the code looks like python but it's wrong 9 times out of 10. But now a 24B model can handle end-to-end SWE tasks in 0-shot a lot of the times.

squirrel•44m ago
He cites o3 and o4-mini as examples of LLMs that play illegal chess moves.
Lerc•18m ago
I don't understand the reasoning behind drawing a conclusion that if something fails a task that requires reasoning implies that thing cannot reason.

To use chess as an example. Humans sometimes play illegal moves. That does not mean Humans cannot reason. It is an instance of failing to show proof of reasoning. Not a proof of the inability to reason.

SubiculumCode•1h ago
I definitely would be okay if we hit an AI winter; our culture and world cannot adapt fast enough for the change we are experiencing. In the meantime, the current level of AI is just good enough to make us more productive, but not so good as to make us irrelevant.
voidhorse•1h ago
The whole thing is silly. Look, we know that LLMs are just really good word predictors. Any argument that they are thinking is essentially predicated on marketing materials that embrace anthropomorphic metaphors to an extreme degree.

Is it possible that reason could emerge as the byproduct of being really good at predicting words? Maybe, but this depends on the antecedent claim that much if not all of reason is strictly representational and strictly linguistic. It's not obvious to me that this is the case. Many people think in images as direct sense datum, and it's not clear that a digital representation of this is equivalent to the thing in itself.

To use an example another HN'er suggested, We don't claim that submarines are swimming. Why are we so quick to claim that LLMs are "reasoning"?

Velorivox•50m ago
> Is it possible that reason could emerge as the byproduct of being really good at predicting words?

Imagine we had such marketing behind wheels — they move, so they must be like legs on the inside. Then we run around imagining what the blood vessels and bones must look like inside the wheel. Nevermind that neither the structure nor the procedure has anything to do with legs whatsoever.

Sadly, whoever named it artificial intelligence and neural networks likely knew exactly what they were doing.

SubiculumCode•42m ago
I was having a discussion with Gemini. It claimed that because Gemini, as a large language model, cannot experience emotion, that the output of Gemini is less likely to be emotionally motivated. I countered that the experience of emotion is irrelevant. Gemini was trained on data written by humans who do experience emotion, who often wrote to express that emotion, and thus Gemini's output can be emotionally motivated, by proxy.
etaioinshrdlu•23m ago
I don't think it's accurate anymore to say LLMs are just really good word predictors. Especially in the last year, they are trained with reinforcement learning to solve specific problems. They are functions that predict next tokens, but the function they are trained to approximate doesn't have to be just plain internet text.
voidhorse•13m ago
Yeah, that's fair. It's probably more accurate to call them sequence predictors or general data predictors than to limit it to words (unless we mean words in the broad, mathematical sense) they are free monoid emulators
rented_mule•20m ago
> this depends on the antecedent claim that much if not all of reason is strictly representational and strictly linguistic. It's not obvious to me that this is the case

I'm with you on this. Software engineers talk about being in the flow when they are at their most productive. For me, the telltale sign of being in the flow is that I'm no longer thinking in English, but I'm somehow navigating the problem / solution space more intuitively. The same thing happens in many other domains. We learn to walk long before we have the language for all the cognitive processes required. I don't think we deeply understand what's going in these situations, so how are we going to build something to emulate it? I certainly don't consciously predict the next token, especially when I'm in the flow.

And why would we try to emulate how we do it? I'd much rather have technology that complements. I want different failure modes and different abilities so that we can achieve more with these tools than we could by just adding subservient humans. The good news is that everything we've built so far is succeeding at this!

We'll know that society is finally starting to understand these technologies and how to apply them when we are able to get away from using science fiction tropes to talk about them. The people I know who develop LLMs for a living, and the others I know that are creating the most interesting applications of them, already talk about them as tools without any need to anthropomorphize. It's sad to watch their frustration as they are slowed down every time a person in power shows up with a vision based on assumptions of human-like qualities rather than a vision informed by the actual qualities of the technology.

Maybe I'm being too harsh or impatient? I suppose we had to slowly come to understand the unique qualities of a "car" before we could stop limiting our thinking by referring to it as a "horseless carriage".

voidhorse•7m ago
Couldn't agree more. I look forward to the other side of this current craze where we actually have reasonable language around what these machines are best for.

On a more general level, I also never understood this urge to build machines that are "just like us". Like you I want machines that, arguably, are best characterized by the ways in which they are not like us—more reliable, more precise, serving a specific function. It's telling that critiques of the failures of LLMs are often met with "humans have the same problems"—why are humans the bar? We have plenty of humans. We don't need more humans. If we're investing so much time and energy, shouldn't the bar be bette than humans? And if it isn't, why isn't it? Oh, right it's because actually human error is good enough and the actual benefit of these tools is that they are humans that can work without break, don't have autonomy, and that you don't need to listen to or pay. The main beneficiaries of this path are capital owners who just want free labor. That's literally all this is. People who actually want to build stuff want precision machines that are tailored for the task at hand, not some grab bag of sort of works sometimes stochastic doohickeys.

UltraSane•14m ago
This paper argues the opposite

https://arxiv.org/abs/2506.01622

Are world models a necessary ingredient for flexible, goal-directed behaviour, or is model-free learning sufficient? We provide a formal answer to this question, showing that any agent capable of generalizing to multi-step goal-directed tasks must have learned a predictive model of its environment. We show that this model can be extracted from the agent's policy, and that increasing the agents performance or the complexity of the goals it can achieve requires learning increasingly accurate world models. This has a number of consequences: from developing safe and general agents, to bounding agent capabilities in complex environments, and providing new algorithms for eliciting world models from agents.

Claude Code logs partial keystrokes/plaintext email address in –/.claude.json

https://github.com/anthropics/claude-code/issues/2713
1•phrinj•1m ago•1 comments

Sinola cartel hacked security cameras to track and kill FBI informants, US says

https://www.theguardian.com/world/2025/jun/27/sinaloa-cartel-fbi-hackers
1•sans_souse•2m ago•0 comments

Show HN: Ape – Minimalistic modal text editor written in F#

https://github.com/gabrosh/ape
1•gabrosh•3m ago•0 comments

The Modified Purdue Subcritical Pile for Nuclear Research Applications

https://www.mdpi.com/2410-390X/9/2/13
2•PaulHoule•9m ago•0 comments

From Zero to Monetized iOS App in 10 Hours with Bolt.new, Expo, and RevenueCat

https://www.aiengineering.report/p/from-zero-to-monetized-ios-app-in
1•waprin•11m ago•0 comments

LLM–Assisted Risk-of-Bias Assessment in RCTs Using the Revised Risk-of-Bias Tool

https://www.jmir.org/2025/1/e70450
1•XzetaU8•15m ago•0 comments

Europe Got Tough on Migration

https://www.nytimes.com/2025/06/29/world/europe/europe-migration-crackdown-trump.html
1•RestlessMind•15m ago•0 comments

Email Privacy Tester

https://www.emailprivacytester.com/
1•DavideNL•18m ago•0 comments

Off with Their Heads: Illustrations of Blemmyes (ca. 1175–1724)

https://publicdomainreview.org/collection/blemmyes/
1•Thevet•19m ago•0 comments

Pwntool – Discontinued Hacker Toolkit Looking for Devs

1•hejhdiss•24m ago•0 comments

Fruit Flies in Space

https://en.wikipedia.org/wiki/Fruit_flies_in_space
1•nadermx•28m ago•0 comments

GRCon 2023 CTF Challenge (NRSC5)

https://fomitchev.net/2023/09/14/grcon-2023-ctf-challenge-nrsc5/
1•geerlingguy•38m ago•0 comments

How to Make a Planet, by Jim Blinn

https://archive.org/details/how-to-make-a-planet-jim-blinn
1•gdubs•40m ago•1 comments

DuckDB's AsOf Joins: Fuzzy Temporal Lookups

https://duckdb.org/2023/09/15/asof-joins-fuzzy-temporal-lookups.html
1•robertclaus•42m ago•0 comments

Show HN: See the economic cost in real-time of social harm in real-time

https://www.suffering.social/
1•avi21218•49m ago•0 comments

Show HN: Domain-check – Rust tool for checking domain name availability

https://github.com/saidutt46/domain-check
1•gvs46•52m ago•0 comments

Next Generation Small-Body Sample Return [pdf]

https://www.hou.usra.edu/meetings/lpsc2025/pdf/2280.pdf
1•andsoitis•1h ago•0 comments

OpenAI's O4‑Mini Makes Geolocation Feel Like Magic

https://medium.com/@jdmsec/how-openais-o4-mini-makes-geolocation-feel-like-magic-f39dc2eb9ea2
1•walterbell•1h ago•0 comments

San Francisco employers are hiring etiquette coaches for Gen Z

https://sfstandard.com/2025/06/28/san-francisco-employers-are-hiring-etiquette-coaches-for-gen-z/
3•gpi•1h ago•1 comments

Show HN: Readeck – Mobile client for organizing bookmarks (Android, open source)

1•potetotown•1h ago•0 comments

Against AI: An Open Letter from Writers to Publishers

https://lithub.com/against-ai-an-open-letter-from-writers-to-publishers/
4•neom•1h ago•1 comments

Delphi Raises $16M Series A from Sequoia Capital to Pioneer "Digital Minds"

https://delphi.framer.website/blog/delphi-raises-16m-series-a-from-sequoia
2•wslh•1h ago•0 comments

'Quantum AI' algorithms outpace the fastest supercomputers, study says

https://www.livescience.com/technology/computing/quantum-ai-algorithms-already-outpace-the-fastest-supercomputers-study-says
2•Bluestein•1h ago•0 comments

Techie went home rather than fix mistake that caused a meltdown

https://www.theregister.com/2025/06/23/who_me/
15•docmechanic•1h ago•7 comments

GPTuner: GPTuner is a manual-reading database tuning system leveraging domain k

https://github.com/SolidLao/GPTuner
3•todsacerdoti•1h ago•0 comments

Astronomers solve mystery of bright burst in space

https://www.independent.co.uk/space/nasa-space-bright-burst-relay-2-b2778135.html
1•Bluestein•1h ago•0 comments

Harvest Move – A game that requires careful movement

https://jslegend.itch.io/harvest-move
1•JSLegendDev•1h ago•0 comments

Systemic Misalignment: Key Failures of AI Alignment Methods

https://www.systemicmisalignment.com/
1•brandonb•1h ago•0 comments

Canada orders China's Hikvision to close Canadian operations

https://www.reuters.com/markets/emerging/ottawa-orders-chinese-manufacturer-hikvision-shutter-canadian-operations-2025-06-28/
3•xnhbx•1h ago•0 comments

First thoughts on Rust vs. OCaml (2020)

https://blog.darklang.com/first-thoughts-on-rust-vs-ocaml/
2•danboarder•1h ago•0 comments