Spatial intelligence is AI’s next frontier

https://drfeifei.substack.com/p/from-words-to-worlds-spatial-intelligence

243•mkirchner•2mo ago

Comments

nothrowaways•2mo ago

"Invest in my startup"

ares623•2mo ago

Before the music stops

verdverm•2mo ago

I do wonder if this will meaningfully move the needle on agent assistants (coding, marketing, schedule my vacation, etc...) considering how much more compute (I would imagine) is needed for video / immersive environments during training and inference

I suspect the calculus is more favorable for robotics

programjames•2mo ago

Far too much marketing speech, far too little math or theory, and completely misses the mark on the 'next frontier'. Maybe four years ago, spatial reasoning was the problem to solve, but by 2022 it was solved. All that remained was scaling up. The actual three next problems to solve (in order of when they will be solved) are:

- Reinforcement Learning (2026)

- General Intelligence (2027)

- Continual Learning (2028)

EDIT: lol, funny how the idiots downvote

7moritz7•2mo ago

Hasn't RLHF and with LLM feedback been around for years now

programjames•2mo ago

Large latent flow models are unbiased. On the other hand, if you purely use policy optimization, RLHF will be biased towards short horizons. If you add in a value network, the value has some bias (e.g. MSE loss on the value --> Gaussian bias). Also, most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly. So, basically, there's a lot of biases that show up in RL training which can make it both hard to train, and even if successful, not necessarily optimizing what you want.

storus•2mo ago

We might not even need RL as DPO has shown.

programjames•2mo ago

> if you purely use policy optimization, RLHF will be biased towards short horizons

> most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly

koakuma-chan•2mo ago

In my thinking what AI lacks is a memory system

7moritz7•2mo ago

That has been solved with RAG, OCR-ish image encoding (deepseek recently) and just long context windows in general.

koakuma-chan•2mo ago

Not really. For example we still can’t get coding agents to work reliably, and I think it’s a memory problem, not a capabilities problem.

atlex2•2mo ago

On the other hand, test-time weight updates would make model interpretability much harder.

Eisenstein•2mo ago

RAG is like constantly reading your notes instead of integrating experiences into your processes.

l9o•2mo ago

What do you consider "General Intelligence" to be?

programjames•2mo ago

A good start would be:

1. Robust to adversarial attacks (e.g. in classification models or LLM steering).

2. Solving ARC-AGI.

Current models are optimized to solve the current problem they're presented, not really find the most general problem-solving techniques.

stirfish•2mo ago

I like to think I'm generally intelligent, but I am not robust to adversarial attacks.

Edit: I'm trying arc-agi tests now and it's looking bad for me: https://arcprize.org/play?task=e3721c99

programjames•2mo ago

"I like to think I'm generally intelligent, but I am not robust to adversarial attacks. I'm trying arc-agi tests now and it's looking bad for me."

One man's modus ponens is another man's modus tollens.

"I'm trying arc-agi tests now and it's looking bad for me. I am not robust to adversarial attacks. I think I'm not generally intelligent."

stirfish•2mo ago

I forgot what modus ponens/tollens are, but you get get it - I think I'm not generally intelligent

For people coming after me, or for anyone who took discrete math a decade ago and need a quick refresher:

Modus ponens (affirming): if P, then Q. P is true, therefore Q.

If it is raining, the grass is wet. It is raining. Therefore the grass is wet.

Modus tollens (denying): if P, then Q. Q is false. Therefore P is false.

If it is raining, then the grass is wet. The grass is not wet. Therefore, it is not raining.

whatever1•2mo ago

Combinatorial search is also a solved problem. We just need a couple of Universes to scale it up.

programjames•2mo ago

If there isn't a path humans know how to take with their current technology, it isn't a solved problem. It's much different than people training an image model for research purposes, and knowing that $100m in compute is probably enough for a basic video model.

htrp•2mo ago

her company world labs is at the forefront of building spatial intelligence models

jonny_eh•2mo ago

So she says

dauertewigkeit•2mo ago

Sutton: Reinforcement Learning

LeCun: Energy Based Self-Supervised Learning

Chollet: Program Synthesis

Fei-Fei: ???

Are there any others with hot takes on the future architectures and techniques needed for of A-not-quite-G-I?

yzydserd•2mo ago

> Fei-Fei: ???

Underrated and unsung. Fei Fei Li first launched ImageNet way back in 2007, a hugely influential move sparking much of the computer vision deep learning that followed since. I remember in a lecture about 7 years ago jph00 saying "text is just waiting for its imagenet moment" -> then came the gpt explosion. Fei Fei was massively instrumental in where we are today.

byearthithatius•2mo ago

Curating a dataset is vastly different than introducing a new architectural approach. ImageNet is a database. Its not like inventing the convolutions for CNNs or the LSTM or a Transformer.

dauertewigkeit•2mo ago

CNNs and Transformers are both really simple and intuitive so I don't think there is any stroke of genius in how they were devised.

Their success is due to datasets and the tooling that allowed models to be trained on large amounts of data, sufficiently fast using GPU clusters.

yzydserd•2mo ago

Exactly right. Neatly said by the author in the linked article.

> I spent years building ImageNet, the first large-scale visual learning and benchmarking dataset and one of three key elements enabling the birth of modern AI, along with neural network algorithms and modern compute like graphics processing units (GPUs).

Datasets + NNs + GPUs. Three "vastly different" advances that came together. ImageNet was THE dataset.

byearthithatius•2mo ago

"CNNs and Transformers are both really simple and intuitive" and labeling a bunch of images you downloaded is not simple and intuitive? It was a team effort and I would hardly call a single dataset what drove modern ML. Most of currently deployed modern ML wasn't trained on that dataset and didn't come from models trained on it.

davmre•2mo ago

It's true that these are very different activities, but I think most ML researchers would agree that it's actually the creation of ImageNet that sparked the deep learning revolution. CNNs were not a novel method in 2012; the novelty was having a dataset big and sophisticated enough that it was actually possible to learn a good vision model from without needing to hand-engineer all the parts. Fei-fei saw this years in advance and invested a lot of time and career capital setting up the conditions for the bitter lesson to kick in. Building the dataset was 'easy' in a technical sense, but knowing that a big dataset was what the field needed, and staking her career on it when no one else was doing or valuing this kind of work, was her unique contribution, and took quite a bit of both insight and courage.

toisanji•2mo ago

From reading that, I'm not quite sure if they have anything figured out. I actually agree, but her notes are mostly fluff with no real info in there and I do wonder if they have anything figured out besides "collect spatial data" like imagenet.

There are actually a lot of people trying to figure out spatial intelligence, but those groups are usually in neuroscience or computational neuroscience. Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information. I wrote this before the huge LLM explosion and I still personally believe it is the path forward.

bonsai_spool•2mo ago

> Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information.

Yes, you and the Mosers who won the Nobel Prize all believe that grid cells are the key to animals understanding their position in the world.

https://www.nobelprize.org/prizes/medicine/2014/press-releas...

Marshferm•2mo ago

It's not enough by a long shot. Placement isn't related directly to vicarious trial and error, path integrations, sequence generation.

There's a whole giant gap between grid cells and intelligence.

teleforce•2mo ago

>There's a whole giant gap between grid cells and intelligence.

Please check this recent article on the state machine in the hippocampus based on learning [1]. The findings support the long-standing proposal that sparse orthogonal representations are a powerful mechanism for memory and intelligence.

[1] Learning produces an orthogonalized state machine in the hippocampus:

https://www.nature.com/articles/s41586-024-08548-w

Marshferm•2mo ago

Of course, but the mechanisms “remain obscure”. The entorhinal cortex is but a facet of this puzzle and placement vs head direction etc must be understood beyond mere prediction. There are too many essential parts that are not understood particularly senses and emotion which play the tinkering precursors to evolutionary function that are excluded now as well as the likelyhood that prediction error and prediction are but mistaken precursor computational bottlenecks to unpredictability. Pushing AI into the 4% of a process materially identified as entorhinal is way premature.

This approach simply follows suit with the blundering reverse engineering of the brain in cog sci where material properties are seen in isolation and processes are deduced piecemeal. The brain can only be understood as a whole first. See rhythms of the brain or unlocking the brain.

There’s a terrifying lack of curiosity in the paper you posted, a kind of smug synthetic rush to import code into a part of the brain that’s a directory among directories that has redundancies as a warning: we get along without this.

Your and their view (OSM) is too narrow. eg categorization is baked into the whole brain. How? This is one of 1000s of processes that generalize materially across the entire brain. Isolating "learning" to the allocortex is incredibly misleading.

https://www.cell.com/current-biology/fulltext/S0960-9822(25)...

byearthithatius•2mo ago

This is super cool and I want to read up more on this as I think you are right insofar as it is the basis for reasoning. However it does seem more complex than just that. So how do we go from coordinate system transformations to abstract reasoning with symbolic representations?

toisanji•2mo ago

There is research showing that the grid cells also represent abstract reasoning: https://pmc.ncbi.nlm.nih.gov/articles/PMC5248972/

Deep Mind also did a paper with grid cells a while ago: https://deepmind.google/blog/navigating-with-grid-like-repre...

porphyra•2mo ago

> if they have anything figured out besides "collect spatial data" like imagenet

I mean she launched her whole career with imagenet so you can hardly blame her for thinking that way. But on the other hand, there's something bitter lesson-pilled about letting a model "figure out" spatial relationships just by looking at tons of data. And tbh the recent progress [1] of worldlabs.ai (Dr Fei Fei Li's startup) looks quite promising for a model that understands stuff including reflections and stuff.

[1] https://www.worldlabs.ai/blog/rtfm

godelski•2mo ago

  > looks quite promising for a model that understands stuff including reflections and stuff.

I got the opposite impression when trying their demo...[0]. Even in their examples some of these issues exist like how objects stay a constant size despite moving. Like missing the parallax or depth information. Not to mention that they show it walking on water lol

As for reflections, I don't get that impression either. They seem extremely brittle to movement.

[0] http://0x0.st/K95T.png

Animats•2mo ago

> From reading that, I'm not quite sure if they have anything figured out. I actually agree, but her notes are mostly fluff with no real info in there and I do wonder if they have anything figured out besides "collect spatial data" like imagenet.

Right. I was thinking about this back in the 1990s. That resulted in a years-long detour through collision detection, physically based animation, solving stiff systems of nonlinear equations, and a way to do legged running over rough terrain. But nothing like "AI". More of a precursor to the analytical solutions of the early Boston Dynamics era.

Work today seems to throw vast amounts of compute at the problem and hope a learning system will come up with a useful internal representation of the spatial world. It's the "bitter lesson" approach. Maybe it will work. Robotic legged locomotion is pretty good now. Manipulation in unstructured situations still sucks. It's amazing how bad it is. There are videos of unstructured robot manipulation from McCarthy's lab at Stanford in the 1960s. They're not that much worse than videos today.

I used to make the comment, pre-LLM, that we needed to get to mouse/squirrel level intelligence rather than trying to get to human level abstract AI. But we got abstract AI first. That surprised me.

There's some progress in video generation which takes a short clip and extrapolates what happens next. That's a promising line of development. The key to "common sense" is being able to predict what happens next well enough to avoid big mistakes in the short term, a few seconds. How's that coming along? And what's the internal world model, assuming we even know?

Earw0rm•2mo ago

I share your surprise regarding LLM, is it fair to say that it's because language - especially formalised, written language - is a self-describing system.

A machine can infer the right (or expected) answer based on data, I'm not sure that the same is true for how living things navigate the physical world - the "right" answer, such as one exists for your squirrel, is arguably Darwinian: "whatever keeps the little guy alive today".

nosianu•2mo ago

> I used to make the comment, pre-LLM, that we needed to get to mouse/squirrel level intelligence rather than trying to get to human level abstract AI. But we got abstract AI first. That surprised me.

"AI" is not based on physical real world data and models like our brain. Instead, we chose to analyze human formal (written) communication. ("formal": actual face to face communication has tons of dimensions adding to the text representation of what is said, from tone, speed to whole body and facial expressions)

Bio-brains have a model based on physical sensor data first and go from there, that's completely missing from "AI".

In hindsight, it's not surprising, we skipped that hard part (for now?). Working with symbols is what we've been doing with IT for a long time.

I'm not sure going all out on trying to base something on human intelligence, i.e. human neuro networks, is a winning move. I see it as if we had been trying to create airplanes that flap their wings. For one, human intelligence already exists, and when you lean back and manage to look at how we do on small and large problems from an outside perspective it has plenty of blind spots and disadvantages.

I'm afraid if we were to manage a hundred percent human level intelligence AI we will be disappointed. Sure, it will be able to do a lot, but in the end, nothing we don't already have.

Right now that would also just be the abstract parts, I think the "moving the body" physical parts in relation to abstract commands would be the far more interesting part, but since current AI is not about using physical sensor data at all, never mind combining it with the abstract stuff...

nycdatasci•2mo ago

You seem to be suggesting that current frontier models are only trained on text and not "sensor data". Multi-modal models are trained on the entire internet + vast amounts of synthetic data. Images and videos are key inputs. Camera sensors are capable of capturing much more "sensor data" than the human eye. Neural networks are the worst way to model intelligence, except all other models.

You may find this talk enlightening: https://simons.berkeley.edu/talks/ilya-sutskever-openai-2023...

nosianu•2mo ago

> You seem to be suggesting

As soon as you start a response like that you should just stop. After all, this is written communication, and what I wrote is plain to see right there.

When you need to start a response that way you should become self-aware that you are not responding to what the person you respond to wrote, but to your own ideas.

There is no need to "interpret" what other people wrote.

Relevant: https://i.imgur.com/Izrqp7d.jpeg

imtringued•2mo ago

>There's some progress in video generation which takes a short clip and extrapolates what happens next. That's a promising line of development. The key to "common sense" is being able to predict what happens next well enough to avoid big mistakes in the short term, a few seconds. How's that coming along? And what's the internal world model, assuming we even know?

https://www.youtube.com/watch?v=udPY5rQVoW0

This has been a thing for a while. It's actually a funny way to demonstrate model based control by replacing the controller with a human.

Animats•2mo ago

That GTA demo isn't about control. The user, not the net, is driving.

That's more like the demos where someone trains on a scene and the neural net can make plausible extensions to the scene as you move the viewpoint. It's more spatial imagination, like the tool in Photoshop that fills in plausible but imaginary backgrounds.

It does handle collisions with the edge of the road. Collisions with other cars don't really work; they mostly disappear. One car splits in half in confusion. The spatial part is making progress, but the temporal part, not so much.

juliangamble•2mo ago

Thanks for your article. The references section was interesting.

I'll add to the discussion a 2018 Nature letter: "Vector-based navigation using grid-like representations in artificial agents" https://www.nature.com/articles/s41586-018-0102-6

and a 2024 Nature article "Modeling hippocampal spatial cells in rodents navigating in 3D environments" https://www.nature.com/articles/s41598-024-66755-x

And a simulation in Github from 2018 https://github.com/google-deepmind/grid-cells

People have been looking at spacial awareness in neurology for quite a while. (In terms of the timeframe of recent developments in LLMs).

hliyan•2mo ago

I kept reading, waiting for a definition of spatial intelligence, but gave up after a few paragraphs. After years of reading VC-funded startup fluff, writing that contain these words tend to put me off now: transform, revolutionize, next frontier, North Star.

diamond559•2mo ago

She's funded by fascist oligarchs at Sequoia, not hard to connect the dots. Just listen to all the buzzwords and ancient Greek allegories though, totally not a bubble...

ACCount37•2mo ago

The question, as always, is: can we get any useful insights from all of that?

Trying to copy biological systems 1:1 rarely works, and copying biological systems doesn't seem to be required either. CNNs are somewhat brain-inspired, but only somewhat, and LLMs have very little architectural similarity to human brain - other than being an artificial neural network.

This functional similarity of LLMs to the human brain doesn't come from reverse engineered details of how the human brain works - it comes from the training process.

Marshferm•2mo ago

There's nothing similar about LLMs and human brains. Theyre entirely divergent. Training a machine has nothing remotely to do with biological development.

ACCount37•2mo ago

They perform incredibly similar functions. Thus, "functionally similar".

Marshferm•2mo ago

There’s no functional similarity in the slightest. Notice you can’t cite examples.

ACCount37•2mo ago

Hard metrics: LLMs perform NLP, NLU and CSR tasks at humanlike levels.

Research findings: LLMs have and use world models. They use some type of abstract thinking - with internal representation that often correspond to human abstract concepts. Which adds up to a capability profile that's amusingly humanlike.

Humans, however, don't like that. They really don't. AI effect is too strong, and it demands that humans must be Special. So some humans, when faced with the possibility that an AI might be doing the same thing their own brains do, resort to coping and seething.

Marshferm•2mo ago

These are not examples, they’re narrative (false) equivalencies.

Brains don’t perform Natural language CSR etc, those are cultural extensions separate from mental states etc. there are no functional equivalencies here.

There are many many empirical disputes for function eg

Aru et al “The feasibility of artificial consciousness through the lens of neuroscience” December 2023

imtringued•2mo ago

What I personally find amusing is this part:

>3. Interactive: World models can output the next states based on input actions

>Finally, if actions and/or goals are part of the prompt to a world model, its outputs must include the next state of the world, represented either implicitly or explicitly. When given only an action with or without a goal state as the input, the world model should produce an output consistent with the world’s previous state, the intended goal state if any, and its semantic meanings, physical laws, and dynamical behaviors. As spatially intelligent world models become more powerful and robust in their reasoning and generation capabilities, it is conceivable that in the case of a given goal, the world models themselves would be able to predict not only the next state of the world, but also the next actions based on the new state.

That's literally just an RNN (not a transformer). An RNN takes a previous state and an input and produces a new state. If you add a controller on top, it is called model predictive control. The most extreme form I have seen is temporal difference model predictive control (TD-MPC). [0]

[0] https://arxiv.org/abs/2203.04955

Marshferm•2mo ago

To decipher if there is anything like spatial intelligence, which is an oxymoronic term at most and redundant at least, one has to decipher the base units of the processes prior to their materialization in the allocortex. And to assign a careful concatenated/parametric categorization of what is unitized, where the processes focus into thresholds etc. This frontier propaganda and the few arvix/nature papers here are too synthetic to lead anywhere of merit.

diamond559•2mo ago

No you just don't understand, don't you see! the ancient greeks foresaw this centuries ago, we are just on the cusp of a world changing moment can't you feel the buzzwords flow through you! First it's creating 7 second meme videos w/ too many arms, then it's right to curing cancer and solving physics! Let the power of buzzwords calm your fears of a bubble.

alyxya•2mo ago

Personally, I think the direction AI will go towards is having an AI brain with something like a LLM at its core augmented with various abilities like spatial intelligence, rather than models being designed with spatial reasoning at its core. Human language and reasoning seems flexible enough to form some kind of spatial understanding, but I'm not so sure about the converse of having spatial intelligence derive human reasoning. Similar to how image generation models have struggled with generating the right number of fingers on hands, I would expect a world model designed to model physical space to not generalize the understanding of simple human ideas.

gf000•2mo ago

> Human language and reasoning seems flexible enough to form some kind of spatial understanding, but I'm not so sure about the converse of having spatial intelligence derive human reasoning

I believe the zero hypothesis would be that a model natively understanding both would work best/come closest to human intelligence (and possibly other different modalities are also needed).

Also, as a complete laymen, our language having several interconnections with spatial concepts would also point towards a multi-modal intelligence. (topic: place, subject: lying under or near, respect/prospect: look back/ahead, etc). In my understanding these connections only secondarily make their way into LLM's representations.

alyxya•2mo ago

There's a difference between what a model is trained on and the inductive biases a model uses to generalize. It isn't as simple as saying training natively on everything. All existing models have certain things they generalize better and certain things they don't generalize due to their model architecture, and the architecture of world models I've seen don't seem as capable of universally generalizing as LLMs.

gradus_ad•2mo ago

I'd imagine Tesla's and Waymo's AI are at the forefront of spatial cognition... this is what has made me hesitant to dismiss the AI hype as a bubble. Once spatial cognition is solved to the extent that language has been solved, a range of applications currently unavailable will drive a tidal wave of compute demand. Beyond self driving, think fully autonomous drone swarms... Militaries around the world certainly are and they're salivating.

dauertewigkeit•2mo ago

Tesla's problems with their multi camera non-Lidar system is precisely because they don't have any spacial cognition.

byearthithatius•2mo ago

100% agree but not just military. Self-driving vehicles will become the norm, robots to mow the lawn, clean the house, eventually humanoids that can interact like LLMs and be functional robots that help out around the house.

jandrewrogers•2mo ago

The automotive AIs are narrow pseudo-spatial models that are good at extracting spatial features from the environment to feed fairly simple non-spatial models. They don't really reason spatially in the same sense that an animal does. A tremendous amount of human cognitive effort goes into updating the maps that these systems rely on.

gradus_ad•2mo ago

Help me understand - my mental model of how auto AI work is that they're using neural nets to process visual information and output a decision on where to move in relation to objects in the world around them. Yes they are moving in a constrained 2D space but is that not fundamentally what animals do?

abstractanimal•2mo ago

What you're describing is what's known as an "end to end" model that takes in image pixels and outputs steering and throttle commands. What happens in an AV is that a bunch of ML models produce input for software written by human engineers, and so the output doesn't come from an entirely ML system, it's a mix of engineered and trained components for various identifiable tasks (perception, planning, prediction, controls).

pharrington•2mo ago

Spatial cognition really means "autonomous robot," and nobody thinks Tesla or Waymo have the most advanced robots.

ajb117•2mo ago

Holy marketing

jandrewrogers•2mo ago

This is essentially a simulation system for operating on narrowly constrained virtual worlds. It is pretty well-understood that these don't translate to learning non-trivial dynamics in the physical world, which is where most of the interesting applications are.

While virtual world systems and physical world systems look similar based on description, a bit like chemistry and chemical engineering, they are largely unrelated problems with limited theory overlap. A virtual world model is essentially a special trivial case that becomes tractable because it defines away most of the hard computer science problems in physical world models.

A good argument could be made that spatial intelligence is a critical frontier for AI, many open problems are reducible to this. I don't see any evidence that this company is positioned to make material progress on it.

inshard•2mo ago

Also good context here is Friston’s Free Energy Principle: A unified theory suggesting that all living systems, from simple organisms to the brain, must minimize "surprise" to maintain their form and survive. To do this, systems act to minimize a mathematical quantity called variational free energy, which is an upper bound on surprise. This involves constantly making predictions about the world, updating internal models based on sensory data, and taking actions that reduce the difference between predictions and reality, effectively minimizing prediction errors.

Key distinction: Constant and continuous updating. I.e. feedback loops with observation, prediction, action (agency), and once more, observation.

It should have survival and preservation as a fundamental architectural feature.

lsllc•2mo ago

> taking actions that reduce the difference between predictions and reality, effectively minimizing prediction errors

Since you can't change reality itself, and you can only take actions to reduce variational free energy, doesn't this make everything into a self-fulfilling prophecy?

I guess there must be some base level of instinct that overrides this; in the case of "I think that sabertooth tiger is going to eat me" you want to make sure the "don't get eaten" instinct counters "minimizing prediction errors".

inshard•2mo ago

Yep. Essentially take risks, expand your world model, but above all, don’t die. There’s a tension there - like “what happens if I poke the bear” vs “this might get me killed.”

frenchie4111•2mo ago

I enjoy Fei-fei li's communication style. It's straight and to the point in a way that I find very easy to parse. She's one of my primary idols in the AI space these days.

jacquesm•2mo ago

I think I perceive a massive bottleneck. Today's incarnation of AI learns from the web, not from the interaction with the humans it talks to. And for sure there is a lot of value there, it is just pointless to see that interaction lost a few hundred or thousand words of context later. For humans their 'context' is their life and total memory capacity, that's why we learn from the interaction with other, more experienced humans. It is always a two way street. But with AI as it is, it is a one way street, one that means that your interaction and your endless corrections when it gets stuff wrong (again) is lost. Allowing for a personalized massive context would go a long way towards improving the value here, at least like that you - hopefully - only have to make the same correction once.

tim333•2mo ago

There was stuff on a possible way around that from Google Research out the other day called Nested Learning https://research.google/blog/introducing-nested-learning-a-n...

My understanding is at the moment you train something like ChatGPT on the web, setting weights with backpropagation till it works well, but if you give some more info and do more backprop it can forget other stuff it's learned, called 'catastrophic forgetting'. The nested learning approach is to split things into a number of smaller models so you can retrain one without mucking up the other ones.

imtringued•2mo ago

That's pretty cool that you point it out.

>We introduce Nested Learning, a new approach to machine learning that views models as a set of smaller, nested optimization problems, each with its own internal workflow, in order to mitigate or even completely avoid the issue of “catastrophic forgetting”, where learning new tasks sacrifices proficiency on old tasks. [0].

It feels funny to be vindicated by rambling something random a week before someone makes an announcement that they did something incredibly similar with great success:

>Here is my stupid and simple unproven idea: Nest the reinforcement learning algorithm. Each critic will add one more level of delay, thereby acting as a low pass filter on the supervised reward function. Since you have two critics now, you can essentially implement a hybrid pre-training + continual learning architecture. The most interesting aspect here is that you can continue training the inner-most critic without changing the outer critic, which now acts as a learned loss function. [1]

[0] https://research.google/blog/introducing-nested-learning-a-n... [1] https://news.ycombinator.com/item?id=45745402

jacquesm•2mo ago

That's a pretty amazing line-up of events.

halfcat•2mo ago

> For humans their 'context' is their life and total memory capacity

And some number of billions of years of evolutionary progress.

Whatever spacial understanding we have could be thought of as a simulation at a quantum level, the size of the universe, for billions of years.

And what can we simulate completely at a quantum level today? Atoms or single cells?

jacquesm•2mo ago

Idealized atoms and very, very simplified single cells.

in-silico•2mo ago

Genie 3 (at a prototype level) achieves the goal she describes: a controllable world model with consistency and realistic physics. Its sibling Veo 3 even demonstrates some [spatial problem-solving ability](https://video-zero-shot.github.io/). Genie and Veo are definitely closer to her vision than anything World Labs has released publicly.

However, she does not mention Google's models at all. This omission makes the blog feel very much like an ad for her company rather than a good-faith guide for the field.

killerstorm•2mo ago

Also Gemini ER, which can kinda just work, spatially, in the real world:

https://deepmind.google/models/gemini-robotics/gemini-roboti...

inciampati•2mo ago

Just had a fantastic experience applying agentic coding to CAD. I needed to add some threads to a few blanks in a 3d print. I used computational geometry to give the agent a way to "feel" around the model. I had it convolve a sphere of the radius of the connector across the entire model. It was able to use this technique to find the precise positions of the existing ports and then add threads to them. It took a few tries to get right, but if I had the technique in mind before it would be very quick. The lesson for me is that the models need to have a way to feel. In the end, the implementation of the 3d model had to be written in code, where it's auditable. Perhaps if the agent were able to see images directly and perfectly, I never would have made this discovery.

jacquesm•2mo ago

OpenSCAD or something like it?

JohnHammersley•2mo ago

Thanks for sharing, I'm interested to know more about how you did this if you have a longer write up somewhere? (or are considering writing one!)

t_mann•2mo ago

CadQuery? Would be appreciated if you're inclined to do writeup of your lessons learned.

btbuildem•2mo ago

I'd love to hear more about this -- I'm messing around with a generative approach to 3D objects

alexose•2mo ago

Generative CAD has incredible potential. I've had some decent results with OpenSCAD, but it's clear that current models don't have much "common sense" when it comes to how shapes connect.

If code-based CAD tools were more common, and we had a bigger corpus to pull from, these tools would probably be pretty usable. Without this, however, it seems like we'll need to train against simulations of the physical world.

mkoubaa•2mo ago

Unlike an LLM prompt it's REALLY hard to describe the end result of a geometric object in text.

"No put the thingy over there. Not that thingy!"

nfg•2mo ago

I’m not really suggesting it’s the right approach for CAD but prompting UI changes using sketches or mockup images works great.

yalogin•2mo ago

Isn’t this what all the ai companies are doing now? This is what is needed to enable robotics with llms and deep mind and others are all actively working on it afaik

jgord•2mo ago

My take, after working on some algos to detect geometry from pointclouds, is that its solvable with current ML techniques, but we lack early stage VC funding for startups working on this :

https://quantblog.wordpress.com/2025/10/29/digital-twins-the...

I have no doubt FeiFei and her well funded team will make rapid progress.

john_minsk•2mo ago

We think alike. Have you tried to replace point cloud of white wall with a generic white wall automatically?

segmondy•2mo ago

I would argue that some would add time to that as well, a lot of our data are missing spatial and temporal information. But if we're able to take text2text models and add in audio/vision then I suspect we can apply the same technique to add in spatial and temporal intelligence. However the data for those are non existent unlike audio and visual data.

andy_ppp•2mo ago

Not sure I want a robot that hallucinates around the home but okay if it folds my laundry and cleans the house and so on!

cellular•2mo ago

I call it "the broken dish threshold".

Once few enough dishes are broken, we will find robots useful in our homes! :)

godelski•2mo ago

I think a lot of people are really bad at evaluating world models. Feifei is right here that they are multimodal but really they must codify a physics. I don't mean "physics" but "a physics". I also think it's naïve to think this can be done through data alone. I mean just ask a physicist...[0].

But why people are really bad at evaluating them is because the details dominate. What matters here is consistency. We need invariance to some things and equivariance to others. As evaluators we tend to be hopeful so the subtle changes frame to frame are overlooked though thats kinda the most important part. It can't just be similar to the last frame, but needs be exactly the same. You need equivariance to translation, yet that's still not happening in any of these models (and it's not a limitation of attention or transformers). You're just going to have a really hard time getting all this data even though by doing that you'll look like you're progressing because you're better fitting it. But in the end the models will need to create some compact formulation representing concepts such as motion. Or in other words, a physics. And it's not like physicists aren't know for being detail oriented and nitpicky over nuances. That is breed into then with good reason

[0] https://m.youtube.com/watch?v=hV41QEKiMlM

ontouchstart•2mo ago

The YouTube video tells a fascinating story. Who would be our Fermi today who can tell the truth and save five years of work, billions of dollars and careers of Ph.D. students?

We wouldn’t expect LLM to review a paper and tell us the truth like Fermi did. That is super-intelligence.

Thanks for sharing.

godelski•2mo ago

My problem with the current environment is that we are too rushed. I think in today's culture no one would have told Dyson to not continue. Or just not care. You got to publish, or perish.

Hard things take time and deep thought. But in our age we seem to not want to think deep. The environment discourages it because it takes longer. It's incredibly difficult to have both speed and quality. They are always in contention.

Mind you, there are a number of Nobel laureates who have claimed they wouldn't have succeeded in today's environment because of this[0]. I'm confident we're so concerned with finding the best that we hinder our ability to.

[0] https://www.sciencealert.com/peter-higgs-says-he-wouldn-t-ha...

ontouchstart•2mo ago

Thanks for the link to Higgs story.

I was trained as a Physicist but went to software engineering. These days I would describe my job as a digital plumber. There are two kinds of holes we are dealing with: rabbit holes and potholes. We tend to get into the rabbit holes because of the love of tools and fall into potholes because of not paying attention.

It turned out the industry might do the same. But the cost would be billions of dollars and decades of work of young talented people.

ontouchstart•2mo ago

https://www.webofstories.com/play/freeman.dyson/94

This link has transcript.

atlex2•2mo ago

I think spatial tokens could help, but they're not really necessary. Lots of physics/physical tasks can be solved with pencil and paper.

On the other hand, it's amazing that a 512x512 image can be represented by 85 tokens (as in OAI's API), or 263 tokens per second for video (with Gemini). It's as if the memory vs compute tradeoff has morphed into a memory vs embedding question.

This dichotomy reminds me of the "Apple Rotators - can you rotate the Apple in your head" question. The spatial embeddings will likely solve dynamics questions a lot more intuitively (ie, without extended thinking).

We're also working on this space at FlyShirley - training pilots to fly then training Shirley to fly - where we benefit from established simulation tools. Looking forward to trying Fei Fei's models!

brrrrrm•2mo ago

we've discovered some kind of differentiable computer[1] and as with all computers, people have their own interests and hobbies they use them for. but unlike computers, everyone pitches their interest or hobby as being the only one that matters.

[1] https://x.com/karpathy/status/1582807367988654081

gnarlouse•2mo ago

This article has me thinking about “the human capacity to outthink nature and the scalability of this.” The wheel is sort of the first time I think man outthought nature: Nature is inherently bumpy and noisy, rolling is certainly a great form of locomotion, but it’s not reliable. When man figured out how to make long tracts of flat land (roads), we outthought nature. In some sense you could argue that our entire tranjectory through science and technology, supported by the scientific method, is another example: nature sort of sucks at persisting high level pattern intuition between one generation to the next, basically anything beyond genes.

I keep going back and forth on whether I think “super-intelligence” is achievable in any form other than speed-super-intelligence, but I definitely think that being able to think 3-dimensionally capably will be a major building block to AI outthinking man, and outthinking nature.

Sort of a shitpost.

djtango•2mo ago

The human body is an organised system of cells contributing to a greater whole - is there much difference between blood vessels designed for the efficient transport of key resources and messengers across the body and roads that carry key resources and messengers across a landmass?

In that sense has nature just replicated its ability to organise but at the species level on a planetary (interplanetary soon) scale?

Why are humans above nature...?

gnarlouse•2mo ago

Fair, i mean i also love the argument that there’s really no difference between “the manmade world” and “the natural world”because the former is entirely composed of parts stripped from or chemically altered from the latter. So yes, nature has absolutely replicated its ability to organize at a species level through human ingenuity.

Humans are maybe separate from nature primarily on the basis of our attempts (of varying success) to steer, structure, and optimize the organization of nature around us, and knowing how to do so is not an explicit aspect of reality, or at least did not make itself known to early humans so it’s reasonable to believe it’s not explicit. By that, I mean you’re not born with any inherent knowledge of the inner workings of quantum gravity, or of the navies stokes equation, or any of the tooling that supports it, but clearly these models exist and evolve tangibly around us in every moment. We found something nature hid from DNA-based biological tree of life, and exploited it to great effect.

Again, this is a colossal shitpost.

zombot•2mo ago

Getting basic trivial shit right is AI's next frontier.

wangii•2mo ago

she's done pretty important work but since then obsessed with the vague term `spatial intelligence`. what does it mean? there isn't a clear definition in the piece. it seems very intuitive & fundamental but tbh not *rigorous*, nor insightful.

I bet it's a dead end.

sbinnee•2mo ago

It's rare for one person to achieve many things. Her ImageNet was certainly HUGE. But she is a researcher. I think the true power of researchers is to persist. I also often think that researchers are too much absorbed into their topics. But that is just their purpose.

It could be a dead end for sure. I just hope that someone figures out the `spatial` part for AIs and brings us closer to better ones.

olirex99•2mo ago

Spatial AI will for sure be a thing, I am not sure if will be next frontier.

The main problem that I still see is: we are not able to fully understand how much can we scale the current models. How much data do we need? Do we have the data for this kind of training? Can the current models generalize the world?

Probably before seeing something really interesting we need another AI winter, where researchers can be researcher and not soldiers of companies.

pmontra•2mo ago

The data is out there if we give at least wheels to a robot and let it bump into things like we did when we were little. We didn't need a billion pictures or videos. Only trial and error, then we developed a mental map of our home and our close neighborhood and discovered that the rest of the world obeys the same rules. Training AIs doesn't work like that now.

I think that they want to follow the same route of LLMs: no understanding of the real world, but finding a brute force approach that's good enough in the most useful scenarios. Same as airplanes: they can't fly in a bird like way and they can't do bird things (land on a branch) but they are crazily useful to go to the other side of the world in a day. They need a lot of brute force to do that.

And yes, maybe an AI winter is what is needed to have the time to stop and have some new ideas.

t0lo•2mo ago

hype buzz malarky that is going to further lobotomise our children and return more value for shareholders

sbinnee•2mo ago

So Dr.Li starts writing a blog! I just subscribed to it. I cannot wait for other articles!

wartywhoa23•2mo ago

Sure, because drones of the global City 17 can't fly blind.

jillesvangurp•2mo ago

The article is a bit of a long read but I've been looking at this topic for some time and things are definitely improving.

We build a map-based productivity app for workers. We map out the workplace and use e.g. asset tracking to visualize where things are and help people find these things and navigate around. There's a lot more to this of course, but we typically geo-reference whatever building maps we can get our hands on on top of openstreetmaps. This allows people to zoom in and out and switch between indoor and outdoor.

The hard part for us: sourcing decent building maps. There usually is some building map information available in the form of cad drawings, fire escape plans, etc. But they aren't really well suited for use as a user friendly map. Also typically getting vector graphics for this is hard. In short, we usually have to spend quite a bit of effort on designing or sourcing maps. And of course these maps aren't static. People extend buildings, move equipment and machines around, and re-purpose the spaces they have. A map of a typical factory is an empty rectangle. You can see where the walls, windows, and doors are and any supporting columns. All the interesting stuff happens in the negative spaces (the blank space between the walls).

Mapping all this is a manual process because it requires people to interpret raw spatial data in context. We build our own world model. A great analogy is text based adventure games where the only map you had was what you built in your head from querying the game. It's a surprisingly hard problem. We're used to decent quality public maps outdoors; but indoors there isn't much. Making outdoor maps is something that is quite expensive but lucrative enough that companies have been investing in that for years. Also openstreetmap has tapped into a huge community of people that manually edit things and/or integrate third party data sets (a lot of stuff is imported as well).

Recently with Google's nano banana model creating building maps got a lot easier. It has some notion of proportions and dimensions. I was able to take a smart phone photo of the fire escape plan mounted to the wall and then I let nano banana clean it up and transform it; without destroying dimensions or hallucinating new walls, doors, windows, etc. or changing the dimensions of rooms. We've also been experimenting with turning bitmaps into vector graphics which can work with promising results but this still needs work. But even just a cleaned up fire escape plan minus all the escape routes, and other map clutter is already a massive improvement for us. Fire escape plans are everywhere and are kind of the base line map we can get for pretty much any building provided they are to scale. Which at least in Germany they are (standards for this are pretty strict).

AI-based map content creation from photos, reference cad diagrams, textual descriptions, etc. is what we want to target next. Given some basic cad map and a photo in the building, can we deduce the vantage point from which the photo was taken and then identify things in the photo and put them on the map in the correct position. People are actually able to do this with enough context. That's what openstreetmap editors do when they add detail to the map. AI models so far don't quite do all of this yet. Essentially this is about creating an accurate world model and using that to populate maps with content. It's not just about things like lidar and stereo vision but about understanding what is what in a photo.

In any case, that's just one example of where I see a lot of potential for smarter models. Nano banana was the first model to not make a mess of our maps.

baxuz•2mo ago

As soon as I see an article on substack, I assume that it's misinformation or has an agenda attached to it.

Proven correct yet again.

Iolaum•2mo ago

Noob question: Aren't self-driving cars' software/AI solving that?

voxleone•2mo ago

>>Spatial Intelligence is the scaffolding upon which our cognition is built.

Human cognition isn’t built on abstract reasoning alone. It’s embodied, grounded in sensation.

Evolution didn’t achieve generalization across domains by making brains more symbolic. It did so by making them more integrated by fusing chemical gradients, touch, proprioception, light, sound, temperature, and pressure into one continuous internal narrative.

Intelligence does not seem to be an algorithmic property; it’s a felt coherence across senses. Our reasoning emerges from a complex interaction of sensory information, memory, emotions, and cognitive processing. Sensory completeness is the way forward.

an0malous•2mo ago

Space, the final frontier

vagab0nd•2mo ago

Human perception is essentially 2D+depth. Shouldn't we be feeding the transformers 2D data? Like a convolution front-end? Instead of tokenizing images, shouldn't we be rendering texts?

JnBrymn•2mo ago

I'm very loosely tracking the state of the art of LLM spatial reasoning in this blog post - https://arcturus-labs.com/blog/2025/03/31/visual-reasoning-i...

Hint ... we've got a long way to go.

Marshferm•2mo ago

This is pie in the sky swing for the hills nonsense that lumps everything into space. First world models are oxymoronic. Ecological principles, processes and material outcomes eschew models, they’re unnecessary because they’re provided for in the exchange between DNA, biology, bauplan, brain and environment as fast acting optic flow. Any animal relying on anything as preposterous as a world model is dead, splat, drowned etc. world models and survival are incompatible.

The notion it is reducible to spatial ‘intelligence’ is just another case of flavor of the month in AIs contraction and desperate search for success and funding.

That a world model as inconsistent and inaccurate as storytelling is given here as a premise to hoist cash for flavor of the month spatial details how desperate the search has become in Frontierland.

SectorC: A C Compiler in 512 bytes

Brookhaven Lab's RHIC concludes 25-year run with final collisions

I write games in C (yes, C)

The F Word

Software factories and the agentic moment

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

First Proof

The Waymo World Model

Al Lowe on model trains, funny deaths and working with Disney

Reinforcement Learning from Human Feedback

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

We mourn our craft

Coding agents have replaced every framework I used

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

72M Points of Interest

France's homegrown open source online office suite

The AI boom is causing shortages everywhere else

Selection Rather Than Prediction

A Fresh Look at IBM 3270 Information Display System

Unseen Footage of Atari Battlezone Arcade Cabinet Production

History and Timeline of the Proco Rat Pedal (2021)

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Where did all the starships go?

Learning from context is harder than we thought

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Hackers (1995) Animated Experience

SectorC: A C Compiler in 512 bytes

Brookhaven Lab's RHIC concludes 25-year run with final collisions

I write games in C (yes, C)

The F Word

Software factories and the agentic moment

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

First Proof

The Waymo World Model

Al Lowe on model trains, funny deaths and working with Disney

Reinforcement Learning from Human Feedback

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

We mourn our craft

Coding agents have replaced every framework I used

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

72M Points of Interest

France's homegrown open source online office suite

The AI boom is causing shortages everywhere else

Selection Rather Than Prediction

A Fresh Look at IBM 3270 Information Display System

Unseen Footage of Atari Battlezone Arcade Cabinet Production

History and Timeline of the Proco Rat Pedal (2021)

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Where did all the starships go?

Learning from context is harder than we thought

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Hackers (1995) Animated Experience

Spatial intelligence is AI’s next frontier

Comments