Spatial intelligence is AI’s next frontier

https://drfeifei.substack.com/p/from-words-to-worlds-spatial-intelligence

67•mkirchner•1h ago

Comments

nothrowaways•1h ago

"Invest in my startup"

ares623•1h ago

Before the music stops

verdverm•1h ago

I do wonder if this will meaningfully move the needle on agent assistants (coding, marketing, schedule my vacation, etc...) considering how much more compute (I would imagine) is needed for video / immersive environments during training and inference

I suspect the calculus is more favorable for robotics

programjames•1h ago

Far too much marketing speech, far too little math or theory, and completely misses the mark on the 'next frontier'. Maybe four years ago, spatial reasoning was the problem to solve, but by 2022 it was solved. All that remained was scaling up. The actual three next problems to solve (in order of when they will be solved) are:

- Reinforcement Learning (2026)

- General Intelligence (2027)

- Continual Learning (2028)

EDIT: lol, funny how the idiots downvote

7moritz7•1h ago

Hasn't RLHF and with LLM feedback been around for years now

programjames•50m ago

Large latent flow models are unbiased. On the other hand, if you purely use policy optimization, RLHF will be biased towards short horizons. If you add in a value network, the value has some bias (e.g. MSE loss on the value --> Gaussian bias). Also, most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly. So, basically, there's a lot of biases that show up in RL training which can make it both hard to train, and even if successful, not necessarily optimizing what you want.

storus•32m ago

We might not even need RL as DPO has shown.

koakuma-chan•58m ago

In my thinking what AI lacks is a memory system

7moritz7•54m ago

That has been solved with RAG, OCR-ish image encoding (deepseek recently) and just long context windows in general.

koakuma-chan•47m ago

Not really. For example we still can’t get coding agents to work reliably, and I think it’s a memory problem, not a capabilities problem.

l9o•57m ago

What do you consider "General Intelligence" to be?

programjames•47m ago

A good start would be:

1. Robust to adversarial attacks (e.g. in classification models or LLM steering).

2. Solving ARC-AGI.

Current models are optimized to solve the current problem they're presented, not really find the most general problem-solving techniques.

stirfish•28m ago

I like to think I'm generally intelligent, but I am not robust to adversarial attacks.

Edit: I'm trying arc-agi tests now and it's looking bad for me: https://arcprize.org/play?task=e3721c99

whatever1•46m ago

Combinatorial search is also a solved problem. We just need a couple of Universes to scale it up.

programjames•42m ago

If there isn't a path humans know how to take with their current technology, it isn't a solved problem. It's much different than people training an image model for research purposes, and knowing that $100m in compute is probably enough for a basic video model.

htrp•57m ago

her company world labs is at the forefront of building spatial intelligence models

jonny_eh•33m ago

So she says

dauertewigkeit•50m ago

Sutton: Reinforcement Learning

LeCun: Energy Based Self-Supervised Learning

Chollet: Program Synthesis

Fei-Fei: ???

Are there any others with hot takes on the future architectures and techniques needed for of A-not-quite-G-I?

yzydserd•37m ago

> Fei-Fei: ???

Underrated and unsung. Fei Fei Li first launched ImageNet way back in 2007, a hugely influential move sparking much of the computer vision deep learning that followed since. I remember in a lecture about 7 years ago jph00 saying "text is just waiting for its imagenet moment" -> then came the gpt explosion. Fei Fei was massively instrumental in where we are today.

byearthithatius•31m ago

Curating a dataset is vastly different than introducing a new architectural approach. ImageNet is a database. Its not like inventing the convolutions for CNNs or the LSTM or a Transformer.

dauertewigkeit•16m ago

CNNs and Transformers are both really simple and intuitive so I don't think there is any stroke of genius in how they were devised.

Their success is due to datasets and the tooling that allowed models to be trained on large amounts of data, sufficiently fast using GPU clusters.

yzydserd•13m ago

Exactly right. Neatly said by the author in the linked article.

> I spent years building ImageNet, the first large-scale visual learning and benchmarking dataset and one of three key elements enabling the birth of modern AI, along with neural network algorithms and modern compute like graphics processing units (GPUs).

Datasets + NNs + GPUs. Three "vastly different" advances that came together. ImageNet was THE dataset.

toisanji•50m ago

From reading that, I'm not quite sure if they have anything figured out. I actually agree, but her notes are mostly fluff with no real info in there and I do wonder if they have anything figured out besides "collect spatial data" like imagenet.

There are actually a lot of people trying to figure out spatial intelligence, but those groups are usually in neuroscience or computational neuroscience. Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information. I wrote this before the huge LLM explosion and I still personally believe it is the path forward.

bonsai_spool•48m ago

> Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information.

Yes, you and the Mosers who won the Nobel Prize all believe that grid cells are the key to animals understanding their position in the world.

https://www.nobelprize.org/prizes/medicine/2014/press-releas...

Marshferm•5m ago

It's not enough by a long shot. Placement isn't related directly to vicarious trial and error, path integrations, sequence generation.

There's a whole giant gap between grid cells and intelligence.

byearthithatius•34m ago

This is super cool and I want to read up more on this as I think you are right insofar as it is the basis for reasoning. However it does seem more complex than just that. So how do we go from coordinate system transformations to abstract reasoning with symbolic representations?

alyxya•47m ago

Personally, I think the direction AI will go towards is having an AI brain with something like a LLM at its core augmented with various abilities like spatial intelligence, rather than models being designed with spatial reasoning at its core. Human language and reasoning seems flexible enough to form some kind of spatial understanding, but I'm not so sure about the converse of having spatial intelligence derive human reasoning. Similar to how image generation models have struggled with generating the right number of fingers on hands, I would expect a world model designed to model physical space to not generalize the understanding of simple human ideas.

gf000•19m ago

> Human language and reasoning seems flexible enough to form some kind of spatial understanding, but I'm not so sure about the converse of having spatial intelligence derive human reasoning

I believe the zero hypothesis would be that a model natively understanding both would work best/come closest to human intelligence (and possibly other different modalities are also needed).

Also, as a complete laymen, our language having several interconnections with spatial concepts would also point towards a multi-modal intelligence. (topic: place, subject: lying under or near, respect/prospect: look back/ahead, etc). In my understanding these connections only secondarily make their way into LLM's representations.

gradus_ad•44m ago

I'd imagine Tesla's and Waymo's AI are at the forefront of spatial cognition... this is what has made me hesitant to dismiss the AI hype as a bubble. Once spatial cognition is solved to the extent that language has been solved, a range of applications currently unavailable will drive a tidal wave of compute demand. Beyond self driving, think fully autonomous drone swarms... Militaries around the world certainly are and they're salivating.

dauertewigkeit•40m ago

Tesla's problems with their multi camera non-Lidar system is precisely because they don't have any spacial cognition.

byearthithatius•29m ago

100% agree but not just military. Self-driving vehicles will become the norm, robots to mow the lawn, clean the house, eventually humanoids that can interact like LLMs and be functional robots that help out around the house.

jandrewrogers•26m ago

The automotive AIs are narrow pseudo-spatial models that are good at extracting spatial features from the environment to feed fairly simple non-spatial models. They don't really reason spatially in the same sense that an animal does. A tremendous amount of human cognitive effort goes into updating the maps that these systems rely on.

gradus_ad•16m ago

Help me understand - my mental model of how auto AI work is that they're using neural nets to process visual information and output a decision on where to move in relation to objects in the world around them. Yes they are moving in a constrained 2D space but is that not fundamentally what animals do?

pharrington•23m ago

Spatial cognition really means "autonomous robot," and nobody thinks Tesla or Waymo have the most advanced robots.

ajb117•38m ago

Holy marketing

jandrewrogers•37m ago

This is essentially a simulation system for operating on narrowly constrained virtual worlds. It is pretty well-understood that these don't translate to learning non-trivial dynamics in the physical world, which is where most of the interesting applications are.

While virtual world systems and physical world systems look similar based on description, a bit like chemistry and chemical engineering, they are largely unrelated problems with limited theory overlap. A virtual world model is essentially a special trivial case that becomes tractable because it defines away most of the hard computer science problems in physical world models.

A good argument could be made that spatial intelligence is a critical frontier for AI, many open problems are reducible to this. I don't see any evidence that this company is positioned to make material progress on it.

znnnsjs•32m ago

diversity hire :)

nis0s•18m ago

How? https://en.wikipedia.org/wiki/Fei-Fei_Li

actionfromafar•8m ago

A lot of the other glib remarks about funding etc may be on point (I don't know) but she's not a "diversity hire".

inshard•8m ago

Also good context here is Friston’s Free Energy Principle: A unified theory suggesting that all living systems, from simple organisms to the brain, must minimize "surprise" to maintain their form and survive. To do this, systems act to minimize a mathematical quantity called variational free energy, which is an upper bound on surprise. This involves constantly making predictions about the world, updating internal models based on sensory data, and taking actions that reduce the difference between predictions and reality, effectively minimizing prediction errors.

Key distinction: Constant and continuous updating. I.e. feedback loops with observation, prediction, action (agency), and once more, observation.

It should have survival and preservation as a fundamental architectural feature.

Five Years of Apple Silicon: M1 to M5 Performance Comparison

Tracking the Emotional Health

Machine lets you practice baseball solo, similar to Topgolf

What is USSD (and who cares)? – All Things Distributed

Warren Buffett's "Few Final Thoughts" in His Last Letter [pdf]

Vercel: The anti-vendor-lock-in cloud

AI-wright: AI steps in Playwright scripts: open-source, Vision-enabled, BYOL

How good is AI at Hacking AD?

Just released a new side project

The Archeologist's Guide to Colonizing Other Worlds – Universe Today

Ironclad OS project popping out Unix-like kernel using Ada

Dump-Headers

"Flexible labor" is a euphemism for "derisking capital"

Addiction Lawsuit Against TikTok Can Proceed in Nevada

Mozilla Firefox gets new anti-fingerprinting defenses

EU moves to ban Huawei from member states' 5G systems

Reimagine the Date Picker

No credible tie between Tylenol use and autism/ADHD study finds

VelociDB: Blueprint for the Modern Embedded Database

Ask HN: Where did the tech people on Twitter go?

New Daily Pill Could Be Life-Saving for Americans with High Cholesterol

Refreshing Apache XML Infrastructure

Artificial Versifying (2019)

Alex Karp Goes to War

Standard Capital: Series A lead investor for 10% equity via application

Echogarden is an easy-to-use speech toolset of speech processing tools

Show HN: EchoStack – Deployable Voice-AI Playbooks Powered by a Manifest System

Political Chinese Culture at Meta

AI Feynman: A Physics-Inspired Method for Symbolic Regression

Is it possible to get a Jr. developer job in Elixir?