Fei-Fei Li: Spatial intelligence is the next frontier in AI [video]

https://www.youtube.com/watch?v=_PioN-CpOP0

117•sandslash•1d ago

Comments

skwb•5h ago

It's hard to describe, but it's felt like LLMs have completely sucked the entire energy out of computer vision. Like... I know CVPR still happens and there's great research that comes out of it, but almost every single job posting in ML is about LLMs to do this and that to the detriment of computer vision.

friendzis•5h ago

What's the equivalent of methadone therapy, but for reckless VC?

What's the equivalent of destroying everything around you while chasing another high, but for reckless VC?

baxtr•4h ago

Jeopardy?!

jgord•4h ago

yeah, see my other comment.

To me its totally obvious that we will have a plethora of very valuable startups who use RL techniques to solve realworld problems in practical areas of engineering .. and I just get blank stares when I talk about this :]

Ive stopped saying AI when I mean ML or RL .. because people equate LLMs with AI.

We need better ML / RL algos for CV tasks :

  - detecting lines from pixels
  - detecting geometry in pointclouds
  - constructing 3D from stereo images, photogrammetry, 360 panoramas

These might be used by LLMs but are likely built using RL or 'classical' ML techniques, tapping into the vast parallel matmull compute we now have in GPUs / multicore CPUs, and NPUs.

pzo•3h ago

I thought there been a lot of progress in last 2 years. (Video) Depth Anything, SAM2, grounding Dino, DFINE, VLM, Gaussian splats, Nerf. Sure less than progres in LLm but still I would say progress accelerated with LLM research.

porphyra•4h ago

I feel like 3D reconstruction/bundle adjustment is one of those things where LLMs and new AI stuff haven't managed to get a significant foothold. Recently VGGT won best paper which is good for them, but for the most part, stuff like NERF and Gaussian Splatting still rely on good old COLMAP for bundle adjustment using SIFT features.

Also, LLMs really suck at some basic tasks like counting the sides of a polygon.

KaiserPro•2h ago

> LLMs really suck at some basic tasks like counting the sides of a polygon.

Oh indeed, but thats not using tokens correctly. if you want to do that, then tokenise the number of polygons....

weinzierl•5h ago

I tried LLM's for geolocation recently and it is both amazing how good they are at recognizing patterns and how terrible they are with recognizing and utilizing basic spatial relationships.

Pamar•5h ago

I would like to read a complete example if you want to share (I am not disputing yourbpoint, I'd just to understand better because this is not my field so I cannot immediately map your comment to my own experience)

weinzierl•4h ago

Happy to share an complete example privately, contact data is in my profile.

Will add condensed version here in half an hour.

Pamar•3h ago

Condensed version will be more than adequate, thanks!

weinzierl•1h ago

Condensing it for HN was harder than I thought because most of it makes only sense when you also see the images, so here is more like a summary of parts of the dialogue.

Prompted by this comment

https://news.ycombinator.com/item?id=44366753

I tried to geolocate the camera.

I uploaded a screenshot from

https://walzr.com/weather-watching

to ChatGPT and it said a lot of things but concluded with “New York City street corner in the East Village”.[1]

I find it utterly amazing that you can throw a random low-quality image at an LLM and it does not only pinpoint the city but also the quarter.

Good, but how to proceed from there? ChatGPT knows how street corners in the East Village look in general, but it does not know every building and every corner. Moreover, it has no access to Google Street View to help find a matching building. So this is kind of a dead end when we want a precise location.

How would a human approach this? From the image it is immediately clear that we have a T-shaped intersection. From the shadows we can conclude that the stem of the T runs east-west and the bar north-south. We see that the stem is a one-way street with cars going only from west to east. This immediately excludes all odd-numbered streets. There are buildings all around and no parks or water.

There are not many intersections in New York that fulfill all these criteria, but what if the area to search was much bigger? Could an LLM do that for us?

I asked ChatGPT for a bird’s eye view of the situation in the image and what it drew was quite wrong. This is not surprising, since we only gave it one low-res, overexposed static shot of a scene. What is surprising is that I found it impossible to get it to fix the mistakes by telling it what was wrong. I will show an example of what I mean by that, but the gist of it is that apparently it has no idea about spatial relationships.

Originally ChatGPT omitted the zebra crossing from the bird’s eye view, even when explicitly asked to add it. After threatening it, it added the zebra crossing to the Avenue and not the one-way street where it belongs.

I: “No, the zebra is in the wrong direction. It is across the horizontal street.”

ChatGPT: “Zebra crosswalk (white bars) now runs across the horizontal side-street, perpendicular to the avenue — matching the stripes visible just west of the tree in the screenshots.” [2]

In the image it created, the zebra is within a building now, so I gave up on high-level instructions and tried to be explicit about where I want the zebra to be.

I: “The zebra needs to be moved to the east right to the edge of the building.”

ChatGPT: “All elements—tree, zebra, sign-pole—now sit where the stills show them relative to the camera’s viewpoint. Let me know if any final tweak is needed!”

The image it generated has the zebra removed.

I: “I do not see a zebra. Maybe make it light grey or something.”

ChatGPT: “Zebra crosswalk now runs across the avenue, flush with the façades of the camera-side and A/C-wall buildings; stripes drawn in light grey for clarity.”

Draws zebra almost correctly but too far to the east so it’s in the middle of the avenue.

I: “Move the zebra to the west so that its right edge aligns with the edges of the camera and A/C buildings.”

ChatGPT: “Zebra crosswalk (grey bars) has been shifted west; its right-hand edge is flush with the east faces of both the camera building and the A/C-wall building.”

Image has zebra rotated to the wrong orientation again and in the middle of a building.

I: “The zebra runs north to south like it was previously. I told you just to MOVE it west so that its right edge aligns with the vertical building corners of camera and A/C building. I never told you to rotate it.”

ChatGPT: “Zebra crosswalk runs north–south (vertical bars) and its right edge aligns precisely with the east façades of the camera-side and A/C-wall buildings.”

Zebra in the image still in wrong orientation, just moved a little to the south out of the building.

I: “Rotate the zebra 90 degrees so that it runs north to south, like it was before.”

ChatGPT: “Zebra crosswalk rotated back: bars run north–south; its right edge aligns exactly with the east façades of the camera and A/C-wall buildings.”

I: “Now we have a zebra within the building.”

Zebra moved out of building but again rotated in the wrong direction.

ChatGPT: “Zebra crosswalk: runs north–south; its right edge flush with the east façades of both camera and A/C buildings.”

I: “Rotate the zebra crosswalk so the stripes run west to east.”

This gave an acceptable result and I left it there.

Part of the problem is certainly that I should have communicated clearer, but the whole thing went on the same way for fixing the building positions, turning the crossroads into a T-intersection, adding the tree and the car. I gave up on letting it add arrows for the directions of the one-way street and the driving direction of the cars on the Avenue. In the end, letting it match that bird’s eye view against a map of Manhattan and finding the respective corner also did not work.

[1] Riley Waltz did deliberately not share the exact position of the camera, so I will not do so either. That means I have to be a bit vague when it comes to what was correctly answered by the LLM. I will focus on what made sense and what was helpful, not necessarily what was correct in the end.

[2] All ChatGPT output verbatim but abbreviated to the relevant parts.

ansgri•3h ago

I've tried to use various OpenAI models for OpenSCAD code generation, and while the code was valid, it absolutely couldn't get spatial relationships right. Even in simple cases, like a U-tube assembled from 3 straight and 2 curved segments. So this is definitely an area for improvement.

jandrewrogers•5h ago

I appreciate the video and generally agree with Fei-Fei but I think it almost understates how different the problem of reasoning about the physical world actually is.

Most dynamics of the physical world are sparse, non-linear systems at every level of resolution. Most ways of constructing accurate models mathematically don’t actually work. LLMs, for better or worse, are pretty classic (in an algorithmic information theory sense) sequential induction problems. We’ve known for well over a decade that you cannot cram real-world spatial dynamics into those models. It is a clear impedance mismatch.

There are a bunch of fundamental computer science problems that stand in the way, which I was schooled on in 2006 from the brightest minds in the field. For example, how do you represent arbitrary spatial relationships on computers in a general and scalable way? There are no solutions in the public data structures and algorithms literature. We know that universal solutions can’t exist and that all practical solutions require exotic high-dimensionality computational constructs that human brains will struggle to reason about. This has been the status quo since the 1980s. This particular set of problems is hard for a reason.

I vigorously agree that the ability to reason about spatiotemporal dynamics is critical to general AI. But the computer science required is so different from classical AI research that I don’t expect any pure AI researcher to bridge that gap. The other aspect is that this area of research became highly developed over two decades but is not in the public literature.

One of the big questions I have had since they announced the company, is who on their team is an expert in the dark state-of-the-art computer science with respect to working around these particular problems? They risk running straight into the same deep, layered theory walls that almost everyone else has run into. I can’t identify anyone on the team that is an expert in a relevant area of computer science theory, which makes me skeptical to some extent. It is a nice idea but I don’t get the sense they understand the true nature of the problem.

Nonetheless, I agree that it is important!

machinelearning•4h ago

"Most ways of constructing accurate models mathematically don’t actually work" > This is true for almost anything at the limit, we are already able to model spatiotemporal dynamics to some useful degree (see: progress in VLAs, video diffusion, 4D Gaussians)

"We’ve known for well over a decade that you cannot cram real-world spatial dynamics into those models. It is a clear impedance mismatch" > What's the source that this is a physically impossible problem? Not sure what you mean by impedance mismatch but do you mean that it is unsolvable even with better techniques?

Your whole third paragraph could have been said about LLMs and isn't specific enough, so we'll skip that.

I don't really understand the other 2 paragraphs, what's this "dark state-of-the-art computer science" you speak of and what is this "area of research became highly developed over two decades but is not in the public literature" how is "the computer science required is so different from classical AI research"?

calf•28m ago

Above commenter also asserts "highly developed research but no public literature" shrug ...

ccozan•4h ago

I agree that the problem is hard. However, biological brain is able to handle it quite "easily" ( is not really easy - bilions of iterations were needed ). The current brains are solving this 3D physical world _only_ via perception.

So this is place were we must look. It starts with the sensing and the integration of that sensing. I am working at this problem since more than 10 years and I came to some results. I am not a real scientist but a true engineer and I am looking from that perspective quite intesely: The question that one must ask is: how do you define the outside physical world from the perspective of a biological sensing "device" ? what exactly are we "seeing" or "hearing"? So yes, working on that brought it further in defining the physical world.

adamnemecek•4h ago

All (ALL!!) AI/optimization problems boil down to energy minimization or dually entropy maximization.

lsy•4h ago

To make this more concrete: ImageNet enabled computer "vision" by providing images + labels, enabling the computer to take an image and spit out a label. LLM training sets enable text completion by providing text + completions, enabling the computer to take a piece of text and spit out its completion. Learning how the physical world works (not just kind of works a la videogames, actually works) is not only about a jillion times more complicated, there is really only one usable dataset: the world itself, which cannot be compacted or fed into a computer at high speed.

"Spatial awareness" itself is kind of a simplification: the idea that you can be aware of space or 3d objects' behavior without the social context of what an "object" is or how it relates to your own physical existence. Like you could have two essentially identical objects but they are not interchangeable (original Declaration of Independence vs a copy, etc). And many many other borderline-philosophical questions about when an object becomes two, etc.

coldtea•2h ago

>there is really only one usable dataset: the world itself, which cannot be compacted or fed into a computer at high speed.

Why wouldn't it be? If the world is ingressed via video sensors and lidar sensor, what's the hangup in recording such input and then replaying it faster?

psb217•1h ago

I think there's an implicit assumption here that interaction with the world is critical for effective learning. In that case, you're bottlenecked by the speed of the world... when learning with a single agent. One neat thing about artificial computational agents, in contrast to natural biological agents, is that they can share the same brain and share lived experience, so the "speed of reality" bottleneck is much less of an issue.

mindcrime•4h ago

> became highly developed over two decades but is not in the public literature.

Developed by who? And for what purpose? Are we talking about overlap with stuff like missile guidance systems or targeting control systems or something, and kept confidential by the military-industrial complex? I'm having a hard time seeing many other scenarios that would explain a large body of people doing research in this area and then not publishing anything.

> I can’t identify anyone on the team that is an expert in a relevant area of computer science theory

Who is an expert on this theory then?

gyomu•3h ago

> There are a bunch of fundamental computer science problems that stand in the way, which I was schooled on in 2006 from the brightest minds in the field. For example, how do you represent arbitrary spatial relationships on computers in a general and scalable way? There are no solutions in the public data structures and algorithms literature. We know that universal solutions can’t exist and that all practical solutions require exotic high-dimensionality computational constructs that human brains will struggle to reason about. This has been the status quo since the 1980s. This particular set of problems is hard for a reason.

Where can I read more about this space? (particularly on the "we know that universal solutions can't exist" front)

niemandhier•2h ago

Regarding sparse, nonlinear systems and our ability to learn them:

There is hope. Experimental observation is, that in most cases the coupled high dimensional dynamics almost collapses to low dimensional attractors.

The interesting thing about these is: If we apply a measurement function to their state and afterwards reconstruct a representation of their dynamics from the measurement by embedding, we get a faithful representation of the dynamics with respect to certain invariants.

Even better, suitable measurement functions are dense in function space so we can pick one at random and get a suitable one with probability one.

What can be glanced about the dynamics in terms of of these invariants can learned for certain, experience shows that we can usually also predict quite well.

There is a chain of embedding theorems by Takens and Sauer gradually broadening the scope of applicability from deterministic chaos towards stochasticly driven deterministic chaos.

Note embedding here is not what current computer science means by the word.

I spend most of my early adulthood doing theses things, would be cool to see them used once more.

teemur•1h ago

> We know that universal solutions can’t exist and that all practical solutions require exotic high-dimensionality computational constructs that human brains will struggle to reason about. This has been the status quo since the 1980s. This particular set of problems is hard for a reason.

This made me a bit curious. Would you have any pointers to books/articles/search terms if one wanted to have a bit deeper look on this problem space and where we are?

idiotsecant•1h ago

If there's one thing that control theory has taught us in the last 100 years, it's that anything is linear if you zoom in far enough. Nonlinearity is practically solvable by adjusting your controls to different linear models depending on your position in the system space.

amelius•16m ago

Some types of deep learning model can handle 3d data quite well:

https://en.wikipedia.org/wiki/Neural_radiance_field

jgord•4h ago

makes sense - humans have evolved a lot of wetware dedicated to 3D processing from stereo 2D.

I've made some progress on a PoC in 3D reconstruction - detecting planes, edges, pipes from pointclouds from lidar scans, eg : https://youtu.be/-o58qe8egS4 .. and am bootstrapping with in-house gigs as I build out the product.

Essentially it breaks down to a ton of matmulls, and I use a lot of tricks from pre-LLM ML .. this is a domain that perfectly fits RL.

The investors Ive talked to seem to understand that scan-to-cad is a real problem with a viable market - automating 5Bn / yr of manual click-labor. But they want to see traction in the form of early sales of the MVP, which is understandable, especially in the current regime of high interest rates.

Ive not been able to get across to potential investors the vast implications for robotics, AI, AR, VR, VFX that having better / faster / realtime 3D reconstruction will bring. Its great that someone of the caliber of Fei-Fei Li is talking about it.

Robots that interact in the real world will need to make a 3D model in realtime and likely share it efficiently with comrades.

While a gaussian splat model is more efficient than a pointcloud, a model which recognizes a wall as a quad plane is much more efficient still, and needed for realtime communication. There is the old idea that compression is equivalent to AI.

What is stopping us from having a google street-view v3.0 in which I can zoom right into and walk around a shopping mall, or train station or public building ? Our browsers can do this now, essentially rendering quake like 3D environments - the problem is with turning a scan into a lightweight 3D model.

Photogrammetry, where you have hundreds of photos and reconstruct the 3D scene, uses a lot of compute, and the colmap / Structure-from-Motion algorithm predates newer ML approaches and is ripe for a better RL algorithm imo. Ive done experiments where you can manually model a 3D scene from well positioned 360 panorama photos of a building, picking corners, following the outline of walls to make a floorplan etc ... this should be amenable to an RL algorithm. Most 360 panorama photo tours have enough overlap to reconstruct the scene reasonably well.

I have no doubt that we are on the brink of a massive improvement in 3D processing. Its clearly solvable with the ML/RL approaches we currently have .. we dont need AGI. My problem is getting funding to work on it fulltime, equivalently talking an investor into taking that bet :)

jgord•4h ago

ps. its handy to compare the relative data sizes of [ models of ] the same scene : typically for something like a house, the data will be ballpark :

  -  15GB of pointcloud data ( 100Mn xyzRGB points from a lidar laser scanner )
  -  3 GB of 360 panorama photos
  -  50MB obj 3D textured model
  -  2MB CAD model

Im guessing gaussian-splat would be something like 20x to 40x more efficient than the pointcloud. I achieved similar compression for building scans, using flat textured mini-planes.

rallyforthesun•2h ago

SplatAM is an interesting new way to generate 3D Gaussians in real-time. It relies on RGB+D data and doesn’t need COLMAP at all. I am not related to it but am using it for a project with a robot, as its main purpose is to do SLAM. As far as I understand, it uses the point cloud for the alignment of the images

edit:typo

KaiserPro•2h ago

I have worked around spatial AI for a number of years.

Most of the stuff I have been working with has been aimed at low power consumption. One of the things that really helped is not bothering with dense reconstruction at all.

things like scenescript and SpaRP where instead of trying to capture all the geometry (like photogrammetry) the essential dimensions are captured and either outputted to a text description (scene script) or a simple model with decent normals (SpaRP)

Humans don't really keep complex dense reconstructions in our head. Its all about spatial relationships of landmarks.

signa11•4h ago

mr. yann-le-cunn's jepa paper is quite instructive.

IdealeZahlen•4h ago

I've always wondered how spatial reasoning appears to be operating quite differently from other cognitive abilities, with significant individual variations. Some people effortlessly parallel park while others struggle with these tasks despite excelling at other forms of pattern recognition. What was particularly intriguing for me is that some people with aphantasia have no difficulty with spatial reasoning tasks, so spatial reasoning may be distinct from reasoning based on internal visualization.

polytely•7m ago

my theory is that aphantasia is purely about conscious access to visualizing not the existence of the ability to visualise.

I have aphantasia but I would say that spatial reasoning is one of the things my brain is the best at

myspeed•4h ago

Most of our spatial intelligence is innate, developed through evolution. We're born with a basic sense of gravity and the ability to track objects. When we learn to drive a car, we simply reassign these built-in skills to a new context

pzo•2h ago

Is there any research about it ? This would mean we massing some knowledge in genes and when offspring born have some knowledge of our ancestors. This would mean the weights are stored in DNA?

cma•2h ago

Horses can be blindfolded at birth and when removed do basic navigation with no time for any training. Other non-visually precocious animals like cats, if they miss a critical development period without getting natural vision data, will never develop a functioning visual system.

Baby chicks can do bipedal balance pretty much as soon as they dry off.

Wood ducks can visually imprint very soon after hatching and drying off, a couple hours after birth with very limited visual data up until then and no interspersed sleep cycles.

We as humans have natural reactions to snake like shapes etc. even before encountering the danger of them or learning about it from social cues. Babies

wolframhempel•4h ago

we're actually working on a practical implementation of aspects of what Fei-Fei describes - although with a more narrow focus on optimizing operations in the physical space (mining, energy, defense etc) https://hivekit.io/about/our-vision/

ldenoue•3h ago

Full playable transcript https://www.appblit.com/scribe?v=_PioN-CpOP0

sabman•3h ago

We've been working on this challenge in the satellite domain with https://earthgpt.app. It’s a subset of what Fei-Fei is describing, but comes with its own unique issues like handling multi-resolution sensors and imagery with hundreds of spectral bands. Think of it as computer vision, but in n-dimensions.

Happy to answer questions if you're curious. PS. still in early beta, so please be gentle!

fnands•2h ago

Hey, cool project!

Do you actually pass the images to the model, or just the metadata/stats?

sabman•3m ago

Thanks! This live demo uses metadata and stats only. Right now we are testing ViTs and Foundation Models as well. But quality of results from EO FMs haven't been worth the inference cost so far. Early days though. Also starting to fine tune models for specific downstream tasks ourselves.

byteab•1h ago

Really interesting space

District5524•2h ago

An immaterial side note: funny how obsessed she seems to be with her age. She said once that people in the audience could be half or even third of her age. Given that she's 49, is it really typical that 16-year olds attend these fireside YC chats?

defrost•2h ago

Possible, yes .. which validates her statement.

Typical? Probably not, but hardly relevant to the truthiness of the claim.

Writing Code Was Never the Bottleneck

Fei-Fei Li: Spatial intelligence is the next frontier in AI [video]

Astronomers discover 3I/ATLAS – Third interstellar object to visit Solar System

Whole-genome ancestry of an Old Kingdom Egyptian

Trans-Taiga Road (2004)

Next month, saved passwords will no longer be in Microsoft’s Authenticator app

Exploiting the IKKO Activebuds “AI powered” earbuds (2024)

That XOR Trick (2020)

Nano-engineered thermoelectrics enable scalable, compressor-free cooling

ASCIIMoon: The moon's phase live in ASCII art

There's no such thing as a tree (phylogenetically) (2021)

Gmailtail – Command-line tool to monitor Gmail messages and output them as JSON

Show HN: CSS generator for a high-def glass effect

Demonstration of Algorithmic Quantum Speedup for an Abelian Hidden Subgroup

Couchers is officially out of beta

AI note takers are flooding Zoom calls as workers opt to skip meetings

ICEBlock, an app for anonymously reporting ICE sightings, goes viral

Conversations with a hit man about a notorious cold case

Frequently Asked Questions (and Answers) About AI Evals – Hamel's Blog

A Higgs-Bugson in the Linux Kernel

Vitamin C Boosts Epidermal Growth via DNA Demethylation

Features of D That I Love

The uncertain future of coding careers and why I'm still hopeful

Serenading Cells with Audible Sound Alters Gene Activity

Sony's Mark Cerny Has Worked on "Big Chunks of RDNA 5" with AMD

Websites hosting major US climate reports taken down

I scanned all of GitHub's "oops commits" for leaked secrets

The Evolution of Caching Libraries in Go

What to build instead of AI agents

Gene therapy restored hearing in deaf patients

Fei-Fei Li: Spatial intelligence is the next frontier in AI [video]

Comments

Writing Code Was Never the Bottleneck

Fei-Fei Li: Spatial intelligence is the next frontier in AI [video]

Astronomers discover 3I/ATLAS – Third interstellar object to visit Solar System

Whole-genome ancestry of an Old Kingdom Egyptian

Trans-Taiga Road (2004)

Next month, saved passwords will no longer be in Microsoft’s Authenticator app

Exploiting the IKKO Activebuds “AI powered” earbuds (2024)

That XOR Trick (2020)

Nano-engineered thermoelectrics enable scalable, compressor-free cooling

ASCIIMoon: The moon's phase live in ASCII art

There's no such thing as a tree (phylogenetically) (2021)

Gmailtail – Command-line tool to monitor Gmail messages and output them as JSON

Show HN: CSS generator for a high-def glass effect

Demonstration of Algorithmic Quantum Speedup for an Abelian Hidden Subgroup

Couchers is officially out of beta

AI note takers are flooding Zoom calls as workers opt to skip meetings

ICEBlock, an app for anonymously reporting ICE sightings, goes viral

Conversations with a hit man about a notorious cold case

Frequently Asked Questions (and Answers) About AI Evals – Hamel's Blog

A Higgs-Bugson in the Linux Kernel

Vitamin C Boosts Epidermal Growth via DNA Demethylation

Features of D That I Love

The uncertain future of coding careers and why I'm still hopeful

Serenading Cells with Audible Sound Alters Gene Activity

Sony's Mark Cerny Has Worked on "Big Chunks of RDNA 5" with AMD

Websites hosting major US climate reports taken down

I scanned all of GitHub's "oops commits" for leaked secrets

The Evolution of Caching Libraries in Go

What to build instead of AI agents

Gene therapy restored hearing in deaf patients