I suspect the calculus is more favorable for robotics
- Reinforcement Learning (2026)
- General Intelligence (2027)
- Continual Learning (2028)
EDIT: lol, funny how the idiots downvote
> most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly
1. Robust to adversarial attacks (e.g. in classification models or LLM steering).
2. Solving ARC-AGI.
Current models are optimized to solve the current problem they're presented, not really find the most general problem-solving techniques.
Edit: I'm trying arc-agi tests now and it's looking bad for me: https://arcprize.org/play?task=e3721c99
One man's modus ponens is another man's modus tollens.
"I'm trying arc-agi tests now and it's looking bad for me. I am not robust to adversarial attacks. I think I'm not generally intelligent."
For people coming after me, or for anyone who took discrete math a decade ago and need a quick refresher:
Modus ponens (affirming): if P, then Q. P is true, therefore Q.
If it is raining, the grass is wet. It is raining. Therefore the grass is wet.
Modus tollens (denying): if P, then Q. Q is false. Therefore P is false.
If it is raining, then the grass is wet. The grass is not wet. Therefore, it is not raining.
LeCun: Energy Based Self-Supervised Learning
Chollet: Program Synthesis
Fei-Fei: ???
Are there any others with hot takes on the future architectures and techniques needed for of A-not-quite-G-I?
Underrated and unsung. Fei Fei Li first launched ImageNet way back in 2007, a hugely influential move sparking much of the computer vision deep learning that followed since. I remember in a lecture about 7 years ago jph00 saying "text is just waiting for its imagenet moment" -> then came the gpt explosion. Fei Fei was massively instrumental in where we are today.
Their success is due to datasets and the tooling that allowed models to be trained on large amounts of data, sufficiently fast using GPU clusters.
> I spent years building ImageNet, the first large-scale visual learning and benchmarking dataset and one of three key elements enabling the birth of modern AI, along with neural network algorithms and modern compute like graphics processing units (GPUs).
Datasets + NNs + GPUs. Three "vastly different" advances that came together. ImageNet was THE dataset.
There are actually a lot of people trying to figure out spatial intelligence, but those groups are usually in neuroscience or computational neuroscience. Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information. I wrote this before the huge LLM explosion and I still personally believe it is the path forward.
Yes, you and the Mosers who won the Nobel Prize all believe that grid cells are the key to animals understanding their position in the world.
https://www.nobelprize.org/prizes/medicine/2014/press-releas...
There's a whole giant gap between grid cells and intelligence.
Please check this recent article on the state machine in the hippocampus based on learning [1]. The findings support the long-standing proposal that sparse orthogonal representations are a powerful mechanism for memory and intelligence.
[1] Learning produces an orthogonalized state machine in the hippocampus:
This approach simply follows suit with the blundering reverse engineering of the brain in cog sci where material properties are seen in isolation and processes are deduced piecemeal. The brain can only be understood as a whole first. See rhythms of the brain or unlocking the brain.
There’s a terrifying lack of curiosity in the paper you posted, a kind of smug synthetic rush to import code into a part of the brain that’s a directory among directories that has redundancies as a warning: we get along without this.
Your and their view (OSM) is too narrow. eg categorization is baked into the whole brain. How? This is one of 1000s of processes that generalize materially across the entire brain. Isolating "learning" to the allocortex is incredibly misleading.
https://www.cell.com/current-biology/fulltext/S0960-9822(25)...
Deep Mind also did a paper with grid cells a while ago: https://deepmind.google/blog/navigating-with-grid-like-repre...
I mean she launched her whole career with imagenet so you can hardly blame her for thinking that way. But on the other hand, there's something bitter lesson-pilled about letting a model "figure out" spatial relationships just by looking at tons of data. And tbh the recent progress [1] of worldlabs.ai (Dr Fei Fei Li's startup) looks quite promising for a model that understands stuff including reflections and stuff.
> looks quite promising for a model that understands stuff including reflections and stuff.
I got the opposite impression when trying their demo...[0]. Even in their examples some of these issues exist like how objects stay a constant size despite moving. Like missing the parallax or depth information. Not to mention that they show it walking on water lolAs for reflections, I don't get that impression either. They seem extremely brittle to movement.
Right. I was thinking about this back in the 1990s. That resulted in a years-long detour through collision detection, physically based animation, solving stiff systems of nonlinear equations, and a way to do legged running over rough terrain. But nothing like "AI". More of a precursor to the analytical solutions of the early Boston Dynamics era.
Work today seems to throw vast amounts of compute at the problem and hope a learning system will come up with a useful internal representation of the spatial world. It's the "bitter lesson" approach. Maybe it will work. Robotic legged locomotion is pretty good now. Manipulation in unstructured situations still sucks. It's amazing how bad it is. There are videos of unstructured robot manipulation from McCarthy's lab at Stanford in the 1960s. They're not that much worse than videos today.
I used to make the comment, pre-LLM, that we needed to get to mouse/squirrel level intelligence rather than trying to get to human level abstract AI. But we got abstract AI first. That surprised me.
There's some progress in video generation which takes a short clip and extrapolates what happens next. That's a promising line of development. The key to "common sense" is being able to predict what happens next well enough to avoid big mistakes in the short term, a few seconds. How's that coming along? And what's the internal world model, assuming we even know?
A machine can infer the right (or expected) answer based on data, I'm not sure that the same is true for how living things navigate the physical world - the "right" answer, such as one exists for your squirrel, is arguably Darwinian: "whatever keeps the little guy alive today".
"AI" is not based on physical real world data and models like our brain. Instead, we chose to analyze human formal (written) communication. ("formal": actual face to face communication has tons of dimensions adding to the text representation of what is said, from tone, speed to whole body and facial expressions)
Bio-brains have a model based on physical sensor data first and go from there, that's completely missing from "AI".
In hindsight, it's not surprising, we skipped that hard part (for now?). Working with symbols is what we've been doing with IT for a long time.
I'm not sure going all out on trying to base something on human intelligence, i.e. human neuro networks, is a winning move. I see it as if we had been trying to create airplanes that flap their wings. For one, human intelligence already exists, and when you lean back and manage to look at how we do on small and large problems from an outside perspective it has plenty of blind spots and disadvantages.
I'm afraid if we were to manage a hundred percent human level intelligence AI we will be disappointed. Sure, it will be able to do a lot, but in the end, nothing we don't already have.
Right now that would also just be the abstract parts, I think the "moving the body" physical parts in relation to abstract commands would be the far more interesting part, but since current AI is not about using physical sensor data at all, never mind combining it with the abstract stuff...
You may find this talk enlightening: https://simons.berkeley.edu/talks/ilya-sutskever-openai-2023...
As soon as you start a response like that you should just stop. After all, this is written communication, and what I wrote is plain to see right there.
When you need to start a response that way you should become self-aware that you are not responding to what the person you respond to wrote, but to your own ideas.
There is no need to "interpret" what other people wrote.
Relevant: https://i.imgur.com/Izrqp7d.jpeg
https://www.youtube.com/watch?v=udPY5rQVoW0
This has been a thing for a while. It's actually a funny way to demonstrate model based control by replacing the controller with a human.
That's more like the demos where someone trains on a scene and the neural net can make plausible extensions to the scene as you move the viewpoint. It's more spatial imagination, like the tool in Photoshop that fills in plausible but imaginary backgrounds.
It does handle collisions with the edge of the road. Collisions with other cars don't really work; they mostly disappear. One car splits in half in confusion. The spatial part is making progress, but the temporal part, not so much.
I'll add to the discussion a 2018 Nature letter: "Vector-based navigation using grid-like representations in artificial agents" https://www.nature.com/articles/s41586-018-0102-6
and a 2024 Nature article "Modeling hippocampal spatial cells in rodents navigating in 3D environments" https://www.nature.com/articles/s41598-024-66755-x
And a simulation in Github from 2018 https://github.com/google-deepmind/grid-cells
People have been looking at spacial awareness in neurology for quite a while. (In terms of the timeframe of recent developments in LLMs).
Trying to copy biological systems 1:1 rarely works, and copying biological systems doesn't seem to be required either. CNNs are somewhat brain-inspired, but only somewhat, and LLMs have very little architectural similarity to human brain - other than being an artificial neural network.
This functional similarity of LLMs to the human brain doesn't come from reverse engineered details of how the human brain works - it comes from the training process.
Research findings: LLMs have and use world models. They use some type of abstract thinking - with internal representation that often correspond to human abstract concepts. Which adds up to a capability profile that's amusingly humanlike.
Humans, however, don't like that. They really don't. AI effect is too strong, and it demands that humans must be Special. So some humans, when faced with the possibility that an AI might be doing the same thing their own brains do, resort to coping and seething.
Brains don’t perform Natural language CSR etc, those are cultural extensions separate from mental states etc. there are no functional equivalencies here.
There are many many empirical disputes for function eg
Aru et al “The feasibility of artificial consciousness through the lens of neuroscience” December 2023
>3. Interactive: World models can output the next states based on input actions
>Finally, if actions and/or goals are part of the prompt to a world model, its outputs must include the next state of the world, represented either implicitly or explicitly. When given only an action with or without a goal state as the input, the world model should produce an output consistent with the world’s previous state, the intended goal state if any, and its semantic meanings, physical laws, and dynamical behaviors. As spatially intelligent world models become more powerful and robust in their reasoning and generation capabilities, it is conceivable that in the case of a given goal, the world models themselves would be able to predict not only the next state of the world, but also the next actions based on the new state.
That's literally just an RNN (not a transformer). An RNN takes a previous state and an input and produces a new state. If you add a controller on top, it is called model predictive control. The most extreme form I have seen is temporal difference model predictive control (TD-MPC). [0]
I believe the zero hypothesis would be that a model natively understanding both would work best/come closest to human intelligence (and possibly other different modalities are also needed).
Also, as a complete laymen, our language having several interconnections with spatial concepts would also point towards a multi-modal intelligence. (topic: place, subject: lying under or near, respect/prospect: look back/ahead, etc). In my understanding these connections only secondarily make their way into LLM's representations.
While virtual world systems and physical world systems look similar based on description, a bit like chemistry and chemical engineering, they are largely unrelated problems with limited theory overlap. A virtual world model is essentially a special trivial case that becomes tractable because it defines away most of the hard computer science problems in physical world models.
A good argument could be made that spatial intelligence is a critical frontier for AI, many open problems are reducible to this. I don't see any evidence that this company is positioned to make material progress on it.
Key distinction: Constant and continuous updating. I.e. feedback loops with observation, prediction, action (agency), and once more, observation.
It should have survival and preservation as a fundamental architectural feature.
Since you can't change reality itself, and you can only take actions to reduce variational free energy, doesn't this make everything into a self-fulfilling prophecy?
I guess there must be some base level of instinct that overrides this; in the case of "I think that sabertooth tiger is going to eat me" you want to make sure the "don't get eaten" instinct counters "minimizing prediction errors".
My understanding is at the moment you train something like ChatGPT on the web, setting weights with backpropagation till it works well, but if you give some more info and do more backprop it can forget other stuff it's learned, called 'catastrophic forgetting'. The nested learning approach is to split things into a number of smaller models so you can retrain one without mucking up the other ones.
>We introduce Nested Learning, a new approach to machine learning that views models as a set of smaller, nested optimization problems, each with its own internal workflow, in order to mitigate or even completely avoid the issue of “catastrophic forgetting”, where learning new tasks sacrifices proficiency on old tasks. [0].
It feels funny to be vindicated by rambling something random a week before someone makes an announcement that they did something incredibly similar with great success:
>Here is my stupid and simple unproven idea: Nest the reinforcement learning algorithm. Each critic will add one more level of delay, thereby acting as a low pass filter on the supervised reward function. Since you have two critics now, you can essentially implement a hybrid pre-training + continual learning architecture. The most interesting aspect here is that you can continue training the inner-most critic without changing the outer critic, which now acts as a learned loss function. [1]
[0] https://research.google/blog/introducing-nested-learning-a-n... [1] https://news.ycombinator.com/item?id=45745402
And some number of billions of years of evolutionary progress.
Whatever spacial understanding we have could be thought of as a simulation at a quantum level, the size of the universe, for billions of years.
And what can we simulate completely at a quantum level today? Atoms or single cells?
However, she does not mention Google's models at all. This omission makes the blog feel very much like an ad for her company rather than a good-faith guide for the field.
https://deepmind.google/models/gemini-robotics/gemini-roboti...
If code-based CAD tools were more common, and we had a bigger corpus to pull from, these tools would probably be pretty usable. Without this, however, it seems like we'll need to train against simulations of the physical world.
"No put the thingy over there. Not that thingy!"
https://quantblog.wordpress.com/2025/10/29/digital-twins-the...
I have no doubt FeiFei and her well funded team will make rapid progress.
Once few enough dishes are broken, we will find robots useful in our homes! :)
But why people are really bad at evaluating them is because the details dominate. What matters here is consistency. We need invariance to some things and equivariance to others. As evaluators we tend to be hopeful so the subtle changes frame to frame are overlooked though thats kinda the most important part. It can't just be similar to the last frame, but needs be exactly the same. You need equivariance to translation, yet that's still not happening in any of these models (and it's not a limitation of attention or transformers). You're just going to have a really hard time getting all this data even though by doing that you'll look like you're progressing because you're better fitting it. But in the end the models will need to create some compact formulation representing concepts such as motion. Or in other words, a physics. And it's not like physicists aren't know for being detail oriented and nitpicky over nuances. That is breed into then with good reason
We wouldn’t expect LLM to review a paper and tell us the truth like Fermi did. That is super-intelligence.
Thanks for sharing.
Hard things take time and deep thought. But in our age we seem to not want to think deep. The environment discourages it because it takes longer. It's incredibly difficult to have both speed and quality. They are always in contention.
Mind you, there are a number of Nobel laureates who have claimed they wouldn't have succeeded in today's environment because of this[0]. I'm confident we're so concerned with finding the best that we hinder our ability to.
[0] https://www.sciencealert.com/peter-higgs-says-he-wouldn-t-ha...
I was trained as a Physicist but went to software engineering. These days I would describe my job as a digital plumber. There are two kinds of holes we are dealing with: rabbit holes and potholes. We tend to get into the rabbit holes because of the love of tools and fall into potholes because of not paying attention.
It turned out the industry might do the same. But the cost would be billions of dollars and decades of work of young talented people.
This link has transcript.
On the other hand, it's amazing that a 512x512 image can be represented by 85 tokens (as in OAI's API), or 263 tokens per second for video (with Gemini). It's as if the memory vs compute tradeoff has morphed into a memory vs embedding question.
This dichotomy reminds me of the "Apple Rotators - can you rotate the Apple in your head" question. The spatial embeddings will likely solve dynamics questions a lot more intuitively (ie, without extended thinking).
We're also working on this space at FlyShirley - training pilots to fly then training Shirley to fly - where we benefit from established simulation tools. Looking forward to trying Fei Fei's models!
I keep going back and forth on whether I think “super-intelligence” is achievable in any form other than speed-super-intelligence, but I definitely think that being able to think 3-dimensionally capably will be a major building block to AI outthinking man, and outthinking nature.
Sort of a shitpost.
In that sense has nature just replicated its ability to organise but at the species level on a planetary (interplanetary soon) scale?
Why are humans above nature...?
Humans are maybe separate from nature primarily on the basis of our attempts (of varying success) to steer, structure, and optimize the organization of nature around us, and knowing how to do so is not an explicit aspect of reality, or at least did not make itself known to early humans so it’s reasonable to believe it’s not explicit. By that, I mean you’re not born with any inherent knowledge of the inner workings of quantum gravity, or of the navies stokes equation, or any of the tooling that supports it, but clearly these models exist and evolve tangibly around us in every moment. We found something nature hid from DNA-based biological tree of life, and exploited it to great effect.
Again, this is a colossal shitpost.
I bet it's a dead end.
It could be a dead end for sure. I just hope that someone figures out the `spatial` part for AIs and brings us closer to better ones.
The main problem that I still see is: we are not able to fully understand how much can we scale the current models. How much data do we need? Do we have the data for this kind of training? Can the current models generalize the world?
Probably before seeing something really interesting we need another AI winter, where researchers can be researcher and not soldiers of companies.
I think that they want to follow the same route of LLMs: no understanding of the real world, but finding a brute force approach that's good enough in the most useful scenarios. Same as airplanes: they can't fly in a bird like way and they can't do bird things (land on a branch) but they are crazily useful to go to the other side of the world in a day. They need a lot of brute force to do that.
And yes, maybe an AI winter is what is needed to have the time to stop and have some new ideas.
We build a map-based productivity app for workers. We map out the workplace and use e.g. asset tracking to visualize where things are and help people find these things and navigate around. There's a lot more to this of course, but we typically geo-reference whatever building maps we can get our hands on on top of openstreetmaps. This allows people to zoom in and out and switch between indoor and outdoor.
The hard part for us: sourcing decent building maps. There usually is some building map information available in the form of cad drawings, fire escape plans, etc. But they aren't really well suited for use as a user friendly map. Also typically getting vector graphics for this is hard. In short, we usually have to spend quite a bit of effort on designing or sourcing maps. And of course these maps aren't static. People extend buildings, move equipment and machines around, and re-purpose the spaces they have. A map of a typical factory is an empty rectangle. You can see where the walls, windows, and doors are and any supporting columns. All the interesting stuff happens in the negative spaces (the blank space between the walls).
Mapping all this is a manual process because it requires people to interpret raw spatial data in context. We build our own world model. A great analogy is text based adventure games where the only map you had was what you built in your head from querying the game. It's a surprisingly hard problem. We're used to decent quality public maps outdoors; but indoors there isn't much. Making outdoor maps is something that is quite expensive but lucrative enough that companies have been investing in that for years. Also openstreetmap has tapped into a huge community of people that manually edit things and/or integrate third party data sets (a lot of stuff is imported as well).
Recently with Google's nano banana model creating building maps got a lot easier. It has some notion of proportions and dimensions. I was able to take a smart phone photo of the fire escape plan mounted to the wall and then I let nano banana clean it up and transform it; without destroying dimensions or hallucinating new walls, doors, windows, etc. or changing the dimensions of rooms. We've also been experimenting with turning bitmaps into vector graphics which can work with promising results but this still needs work. But even just a cleaned up fire escape plan minus all the escape routes, and other map clutter is already a massive improvement for us. Fire escape plans are everywhere and are kind of the base line map we can get for pretty much any building provided they are to scale. Which at least in Germany they are (standards for this are pretty strict).
AI-based map content creation from photos, reference cad diagrams, textual descriptions, etc. is what we want to target next. Given some basic cad map and a photo in the building, can we deduce the vantage point from which the photo was taken and then identify things in the photo and put them on the map in the correct position. People are actually able to do this with enough context. That's what openstreetmap editors do when they add detail to the map. AI models so far don't quite do all of this yet. Essentially this is about creating an accurate world model and using that to populate maps with content. It's not just about things like lidar and stereo vision but about understanding what is what in a photo.
In any case, that's just one example of where I see a lot of potential for smarter models. Nano banana was the first model to not make a mess of our maps.
Proven correct yet again.
Human cognition isn’t built on abstract reasoning alone. It’s embodied, grounded in sensation.
Evolution didn’t achieve generalization across domains by making brains more symbolic. It did so by making them more integrated by fusing chemical gradients, touch, proprioception, light, sound, temperature, and pressure into one continuous internal narrative.
Intelligence does not seem to be an algorithmic property; it’s a felt coherence across senses. Our reasoning emerges from a complex interaction of sensory information, memory, emotions, and cognitive processing. Sensory completeness is the way forward.
Hint ... we've got a long way to go.
The notion it is reducible to spatial ‘intelligence’ is just another case of flavor of the month in AIs contraction and desperate search for success and funding.
That a world model as inconsistent and inaccurate as storytelling is given here as a premise to hoist cash for flavor of the month spatial details how desperate the search has become in Frontierland.
nothrowaways•2mo ago
ares623•2mo ago