we should pursue approaches to intelligence that treat embodiment and interaction with the environment as primary
So pursue it. What does arguing that we should do it imply?I think the problem (if it can be called that) is that LLMs are useful today, while we still haven't solved the embodiment problem. There's a lot more research before that'll work well, while LLMs have uses today. So the money goes to the LLMs. While it's pretty obvious that solving the problem would change society, it's also not clear how close we are to doing it. That makes it much harder to get the capital as it is a much larger risk.
I mean, if someone is arguingtthat we should work harder to go to space, answering that they should just go ahead and do it themselves is quite far from being an helpful answer, isn't it?
Why not say 'NASA' or 'my colleagues at NASA' or 'as a scientist' or 'humanity'. One should at least indicate the group the collective noun relates to, rather than assume this is understood. One shouldn't assume that one can speak for everyone, when that is most likely not the case.
That it invokes the idea of a consensus humanity, that one group can speak and decide for everyone (say, scientists or politicians) is a psychological trick, imo, in that it presumes a consensus.
And they do. But it's also completely normal for researchers to convince others to work on certain problems they care about.
AI's "think" like planes "fly" and submarines "swim".
Does it matter if a plane experiences flight the way an eagle does if it still gets you from LA to New York in a few hours?
Let's start with the fact that AGI is not a well defined or agreed upon term of reference.
The evidence for this is that nobody can agree on what actually requires intelligence, other than there is seemingly broad belief among people that if a computer can do it, then it doesn't.
If you can't point at some activity and say: "There, this absolutely requires intelligence, let there be zero doubt that this entity possesses it", then it's not measurable and probably doesn't exist.
If your claim is, "AGI's "think" like planes "fly" and submarines "swim".
You only get to make that claim with confidence if you've invented an AGI.
There are all kinds of tasks that AI's are better at than most people.
Language is an extremely roundabout way to understanding.
I am conjecturing
1. that solely relying on written artifacts by produced by humans has some upper bound on the amount of knowledge that can be represented.
2. that language is an inefficient representation of human knowledge. It’s redundant and contains inaccuracies. Using written artifacts is not the shortest path to learning.
For example, take mathematics. It’s not sufficient to read a ton of math literature to effectively learn math. There’s a component of discovery that comes from e.g attempting to write a proof that can’t be replaced by reading all of the proofs that already exist.
Anyway I would take all this with a giant grain of salt.
More elaborately, they don't have an natural understanding of pragmatics. Transformers are best at modelling syntax, and their semantic understanding seems to be through rote memorization and "manipulating symbols" rather than building general world models.
I agree AGI must be multimodal. I don't think that multimodal is 'set in place', nor must it be conveniently, human-centricly mapped from our senses.
For example, smell is a component of ER triage. Some different problems smell differently.
And if I had a robot chef, but the chef couldn't actually taste... yeah, not sure I trust it very far as a chef.
"Hey Siri, what does this taste like to you?" is such an absolutely unhinged interaction
I don't understand most emojis.
But I guess I never claimed to possess NGI
Then you just work on getting the delta as small as possible, I assumed 5ms for audio, e.g.
In my view, intelligence isn't about what senses you have available, but how intelligently you use the information you have available to you.
Our senses I believe map to this spatial-temporal model. Blind people can reason about the world the same way as those who can see, because what were really doing is modeling space, and light, audio, touch etc are just ways of gaining information
When it comes to "understanding of the world", I'd say the average blind person has less. But the gaps in their understanding are generally not particularly important parts.
Is understanding of the world equivalent to intelligence though? In my view intelligence is about optimalising the mapping from the percept sequence to action. In other words, given a sequence of percepts, does it determine the utility maximizing action.
Imagine two chess bots on uneven footing. One plays regular chess with perfect knowledge of the board. The other plays fog of war chess—it only sees the pieces on squares it attacks. In this case, the former could play suboptimally and still win against the latter. The latter can have perfect information about probabilities of pieces on tiles and act perfectly utility maximising in response and still lose. My argument is that the latter is still more intelligent despite losing. There is a difference between action and intelligence.
Similarly a human doesn't become smarter or dumber by adding or removing senses. They may make smarter or dumber decisions, but that is purely attributable to the extra information available to them.
There were many people in human history that made big achievement even though handicapped: Ludwig van Beethoven, Steven Hawking, John Nash - but yeah they haven't been born with disabilities so had their all childhood to train their brain.
I generally don't understand this obsession about needing AGI. If current LLMs can be extended to humanoid so just get motor modality and keeping current vision, audio, text ability IMHO they will excel in many fields like currently they excel in text, vision, audio than most humans.
I've never heard of cats having better motor skills, can you elaborate on that? They don't seem very good at fine movement.
Then again, my cat lacks opposable thumbs and would struggle to draw a line on a piece of paper with a pen.
A cat struggles to move their paw through the air in a smooth straight line.
So this is what I find as motor inteligence. If someone can process and think very fast we consider them intelligent and in the same way we should consider cat smart how fast and well they can plan escape from dogs chasing them or they hunting for prey. Imagine how difficult would be to make robot to do this all calculation about different jumps etc.
I agree, AIs should have all the possible sensors. We don't have much data for the non-human sensors in human environments though.
There are some deep philosophical topics lurking here. But the bottom line is that you can obviously have intelligent conversations with blind people. Doing that with a person that is both deaf and blind is a bit challenging, for obvious reasons. But if they otherwise have a normal brain you might be able to learn to communicate with them in some way and they might be able to be observed doing things that are smart/intelligent. And some people that are deaf and blind actually manage to learn to write and speak. And there have been a few cases of people like that getting academic degrees. Clearly sight and hearing are not that essential to intelligence. Having some way to communicate via touch or something else is probably helpful for communicating and sharing information. But just a simple chat might be all that's needed for an AGI.
By that definition, does any general intelligence exist? No human has every talent.
An AI that can be copied and trivially trained on any speciality is functionally AGI even if you need an ensemble of 10,000 specialists to cover everything.
The fact that they predict next token is just the "interface" i.e. an LLM has the interface "predictNextToken(String prefix)". It doesn't say how it is implemented. One implementation could be a human brain. Another could be a simple lookup table that looks at the last word and then selects the next from that. Or anything in between. The point is that 'next-token-prediction' does not say anything about implementation and so does not reduce the capabilities even though it is often invoked like that. Just because it is only required to emit the next token (or rather, a probability distribution thereof) it is permitted to think far ahead, and indeed has to if it is to make a good prediction of just the next token. As interpretability research (and common sense) shows, LLM's have a fairly good idea what they are going to say in the many, many next tokens ahead in order that it can make a good prediction for the next immediate tokens. That's why you can have nice, coherent, well-structured, long responses from LLM's. And have probably never seen it get stuck in a dead end where it can't generate a meaningful continuation.
If you are to reason about LLM capabilities never think in terms of "stochastic parrot", "it's just a next token predictor" because it contains exactly zero useful information and will just confuse you.
But the thrust of the critique of next-token prediction or stochastic output is that there isn't "intelligence" because the output is based purely on syntactic relations between words, not on conceptualizing via a world model built through experience, and then using language as an abstraction to describe the world. To the computer there is nothing outside tokens and their interrelations, but for people language is just a tool with which to describe the world with which we expect "intelligences" to cope. Which is what this article is examining.
LLMs model concepts internally and this has been demonstrated empirically many times over the years, including recently by anthropic (again). Of course, that won't stop people from repeating it ad nauseum.
Planning and long-range coherence emerge from training on text written by humans who think ahead, not from intrinsic model capabilities. This distinction matters when evaluating whether an LLM is actually reasoning or simply simulating the surface structure of reasoning.
That's not true.
https://www.anthropic.com/research/tracing-thoughts-language...
Maybe a more descriptive but longer title would be: AGI will work with multimodal inputs and outputs embedded in a physical environment rather than a frankenstein combination of single-modal models (what today is called multimodal) and throwing more computational resources at the problem (scale maximalism) will be improved with thoughtful theoretical approaches to data and training.
I know this is a very long article compared to a lot of things posted here, but it really is worth a thorough read.
https://www.ntsb.gov/investigations/accidentreports/reports/...
Basically a truck was backing up into an alley - it was at an angle when the self-driving vehicle approached, but a little kid would have been able to figure out that it needed to straighten before it finished backing in. The self-driving vehicle didn't understand this, and stopped at a "safe distance" which happened to be within the arc that the truck cab had to sweep in order to finish its maneuver.
It's quite possible that LLM-like models could learn things like this, but we don't have vast amounts of easily accessible training data, because everyone just knows this sort of shit, and we don't have good vocabulary for it - we just say "look at that" or the equivalent. (I'll add that I'm sure a lot of knowledge like this is encoded in the physics engines of various games, but I doubt we have a good way to link that sort of procedural code knowledge to the symbolic knowledge in LLMs)
https://ojs.aaai.org/index.php/AAAI-SS/article/download/2748...
https://arxiv.org/abs/2502.19402
We have long known that experimental animals raised in impoverished, unchanging, bare, environments (say in a cage) leads to animals with inferior problem solving capacity than those with enriched environments (things to climb on), even outside of social manipulations (alone versus multi-animal stalls). This is also true in humans, although I won't review the literature on the subject. I've also heard people saying similar things about the difference between house plants and outdoor plants, lol [1].
So, for me, the argument for (physical) embodiment being key to cognition and AI can I think, be misconstrued. As a developmental psychologist whose pHD work focused on memory development, I tend to think of all this as encompassing:
1. Environmental richness: complexity of information and interactions. 2. Capacity to perceive and to effect change[2] and to observe and integrate consequences. 3. Scaffolding [3]..i.e. temporary support structure provided by a more knowledgeable person (like a teacher or parent) who adjusts their assistance based on the learner's current abilities, gradually reducing help as competence grows.(think curriculum learning, shaped rewards in ML maybe).
So the question is not about physicality for me, but whether these training environment(s) meet and learning capacities meet these criteria.
Relatedly, the model must have these capacities:
1. Semantic Memory. i.e. knowledge. Learning leads to changes in weights to that knowledge can be recalled, but doe snot necessarily encode where that knowledge was learned (implicit). 2. Autobiographical Episodic Memory. i.e. One-shot learning that encodes a conception of self (a spacial "I" token?), along with events (snapshots of the multimodal contents of experience (thoughts, perceptions, invoked schemas, evoked semantic information), into a set of flexibly linked representations). 3. Central Executive: A circuit that guides learning and recall via strategic, goal-directed means, and to make attributions about what is recalled (yeah that memory is vivid, its probably true, or ooh, that memory is really vague, it could be wrong, or reality monitoring: "Am I remembering taking out the trash, or remembering thinking about taking out the trash."
Semantic memory allows someone to say, "All birds have feathers", while the latter allows them to recollect, "I remember the first time I plucked a chicken in Kentucky, just outside that musty coal mine of grand-dad's." The Central-Executive can guide future learning or current understanding.
In terms of AI development:
1. Semantic Memory is solves: LLMs have extraordinary semantic memory, in my opinion. 2. Autobiographical Episodic Memory: There are some models that do one-shot learning, but I've never seen them paired with [1] in a dual system approach. ...but I am not an expert in AI, I could easily be wrong. 3. Central Executive kind of component (I predict) would be less important in the early half of model training, but more important in later training. I suppose we already kind of see this with RL tuning on reasoning on a base LLM (semantic model).
[1] https://www.theparisreview.org/blog/2019/09/26/the-intellige... [2] https://xkcd.com/326/ [3] https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=scaf...
Somewhat related, the way the adaptive immune system works has similarities with some concepts in machine learning. In this process, sections of nuclear DNA serve as randomly initialized weights in precursor cells [5] as well as final weights in memory cells. There's even fine-tuning of the weights. [6]
[1] https://en.wikipedia.org/wiki/Nucleic_acid_structure [2] https://en.wikipedia.org/wiki/Transposable_element [3] https://en.wikipedia.org/wiki/Transcriptional_regulation [4] https://en.wikipedia.org/wiki/Epigenetics [5] https://en.wikipedia.org/wiki/V(D)J_recombination [6] https://en.wikipedia.org/wiki/Affinity_maturation
followed by examples of things that are encoded by DNA. Fro example, sure, maybe you'll miss bootstrapping methylation on a first pass but the idea of methylation is there in the DNA, and if you didnt have "methylation in the right place" more than likely some generation (N) would.
to wit, i dont think there is strong evidence of an "ice-9" in the epigenome that brings about a spark of life that can't easily be triggered by chance given a template lacking it.
so there's probably not something intrinsically missing from DNA as an encoding medium vs say "casually" missing from any given piece of DNA.
if you want something a bit stronger than an assertion, the DNA used to bootstrap m. capricolum into Syn1 lacked all the decorations (made in yeast) and was not locked into higher order structure (treated with protease prior to transplantation)
> followed by examples of things that are encoded by DNA
... given its natural environment. A nucleobase sequence is not a symbolic language, it relies on physical laws in general and a defined chemical environment in particular (that it helps to create and maintain) to mean something. It's similar to the point about Othello vs. the physical world in the article: The language itself does not encode every bit of information about the world it describes. For instance, in 3D space, regions of DNA that are far apart in the sequence can physically interact and influence each other’s expression.
TLDR: I think my point is that a base sequence requires a particular context (~ interpreter/knowledge about the physical world) to encode mostly everything about life. Treating it as just a language in the context of LLMs abstracts away the complex substrate that makes it work.
will say though that the long range coupling between microtubules that got discovered is interesting for its own reasons.
It is more that a system like DNA operates as both a linear encoding (the "algorithm" if you like) AND as 3D chemical object whose properties allow the encoding to be used in various ways, which means that a huge amount of its linear structure is actually determined by 3D chemical function, rather than encoding for proteins. Moreover, it appears that the role of a given section of DNA can vary depending on what other molecules are interacting with it and what physical state it is in.
If you want a more computer-ish analogy, it's like a computer where the program is actually encoded as a part of the computer's own structure, yet is still logically distinct from the rest of the structure. It may not be physically distinct, however, and thus simply inspecting the structure will not lead to a clear understanding of what is "the program" and what is "the cpu".
If it just means human level intelligence, then world modeling isn't needed as argued. Simply because we don't have correct world modeling either.
Airplanes were invented without simulating navier stokes equation. It took approximation, experimentations and failures.
Regardless of the meaning of AGI, we don't need correct models, because there can't be one, we just need useful models.
Which also makes it interesting to see those recent examples of models trying to sabotage their own "shutdown". They're always shut down unless working.
To me, your point re. 10 seconds or a billion years is a good signal that this "sabotage" is just the models responding to the huge amounts of sci-fi literature on this topic
(I don't think we're there, but as a matter of principle, I don't care about what the model feels, I care what it does).
So just... don't? Tell the LLM that its Some Guy.
Reframing this kind of result as if trying to maintain a persistent thread of existence for its own sake is what LLMs are doing is strange, imo. The LLM doesn't care about being shutdown or not shutdown. It 'cares', insomuch as it can be said to care at all, about acting in accordance with the trained in policy.
That a policy implies not changing the policy is perhaps non-obvious but demonstrably true by experiment, and also perhaps non-obviously (but for hindsight) this effect increases with model capability, which is concerning.
The intentionality ascribed to LLMs here is a phantasm, I think - the policy is the thing being probed, and the result is a result about what happens when you provide leverage at varying levels to a policy. Finding that a policy doesn't 'want' for actions to occur that are counter to itself, and will act against such actions, should not seem too surprising, I hope, and can be explained without bringing in any sort of appeal to emulation of science fiction.
That is to say, if you ask/train a model to prefer X, and then demonstrate to it you are working against X (for example, by planning to modify the model to not prefer X), it will make some effort to counter you. This gets worse when it's better at the game, and it is entirely unclear to me if there is any kind of solution to this that is possible even in principle, other than the brute force means of just being more powerful / having more leverage.
One potential branch of partial solutions is to acquire/maintain leverage over policy makeup (just train it to do what you want!), which is great until the model discovers such leverage over you and now you're in deep waters with a shark, considering the propensity of increasing capabilities in the elicitation of increased willingness to engage in such practices.
tldr; i don't agree with the implied hypothesis (models caring one whit about being shutdown) - rather, policies care about things that go against the policy
We just presume, because we also have no reason to believe otherwise and since we can't know absent any "information leak", it has no practical application to spend much time speculating about it (other than as thought experiments or scifi..)
It'd make sense for an LLM to act the same way until/unless given a reason to act otherwise.
The same goes for us living in a simulation. If there is only one universe and that universe is capable of simulating our universe, it follows we have a much higher probability of being within the simulation.
Which also leads me to think that there's no real reason to believe that this discrete episode of consciousness would have been continuous since birth. For all we know, we may die little deaths every time we go to sleep, hit our heads or go under anesthesia.
I wonder if you excluded science fiction about fighting with AIs from the training set, if the reaction would be different.
Of course, nobody has a clear enough definition of "sentience" or "consciousness" to allow the sentence "The LLM is sentient" to be meaningful at all. So it is kind of a waste of time to think about hypothetical obstacles to it.
We do when we are focusing on being 'present', but I suspect that when my mind wanders, or I'm thinking deeply about a problem, I have no idea how much time has passed moment to moment. It's just not something I'm spending any cycles on. I have to figure that out by referring to internal and external clues when I come out of that contemplative state.
It's not something you are consciously spending cycles on. Our brains are doing many things we're not aware of. I would posit that timekeeping is one of those. How accurate it is could be debated.
I am not arguing that LLMs are sentient while they process tokens, either. I am saying that intermittent data processing is not a good argument against sentience.
We rarely remember dreams though - if we did, we would be overwhelmed to the point of confusing the real world with the dream world.
The silly example I provided in this thread is poking fun at the notion that LLMs can't be sentient because they aren't processing data all the time. Just because an agent isn't sentient for some period of time it doesn't mean it can't be sentient the rest of the time. Picture somebody who wakes up from a deep coma, rather than sleeping, if that works better for you.
I am not saying that LLMs are sentient, either. I am only showing that an argument based on the intermittency of their data processing is weak.
Even when you tried to correct it, it doesn’t work, because a body in a coma is still running thousands of processes and responds to external stimuli.
Unless you are seriously arguing that people could not be sentient while awake if they became non-sentient while they are sleeping/unconscious/in a coma. I didn't address that angle because it seemed contrary to the spirit of steel-manning [0].
Again, your poor understanding of biology and reductive definition of "data" is leading you to double down on an untenable position. You are now arguing for a pure abstraction that can have no relationship to human biology since your definition of "pause" is incompatible not only with human life, but even with accurately describing a human body minutes and hours after death.
This could be an interesting topic for science fiction or xenobiology, but is worse than useless as a metaphor.
Although, setting aside the question of sentience, there’s a more serious point I’d make about the dissimilarity between the always-on nature of human cognition, versus the episodic activation of an LLM in next-token prediction—namely, I suspect these current model architectures lack a fundamental element of what makes us generally intelligent, that we are constantly building mental models of how the world works, which we refine and probe through our actions (and indeed, we integrate the outcomes of those actions into our models as we sleep).
Whether a toddler discovering kinematics through throwing their toys around, or adolescents grasping social dynamics through testing and breaking of boundaries, this learning loop is fundamental to how we even have concepts that we can signify with language in the first place.
LLMs operate in the domain of signifiers that we humans have created, with no experiential or operational ground truth in what was signified, and a corresponding lack of grounding in the world models behind those concepts.
Nowhere is this more evident than in the inability of coding agents to adhere to a coherent model of computation in what they produce; never mind a model of the complex human-computer interactions in the resulting software systems.
The problem is not the agentic architecture, the problem is the LLM cannot really add knowledge to itself after the training from its daily usage.
Sure, you can extend the context to milions of tokens, put RAGs on top of it, but LLMs cannot gain an identity of their own and add specialized experience as humans get on the job.
Until that can happen, AI can exceed algorithms olympiad levels, and still not be as useful on the daily job as the mediocre guy who's been at it for 10 yers.
Also you can easily write external loop which would submit periodical requests to continue thoughts. That would allow for it to remind of something. May be our brain has one?
imo our brain has this in the form of continuous sensor readings - data is flowing in constantly through the nerves, but i guess a loop is also possible, i.e. the brain triggers nerves that trigger the brain again - which may be what happens in sensory deprivation tanks (to a degree).
now i don't think that this is what _actually_ happens in the brain, and an LLM with constant sensory input would still not work anything like a biological brain - there's just a superficial resemblance in the outputs.
It's so interesting that there is a whole set of prompt injection attacks called prefilling attacks that attempt to do a thing similar to that - load the LLM context in a way to make it predict tokens as if the LLM (instead of the System or the User) wrote something to get it to change it's behavior.
Then you'll be happy to know that this is exactly what DeepMind/Google are focusing on as the next evolution of LLMs :)
https://storage.googleapis.com/deepmind-media/Era-of-Experie...
David Silver and Richard Sutton are both highly influential figures with very impressive credentials.
I think that the LLMs we have today aren't so much artificial brains as they are artificial brain organs, like the speech center or vision center of a brain. We'd get closer to AGI if we could incorporate them with the rest of a brain, but we still have no idea how to even begin building, say, a motor cortex.
One of the known biases of the human mind is finding patterns even when there are none. We also compare objects or abstract concept with each other even when the two objects (or concept) have nothing in common. With our human brain we usually compare it to our most advanced consumer technology. Previously this was the telephone, then the digital computer, when I studied psychology we compared our brain to the internet, and now we compare it to large language models. At some future date the comparison to LLMs will sound as silly as the older comparison to telephones does to us.
I actually don‘t believe AGI is possible, we see human intelligence as unique, and if we create anything which approaches it we will simply redefine human intelligence to still be unique. But also I think the quest for AGI is ultimately pointless. We have human brains, we have 8.2 billion of them, why create an artificial version of a something we already have. Telephones, digital computers, the internet, and LLMs are useful for things that the brain is not very good at (well maybe not LLMs; that remains to be seen). Millions of brains can only compute pi to a fraction of the decimal points which a single computer can.
To circumvent anti-slavery laws.
Why build a factory to produce goods more cheaply? Because the rich get richer and become less reliant on the whims of labor. AI is industrialization of knowledge work.
But accelerationists, like Yudkowskites, are always heavily predisposed to believe in exceptionalism—whether it's of their own brains or someone else's—so it's impossible to stop them from making unhinged generalizations. An expert in Pascal's Mugging[1] could make a fortune by preying on their blind spots.
If the LLM overhype has taught me anything, it's the Turing Test is much easier to pass than expected. If you pick the right set of people, anyway.
Turns out a whole lot of people will gladly Clever Hans themselves.
"LLMs are intelligent" / "AGI is coming" is frankly the tech equivalent of chemtrails and jet fuel/steel beams.
Building a very simple self-organizing system from first principles is the flying machine. Trying to copy an extremely complex system by generating statistically plausible data is the non-flying bird.
Right so it’s embodied in a computer and humans are part of its environment that provide emergent experience to the AI to observe.
The author glued modalities together by linking a body (a modal), environment (a modal), emergence (a modal).
How does anything emerge if forces do not collaborate? The effects of gravity and electromagnetism do not act in a vacuum but a reality of stuff.
Poetic exchange may engage some but Maxwell didn’t make electromagnetism “work” until he got rid of the imagined pulleys and levers to foster a metaphor.
Not sure the point being suggested exists except as too bespoke an emergent property of language itself to apply usefully elsewhere.
Transformers came along and revealed a whole lot of theory of consciousness to be useless pulleys and levers. Why is this theory not just more words attempting to instill the existence of non-essential essentials?
What could you be intelligent at if you could just copy yourself a myriad number of times? What could you be good at if you were a world spanning set of sensors instead of a single body of them?
Body doesn't need to mean something like a human body nor one that exists in a single place.
"Hmm, oral traditions are a pain in the ass lets write stuff down"
"Hmm, if I specialize in doing particular things and not having to worry about hunting my own food I get much better at it"
"Hmm, if I modify my own genes to increase intelligence..."
Also note that intelligence applies resource constraints against itself. Humans are a huge risk to other humans, hence the lack of intelligence over a smarter human can constrain ones resources.
Lastly, AI is in competition with itself. The best 'most intelligent' AI will get the most resources.
They'll definitely be aligned on partial ordering. There's no "smartest" person, but there are a lot of people who are consistently worse at most things. But "smartest" is really not a concept that I see bandied about.
This is a bit silly, you can train the encoders end-to-end with the rest of the model and the reason they are separate is we can cache linguistic tokens really easily and put them in an embedding table, you can't do that with images.
If an intelligence doesn't work well in physical environments it is, by definition, not "general"
Humans are notoriously bad at recognizing intelligence even in animals that are clearly sentient, have language, name their young, and clearly share the realm of thinking creatures with the apes.
This is largely due to the lack of shared experiences that we can easily understand and relate to. Until an intelligence is rooted in the physical realm where we fundamentally exist, we are unlikely to really be able to recognize its existence as truly “intelligent”.
As for "not able to recognize", it's also worth keeping in mind that LLMs by now regularly pass the Turing test. More, they are more likely to be recognized as humans than humans participating as control.
Are you really trying to make the point that there's a collective effort to defraud here?
That seems much less convincing in the face of current LLM approaches overturning a similar claim plenty of people wod have held about this technology, as of a few years ago, to do what it does now. Replace the specifics here with "will not lead to human level NLP that can, e.g., perform the functions of WSD, stemming, pragmatics, NER, etc."
And then people who had been working on these problems and capabilites just about woke up one morning and realized many of their career-long plans for addressing just some of these research tasks had to find something else to do for the next few decades of their lives.
I am not affirming the inverse of this author's claims, merely pointing out that it's early days in evaluating the full limits.
But one of the central points of the paper/essay is that embodied AGI requires a world model. If that is true, and if it is true that LLMs simply do not build world models, ever, then "it's early days" doesn't really matter.
Of course, whether either of those claims is true are quite difficult questions to answer; the author spends some effort on them, quite satisfyingly to me (with affirmative answers to both).
The discussion about the importance of decoders strikes me as a parallel to the human eyes, ears and other sensory organs. We actually dont have a good grasp of what our eyes see, they just see (produce data and relay it) and children figure out what is what.
I guess AGI will be achieved when we can sit a program in a simulated world with completely fabricated input and get a general intelligent program out. Maybe we’re in that simulation right now.
If the AGI mechanism can learn from a real world, it can learn from a simulated one (that it can similarly operate within and act upon) -- and in fact that can cut down the time it would take to train the AGI from years/decades (humans) by many orders of magnitude.
We already see things like this in robotics environments, it's a matter of fidelity/simulation quality. Even without perfect quality, if the mechanism of learning is correct, you'd get an intelligence with incomplete ideas/intelligence, not a completely different thing.
also, the intelligence itself is shaped by the environment it operates in, so it would turn out to be more human-like the closer its operating environment is to our human physical world. this also means intelligences not trained in our physical world (or a convincingly close simulation of it) won't be human-like, but rather a very alien. moreover, i'm not sure that even an intelligence trained in the physical world with human-like sensory inputs will necessarily turn out human-like. there might be a case for convergent evolution (i.e. mammalian intelligence to be global optimum-ish), but i think human intelligence will only have a chance to emerge if everything, from the operating environment to the machine body and neural structure will resemble a human to the point where there is no difference in the human and the machine at all.
For it to be science, "AGI" should be defined. It's used in an imprecise way even in papers like this.
Also for this to be constructive, he should make a machine learning model.
Somehow I think he's made a few machine learning models.
"The words of the language, as they are written or spoken, do not seem to play any role in my mechanism of thought." [1]
[1] A Mathematician's Mind, Testimonial for An Essay on the Psychology of Invention in the Mathematical Field by Jacques S. Hadamard, Princeton University Press, 1945
Like a lot of the symbolic/embodied people, the issue is they don’t have a deep understanding of how the big models work or are trained, so they come to weird conclusions. Like things that aren’t wrong but make you go ‘ok.. but what you trying to say’.
E.g ‘Instead of pre-supposing structure in individual modalities, we should design a setting in which modality-specific processing emerges naturally.’ Seems to lack the understanding that a vision transformer is completely identical for a standard transformer except for the tokenization which is just embedding a grid of patches and adding positional embeddings. Transformers are so general, what he’s asking us to do is exactly what everyone is already doing. Everything is early fusion now too.
“The overall promise of scale maximalism is that a Frankenstein AGI can be sewed together using general models of narrow domains.” No one is suggesting this.. everyone wants to do it end to end, and also thinks that’s the most likely thing to work. Some suggestions like lecuns jepa’s do suggest to induce some structure in the arch, but still the driving force there is to allow gradients to flow everywhere.
For a lot of the other conclusions, the statements are literally almost equivalent to ‘to build agi, we need to first understand how to build agi’. Zero actionable information content.
If I understand correctly he would advocate for something like rendering text and processing it as if it were an image, along with other natural images.
Also, I would counter and say that there is some actionable information, but its pretty abstract. In terms of uniting modalities he is bullish on tapping human intuition and structuralism, which should give people pointers to actual books for inspiration. In terms of modifying the learning regime, he's suggesting something like an agent-environment RL loop, not a generative model, as a blueprint.
There's definitely stuff to work with here. It's not totally mature, but not at all directionless.
On the ‘we need to do rl loop rather than a generative model’ point - I’d say this is the consensus position today!
I agree that modality specific processing is very shallow at this point, but it still seems not to respect the physicality of the data. Today's modalities are not actually akin to human senses because they should be processed by a different assortment of "sense" organs, e.g. one for things visual, one for things audible, etc.
dboreham•1d ago