There is a field of study for this called statistical mechanics.
And I've always understood talking about emergence as if it were some sort of quasi-magical and unprecedented new feature of LLMs to mean, "I don't have a deep understanding of how machine learning works." Emergent behavior is the entire point of artificial neural networks, from the latest SOTA foundation model all the way back to the very first tiny little multilayer perceptron.
I always understood this to be the initial framing, e.g. in the Language Models are Few Shot Learners paper but then it got flipped around.
If you want to understand how birds fly, the fact that planes also fly is near useless. While a few common aerodynamic principles apply, both types of flight are so different from each other that you do not learn very much about one from the other.
On the other hand, if your goal is just "humans moving through the air for extended distances", it doesn't matter at all that airplanes do not fly the way birds do.
And then, on the generated third hand, if you need the kind of tight quarters maneuverability that birds can do in forests and other tangled spaces, then the way our current airplanes fly is of little to no use at all, and you're going to need a very different sort of technology than the one used in current aircraft.
And on the accidentally generated fourth hand, if your goal is "moving very large mass over very long distance", the the mechanisms of bird flight are likely to be of little utility.
The fact that two different systems can be described in a similar way (e.g. "flying") doesn't by itself tell you that they are working in remotely the same way or capable of the same sorts of things.
A system is the product of the interaction of its parts. It is not the sum of the behaviour of its parts. If a system does not exhibit some form of emergent behaviour, it is not a system, but something else. Maybe an assembly.
If putting together a bunch of X's in a jar always makes the jar go Y, then is Y an emergent property?
Or we need to better understand why a bunch of X's in a jar do that, and then the property isn't emergent anymore, but rather the natural outcome of well-understood X's in a well-understood jar.
As in your example: If a bunch of x in a jar leads to the jar tipping over, it is not emergent. That’s just cause and effect. Problem to start with is that the jar containing x is not even a system in the first place, emergence as a concept is not applicable here.
There may be a misunderstanding on your side of the term emergence. Emergence does not equal non-understanding or some spooky-hooky force coming from the unknown. We understand the functions of the elements of a car quite well. The emergent behaviour of a car was intentionally brought about by massive engineering.
Reductionism does not lead to an explaining-away of emergence.
turned the car into a motorcycle.
here's an article with a photo for anyone who's interested: https://archive.is/y96xb
Emergent properties are unavoidable for any complex system and probably exponentially scale with complexity or something (I'm sure there's an entire literature about this somewhere).
One good instance are spandrels in evolutionary biology. The wikipedia article is a good explanation of the subject: https://en.m.wikipedia.org/wiki/Spandrel_(biology)
LLMs approximate addition. For a long time they would produce hot garbage. Then after a lot of training, they could sum 2 digit numbers correctly.
At this point we’d say “they can do addition”, and the property has emerged. They have passed a binary skill threshold.
The parameter count of an LLM defines a certain bit budget. This bit budget must be spread across many, many tasks
I'm pretty sure that LLMs, like all big neural networks, are massively under-specified, as in there are way more parameters than data to fit (understanding the training data set is bigger than the size of the model, but the point is the same loss can be achieved with many different combinations of parameters).And I think of this underspecification as the reason neural networks extrapolate cleanly and this generalize.
There needed to be someone willing to try going big at an organization with sufficient idle compute/data just sitting there, not a surprise it first happened at Google.
Every bet makes perfect sense after you consider how promising the previous one looked, and how much cheaper the compute was getting. Imagine being tasked to train an LLM in 1995: All the architectural knowledge we have today and a state-level mandate would not have gotten all that far. Just the amount of fast memory that we put to bear wouldn't have been viable until relatively recently.
I remember back in the 90s how scientists/data analysts were saying that we'd need exaflop scale systems to tackle certain problems. I remember thinking how foreign that number was when small systems were running maybe tens of megaFLOPS. Now we have systems starting to zettaflops (FP8 so not exact comparison).
In other words, no one expected GPT-3 to suddenly start solving tasks without training as it did, but it was expected to be useful as an incremental improvement to what GPT-2 did. At the time, GPT-2 was seeing practical use, mainly in text generation from some initial words - at that point the big scare was about massive generation of fake news - and also as a model that one could fine-tune for specific tasks. It made sense to train a larger model that would do all that better. The rest is history.
It Is Difficult to Get a Man to Understand Something When His Salary Depends Upon His Not Understanding It
For LLMs (as with other models), many local optima appear to support roughly the same behavior. This is the idea of the problem being under-specified ie many more equations than unknowns so there are many ways to get the same result.
It's hard to explain, but higher dimensional spaces have weird topological properties, not all behave the same way and some things are perfectly doable in one set of dimensions while for others it just plain doesn't work (e.g. applying surgery on to turn a shape into another).
A simple process produces a Mandelbrot set. A simple process (loss minimization through gradient descent) produces LLMs. So what plays the role of 2D-plane or dense point grid in the case of LLMs? It is the embeddings, (or ordered combinations of embeddings ) which are generated after pre-training. In case of a 2D plan, the closeness between two points is determined by our numerical representation schema. But in case of embeddings, we learn the 2D-grid of words (playing the role of points) by looking at how the words are getting used in corpus
The following is a quote from Yuri Manin, an eminent Mathematician.
https://www.youtube.com/watch?v=BNzZt0QHj9U Of the properties of mathematics, as a language, the most peculiar one is that by playing formal games with an input mathematical text, one can get an output text which seemingly carries new knowledge. The basic examples are furnished by scientific or technological calculations: general laws plus initial conditions produce predictions, often only after time-consuming and computer-aided work. One can say that the input contains an implicit knowledge which is thereby made explicit.
I have a related idea which I picked up from somewhere which mirrors the above observation.
When we see beautiful fractals generated by simple equations and iterative processes, we give importance to only the equations, not to the cartesian grid on which that process operates.
it doesn't seem that surprising to me.
Consider the story of Charles Darwin, who knew evolution existed, but who was so afraid of public criticism that he delayed publishing his findings so long that he nearly lost his priority to Wallace.
For contrast, consider the story of Alfred Wegener, who aggressively promoted his idea of (what was later called) plate tectonics, but who was roundly criticized for his radical idea. By the time plate tectonics was tested and proven, Wegener was long gone.
These examples suggest that, in science, it's not the claims you make, it's the claims you prove with evidence.
Maybe it’s a variation of the “assume a frictionless spherical horse” problem but it’s very confusing.
What do you mean by this? I don’t think the understanding of LLMs is sufficient to make this claim
For instance, if you have a massive tome of English text, a rather high percentage of it will be grammatically-correct (or close), syntactic and understandable, because humans who speak good English took the time to write it and wrote it how other humans would expect to read or hear it. This, by its very nature, embeds "English language" knowledge due to sequence, word choice, normally-hard-to-quantify expressions (colloquial or otherwise), etc.
When you consider source data from many modes, there's all kinds of implicit stuff that gets incidentally written.. for instance, real photographs of outer space or deep sea would only show humans in protective gear, not swimming next to the Titanic. Conversely, you won't see polar bears eating at Chipotle, or giant humans standing on top of mountains.
There's a statistical probability of "this showed up enough in the training data to loosely confirm its existence" / "can't say I ever saw that, so let's just synthesize it" aspect of the embeddings that one person could interpret as "emergent intelligence", while another could just-as-convincingly say it's probabilistic output that is mostly in line with what we expect to receive. Train the LLM on absolute nonsense instead and you'll receive exactly that back.
> “The real question is how can we predict when a new LLM will achieve some new capability X. For example, X = “Write a short story that resonates with the social mood of the present time and is a runaway hit”
Framing a capability as something that is objectively measurable (“able to perform math on the 12th grade level”, “able to write a coherent, novel text without spelling/grammar mistakes”) makes sense within the context of what the author is trying to demonstrate.
But the social proof aspect (“is a runaway hit”) feels orthogonal to it? Things can be runaway hits for social factors independently of the capability they actually represent.
AI is very good at gaming metrics so it’s difficult to list some criteria where achieving it is meaningful. A hypothetical coherent novel without spelling/grammar mistakes could in effect be a copy of some existing work that shows up in its corpus, however a hit requires more than a reskinned story.
Empirically we observe that an LLM trained purely to predict a next token can do things like solve complex logic puzzles that it has never seen before. Skeptics claim that actually the network has seen at least analogous puzzles before and all it is doing is translating between them. However the novelty of what can be solved is very surprising.
Intuitively it makes sense that at some level, that intelligence itself becomes a compression algorithm. For example, you can learn separately how to solve every puzzle ever presented to mankind, but that would take a lot of space. At some point it's more efficient to just learn "intelligence" itself and then apply that to the problem of predicting the next token. Once you do that you can stop trying to store an infinite database of parallel heuristics and just focus the parameter space on learning "common heuristics" that apply broadly across the problem space, and then apply that to every problem.
The question is, at what parameter count and volume of training data does the situation flip to favoring "learning intelligence" rather than storing redundant domain specialised heuristics? And is it really happening? I would have thought just looking at the activation patterns could tell you a lot, because if common activations happen for entirely different problem spaces then you can argue that the network has to be learning common abstractions. If not, maybe it's just doing really large scale redundant storage of heurstics.
cratermoon•4h ago
"Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance."
moffkalast•3h ago
K0balt•3h ago
An aircraft can approach powered flight without achieving it. With a given amount of thrust or aerodynamic characteristics, the aircraft will weigh dynamic_weight=(static_weight - x) where x is a combination of the aerodynamic characteristics and the amount of thrust applied.
In no case where dynamic_weight>0 will the aircraft fly, even though it exhibits characteristics of flight, I.e the transfer of aerodynamic forces to counteract gravity.
So while it progressively exhibits characteristics of flight, it is not capable of any kind of flight at all until the critical point of dynamic_weight<0. So the enabling characteristics are not “emergent”, but the behavior is.
I think this boils down to a matter of semantics.
scopemouthwash•3h ago
I think the word you’re looking for is “analogy”.
foobarqux•3h ago
autoexec•3h ago
jebarker•3h ago
Al-Khwarizmi•2h ago
To me the paper is overhyped. Knowing how neural networks work, it's clear that there are going to be underlying properties that vary smoothly. This doesn't preclude the existence of emergent abilities.