But I think the paper fails to answer the most important question. It alleges that this isn't a statistical model: "it is not a statistical model that predicts the most likely next state based on all the examples it has been trained on.
We observe that it learns to use its attention mechanism to compute 3x3 convolutions — 3x3 convolutions are a common way to implement the Game of Life, since it can be used to count the neighbours of a cell, which is used to decide whether the cell lives or dies."
But it is never actually shown that this is the case. It later on isn't even alleged that this is true, rather the metric they use is that it gives the correct answers often enough, as a test for convergence and not that the net has converged to values which give the correct algorithm.
But there is no guarantee that it actually has learned the game. There are still learned parameters and the paper doesn't investigate if these parameters actually have converged to something where the Net is actually just a computation of the algorithm. The most interesting question is left unanswered.
(This can be shown by comparing that attention matrix to a "manually computed Neighbour Attention matrix", which is known to be equivalent to 3 by 3 conv.)
"We detected that the model had converged by looking for 1024 training batches with perfect predictions, and that it could perfectly run 100 Life games for 100 steps." This would be superfluous (and even a pretty bizarre methodology) if the shape of the attention matrix was proof that the Network performed the actual game of life algorithm.
Just to be clear, I am not saying that the NN isn't converging to performing some computation that would also be seen in other algorithms. I am saying that the paper does not investigate whether the resulting NN actually performs the game of life algorithm. The convolution part is certainly evidence, but I think it would have been worthwhile to look at the actual resulting Net and figure out if the trained weights together actually formed an algorithm. This is also the only way to determine the truth of the initial claim, that this isn't just a statistical model, but rather an actual algorithm.
It’s similar to comparing hardware radio and software-defined radio: Yes, we already know how to build a radio with hardware but a software-defined one offers greater flexibility.
Like for learning the English language we don’t fully understand the way LLMs work. We can’t fully characterize it. So we have debates on whether the LLM actually understands English or understands what it’s talking about. We simply don’t know.
The results of this show that the transformer understands the game of life. Or whatever the transformer does with the rules of the game of life it’s safe to say that it fits a definition of understanding as mankind knows it.
Like much of machine learning where we use the abstraction of curve fitting to understand higher dimensional learning we can do the same extrapolation here.
If the transformer understands the game of life then that understanding must translate over to the LLM. The LLM understands English and understands the contents of what it is talking about.
There was a clear gradient of understanding before understanding the game of life hit saturation. The transformer lived in a state where it didn’t get everything right but it understood the game of life to a degree.
We can extrapolate that gradient to LLMs as well. LLMs are likely on that gradient, not yet at saturation. Either way, I think it’s safe to say that LLMs understand what they are talking about. It’s just that they haven’t hit saturation yet. There’s clearly things that we as humans understand better than the LLM.
But let’s extrapolate this concept to an even higher level:
Have we as humans hit saturation yet?
There are only 512 training examples needed for that, and it would be a lot more interesting if a learning algorithm were able to fit that 3x3 convolution layer from those 512 examples. IIRC, and don't quote me on that, but that's not been done.
I firmly believe that differentiable logic CA is the winner, in particular because it extracts the logic directly, and thus leads to generalize-able programs as opposed to staying stuck in matrix multiplication land.
bonzini•2h ago
montebicyclelo•2h ago
If by small grid you are referring to the attention matrix plot shown, then that is not a correct interpretation. That diagonal-like pattern it learns, is 3x3 convolution, so it can compare the neighbours of a given cell.
Edit: and note that every grid it is trained on / runs inference on is randomly generated and completely unique, so it cannot just memorise examples
bonzini•2h ago
yorwba•1h ago
Using 2D RoPE instead would in principle allow scaling up as well, and maybe even period detection if you train it across a range of grids, but would eventually hit the same issues that plague long-context scaling in LLMs.
bernb•1h ago
Would it be possible to train an LLM on the rules how we would teach them to a human?