It might be fun, for example, to simulate the evolution of multiple coupled discretized fields.
What's interesting is that action tokens are learned from video. In other words, the training dataset does not include actions like "go left" and "go right"; and these actions are learned from the pixels that moved. This means that learned actions may not map exactly to the game actions available to the user. That means we (humans) cannot necessarily use this world model to play the game.
I suspect the inferred actions probably directly correspond to human-understandable actions; and after playing with the action tokens, a reasonable human can probably guess what, say, the third action token in the dictionary corresponds to ("jump"). This is likely as game actions are sparse (in both time and action spaces) and often independent/orthogonal (in action space).
mattnewton•4mo ago
Do we know if the 3B model shown in the twitter thread is saturated and we need to train a bigger one, or if it is still converging? 3B parameters seems light for this but I don’t have a good intuition!
(Nit: “Zelda Ocarina of Time” is definitely showing Zelda A Link To The Past sprites, which would make more sense as that is a top down 2d SNES game and Ocarina was a 3d N64 game)