This seems really similar to the motivations around masked language modeling. By providing increasingly-masked targets over time, a smooth difficulty curve can be established. Randomly masking X% of the tokens/bytes is trivial to implement. MLM can take a small corpus and turn it into an astronomically large one.
If you sat down to solve a problem you’ve never seen before you wouldn’t even know what a valid “later state” looking like.
The happy Tetris bug is also a neat example of how “bad” inputs can act like curriculum or data augmentation. Corrupted observations forced the policy to be robust to chaos early, which then paid off when the game actually got hard. That feels very similar to tricks in other domains where we deliberately randomize or mask parts of the input. It makes me wonder how many surprisingly strong RL systems in the wild are really powered by accidental curricula that nobody has fully noticed or formalized yet.
omneity•1h ago
What you get is an iterator over the dataset that samples based on how far you are in the training.
0: https://github.com/omarkamali/curriculus