Outwardly, it seems to be limited by unmasking too few tokens per round, even when the heatmap shows many more high-confidence guesses available. On some of the larger puzzles it looks like it's wasting many rounds filling in the 'obvious' shapes, and then gets the interesting bit in the last round. It also doesn't seem to have learned the idea of "the background is blue with shapes drawn on top," where background is often 50% of the solution in these puzzles.
gen3•4h ago