(Shameless plug: I am one of the developers of Thinky.gg (https://thinky.gg), which is a thinky puzzle game site for a 'shortest path style' [Pathology] and a Sokoban variant [Sokoath] )
These games are typically NP Hard so the typical techniques that solvers have employed for Sokoban (or Pathology) have been brute forced with varying heuristics (like BFS, dead-lock detection, and Zobrist hashing). However, once levels get beyond a certain size with enough movable blocks you end up exhausting memory pretty quickly.
These types of games are still "AI Proof" so far in that LLMs are absolutely awful at solving these while humans are very good (so seems reasonable to consider for for ARC-AGI benchmarks). Whenever a new reasoning model gets released I typically try it on some basic Pathology levels (like 'One at a Time' https://pathology.thinky.gg/level/ybbun/one-at-a-time) and they fail miserably.
Simple level code for the above level (1 is a wall, 2 is a movable block, 4 is starting block, 3 is the exit):
000
020
023
041
Similar to OP, I've found Claude couldn’t manage rule dynamics, blocked paths, or game objectives well and spits out random results.
But I am fairly sure all of Baba Is You solutions are present in the training data for modern LLMs so it won’t make for a good eval.
Claude 4 cannot solve any Baba Is You level (except level 0 that is solved by 8 right inputs), so for now it's at least a nice low bar to shoot for...
In some ways, this reminds me of the history of AI Go (board game). But the resolution there was MCTS, which wasn't at all what we wanted (insofar as MCTS is not generalizable to most things).
I think generally the whole thing with puzzle games is that you have to determine the “right” intermediate goals. In fact, the naive intermediate goals are often entirely wrong!
A canonical sokoban-like inversion might be where you have to push two blocks into goal areas. You might think “ok, push one block into its goal area and then push another into it.”
But many of these games will have mechanisms meaning you would first want to push one block into its goal, then undo that for some reason (it might activate some extra functionality) push the other block, and then finally go back and do the thing.
There’s always weird tricks that mean that you’re going to walk backwards before walking forwards. I don’t think it’s impossible for these things to stumble into it, though. Just might spin a lot of cycles to get there (humans do too I guess)
To me, those discoveries are the fun part of most puzzle games. When you unlock the "trick" for each level and the dopamine flies, heh.
MCTS wasn't _really_ the solution to go. MCTS-based AIs existed for years and they weren't _that_ good. They weren't superhuman for sure, and the moves/games they played were kind of boring.
The key to doing go well was doing something that vaguely looks like MCTS but the real guts are a network that can answer: "who's winning?" and "what are good moves to try here?" and using that to guide search. Additionally essential was realizing that computation (run search for a while) with a bad model could be effectively+efficiently used to generate better training data to train a better model.
It has the same problem with playing chess. But I’m not sure if there is a datatype it could work with for this kinda game. Currently it seems more like LLMs can’t really work on spacial problems. But this should actually be something that can be fixed (pretty sure I saw an article about it on HN recently)
kinduff•4h ago
I've seen AI struggle with ASCII, but when presented as other data structures, it performs better.
edit:
e.g. JSON with structured coordinates, graph based JSON, or a semantic representation with the coordinates
hajile•3h ago
RainyDayTmrw•3h ago
To the extent that the current generation of AI isn't general, yeah, papering over some of its weaknesses may allow you to expose other parts of it, both strengths and other weaknesses.
kadoban•9m ago