Train PPO or DQN on one layout and it solves that layout. Shift the key, add or move a wall passage, alter the distractor key setup and performance collapses. The usual story of memorising geometry instead of the rules.
Instead, I train a small set of skills, like find the correct key, go to the passage, open the correct door, reach the goal. Each skill trained once, then frozen. When the layout changes, nothing updates. It retrieves the right skills from longterm memory and composes them.
State space is already large if treated symbolically. Roughly 50 reachable cells for the agent, 50 for the key, 4 door configurations, multiple passage layouts, 3 inventory values, 4 headings. Around 360,000 distinct logical states from conservative counting.
At composition time, the system only reuses states it actually encountered during skill training. No gradients. No online policy adaptation.
Benchmark: 2500 zeroshot episodes with randomised keys and randomised passages. No retraining. Solve rate about 94%.
Frozen skills. New layouts. Still works.
So here's the real question: If hierarchical RL should solve this, why does it still struggle with such a tiny, structured world unless you train it across every variation? Or am I wrong?
And what’s actually being learned when a system generalises to layouts it has never seen?
I'm interested in that discussion. The gap between, this looks trivial and most agents don't generalise, feels like the interesting thing here.