I tested 51 models against a subset (300 puzzles) in two modes: single-shot (output the full solution) and agentic (iterate with verifier feedback).
Some results:
- Best model (GPT 5.2@xhigh) solves 56%. (~ half the puzzles are unsolved by any model)
- Agentic solves average 29 turns. The longest attempt took ~1,200 turns over 14 hours.
- Cost per success varies wildly (cheapest: $0.00033 — Grok 4.1 Fast Reasoning, most expensive: $238.16 — Claude Sonnet 4.6 (1M context))
- Reasoning depth (eg. @medium, @high, @xhigh) dramatically improves capability (up to repeated infrastructure failure for @xhigh)
- Stark difference between US closed models (3 at >33%) and Chinese open models (top: 6%)
Made the website to show off the dataset + play every puzzle, and even every replay AI agent solves step-by-step (fun to watch how it gets to solutions).
Also here's the paper: https://arxiv.org/abs/2603.02119
I didn't test human ability to solve, but it seems these puzzles are pretty difficult. I'd be curious how HN audience fares on the puzzles.