I built this benchmark to see how LLM performance scales with puzzle complexity on a task that’s hard to solve via pattern matching alone.
It evaluates 34 models on nonograms across increasing grid sizes. One thing that stood out is how differently models approach the same task: some generate code to brute-force solutions, others reason step-by-step like a human.
Community feedback surfaced a few methodological issues (notably around reasoning configuration), which led to a re-run and adding raw prompt/output storage so failures can be inspected.
Everything is open source and rerunnable — happy to answer questions or clarify methodology.
mauricekleine•1d ago
It evaluates 34 models on nonograms across increasing grid sizes. One thing that stood out is how differently models approach the same task: some generate code to brute-force solutions, others reason step-by-step like a human.
Community feedback surfaced a few methodological issues (notably around reasoning configuration), which led to a re-run and adding raw prompt/output storage so failures can be inspected.
Everything is open source and rerunnable — happy to answer questions or clarify methodology.