182 test cases, 4 tunings, 14 models via OpenRouter. Two open-weight Qwen models from Alibaba crushed everything else (83.5%), while most "flagship" models scored below 50%. MiniMax M2.5 scored worse than random guessing.
Everything is open source: https://github.com/jmcapra/FretBench
I'm curious whether the performance gap is related to tokenisation of ASCII art — if anyone has insights on how different tokenisers handle grid-structured text, I'd love to hear it.