To bootstrap this process, we needed successful solution traces from an open-weight reasoning model for cold-start supervised fine-tuning. That requirement led us to investigate GPT-OSS-120B. While doing so, we noticed something unexpected: simply placing the model into the interleaved thinking regime produced large and consistent score improvements on ARC AGI 2 tasks. We were seeing scores that we didn’t think was possible for a medium sized OSS model.
This observation ultimately shifted the focus of our work as we wanted to find out how universally this observation applies while staying within our resource constraints. We concluded that it applies quite generally, with double digit gains in frontier models too.
Previously, I have read debates about whether ARC AGI 2 is primarily a reasoning benchmark or a visual benchmark. I guess we can now add agentic benchmark to the mix as well!