We are a team of independent researchers from Germany working on ARC AGI 2 since last summer. The general opinion on open-weight models is that they are too weak for this fairly difficult benchmark and score at near noise levels. We found that GPT OSS 120B is actually much more capable than previously thought, once the interleaved thinking regime is stabilized. We basically let the model use a stateful IPython based REPL via function calling and patched vLLM so that the model can reliably do interleaved thinking. The score jumped more than 4x.
Technical write-up: https://pivotools.github.io/posts/agentic_coding_arc_agi/ Code: https://github.com/gutfeeling/arc-agi-2-submission Data: https://huggingface.co/datasets/arcagi2/arcagi2-agentic-codi...
For safety, we support sandboxed execution using IPyBox (local Docker) and Daytona (cloud), so others can reproduce this without running untrusted code locally.
It gets more interesting: the effect seems to be general and translates seamlessly to other models without even changing prompts. We are not sure why agentic coding is so powerful in ARC AGI 2, which isn't traditionally thought of as an agentic benchmark. Perhaps code execution provides a stronger form of verification than COT, or perhaps it encourages a qualitatively different form of thinking.
We will be around for a while and would be happy to hear ideas / feedback and discuss infra issues / interleaved thinking / GPT OSS / ARC AGI 2.