frontpage.

My friends and I started this project in the summer of 2025 with the initial goal of participating in the ARC Prize Kaggle competition. Early on, we were exploring agentic coding with frontier reasoning models and found that models like o3 and o4-mini could generate high-quality synthetic ARC-style puzzles. Our plan was to use these synthetic puzzles to train a smaller model via agentic reinforcement learning (RLVR with interleaved thinking).

To bootstrap this process, we needed successful solution traces from an open-weight reasoning model for cold-start supervised fine-tuning. That requirement led us to investigate GPT-OSS-120B. While doing so, we noticed something unexpected: simply placing the model into the interleaved thinking regime produced large and consistent score improvements on ARC AGI 2 tasks. We were seeing scores that we didn’t think was possible for a medium sized OSS model.

This observation ultimately shifted the focus of our work as we wanted to find out how universally this observation applies while staying within our resource constraints. We concluded that it applies quite generally, with double digit gains in frontier models too.

Previously, I have read debates about whether ARC AGI 2 is primarily a reasoning benchmark or a visual benchmark. I guess we can now add agentic benchmark to the mix as well!

I Use Obsidian

Ask HN: Are compiler errors for unused code necessary?

Memories Family

Book a Meeting with a YC Founder

Ask HN: Can AI replace apps, or will economics keep the app market alive?

Show HN: Preference-aware routing for OpenClaw via Plano

The Servo project and its impact on the web platform ecosystem

Mira: An agent that never forgets anything. Persistent, shared memory

Python HTTP server using Erlang and BEAM

Dual nationals face scramble for UK passports as new rules come into force

GraphQLite: SQLite graph extension supporting Cypher

Show HN: AccessLint – Static accessibility analysis for iOS/Swift

The Problem with Left Nationalism

We're Measuring Data Center Sustainability Wrong

Ask HN: How can a non-technical founder prove they're more than an "idea guy"?

I swear the UFO is coming any minute

What Neptune.ai Got Right (and How to Keep It)

Show HN: Turn Claude Code or Codex into proactive, autonomous 24/7 AI agents

The Case for Duolingo

The 24-Day Notice That Was a 7-Month Signal

Space Station returns to a full crew complement after a month

Can Opus 4.6 Do Category Theory in Lean?

Bankruptsy

Architecture of Consoles

Updated Thoughts on AI Risk

Show HN: ChessGrammar – API that detects tactical patterns in chess positions

AI Eats the World, and Most of Its Flash Storage

Diagnosing a PET Video Fault from One Photograph

Show HN: FolioDoc – I built a tool to stop chasing clients for documents

Phishing Detection NLP Heuristic: Prototype Achieves 60% Detection Rate

Show HN: Solving ARC AGI 2 with interleaved thinking and stateful IPython REPL