frontpage.

Hi HN,

We are a team of independent researchers from Germany working on ARC AGI 2 since last summer. The general opinion on open-weight models is that they are too weak for this fairly difficult benchmark and score at near noise levels. We found that GPT OSS 120B is actually much more capable than previously thought, once the interleaved thinking regime is stabilized. We basically let the model use a stateful IPython based REPL via function calling and patched vLLM so that the model can reliably do interleaved thinking. The score jumped more than 4x.

Technical write-up: https://pivotools.github.io/posts/agentic_coding_arc_agi/ Code: https://github.com/gutfeeling/arc-agi-2-submission Data: https://huggingface.co/datasets/arcagi2/arcagi2-agentic-codi...

For safety, we support sandboxed execution using IPyBox (local Docker) and Daytona (cloud), so others can reproduce this without running untrusted code locally.

It gets more interesting: the effect seems to be general and translates seamlessly to other models without even changing prompts. We are not sure why agentic coding is so powerful in ARC AGI 2, which isn't traditionally thought of as an agentic benchmark. Perhaps code execution provides a stronger form of verification than COT, or perhaps it encourages a qualitatively different form of thinking.

We will be around for a while and would be happy to hear ideas / feedback and discuss infra issues / interleaved thinking / GPT OSS / ARC AGI 2.

Show HN: Awesome-Epstein-Files Catalogue

Show HN: YourSitee – a privacy-first link-in-bio (public beta)

A local task state manager for your projects. Designed for humans and LLMs

Optimizing Quality vs. Latency in Real-Time Text-to-Speech AI Models

YwinCap: Technical deconstruction of SEO-driven authority fraud

Game Theory #7: America's Game [video]

Show HN: I built a Chrome extension that finds edges on Polymarket

Pax: The Cache Performance You're Looking For

Why differential privacy is awesome

Redefining GAN power devices for adoption in EVs and data centres

Do not apologize for replying late to my email

GLP-1

What Is Claude? Anthropic Doesn't Know, Either

Show HN: Hacker News for Songs

The five stages of losing our craft

OpenMOQ Software Consortium – Advancing MOQ Protocol

Alphabet sells rare 100-year bond

Flood Fill vs. The Magic Circle

Ask HN: What conventions exist for declaring AI content online?

Show HN: Seedance.fast – Early Access to ByteDance's Seedance 2.0 via Volcengine

The big AI job swap: why white-collar workers are ditching their careers

Best of 2024 Data Center Podcast [video]

Downsides to US-Canadian dual citizenship for US resident?

Why Y Combinator and Aaron Epstein Are Betting on AI-Native Agencies

OpenClaw Prompt Injection via Chat History Spoofing (Fixed)

Row Polymorphism without the Jargon (2020)

OpenClaw creator: "Netlify shares phone numbers"

Emergent: LLM-Native Python Framework

Show HN: Chroma Master A premium Flutter color suite with 7 integrated games

Web Development Improvements

Show HN: We let GPT OSS 120B write and run Python and ARC AGI 2 jumped 4x