frontpage.

I've been working on applying LLMs to long-context, verifiable problems over the past year, and today I'm releasing a benchmark of 62,000 pencil puzzles across 94 types (sudoku, nonori, slitherlink, etc.). The benchmark also allows for intermediate checks /rule breaks for all varieties at any step.

I tested 51 models against a subset (300 puzzles) in two modes: single-shot (output the full solution) and agentic (iterate with verifier feedback).

Some results:

- Best model (GPT 5.2@xhigh) solves 56%. (~ half the puzzles are unsolved by any model)

- Agentic solves average 29 turns. The longest attempt took ~1,200 turns over 14 hours.

- Cost per success varies wildly (cheapest: $0.00033 — Grok 4.1 Fast Reasoning, most expensive: $238.16 — Claude Sonnet 4.6 (1M context))

- Reasoning depth (eg. @medium, @high, @xhigh) dramatically improves capability (up to repeated infrastructure failure for @xhigh)

- Stark difference between US closed models (3 at >33%) and Chinese open models (top: 6%)

Made the website to show off the dataset + play every puzzle, and even every replay AI agent solves step-by-step (fun to watch how it gets to solutions).

Also here's the paper: https://arxiv.org/abs/2603.02119

I didn't test human ability to solve, but it seems these puzzles are pretty difficult. I'd be curious how HN audience fares on the puzzles.

The gap between ICP documents and buyer understanding in B2B sales

Academics Need to Wake Up on AI

Qwen Tech Lead Steps Down

Fire the CEO, Introducing the AxO's

Mpv Is the MVP of Video and Image Viewing

Deprecate confusing APIs like "os.path.commonprefix()"

Ask HN: Using AI at work is stupidity, or a good tool if used properly?

How HN: DocAPI – HTTP 402 as designed: agents register, pay USDC, run forever

Why exe.dev VMs are persistent

Gram 1.0 Released

OpenAI releases GPT-5.3 Instant update to make ChatGPT less 'cringe'

Beatport and Beatsource to Unite into One Premium DJ Platform

Identity Formation and the Politics of Belonging: Bengali Migrants in Kerala [pdf]

Ask HN: What are your go to sources for relatively unbiased global news?

Show HN: Voquill, an open source and cross-platform alternative to wisprflow

The unfortunate need for an "age verification" API for legal compliance

OpenclawwOpenClaw Partners with VirusTotal for Skill Security

Blocking a brain receptor may calm blood pressure signals

Show HN: Mozilla.ai introduces Clawbolt, an AI Assistant for the trades

Claude and Pentagon whole fight timeline

New tool for designing software architecture diagrams and presentations

Section 230 is the best protection we have from Trump's censorship

Cofounder search: An internet-native way to do ML and bio research

The Making of the Atomic Bomb book predicted the AI crisis before it happened

Show HN: SmartRuler Pro – ESP32-powered motorized ruler with 0.5mm precision

Show HN: HackerNews.pink – A PWA HN reader with personalized recommendations

Show HN: SOTA long memory eval with open source models

Wormhole Vectors with Trey Grainger

Why payment fees matter more than you think

GitLab Active Incident

Show HN: Pencil Puzzle Bench – LLM Benchmark for Multi-Step Verifiable Reasoning