frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Pencil Puzzle Bench – LLM Benchmark for Multi-Step Verifiable Reasoning

https://ppbench.com/
4•bluecoconut•2h ago
I've been working on applying LLMs to long-context, verifiable problems over the past year, and today I'm releasing a benchmark of 62,000 pencil puzzles across 94 types (sudoku, nonori, slitherlink, etc.). The benchmark also allows for intermediate checks /rule breaks for all varieties at any step.

I tested 51 models against a subset (300 puzzles) in two modes: single-shot (output the full solution) and agentic (iterate with verifier feedback).

Some results:

- Best model (GPT 5.2@xhigh) solves 56%. (~ half the puzzles are unsolved by any model)

- Agentic solves average 29 turns. The longest attempt took ~1,200 turns over 14 hours.

- Cost per success varies wildly (cheapest: $0.00033 — Grok 4.1 Fast Reasoning, most expensive: $238.16 — Claude Sonnet 4.6 (1M context))

- Reasoning depth (eg. @medium, @high, @xhigh) dramatically improves capability (up to repeated infrastructure failure for @xhigh)

- Stark difference between US closed models (3 at >33%) and Chinese open models (top: 6%)

Made the website to show off the dataset + play every puzzle, and even every replay AI agent solves step-by-step (fun to watch how it gets to solutions).

Also here's the paper: https://arxiv.org/abs/2603.02119

I didn't test human ability to solve, but it seems these puzzles are pretty difficult. I'd be curious how HN audience fares on the puzzles.

The gap between ICP documents and buyer understanding in B2B sales

https://artemisgtm.ai/blog/why-most-b2b-companies-get-icp-wrong
1•thegtmauditguy•1m ago•1 comments

Academics Need to Wake Up on AI

https://alexanderkustov.substack.com/p/academics-need-to-wake-up-on-ai
1•verdverm•1m ago•0 comments

Qwen Tech Lead Steps Down

https://twitter.com/JustinLin610/status/2028865835373359513
1•informal007•1m ago•0 comments

Fire the CEO, Introducing the AxO's

https://boringops.sh/articles/fire_the_ceo/
1•boringops-dan•2m ago•0 comments

Mpv Is the MVP of Video and Image Viewing

https://nickjanetakis.com/blog/mpv-is-the-mvp-of-video-and-image-viewing
1•nickjj•2m ago•0 comments

Deprecate confusing APIs like "os.path.commonprefix()"

https://sethmlarson.dev/deprecate-confusing-apis-like-os-path-commonprefix
1•todsacerdoti•2m ago•0 comments

Ask HN: Using AI at work is stupidity, or a good tool if used properly?

1•MrLey•7m ago•0 comments

How HN: DocAPI – HTTP 402 as designed: agents register, pay USDC, run forever

https://www.docapi.co
1•siwandev•9m ago•1 comments

Why exe.dev VMs are persistent

https://blog.exe.dev/persistent
2•tosh•9m ago•0 comments

Gram 1.0 Released

https://gram.liten.app/posts/first-release/
1•birdculture•11m ago•0 comments

OpenAI releases GPT-5.3 Instant update to make ChatGPT less 'cringe'

https://9to5mac.com/2026/03/03/openai-releases-gpt-5-3-instant-update-to-make-chatgpt-less-cringe/
1•HiroProtagonist•13m ago•0 comments

Beatport and Beatsource to Unite into One Premium DJ Platform

https://www.beatportal.com/articles/1291036-beatport-and-beatsource-to-unite-into-one-premium-dj-...
1•DocFeind•13m ago•0 comments

Identity Formation and the Politics of Belonging: Bengali Migrants in Kerala [pdf]

https://www.aijfr.com/papers/2025/5/1400.pdf
1•thunderbong•13m ago•0 comments

Ask HN: What are your go to sources for relatively unbiased global news?

1•Jimmc414•13m ago•0 comments

Show HN: Voquill, an open source and cross-platform alternative to wisprflow

https://github.com/josiahsrc/voquill
1•josiahsrc•14m ago•0 comments

The unfortunate need for an "age verification" API for legal compliance

https://lists.ubuntu.com/archives/ubuntu-devel/2026-March/043510.html
2•turrini•14m ago•0 comments

OpenclawwOpenClaw Partners with VirusTotal for Skill Security

https://openclaw.ai/blog/virustotal-partnership
1•breitkreutz•15m ago•0 comments

Blocking a brain receptor may calm blood pressure signals

https://medicalxpress.com/news/2026-02-clue-hypertension-blocking-brain-receptor.html
2•PaulHoule•17m ago•0 comments

Show HN: Mozilla.ai introduces Clawbolt, an AI Assistant for the trades

https://github.com/mozilla-ai/clawbolt
6•river_otter•17m ago•0 comments

Claude and Pentagon whole fight timeline

https://www.youtube.com/watch?v=Ph8CrTNlWbM
2•ashutosh0707•18m ago•0 comments

New tool for designing software architecture diagrams and presentations

https://savnet.co/networks/designer
1•oscarricardosan•18m ago•0 comments

Section 230 is the best protection we have from Trump's censorship

https://www.ms.now/opinion/section-230-trump-free-speech
1•01-_-•18m ago•0 comments

Cofounder search: An internet-native way to do ML and bio research

https://labless.bio
1•jeremykalfus•19m ago•1 comments

The Making of the Atomic Bomb book predicted the AI crisis before it happened

https://blog.adafruit.com/2026/03/03/the-making-of-the-atomic-bomb-1986-by-richard-rhodes/
1•ptorrone•19m ago•0 comments

Show HN: SmartRuler Pro – ESP32-powered motorized ruler with 0.5mm precision

https://smart-ruler.bunnytech.io/
1•iosifnicolae2•19m ago•0 comments

Show HN: HackerNews.pink – A PWA HN reader with personalized recommendations

https://hackernews.pink/
1•gurkenkoenig•20m ago•0 comments

Show HN: SOTA long memory eval with open source models

https://ensue.dev/blog/beating-memory-benchmarks/
3•austinbaggio•20m ago•0 comments

Wormhole Vectors with Trey Grainger

1•CShorten•20m ago•0 comments

Why payment fees matter more than you think

https://cuencahighlife.com/why-payment-fees-matter-more-than-you-think/
2•dxs•20m ago•0 comments

GitLab Active Incident

https://status.gitlab.com
1•ustad•21m ago•0 comments