frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

From 0% to 36% on Day 1 of ARC-AGI-3

https://www.symbolica.ai/blog/arc-agi-3
42•lairv•2h ago

Comments

lairv•2h ago
Note that this uses a harness so it doesn't qualify for the official ARC-AGI-3 leaderboard

According to the authors the harness isn't ARC-AGI specific though https://x.com/agenticasdk/status/2037335806264971461

falcor84•2h ago
I for one think that harness development is perhaps the most interesting part at the moment and would love to have an alternative leaderboard with harnesses.
sanxiyn•2h ago
There is. Official leaderboard is without harness, and community leaderboard is with harness. Read ARC-AGI-3 Technical Paper for details.
falcor84•2h ago
I went through the technical paper again, and while they explain why they decided against the harness, I disagree with them - my take is that if harnesses are overfitting, then they should be penalized on the hidden test set.

Anyway, searching both in ARC-AGI's paper and website and directly on kaggle, I failed to find a with-harness leaderboard; can you please give the link?

sanxiyn•1h ago
Here it is: https://arcprize.org/leaderboard/community
steve_adams_86•38m ago
I'm so into harness development right now. Once it clicked that harnesses can bring more safety and determinism to LLMs, I started to wonder where I'd need that and why (vs MCP or just throwing Claude Code at everything), and my brain gears have been turning endlessly since then. I'd love to see more of what people do with them. My use cases are admittedly lame and boring, but it's such a fun paradigm to think and develop around.
krackers•59m ago
> this uses a harness

This seems like an arbitrary restriction. Tool-use requires a harness, and their whitepaper never defines exactly what counts as valid.

osti•41m ago
Doesn't the chat version of chatgpt or gemini also have interleaved tool calls, so do those also count as with harnesses?
esafak•2h ago
Anybody used this Agentica of theirs?
modeless•41m ago
On the public set of 25 problems. These are intended for development and testing, not evaluation. There are 110 private problems for actual evaluation purposes, and the ARC-AGI-3 paper says "the public set is materially easier than the private set".
SchemaLoad•40m ago
Benchmarks on public tests are too easy to game. The model owners can just incorporate the answers in to the dataset. Only the private problems actually matter.
sanxiyn•36m ago
In this case the code is public and you can see they are not cheating in that sense.
SchemaLoad•32m ago
Once the model has seen the questions and answers in the training stage, the questions are worthless. Only a test using previously unseen questions has merit.
lambda•27m ago
They aren't training new models for this. This is an agent harness for Opus 4.6.
measurablefunc•15m ago
All traffic is monitored, all signal sources are eventually incorporated into the training set in one way or another. The person you're responding to is correct, even a single API call to any AI provider is sufficient to discount future results from the same provider.
Davidzheng•6m ago
I agree it's not cheating that restricted sense. But I'm not really convinced that it can't be cheating in a more general sense. You can try like 10^10 variations of harnesses and select the one that performs best. And probably if you then look at it, it will not look like it's necessarily cheating. But you have biased the estimator by selecting the harness according to the value.

Show HN: I put an AI agent on a $7/month VPS with IRC as its transport layer

https://georgelarson.me/writing/2026-03-23-nullclaw-doorman/
146•j0rg3•5h ago•54 comments

Why so many control rooms were seafoam green (2025)

https://bethmathews.substack.com/p/why-so-many-control-rooms-were-seafoam
658•Amorymeltzer•1d ago•136 comments

Apple discontinues the Mac Pro

https://9to5mac.com/2026/03/26/apple-discontinues-the-mac-pro/
183•bentocorp•7h ago•156 comments

From 0% to 36% on Day 1 of ARC-AGI-3

https://www.symbolica.ai/blog/arc-agi-3
42•lairv•2h ago•17 comments

Judge blocks Pentagon effort to 'punish' Anthropic with supply chain risk label

https://www.cnn.com/2026/03/26/business/anthropic-pentagon-injunction-supply-chain-risk
295•prawn•4h ago•168 comments

Moving from GitHub to Codeberg, for lazy people

https://unterwaditzer.net/2025/codeberg.html
548•jslakro•14h ago•276 comments

Agent-to-Agent Pair Programming

https://axeldelafosse.com/blog/agent-to-agent-pair-programming
20•axldelafosse•2h ago•5 comments

Dobase – Your workspace, your server

https://dobase.co/
34•frenkel•3d ago•10 comments

DOOM Over DNS

https://github.com/resumex/doom-over-dns
228•Venn1•3d ago•73 comments

Chroma Context-1: Training a Self-Editing Search Agent

https://www.trychroma.com/research/context-1
17•philip1209•8h ago•1 comments

My minute-by-minute response to the LiteLLM malware attack

https://futuresearch.ai/blog/litellm-attack-transcript/
321•Fibonar•12h ago•129 comments

We rewrote JSONata with AI in a day, saved $500k/year

https://www.reco.ai/blog/we-rewrote-jsonata-with-ai
77•cjlm•5h ago•74 comments

Anthropic Subprocessor Changes

https://trust.anthropic.com
53•tencentshill•6h ago•29 comments

Chicago artist creates tourism posters for city's neighborhoods

https://www.chicagotribune.com/2026/03/25/chicago-neighborhood-posters/
68•NaOH•4h ago•31 comments

HandyMKV for MakeMKV and HandBrake Automation

https://github.com/dmars8047/handymkv
10•geerlingguy•1h ago•1 comments

Whistler: Live eBPF Programming from the Common Lisp REPL

https://atgreen.github.io/repl-yell/posts/whistler/
43•varjag•3d ago•2 comments

We haven't seen the worst of what gambling and prediction markets will do

https://www.derekthompson.org/p/we-havent-seen-the-worst-of-what
624•mmcclure•8h ago•450 comments

HyperAgents: Self-referential self-improving agents

https://github.com/facebookresearch/hyperagents
147•andyg_blog•2d ago•57 comments

OpenTelemetry profiles enters public alpha

https://opentelemetry.io/blog/2026/profiles-alpha/
153•tanelpoder•11h ago•20 comments

$500 GPU outperforms Claude Sonnet on coding benchmarks

https://github.com/itigges22/ATLAS
112•yogthos•10h ago•39 comments

John Bradley, author of xv, has died

https://voxday.net/2026/03/25/rip-john-bradley/
237•linsomniac•9h ago•71 comments

Generators in Lone Lisp

https://www.matheusmoreira.com/articles/generators-in-lone-lisp
8•matheusmoreira•3d ago•0 comments

Using FireWire on a Raspberry Pi

https://www.jeffgeerling.com/blog/2026/firewire-on-a-raspberry-pi/
67•jandeboevrie•7h ago•29 comments

CERN to host a new phase of Open Research Europe

https://home.cern/news/news/cern/cern-host-europes-flagship-open-access-publishing-platform
204•JohnHammersley•8h ago•17 comments

Running Tesla Model 3's computer on my desk using parts from crashed cars

https://bugs.xdavidhu.me/tesla/2026/03/23/running-tesla-model-3s-computer-on-my-desk-using-parts-...
872•driesdep•1d ago•301 comments

Show HN: Veil – Dark mode PDFs without destroying images, runs in the browser

https://veil.simoneamico.com/
53•simoneamico•16h ago•9 comments

Show HN: Fio: 3D World editor/game engine – inspired by Radiant and Hammer

https://github.com/ViciousSquid/Fio
45•vicioussquid•7h ago•4 comments

Order Granting Preliminary Injunction – Anthropic vs. U.S. Department of War [pdf]

https://storage.courtlistener.com/recap/gov.uscourts.cand.465515/gov.uscourts.cand.465515.134.0.pdf
120•theindieman•5h ago•18 comments

Show HN: Turbolite – a SQLite VFS serving sub-250ms cold JOIN queries from S3

https://github.com/russellromney/turbolite
121•russellthehippo•9h ago•29 comments

Colibri – chat platform built on the AT Protocol for communities big and small

https://colibri.social/
108•todotask2•10h ago•66 comments