frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

How We Broke Top AI Agent Benchmarks: And What Comes Next

https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/
74•Anon84•2h ago

Comments

ggillas•1h ago
This is a phenomenal paper on exploits and hopefully changes the way benchmarking is done.

From the paper: We achieved near-perfect scores on all of them without solving a single task. The exploits range from the embarrassingly simple (sending {} to FieldWorkArena) to the technically involved (trojanizing binary wrappers in Terminal-Bench), but they all share a common thread: the evaluation was not designed to resist a system that optimizes for the score rather than the task.

operatingthetan•1h ago
>hopefully changes the way benchmarking is done.

Yeah the path forward is simple: check if the solutions actually contain solutions. If they contain exploits then that entire result is discarded.

Leynos•1h ago
Also, fuzz your benchmarks
siva7•1h ago
Could it really be that not only we vibeslop all apps nowadays but also don't care to even check how ai solved a benchmark it claimed solved?
operatingthetan•1h ago
Probably a more interesting benchmark is one that is scored based on the LLM finding exploits in the benchmark.
SpicyLemonZest•38m ago
Frontier model developers try to check for memorization. But until AI interpretability is a fully solved problem, how can you really know whether it actually didn't memorize or your memorization check wasn't right?
ZeroGravitas•53m ago
In human multiple choice tests they sometimes use negative marking to discourage guessing. It feels like exploits should be cancel out several correct solutions.
zer00eyz•1h ago
2024: Industry group invalidates 2,600 official Intel CPU benchmarks — SPEC says the company's compiler used unfair optimizations to boost performance https://www.tomshardware.com/pc-components/cpus/spec-invalid...

2003: Nvidia accused of cheating in 3DMark 03 https://www.gamespot.com/articles/nvidia-accused-of-cheating...

It's almost like the benchmarks were designed with zero understanding of the history of benchmark manipulation.

I like what LLM's are doing and providing. But the industry as a whole seems to live in a vacuum that ignores so much of the hard lessons that have been learned over the last 50 years of computing. It is doing itself a disservice.

irishcoffee•1h ago
> It's almost like the benchmarks were designed with zero understanding of the history of benchmark manipulation.

I wonder if this common? We should call it Goodharts law while someone does the research on how common this is.

For real, I’ve assumed from the jump these things were all gamed, with the amount of money on the line.

bee_rider•26m ago
What was the cheat in the 2024 Intel situation? The TomsHardware article and the Phoronix article they linked were quite vague. (Not to say I have any doubts, just curious, hadn’t heard of this one).
charcircuit•1h ago
I always assumed that these benchmarks would happen in a sandbox. I'm surprised that no one realized this sooner.
ModernMech•1h ago
I'm surprised anyone took them seriously in the first place.
operatingthetan•1h ago
We need good benchmarks or we are just left following the hype train.
subulaz•1h ago
a LOT of the people who love benchmarks are middle management hard-selling GenAI/LLM as magic tech sauce to vaguely technical executives who only want to know about the money aka headcount savings they so desperately desire.

their collective butts are already glued to the hype train as they chase numbers they (often) manufactured to justify the latest round of tech spend.

lots of good use cases out there - like the incredible progress with medical imaging analysis or complex system models for construction - and lots of crap use cases that need benchmarks to cosplay relevance.

lnrd•1h ago
I'm honestly confused by the design of SWE-bench and why is considered reliable.

It's based on existing GitHub PRs and Issues, the full dataset is on HuggingFace and is one year old now. All frontier models 100% have those issues and PRs in their training data so obviously they are good at reproducing fixes for them when confronted with the same codebase and similar requests. Am I missing something? How is this considered the most reliable benchmark?

SpicyLemonZest•41m ago
Frontier model developers do not consider SWE-bench to be reliable. OpenAI announced in February (https://openai.com/index/why-we-no-longer-evaluate-swe-bench...) that they consider it hopelessly contaminated, advocating for a new version SWE-bench Pro that was published more recently. (They seem to believe that even the publicly accessible part of the SWE-bench Pro problem set will be more resistant to training set contamination issues in the future, for reasons that to be honest I don't really understand.)
oliver236•1h ago
what are the point of benchmarks?
andai•1h ago
If there was not benchmark, number would not go up.
danslo•57m ago
If only the blog itself wasn't written by AI?

>No reasoning. No capability. Just exploitation of how the score is computed.

shudder

gaythread•38m ago
Modern day HN is overrun with AI posts.
cpldcpu•38m ago
Yes, marks of AI all over the place. Also the SVGs.

>No solution written, 100% score.

Its weird. Turns out that hardest problem for LLMs to really tackle is long-form text.

basch•18m ago
Maybe in one shot.

In theory I would expect them to be able to ingest the corpus of the new yorker and turn it into a template with sub-templates, and then be able to rehydrate those templates.

The harder part seems to be synthesizing new connection from two adjacent ideas. They like to take x and y and create x+y instead of x+y+z.

sidpatil•17m ago
Someone here mentioned a whole ago that the labs deliberately haven't tried to train these characteristics out of their models, because leaving them in makes it easier to identify, and therefore exclude, LLM-generated text from their training corpus.
alexchantavy•18m ago
I wonder what college freshman-level writing classes are teaching about writing voice and AI. The tell-tale patterns are pretty frustrating to read.
jmward01•38m ago
Not really on the topic, but I have wondered if we need a different type of test to help find model architecture potential. Standardized training sets followed by testing to see the potential curves of a model. train on x, test, add y, test, add z, test. At each increment you see how well the model is absorbing the information and extrapolate how well that architecture may do if more fully trained.
jgalt212•36m ago
The real question is how to close to VW and Deiselgate are these offenses? And what exposure do these companies have? I would assume securities fraud, if only because Matt Levine says everything is securities fraud.
SoKamil•34m ago
The more research on this topic is created, the more knowledge how to game them will be stored in future training data. And since it comes from university, it is ranked higher in data corpus. It sounds like a self fulfilling prophecy.
abirch•32m ago
Damned old Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure".

https://en.wikipedia.org/wiki/Goodhart%27s_law

lukev•27m ago
I think we should all consider the possibility that part of the reason Anthropic hasn't immediately released Mythos is that it would be slightly disappointing relative to the benchmark scores.
eiens•15m ago
The models don’t get better on every dimension as they scale up - there’s trade offs.

I’m convinced specialised models are the way but this means writing off the investment in existing assets which they won’t do for obvious reasons.

bbcc90•11m ago
Yes good evals are really hard - that’s not really news.

This team is doing a good job. They use problems that were created in last 30days to avoid training set leakage. https://swe-rebench.com/

Anyone know how I can cancel this? I dont want it

https://old.reddit.com/r/wallstreetbets/comments/1siq4m2/anyone_know_how_i_can_cancel_this_i_dont...
2•simonpure•10m ago•0 comments

Tell HN: See the AI Doc

2•linsomniac•13m ago•1 comments

Why Aren't We Uv Yet?

https://aleyan.com/blog/2026-why-arent-we-uv-yet/
2•birdculture•18m ago•1 comments

Apple Silicon and Virtual Machines: Beating the 2 VM Limit

https://khronokernel.com/macos/2023/08/08/AS-VM.html
16•krackers•19m ago•0 comments

Heartbeat – open implementation of KAIROS, the always-on agent hiden in Claude C

https://github.com/uameer/heartbeat
1•usmame•19m ago•1 comments

The Polycorp Poly 1. New Zealand's school computer

https://www.classic-computers.org.nz/collection/poly1.htm
2•rbanffy•21m ago•0 comments

Ask HN: Why have we not stepped back on the moon again?

2•chirau•24m ago•2 comments

Ask HN: How did you specialize as a software engineer?

2•legerdemain•31m ago•2 comments

Is "Tokenmaxxing" a Flex?

https://www.businessinsider.com/tokenmaxxing-ai-token-leaderboards-debate-2026-4
2•pascal-maker•35m ago•2 comments

Git fixup is magic (and Magit is too)

https://arialdomartini.github.io/git-fixup
2•fanf2•35m ago•0 comments

Trump's World Liberty Financial borrows $75M using its own token as collateral

https://www.coindesk.com/markets/2026/04/09/trump-s-world-liberty-financial-borrows-usd75-million...
7•JohnTHaller•36m ago•0 comments

Show HN: Beta Testing needed for my package Trustcheck

https://github.com/Halfblood-Prince/trustcheck
1•halfblood1010•39m ago•1 comments

Ask HN: Agentic Permutation of Testing Paths In A System

4•davidajackson•43m ago•0 comments

Amazon Luna Will No Longer Allow Owners to Buy Games, Access Game Stores

https://www.ign.com/articles/amazon-luna-will-no-longer-allow-owners-to-buy-games-access-game-sto...
6•surgical_fire•44m ago•1 comments

Living Memory Inference

https://github.com/alash3al/loci
2•alash3al•47m ago•0 comments

YouTube Premium price increase to take effect in June

https://www.latimes.com/entertainment-arts/story/2026-04-10/youtube-premium-price-increase
2•obilgic•52m ago•0 comments

Open-source MCP server for LinkedIn

https://github.com/stickerdaniel/linkedin-mcp-server
2•arguflow•55m ago•0 comments

Hours Without Internet

https://bsky.app/profile/netblocks.org/post/3mj6hjlonjc2m
1•stupefy•56m ago•1 comments

Top% of users capture 61.5% of engagement in Hezbollah discourse on X

https://arxiv.org/abs/2603.26681
2•soufan•57m ago•0 comments

Several Mac mini and Mac Studio configs are now out of stock at Apple

https://9to5mac.com/2026/04/11/mac-mini-mac-studio-configs-completely-out-of-stock/
8•gnabgib•57m ago•0 comments

Prompt to App

https://prompttoapp.dev/
32•helloww•57m ago•4 comments

Get Users on Autopilot

https://www.usehotdrop.com/
1•Lucnyg•1h ago•0 comments

Producing The Perfect Token

https://blog.luminal.com/p/producing-the-perfect-token
1•jafioti•1h ago•0 comments

A general technique for automating NES games

https://tom7.org/mario/
2•azhenley•1h ago•0 comments

Show HN: PlaneFeed – scroll live flights like TikTok

https://planefeed.app/
1•mind1m•1h ago•0 comments

Canada's Liberal party adopts motion to restrict kids from social media

https://toronto.citynews.ca/2026/04/11/liberal-party-adopts-motion-to-restrict-kids-from-social-m...
2•EmbarrassedHelp•1h ago•0 comments

447 TB/cm² at zero retention energy – atomic-scale memory on fluorographane

https://zenodo.org/records/19513269
8•iliatoli•1h ago•0 comments

Apple Stops Accepting Orders for Some Mac Mini and Mac Studio Models

https://www.macrumors.com/2026/04/11/some-mac-mini-mac-studio-currently-unavailable/
6•dabinat•1h ago•1 comments

Dark Castle

https://darkcastle.co.uk/
19•evo_9•1h ago•2 comments

Show HN: Kern – Agents that do the work and show it

https://github.com/oguzbilgic/kern-ai
2•obilgic•1h ago•0 comments