AI benchmarks are a bad joke – and LLM makers are the ones laughing

https://www.theregister.com/2025/11/07/measuring_ai_models_hampered_by/

95•pseudolus•2h ago

Comments

Marshferm•2h ago

Don’t get high on your own supply.

calpaterson•1h ago

Definitely one of the weaker areas in the current LLM boom. Comparing models, or even different versions of the same model, is a pseudo-scientific mess.

I'm still using https://lmarena.ai/leaderboard. Perhaps there is something better and someone will pipe up to tell me about it. But we use LLMs at work and have unexplainable variations between them.

And when we get a prompt working reliably on one model, we often have trouble porting it to another LLM - even straight "version upgrades" such as from GPT-4 to -5. Your prompt and your model become highly coupled quite easily.

I dunno what to do about it and am tending to just pick Gemini as a result.

bubblelicious•1h ago

I work on LLM benchmarks and human evals for a living in a research lab (as opposed to product). I can say: it’s pretty much the Wild West and a total disaster. No one really has a good solution, and researchers are also in a huge rush and don’t want to end up making their whole job benchmarking. Even if you could, and even if you have the right background you can do benchmarks full time and they still would be a mess.

Product testing (with traditional A/B tests) are kind of the best bet since you can measure what you care about _directly_ and at scale.

I would say there is of course “benchmarketing” but generally people do sincerely want to make good benchmarks it’s just hard or impossible. For many of these problems we’re hitting capabilities where we don’t even have a decent paradigm to use,

ACCount37•1h ago

A/B testing is radioactive too. It's indirectly optimizing for user feedback - less stupid than directly optimizing for user feedback, but still quite dangerous.

Human raters are exploitable, and you never know whether the B has a genuine performance advantage over A, or just found a meat exploit by an accident.

It's what fucked OpenAI over with 4o, and fucked over many other labs in more subtle ways.

bubblelicious•1h ago

Are you talking about just preferences or A/B tests on like retention and engagement? The latter I think is pretty reliable and powerful though I have never personally done them. Preferences are just as big a mess: WHO the annotators are matters, and if you are using preferences as a proxy for like correctness, you’re not really measuring correctness you’re measuring e.g. persuasion. A lot of construct validity challenges (which themselves are hard to even measure in domain).

bjackman•1h ago

For what it's worth, I work on platforms infra at a hyperscaler and benchmarks are a complete fucking joke in my field too lol.

Ultimately we are measuring extremely measurable things that have an objective ground truth. And yet:

- we completely fail at statistics (the MAJORITY of analysis is literally just "here's the delta in the mean of these two samples". If I ever do see people gesturing at actual proper analysis, if prompted they'll always admit "yeah, well, we do come up with a p-value or a confidence interval, but we're pretty sure the way we calculate it is bullshit")

- the benchmarks are almost never predictive of the performance of real world workloads anyway

- we can obviously always just experiment in prod but then the noise levels are so high that you can entirely miss million-dollar losses. And by the time you get prod data you've already invested at best several engineer-weeks of effort.

AND this is a field where the economic incentives for accurate predictions are enormous.

In AI, you are measuring weird and fuzzy stuff, and you kinda have an incentive to just measure some noise that looks good for your stock price anyway. AND then there's contamination.

Looking at it this way, it would be very surprising if the world of LLM benchmarks was anything but a complete and utter shitshow!

bofadeez•1h ago

Has your lab tried using any of the newer causal inference–style evaluation methods? Things like interventional or counterfactual benchmarking, or causal graphs to tease apart real reasoning gains from data or scale effects. Wondering if that’s something you’ve looked into yet, or if it’s still too experimental for practical benchmarking work.

ACCount37•1h ago

Ratings on LMArena are too easily gamed.

Even professional human evaluators are quite vulnerable to sycophancy and overconfident-and-wrong answers. And LMArena evaluators aren't professionals.

A lot of the sycophancy mess that seeps from this generation of LLM stems from reckless tuning based on human feedback. Tuning for good LMArena performance has similar effects - and not at all by a coincidence.

HPsquared•1h ago

Psychometric testing of humans has a lot of difficulties, too. It's hard to measure some things.

jennyholzer•1h ago

I've been getting flagged by high-on-their-own-supply AI boosters for identifying that LLM benchmarks have been obvious bullshit for at least the last year and a half.

What changed to make "the inevitable AI bubble" the dominant narrative in last week or so?

conception•1h ago

Companies are talking about needing trillions of dollars is why.

HPsquared•1h ago

Benchmarks in general have this problem, across pretty much all industries. "When a measure becomes a target" and all that.

icameron•1h ago

The market was down for AI related stocks especially, while down only over 3% it’s the worst week since April, and there’s no single event that is to blame it just looks like market sentiment has shifted away from the previous unchecked exuberance.

Kiro•15m ago

Link those comments please because I checked your history and the flagged ones were pure nonsense with zero insights. Also, calling out LLM benchmarks has never been a radical take and basically the default on this site.

shanev•1h ago

This is solvable at the level of an individual developer. Write your own benchmark for code problems that you've solved. Verify tests pass and that it satisfies your metrics like tok/s and TTFT. Create a harness that works with API keys or local models (if you're going that route).

motoboi•40m ago

Well, openai github is open to write evaluations. Just add your there and guaranteed that the next model will perform better on them.

cactusplant7374•21m ago

I think that's what this site is doing: https://aistupidlevel.info/

bee_rider•41m ago

> "For example, if a benchmark reuses questions from a calculator-free exam such as AIME," the study says, "numbers in each problem will have been chosen to facilitate basic arithmetic. Testing only on these problems would not predict performance on larger numbers, where LLMs struggle."

When models figure out how to exploit an effect that every clever college student does, that should count as a win. That’s a much more human-like reasoning ability, than the ability to multiply large numbers or whatever (computers were already good at that, to the point that it has become a useless skill for humans to have). The point of these LLMs is to do things that computers were bad at.

layer8•35m ago

I don’t think the fact that LLMs can handle small numbers more reliably has anything to do with their reasoning ability. To the contrary, reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.

However:

> Testing only on these problems would not predict performance on larger numbers, where LLMs struggle.

Since performance on large numbers is not what these exams are intended to test for, I don’t see this as a counterargument, unless the benchmarks are misrepresenting what is being tested for.

zamadatix•13m ago

Pencil and paper is just testing with tools enabled.

layer8•9m ago

You seem to be addressing an argument that wasn’t made.

Personally, I’d say that LLM tools are more akin to a human using a calculator.

zamadatix•5m ago

I'm not addressing an argument, just stating that's already a form of LLM testing today.

LadyCailin•8m ago

I’d say it’s fair for LLMs to be able to use any tool in benchmarks, so long as they are the ones to decide to use them.

gardnr•35m ago

A discussion on models "figuring out" things: https://www.youtube.com/watch?v=Xx4Tpsk_fnM (Forbidden Technique)

BolexNOLA•20m ago

>the point of these LLMs is to do things that computers were bad at.

The way they’re being deployed it feels like the point of LLMs is largely to replace basic online search or to run your online customer support cheaply.

I’m a bit out on a limb here because this is not really my technical expertise by any stretch of the imagination, but it seems to me these benchmark tests don’t really tell us much about how LLM’s perform in the ways most people actually use them. Maybe I’m off base here though

nradov•14m ago

LLMs can probably be taught or configured to use external tools like Excel or Mathematica when such calculations are needed. Just like humans. There are plenty of untapped optimization opportunities.

moritzwarhier•31m ago

When people claim that there is such a thing as "X% accuracy in reasoning", it's really hard to take anything else seriously, no matter how impressive.

AI (and humans!) aside, claiming that there was an oracle that could "answer all questions" is a solved problem. Such a thing cannot exist.

But this is going already too deep IMO.

When people start talking about percentages or benchmark scores, there has to be some denominator.

And there can be no bias-free such denominator for

- trivia questions

- mathematical questions (oh, maybe I'm wrong here, intuitively I'd say it's impossible for various reasons: varying "hardness", undecidable problems etc)

- historical or policital questions

I wanted to include "software development tasks", but it would be a distraction. Maybe there will be a good benchmark for this, I'm aware there are plenty already. Maybe AI will be capable to be a better software developer than me in some capacity, so I don't want to include this part here. That also maps pretty well to "the better the problem description, the better the output", which doesn't seem to work so neatly with the other categories of tasks and questions.

Even if the whole body of questions/tasks/prompts would be very constrained and cover only a single domain, it seems impossible to guarantee that such benchmark is "bias-free" (I know AGI folks love this word).

Maybe in some interesting special cases? For example, very constrained and clearly defined classes of questions, at which point, the "language" part of LLMs seems to become less important and more of a distraction. Sure, AI is not just LLMs, and LLMs are not just assistants, and Neural Networks are not just LLMs...

There the problem begins to be honest: I don't even know how to align the "benchmark" claims with the kind of AI they are examinin and the ones I know exist.

Sure it's possible to benchmark how well an AI decides whether, for example, a picture shows a rabbit. Even then: for some pictures, it's gotta be undecidable, no matter how good the training data is?

I'm just a complete layman and commenting about this; I'm not even fluent in the absolute basics of artificial neural networks like perceptrons, gradient descent, backpropagation and typical non-LLM CNNs that are used today, GANs etc.

I am and was impressed by AI and deep learning, but to this day I am thorougly disappointed by the hubris of snakeoil salespeople who think it's valuable and meaningful to "benchmark" machines on "general reasoning".

I mean, it's already a thing in humans. There are IQ tests for the non-trivia parts. And even these have plenty of discussion revolving around them, for good reason.

Is there some "AI benchmark" that exclusively focuses on doing recent IQ tests on models, preferably editions that were published after the particular knowledge cutoff of the respective models? I found (for example) this study [1], but to be honest, I'm not the kind of person who is able to get the core insights presented in such a paper by skimming through it.

Because I think there are impressive results, it's just becomimg very hard to see through the bullshit at as an average person.

I would also love to understand mroe about the current state of the research on the "LLMs as compression" topic [2][3].

[1] https://arxiv.org/pdf/2507.20208

[2] https://www.mattmahoney.net/dc/text.html

[3] https://arxiv.org/abs/2410.21352

SurceBeats•16m ago

Benchmarks optimize for fundraising, not users. The gap between "state of the art" and "previous gen" keeps shrinking in real-world use, but investors still write checks based on decimal points in test scores.

RA_Fisher•7m ago

For statistical AI models, we can use out of sample prediction error as an objective measure to compare models. What makes evaluating LLMs difficult is that comparisons are inextricable from utility (whereas statistical AI models do have a pre-utility step wherein it can be shown out of sample prediction epsilon is minimized).

Mothers say chatbots encouraged their sons to kill themselves

Radio play about techno futurism

Day 5 of building the internet's biggest alternative software platform

Riding in a Chinese Robotaxi Is Pretty Smooth–That's a Problem for Waymo

Cloudflare Scrubs Aisuru Botnet from Top Domains List

We built a vector search engine that lets you choose precision at query time

It's mainframes all the way down

Tidally Torn: Why the Most Common Stars May Lack Large, Habitable-Zone Moons

Ask HN: EnvSecOps == Attestation-Based Identity does this direction make sense?

3D Flight Tracker

CRISPR gene-editing works to reduce high cholesterol in a new study

Former Capitol Police officer a forensic match for Jan. 6 pipe bomber

The NYC Transportation Crew Helping Preserve Its Cobblestone Streets

52 Year old data tape could contain Unix history

What Is Pwn?

Real-Time AI Model Performance Monitoring

What happens when brainrot meets clicker games?

2025 Meditation App Landscape: Comprehensive Review of Top Mindfulness Platforms

Shader Glass

The Case for Boring APIs

The Curse of Dimensionality and GIS

Weasel War Dance

A Scientific Comparison with Meditation App

Improving Rust Compile Times by 71%

New-hire air traffic controllers must be under age of 31

Authorities Shut Down Film Festival in New York

Some Mercedes EVs Now Have a $50k Discount

UK lawyer buys large numbers of freeholds, sends 'aggressive' payment demands

Ancient earth: a visualization of continental drift. (WebGL)

Clerk vs. Auth0 vs. Keycloak vs. FusionAuth