AI benchmarks are a bad joke – and LLM makers are the ones laughing

https://www.theregister.com/2025/11/07/measuring_ai_models_hampered_by/

103•pseudolus•2h ago

Comments

Marshferm•2h ago

Don’t get high on your own supply.

calpaterson•1h ago

Definitely one of the weaker areas in the current LLM boom. Comparing models, or even different versions of the same model, is a pseudo-scientific mess.

I'm still using https://lmarena.ai/leaderboard. Perhaps there is something better and someone will pipe up to tell me about it. But we use LLMs at work and have unexplainable variations between them.

And when we get a prompt working reliably on one model, we often have trouble porting it to another LLM - even straight "version upgrades" such as from GPT-4 to -5. Your prompt and your model become highly coupled quite easily.

I dunno what to do about it and am tending to just pick Gemini as a result.

bubblelicious•1h ago

I work on LLM benchmarks and human evals for a living in a research lab (as opposed to product). I can say: it’s pretty much the Wild West and a total disaster. No one really has a good solution, and researchers are also in a huge rush and don’t want to end up making their whole job benchmarking. Even if you could, and even if you have the right background you can do benchmarks full time and they still would be a mess.

Product testing (with traditional A/B tests) are kind of the best bet since you can measure what you care about _directly_ and at scale.

I would say there is of course “benchmarketing” but generally people do sincerely want to make good benchmarks it’s just hard or impossible. For many of these problems we’re hitting capabilities where we don’t even have a decent paradigm to use,

ACCount37•1h ago

A/B testing is radioactive too. It's indirectly optimizing for user feedback - less stupid than directly optimizing for user feedback, but still quite dangerous.

Human raters are exploitable, and you never know whether the B has a genuine performance advantage over A, or just found a meat exploit by an accident.

It's what fucked OpenAI over with 4o, and fucked over many other labs in more subtle ways.

bubblelicious•1h ago

Are you talking about just preferences or A/B tests on like retention and engagement? The latter I think is pretty reliable and powerful though I have never personally done them. Preferences are just as big a mess: WHO the annotators are matters, and if you are using preferences as a proxy for like correctness, you’re not really measuring correctness you’re measuring e.g. persuasion. A lot of construct validity challenges (which themselves are hard to even measure in domain).

bjackman•1h ago

For what it's worth, I work on platforms infra at a hyperscaler and benchmarks are a complete fucking joke in my field too lol.

Ultimately we are measuring extremely measurable things that have an objective ground truth. And yet:

- we completely fail at statistics (the MAJORITY of analysis is literally just "here's the delta in the mean of these two samples". If I ever do see people gesturing at actual proper analysis, if prompted they'll always admit "yeah, well, we do come up with a p-value or a confidence interval, but we're pretty sure the way we calculate it is bullshit")

- the benchmarks are almost never predictive of the performance of real world workloads anyway

- we can obviously always just experiment in prod but then the noise levels are so high that you can entirely miss million-dollar losses. And by the time you get prod data you've already invested at best several engineer-weeks of effort.

AND this is a field where the economic incentives for accurate predictions are enormous.

In AI, you are measuring weird and fuzzy stuff, and you kinda have an incentive to just measure some noise that looks good for your stock price anyway. AND then there's contamination.

Looking at it this way, it would be very surprising if the world of LLM benchmarks was anything but a complete and utter shitshow!

bofadeez•6m ago

Even a p-value is insufficient. Maybe can use some of this stuff https://web.stanford.edu/~swager/causal_inf_book.pdf

bofadeez•1h ago

Has your lab tried using any of the newer causal inference–style evaluation methods? Things like interventional or counterfactual benchmarking, or causal graphs to tease apart real reasoning gains from data or scale effects. Wondering if that’s something you’ve looked into yet, or if it’s still too experimental for practical benchmarking work.

ACCount37•1h ago

Ratings on LMArena are too easily gamed.

Even professional human evaluators are quite vulnerable to sycophancy and overconfident-and-wrong answers. And LMArena evaluators aren't professionals.

A lot of the sycophancy mess that seeps from this generation of LLM stems from reckless tuning based on human feedback. Tuning for good LMArena performance has similar effects - and not at all by a coincidence.

HPsquared•1h ago

Psychometric testing of humans has a lot of difficulties, too. It's hard to measure some things.

jennyholzer•1h ago

I've been getting flagged by high-on-their-own-supply AI boosters for identifying that LLM benchmarks have been obvious bullshit for at least the last year and a half.

What changed to make "the inevitable AI bubble" the dominant narrative in last week or so?

conception•1h ago

Companies are talking about needing trillions of dollars is why.

HPsquared•1h ago

Benchmarks in general have this problem, across pretty much all industries. "When a measure becomes a target" and all that.

icameron•1h ago

The market was down for AI related stocks especially, while down only over 3% it’s the worst week since April, and there’s no single event that is to blame it just looks like market sentiment has shifted away from the previous unchecked exuberance.

Kiro•16m ago

Link those comments please because I checked your history and the flagged ones were pure nonsense with zero insights. Also, calling out LLM benchmarks has never been a radical take and basically the default on this site.

shanev•1h ago

This is solvable at the level of an individual developer. Write your own benchmark for code problems that you've solved. Verify tests pass and that it satisfies your metrics like tok/s and TTFT. Create a harness that works with API keys or local models (if you're going that route).

motoboi•42m ago

Well, openai github is open to write evaluations. Just add your there and guaranteed that the next model will perform better on them.

cactusplant7374•23m ago

I think that's what this site is doing: https://aistupidlevel.info/

bee_rider•42m ago

> "For example, if a benchmark reuses questions from a calculator-free exam such as AIME," the study says, "numbers in each problem will have been chosen to facilitate basic arithmetic. Testing only on these problems would not predict performance on larger numbers, where LLMs struggle."

When models figure out how to exploit an effect that every clever college student does, that should count as a win. That’s a much more human-like reasoning ability, than the ability to multiply large numbers or whatever (computers were already good at that, to the point that it has become a useless skill for humans to have). The point of these LLMs is to do things that computers were bad at.

layer8•36m ago

I don’t think the fact that LLMs can handle small numbers more reliably has anything to do with their reasoning ability. To the contrary, reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.

However:

> Testing only on these problems would not predict performance on larger numbers, where LLMs struggle.

Since performance on large numbers is not what these exams are intended to test for, I don’t see this as a counterargument, unless the benchmarks are misrepresenting what is being tested for.

zamadatix•15m ago

Pencil and paper is just testing with tools enabled.

layer8•11m ago

You seem to be addressing an argument that wasn’t made.

Personally, I’d say that such tool use is more akin to a human using a calculator.

zamadatix•7m ago

I'm not addressing an argument, just stating that's already a form of LLM testing done today for people wanting to look at the difference in results the same as the human analogy.

layer8•4m ago

Okay, but then I don’t understand why you replied to my comment for that, there is no direct connection to what I wrote, nor to what bee_rider wrote.

zamadatix•3m ago

> To the contrary, reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.

People interested can see the results of giving LLMs pen and paper today by looking at benchmarks with tools enabled. It's an addition to what you said, not an attack on a portion of your comment :).

LadyCailin•10m ago

I’d say it’s fair for LLMs to be able to use any tool in benchmarks, so long as they are the ones to decide to use them.

luke0016•5m ago

> reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.

Or given a calculator. Which it's running on. Which it in some sense is. There's something deeply ironic about the fact that we have an "AI" running on the most technologically advanced calculator in the history of mankind and...it can't do basic math.

ambicapter•2m ago

> Since performance on large numbers is not what these exams are intended to test for,

How so? Isn't the point of these exams to test arithmetic skills? I would hope we'd like arithmetic skills to be at a constant level regardless of the size of the number?

gardnr•36m ago

A discussion on models "figuring out" things: https://www.youtube.com/watch?v=Xx4Tpsk_fnM (Forbidden Technique)

BolexNOLA•22m ago

>the point of these LLMs is to do things that computers were bad at.

The way they’re being deployed it feels like the point of LLMs is largely to replace basic online search or to run your online customer support cheaply.

I’m a bit out on a limb here because this is not really my technical expertise by any stretch of the imagination, but it seems to me these benchmark tests don’t really tell us much about how LLM’s perform in the ways most people actually use them. Maybe I’m off base here though

nradov•16m ago

LLMs can probably be taught or configured to use external tools like Excel or Mathematica when such calculations are needed. Just like humans. There are plenty of untapped optimization opportunities.

moritzwarhier•33m ago

When people claim that there is such a thing as "X% accuracy in reasoning", it's really hard to take anything else seriously, no matter how impressive.

AI (and humans!) aside, claiming that there was an oracle that could "answer all questions" is a solved problem. Such a thing cannot exist.

But this is going already too deep IMO.

When people start talking about percentages or benchmark scores, there has to be some denominator.

And there can be no bias-free such denominator for

- trivia questions

- mathematical questions (oh, maybe I'm wrong here, intuitively I'd say it's impossible for various reasons: varying "hardness", undecidable problems etc)

- historical or policital questions

I wanted to include "software development tasks", but it would be a distraction. Maybe there will be a good benchmark for this, I'm aware there are plenty already. Maybe AI will be capable to be a better software developer than me in some capacity, so I don't want to include this part here. That also maps pretty well to "the better the problem description, the better the output", which doesn't seem to work so neatly with the other categories of tasks and questions.

Even if the whole body of questions/tasks/prompts would be very constrained and cover only a single domain, it seems impossible to guarantee that such benchmark is "bias-free" (I know AGI folks love this word).

Maybe in some interesting special cases? For example, very constrained and clearly defined classes of questions, at which point, the "language" part of LLMs seems to become less important and more of a distraction. Sure, AI is not just LLMs, and LLMs are not just assistants, and Neural Networks are not just LLMs...

There the problem begins to be honest: I don't even know how to align the "benchmark" claims with the kind of AI they are examinin and the ones I know exist.

Sure it's possible to benchmark how well an AI decides whether, for example, a picture shows a rabbit. Even then: for some pictures, it's gotta be undecidable, no matter how good the training data is?

I'm just a complete layman and commenting about this; I'm not even fluent in the absolute basics of artificial neural networks like perceptrons, gradient descent, backpropagation and typical non-LLM CNNs that are used today, GANs etc.

I am and was impressed by AI and deep learning, but to this day I am thorougly disappointed by the hubris of snakeoil salespeople who think it's valuable and meaningful to "benchmark" machines on "general reasoning".

I mean, it's already a thing in humans. There are IQ tests for the non-trivia parts. And even these have plenty of discussion revolving around them, for good reason.

Is there some "AI benchmark" that exclusively focuses on doing recent IQ tests on models, preferably editions that were published after the particular knowledge cutoff of the respective models? I found (for example) this study [1], but to be honest, I'm not the kind of person who is able to get the core insights presented in such a paper by skimming through it.

Because I think there are impressive results, it's just becomimg very hard to see through the bullshit at as an average person.

I would also love to understand mroe about the current state of the research on the "LLMs as compression" topic [2][3].

[1] https://arxiv.org/pdf/2507.20208

[2] https://www.mattmahoney.net/dc/text.html

[3] https://arxiv.org/abs/2410.21352

SurceBeats•17m ago

Benchmarks optimize for fundraising, not users. The gap between "state of the art" and "previous gen" keeps shrinking in real-world use, but investors still write checks based on decimal points in test scores.

RA_Fisher•8m ago

For statistical AI models, we can use out of sample prediction error as an objective measure to compare models. What makes evaluating LLMs difficult is that comparisons are inextricable from utility (whereas statistical AI models do have a pre-utility step wherein it can be shown out of sample prediction epsilon is minimized).

Btop: A better modern alternative of htop with a gamified interface

AI benchmarks are a bad joke – and LLM makers are the ones laughing

An Algebraic Language for the Manipulation of Symbolic Expressions (1958) [pdf]

Why is Zig so cool?

Valdi – A cross-platform UI framework that delivers native performance

Making Democracy Work: Fixing and Simplifying Egalitarian Paxos

Ticker: Don't Die of Heart Disease

My friends and I accidentally faked the Ryzen 7 9700X3D leaks

Friendly attributes pattern in Ruby

Why Sam Altman Won't Be on the Hook for OpenAI's Spending Spree

Reverse engineering a neural network's clever solution to binary addition (2023)

Dark mode by local sunlight (2021)

Cekura (YC F24) Is Hiring

Myna: Monospace typeface designed for symbol-heavy programming languages

Computational Complexity of Air Travel Planning (2003) [pdf]

Immutable Software Deploys Using ZFS Jails on FreeBSD

Why I love OCaml (2023)

How did I get here?

The Initial Ideal Customer Profile Worksheet

C++ move semantics from scratch (2022)

Mullvad: Shutting down our search proxy Leta

Cerebras Code now supports GLM 4.6 at 1000 tokens/sec

YouTube Removes Windows 11 Bypass Tutorials, Claims 'Risk of Physical Harm'

Ruby already solved my problem

Nubeian Translation for Childhood Songs by Hamza El Din

Apple is crossing a Steve Jobs red line

Running a 68060 CPU in Quadra 650

Venn Diagram for 7 Sets

Apple's "notarisation" – blocking software freedom of developers and users

Angel Investors, a Field Guide

Btop: A better modern alternative of htop with a gamified interface

AI benchmarks are a bad joke – and LLM makers are the ones laughing

An Algebraic Language for the Manipulation of Symbolic Expressions (1958) [pdf]

Why is Zig so cool?

Valdi – A cross-platform UI framework that delivers native performance

Making Democracy Work: Fixing and Simplifying Egalitarian Paxos

Ticker: Don't Die of Heart Disease

My friends and I accidentally faked the Ryzen 7 9700X3D leaks

Friendly attributes pattern in Ruby

Why Sam Altman Won't Be on the Hook for OpenAI's Spending Spree

Reverse engineering a neural network's clever solution to binary addition (2023)

Dark mode by local sunlight (2021)

Cekura (YC F24) Is Hiring

Myna: Monospace typeface designed for symbol-heavy programming languages

Computational Complexity of Air Travel Planning (2003) [pdf]

Immutable Software Deploys Using ZFS Jails on FreeBSD

Why I love OCaml (2023)

How did I get here?

The Initial Ideal Customer Profile Worksheet

C++ move semantics from scratch (2022)

Mullvad: Shutting down our search proxy Leta

Cerebras Code now supports GLM 4.6 at 1000 tokens/sec

YouTube Removes Windows 11 Bypass Tutorials, Claims 'Risk of Physical Harm'

Ruby already solved my problem

Nubeian Translation for Childhood Songs by Hamza El Din

Apple is crossing a Steve Jobs red line

Running a 68060 CPU in Quadra 650

Venn Diagram for 7 Sets

Apple's "notarisation" – blocking software freedom of developers and users

Angel Investors, a Field Guide

AI benchmarks are a bad joke – and LLM makers are the ones laughing

Comments