A. model improvement tests, suites, and benchmarks
B. data on competitors' evals
C. test answer keys
D. alpha to VC firms
E. all of the above
???
It'd be nice if it were actually open and we could inspect all the statistics.
1. Ex https://mppbench.com/
This is hard to swallow.
I don't believe a single word this article says. Apparently the "real author" (the human being who wrote the original prompt to generate this article) only intend to use this article to generate clicks and engagement but don't care at all about what's in there.
Instead, finance bros are convinced by the argument that number goes up.
def is_it_true(question):
return profit_if_true(question) > profit_if_false(question)
AI will make it cheaper, faster, better, no problem. You can eat the cake now and save it for later.I'm being (mostly) serious, suppose you're a stuffed ahort trying to boost your valuation, how can you work out who's smart enough to train your LLM? (Never mind how to get them to work for you!)
Then for LMArena there is the host of other biases / construct validity: people are easily fooled, even PhD experts; in many cases it’s easier for a model to learn how to persuade than actually learn the right answers.
But a lot of dismissive comments as if frontier labs don’t know this, they have some of the best talent in the world. They aren’t perfect but they in a large sene know what they’re doing and what the tradeoffs of various approaches are.
Human annotations are an absolute nightmare for quality which is why coding agents are so nice: they’re verifiable and so you can train them in a way closer to e.g. alphago without the ceiling of human performance
So we should expect the models to eventually tend toward the same behaviors that politicians exhibit?
Isn’t it fascinating how it comes down to quality of judgement (and the descriptions thereof)?
We need an LMArena rated by experts.
But at least the two examples of judging AI provided in the article can be solved by any moron by expending enough effort. Any moron can tell you what Dorothy says to Toto when entering Oz by just watching the first thirty minutes of the movie. And while validating answer B in the pan question takes some ninth-grade math (or a short trip to wikipedia), figuring out that a nine inch diameter circle is in fact not the same area as a 9x13 inch square is not rocket science. And with a bit of craft paper you could evaluate both answers even without math knowledge
So the short answer is: with effort. You spend lots of effort on finding a good evaluator, so the evaluator can judge the LLM for you. Or take "average humans" and force them to spend more effort on evaluating each answer
Secondly, it doesn't fix stupidity. A participant who earnestly takes the quality goals of the system to heart instead of focusing on maximizing their take (thus, obviously stupid) will still make bad classifications due to that reason.
1. I would expect any paid arrangement to include a quality-control mechanism. With the possible exception of if it was designed from scratch by complete ignoramuses.
2. Do you have a proposal for a better incentive?
2. Criticism of a method does not require that there is a viable alternative. Perhaps the better idea is just to not incentivize people to do tasks they are not qualified for
Agreed, and would add that it doesn’t fix other things like lack of skill, focus, time, etc.
An example is the output of the Amazon Turk “Sheep Market” experiment:
https://docubase.mit.edu/project/the-sheep-market/
Some of those sheep were really ba-aaa-ad.
We give them WAY too much credit by watching mostly the things they have been trained specifically to do and pretending this indicates a general mental competence that just doesn't exist.
People hold falsehoods to be true, and cannot calculate a 10% tip.
By being closed, they'll never be optimal.
It’s not how I do, and I suppose how many people do. I specifically ask questions related to niche subjects that I know perfectly well and that is very easy for me to spot mistakes.
The first time I used it, that’s what came naturally to my mind. I believe it’s the same for others.
Of course people visiting a website specifically designed for evaluating LLMs do try all kinds of specific things to specifically test for weaknesses. There may be users who just click on the response with more emojis, but I strongly doubt they are the majority on that particular site.
I miss that one, is 5 any better? I switched to claude before it launched.
The thing was huge. They were training the thing to be GPT5, before they figured out their userbase to too large to be served something that big.
When it comes to conversation, Gemini 3 Pro right now is the closest.
When I asked it to make a nightmare Sauron would show me in Palantir, and ChatGPT5.2 Thinking tried to make it "playful" (directly against my instructions) and went with some shallow but safe option. Gemini 3 Pro prepared something much deeper and more profound.
I don't know nearly as much about talking with Opus 4.5 - while I use it for coding daily, I don't use it as a go-to chat. As a side note, Opus 3 has a similar vibe to GPT 4.5.
Meta "cheated" on lmarena not by using a smarter model but by using one that was more verbose and friendly with excessive emojis.
LLMs are fallible. Humans are fallible. LLMs improve (and improve fast). Humans do not (overall, ie. "group of N experts in X", "N random internet people").
All those "turing tests" will start flipping.
Today it's "N random internet humans" score too low on those benchmarks, tomorrow it'll be "group of N expert humans in X" score too low.
Shouldn't the model effectively 1. learn to complete the incorrect thing and 2. learn the context that it's correct and incorrect? In this case the context being lazy LMArena users. And presumably, in the future, poorly filtered training data.
We seem to be able to read incorrect things and not be corrupted (well, theoretically). It's not ideal, but it seems an important component to intellectual resilience.
It seems like the model knowing the data is LMArena, or some type of un-trusted, would be sufficient to shift the prior to a reasonable place.
This is pure gold. I've always found this approach of evals on a moving-target via consensus broken.
> Why is LMArena so easy to game? The answer is structural. > The system is fully open to the Internet. LMArena is built on unpaid labor from uncontrolled volunteers.
also all user's votes count equally, bu not all users have equal knowledge.
They've raised about $250 million, so I don't see that happening anytime soon.
Beyond that there is coding up a web page, which as we all know can be vibe coded in a few hours...
What else is there to spend money on?
Maybe if they started ranking the answers on a 1-10 range, allowing people to specify graduations of correctness/wrongness, then the crowd would work?
Yes, the system desperately needs this. Many doctors malpractice for DECADES.
I would absolutely seek to, damn, even pay good money to, be able to talk with a doctor's previous patients, particularly if they're going to perform a life-changing procedure on me.
Which is exactly that. I've actually found great specialists there, looking at their ratings.
> They're not reading carefully. They're not fact-checking, or even trying.
Uhhh, how was that established?
I know we can solve this in ordinary tasks just using prompt but that's really annoying. Sometimes I just want a yes or no answer and then I get a phd thesis in the matter.
A voting system open to the public is completely screwed even if somehow its incentives are optimized toward strongly encouraging ideal behavior.
> In battle mode, you'll be served 2 anonymous models. Dig into the responses and decide which answer best fits your needs.
It's not a given that someone's needs are "factual accuracy". Maybe they're after entertainment, or winning an argument.
The idea is simple*: Instead of users rating content, AI does it based on fact check.
None. Zero products or roadmaps on that.
Worse than that, people don't want this. It might tell them that they are wrong, with no chance to get your buddies to upvote you or game the system socially. It would probably flop.
Both AI companies and users want control, they want to game stuff. LMArena is ideal for that.
---
* I know it's a simple idea, but hard to achieve, and I'm not underestimating the difficulty. Doesn't matter thuogh: no one is even signaling the intention of solving it. Harder problems have been signaled (protein research, math).
However, LMarena,despite its flaws(recaptcha in 2026?) is the only "testing ground" where you can examine the entire breadth of internet users. Everything else is incredibly selective, hamstrung bureaucratic benchmark on pre-approved QA sessions. It doesn't handle edge cases or out-of-distribution content. LMarena is the "out-of-distribution" questions that trigger the corner cases and expose weak parts in processing(like tokenization/parsing bugs) or inference inefficiency(infinite loops, stalling and various suboptimal paths), its "idiot-proofing" any future interactions beyond sterile test-sets.
> Raw intelligence meets battle-tested experience
>A global community of the smartest people in every field who've shipped products, won cases, published breakthroughs, and made decisions under pressure.
observationist•1d ago
dust42•1d ago
denismi•23h ago
Is there an established name for this LLMism?
I don't need a "Reality Check" or a "Hard Truth". The thought can be concluded without this performative honesty nonsense or the emotive hyperbole.
This probably grates me more than any other.
duncancarroll•22h ago
aratahikaru5•21h ago
The article makes strong points, includes real data and quotes, shows proof of work (sampling 100 Q&A), so does that even matter at this point? This doesn't feel like "slop" to me at all.
ryan_n•21h ago
joe_the_user•20h ago
I don't know if this proves it's an LLM text or whether that style is simply spilling out everywhere.