Like any LLM benchmark, LMArena is highly flawed. I do think it has a right to exist. For me anecdotally it has been indicative of which LLMs style I like best, not necessarily its factual accuracy. It hasn't however been a very useful tool to find the best LLM for a given job.
To the article's point though, it's treated as the gold standard, which it isn't. We should have learned that with the sycophancy-gate.
I'm not sure if the methodology here really is sound for the question at hand. It's a bit like saying, oh prediction markets don't work because 40% of people that voted were wrong.
You can't really get around running your own benchmarks for the job at hand, if you really want to get 95th-percentile performance on a task.
halbgut•1d ago
To the article's point though, it's treated as the gold standard, which it isn't. We should have learned that with the sycophancy-gate.
I'm not sure if the methodology here really is sound for the question at hand. It's a bit like saying, oh prediction markets don't work because 40% of people that voted were wrong.
You can't really get around running your own benchmarks for the job at hand, if you really want to get 95th-percentile performance on a task.