This is key.
Public benchmarks are essentially trust-based and the trust just isn't there.
With public benchmarks we're trusting the labs not to cheat. And it's easy to "cheat" accidentally - they actually need to make a serious effort to not contaminate the training data.
And there's massive incentives for the labs to cheat in order to get the hype going around their launch and justify their massive investments in training. It doesn't have to be the CEO who's directing it. Can even be one/a few researchers who are responsible for a specific area of model performance and are under tremendous pressure to deliver.
There's a long history of that sort of behaviour. ISPs gaming bandwidth tests when they detect one is being run. Software recognizing being run in a VM or on a particular configuration. I don't think it's a stretch to assume some of the money at OpenAI and others has gone into spotting likely benchmark queries and throwing on a little more compute or tagging them for future training.
I would be outright shocked if most of these benchmarks are even attempting serious countermeasures.
So there's no ground truth; they're just benchmarking how impressive an LLM's code review sounds to a different LLM. Hard to tell what to make of that.
There is a possibility that machine based PR reviews are better: for instance because they are not prejudiced based on who is the initiator of the PR and because they don't take other environmental factors into account. You'd expect a machine to be more neutral, so on that front the machine should and possibly could score better. But until the models consistently outperform the humans in impartially scored quality vs a baseline of human results it is the humans that should call this, not the machines.
It's just that it is fairly trivial to present a PR to a machine in such a way that it can only comment on the differences in the code. I would find it surprising if that somehow led to a bias about the author. Can you give an example of how you think that would creep into such an interaction?
This might be more like asking amateur painters to each paint a picture of a different one of the pumpkins, then judging each other's paintings without seeing the actual pumpkin that painting was based on.
It's supposed to leverage a "generate vs. critique" gap in skill level as a form of self-improvement. It's easier to judge how good food is vs. make it.
But here's the thing. When it comes to code review, you need to be effectively as skilled as the person who wrote it. There isn't really a gap.
And then the real clincher is this. LLMs naturally have a skill gap between their judgement and generation skills as is. The reason is that they have superhuman pattern matching and memorization ability. They can use their memorized patterns as a massive crutch for their actual reasoning skills, but they can't do the same for judgement calls in code review.
Or in other words, I don't need to be a chef myself to decide if a meal is good or not.
I haven't found it to be really useful so far, but it's also very little added work, so for now I keep on using it. If it saves my ass even just once, it will probably be worth it overall.
That's a common fallacy of safety by the way :)
It could very well "save your ass" just once (whatever that means) while costing you more in time, opportunity, effort, or even false sense of safety, to generate more harm than it will ultimately save you.
And it's not even safety critical code.
the image shows it with a score of 62.7, not 58.5
which is right? mistakes like this undermine the legitimacy of a closed benchmark, especially one judged by an LLM
Resulted in the following answers:
- Gemini 2.5 flash: Gemini 2.5 Flash
- Claude Sonnet 4: Claude Sonnet 4
- Chat GPT: GPT-5
To me its conceivable that GPT 4o would be biased toward output generated by other OpenAI models.*
The self-preference is almost certainly coming from post-processing, or more likely because the model name is inserted into the system prompt.
FTFY
thanks, but no thanks, I don't buy such marketing propaganda.
It’d be harder to juice benchmarks if a random sample of ~100 top models were randomly sampled in this manner for output tokens while evaluating the target model’s output.
On second thought, I’m slapping AGPL on this idea. Please hire me and give me one single family house in a California metro as a bonus. Thanks.
I've only seen it go above 5000 for very difficult style transfer problems where it has to wrangle with the micro-placement of lots of text. Or difficult math problems.
1) Gemini 2.5 Pro rank only non-google models 2) Claude 4.1 Opus rank only non-Anthropic models 3) GPT5-thinking rank only non-OpenAI 4) Then sum up the rankings and sort by the sum.
44za12•3h ago