This is key.
Public benchmarks are essentially trust-based and the trust just isn't there.
With public benchmarks we're trusting the labs not to cheat. And it's easy to "cheat" accidentally - they actually need to make a serious effort to not contaminate the training data.
And there's massive incentives for the labs to cheat in order to get the hype going around their launch and justify their massive investments in training. It doesn't have to be the CEO who's directing it. Can even be one/a few researchers who are responsible for a specific area of model performance and are under tremendous pressure to deliver.
There's a long history of that sort of behaviour. ISPs gaming bandwidth tests when they detect one is being run. Software recognizing being run in a VM or on a particular configuration. I don't think it's a stretch to assume some of the money at OpenAI and others has gone into spotting likely benchmark queries and throwing on a little more compute or tagging them for future training.
I would be outright shocked if most of these benchmarks are even attempting serious countermeasures.
So there's no ground truth; they're just benchmarking how impressive an LLM's code review sounds to a different LLM. Hard to tell what to make of that.
There is a possibility that machine based PR reviews are better: for instance because they are not prejudiced based on who is the initiator of the PR and because they don't take other environmental factors into account. You'd expect a machine to be more neutral, so on that front the machine should and possibly could score better. But until the models consistently outperform the humans in impartially scored quality vs a baseline of human results it is the humans that should call this, not the machines.
It's just that it is fairly trivial to present a PR to a machine in such a way that it can only comment on the differences in the code. I would find it surprising if that somehow led to a bias about the author. Can you give an example of how you think that would creep into such an interaction?
I had some fun trying to answer it, ignoring fixating on whether or not the premise is true, for argument's sake.
My answer is:
I would think "attempting to assess reality that is not grounded in reality" is hard to ignore due to a combination of "it's what is available," being easy to understand, and seeming useful (decoupled from whether it's really so). As a result, it's hard to ignore because it's what is mostly available to us for consumption and is easy to make "consumable."
I think there is a LARGE overlap in this topic with my pet peeve and hatred of mock tests in development. They are not completely useless, but their obvious flaws and vulnerabilities seem to me to be in the same area: "Not grounded in reality."
Said another way: Because it's what's easy to make, and thus there is a lot of it, creating a positive feedback loop of mere-exposure effect. Then it becomes hard to ignore because it's what's shoved in our face.
> So there's no ground truth; they're just benchmarking how impressive an LLM's code review sounds to a different LLM. Hard to tell what to make of that.
The comment I replied to was:
> That's how 99% of 'LLM benchmark numbers' circulating on the internet work.
And that's just false. SWE-Bench verified isn't like this. Aider Polyglot isn't like this. SWE-Lancer Diamond isn't like this. The new internal benchmarks used by OpenAI in GPT-5's model card aren't like this.
Maybe this benchmark is a special snowflake and needs LLM-as-a-judge, but this doesn't invalidate the original concern: setting up a benchmark this way runs into a series of problems and is prone to show performance differences that might not be there with a different setups. Benchmarks are already hard to trust, I'm not sure how this is any more indicative than the rest.
[1] https://openai.com/index/introducing-swe-bench-verified/
This might be more like asking amateur painters to each paint a picture of a different one of the pumpkins, then judging each other's paintings without seeing the actual pumpkin that painting was based on.
It's supposed to leverage a "generate vs. critique" gap in skill level as a form of self-improvement. It's easier to judge how good food is vs. make it.
But here's the thing. When it comes to code review, you need to be effectively as skilled as the person who wrote it. There isn't really a gap.
And then the real clincher is this. LLMs naturally have a skill gap between their judgement and generation skills as is. The reason is that they have superhuman pattern matching and memorization ability. They can use their memorized patterns as a massive crutch for their actual reasoning skills, but they can't do the same for judgement calls in code review.
Or in other words, I don't need to be a chef myself to decide if a meal is good or not.
As you point out there are many problems that higher complexity classes than NP.
But it does hold for this problem.
Can you give me a very large semiprime?
And claude opus answered:
Here's a very large semiprime:
N = 29927402397991286489627837734179186385188296382227646249397073654051914085318503794952624411151858464246403027505634195232053330357484129331920822220662818816547063469215394303721576869467659309978113411955550111870966028627418736664
This is a over 200-digit semiprime. Factoring semiprimes of this size is computationally intensive, which is why they form the basis of RSA encryption security.
---
Verifying whether this answer is correct is very hard, much harder than generating it.
Problems of this form come up very often. Not even in formal mathematics. Some magic number in the code that you need to reverse engineer to tell it's correct. Some library which you don't have the documentation for but was available when it was written. Hidden intentions or even requirements that are not clear from the code itself. If a weaker LLM is validating a stronger LLM the weaker LLM will simply not grasp the subtleties the stronger LLM created in it's answer. In fact it's a pretty common statement that writing code is easier than reading it. Which is precisely about generation vs validation.
Not if it's divisible by 2.
from sympy import isprime
num = 29927402397991286489627837734179186385188296382227646249397073654051914085318503794952624411151858464246403027505634195232053330357484129331920822220662818816547063469215394303721576869467659309978113411955550111870966028627418736664
print(num//2) # 14963701198995643244813918867089593192594148191113823124698536827025957042659251897476312205575929232123201513752817097616026665178742064665960411110331409408273531734607697151860788434733829654989056705977775055935483014313709368332
print(isprime(num//2)) # FalseThis can correct for generation errors, but cannot correct for quality measurement errors, so the question is valid.
It's not hard. You are visiting a website with an .ai domain. You already know what the conclusions will be.
I haven't found it to be really useful so far, but it's also very little added work, so for now I keep on using it. If it saves my ass even just once, it will probably be worth it overall.
That's a common fallacy of safety by the way :)
It could very well "save your ass" just once (whatever that means) while costing you more in time, opportunity, effort, or even false sense of safety, to generate more harm than it will ultimately save you.
And it's not even safety critical code.
IME the highest value (at the moment) is having an LLM integrated into the PR page, that reads your code + CI log, and effectively operates as a sanity check / semantic linter.
A common workflow for us: is Draft PR -> Passes CI (inclusive of an LLM 'review') -> Published -> Passes Human review -> Scheduled to merge
The goal is to get a higher margin of confidence that your code (1) will not blow up in production (2) faithfully does what it's trying to do.
The value of the LLM reviewer is maybe 80% in the first bucket and 20% in the second bucket, IME. It often catches bugs like "off by one" and "you meant this to be `if not x`, based on the flag name and behavior, not `if x`".
the image shows it with a score of 62.7, not 58.5
which is right? mistakes like this undermine the legitimacy of a closed benchmark, especially one judged by an LLM
Resulted in the following answers:
- Gemini 2.5 flash: Gemini 2.5 Flash
- Claude Sonnet 4: Claude Sonnet 4
- Chat GPT: GPT-5
To me its conceivable that GPT 4o would be biased toward output generated by other OpenAI models.*
The self-preference is almost certainly coming from post-processing, or more likely because the model name is inserted into the system prompt.
FTFY
thanks, but no thanks, I don't buy such marketing propaganda.
It’d be harder to juice benchmarks if a random sample of ~100 top models were randomly sampled in this manner for output tokens while evaluating the target model’s output.
On second thought, I’m slapping AGPL on this idea. Please hire me and give me one single family house in a California metro as a bonus. Thanks.
I've only seen it go above 5000 for very difficult style transfer problems where it has to wrangle with the micro-placement of lots of text. Or difficult math problems.
1) Gemini 2.5 Pro rank only non-google models 2) Claude 4.1 Opus rank only non-Anthropic models 3) GPT5-thinking rank only non-OpenAI 4) Then sum up the rankings and sort by the sum.
The sentence is too obviously LLM generated, but whatever.
> Weaknesses:
>
> False positives: A few reviews include incorrect or harmful fixes.
> Inconsistent labeling: Occasionally misclassifies the severity of findings or touches forbidden lines.
> Redundancy: Some repetition or trivial suggestions that dilute review utility.
wtf are "forbidden lines"?
Example: Given a long travel journal How many cities does the author mention? GPT-5: 12 Expected: 17
44za12•6mo ago