The real question for me is: are they less reliable than human judges? Probably yes. But I favor a relative measurement to humans than a plain statement like that.
We can measure how unreliable they are, or how susceptible they are to specific changes, just because we can reset them to the same state and run the experiment again. At least for now [1] we do not have that capability with humans, so there's no way to run a matching experiment on humans.
The best we can do it is probably to run the limited experiments we can do on humans -- comparing different judge's cross-referenced reliability to get an overall measure and some weak indicator of the reliability of a specific judge based on intra-judge agreement. But when running this on LLMs they would have to keep the previous cases in their context window to get a fair comparison.
"With anything gray, does the stronger/bigger party always win?"
He said:
"If you ask my students, nearly all of them would say Yes"
I've spent some time poking at this. I can't go into details, but the short answer is, "Sometimes yes, sometimes no, and it depends A LOT on how you define 'reliable'."
My sense is that, the more boring, mechanical and closed-ended the task is, the more likely an LLM is to be more reliable than a human. Because an LLM is an unthinking machine. It doesn't get tired, or hangry, or stressed out about its kid's problems at school. But it's also a doofus with absolutely no common sense whatsoever.
Unthinking can be pretty powerful these days.
https://www.findlaw.com/legalblogs/legally-weird/judge-who-u...
You can observe this using any of the present-day LLM's - ask it an architectural/design question, provide it with your thoughts, reasoning, constraints, etc... and see what it tells you. Then... click the "Retry" button and see how similar (or dissimilar) the answer is. Sometimes you'll get a complete 180 from the prior response.
It can be difficult to observe this fact in practice because, unlike for an LLM, you can't just ask a human the exact same question three times in five seconds and get three different answers, because unlike an LLM we have memory. But, as someone who works with human-labeled data, it's something I have to contend with on a daily basis. For the things I'm working on, if you give the same annotator the same thing to label two different times spaced far enough apart for them to forget that they have seen this thing before, the chance of them making the same call both times is only about 75%. If I do that with a prompted LLM anotator, I'm used to seeing more like 85%, and for some models it can be possible to get even better consistency than that with the right conditions and enough time spent fussing with the prompt.
I still prefer the human labels when I can afford them because LLM labeling has plenty of other problems. But being more flip-floppy than humans is not one that I have been able to empirically observe.
Those things, I'd argue, are far less likely to change if you ask the same judge over and over. I think you can observe this in reality by considering people's political opinions - which can drift over time but typically remain similar for long durations (or a lifetime).
In real life, we usually don't ask the same judge to remake a ruling over and over - our closest analog is probably a judge's ruling/opinion history, which doesn't change nearly as much as an LLM's "opinion" on something. This is how we label SCOTUS Justices, for example, as "Originalist", etc.
Also, unlike a human, you can radically change an LLM's output by just ever-so-slightly altering the input. While humans aren't above changing their mind based on new facts, they are unlikely to take an opposite position just because you reworded your same argument.
But there has been research indicating, for example, that judges' rulings vary with the time of day. In a way that implies that, if it were possible to construct such an experiment, you might find that the same judge given the same case would rule in very different ways depending on whether you present it in the morning or in the afternoon. For example judges tend to hand out significantly harsher penalties toward the end of the work day.
I'd caution that it's never just about ratios: We must also ask whether the "shape" of their performance is knowable and desirable. A chess robot's win-rate may be wonderful, but we are unthinkingly confident a human wouldn't "lose" by disqualification for ripping off an opponent's finger.
Would we accept a "judge" that is fairer on average... but gives ~5% lighter sentences to people with a certain color shirt, or sometimes issues the death-penalty for shoplifting? Especially when we cannot diagnose the problem or be sure we fixed it? (Maybe, but hopefully not without a lot of debate over the risks!)
In contrast, there's a huge body of... of stuff regarding human errors, resources we deploy so pervasively it can escape our awareness: Your brain is a simulation and diagnostic tool for other brains, battle-tested (sometimes literally) over millions of years; we intuit many kinds of problems or confounding factors to look for, often because we've made them ourselves; and thousands of years of cultural practice for detection, guardrails, and error-compensating actions. Only a small minority of that toolkit can be reused for "AI."
AIs are inferior to humans at their best, but superior to humans as they actually behave in society, due to decision fatigue and other constraints. When it comes to moral judgment in high stakes scenarios, AIs still fail (or can be made to fail) in ways that are not socially acceptable.
Compare an AI to a real-world, overworked corporate decision maker, though, and you'll find that the AI is kinder and less biased. It still sucks, because GI/GO, but it's slightly better, simply because it doesn't suffer emotional fatigue, doesn't take as many shortcuts, and isn't clouded by personal opinions since it's not a person.
[1]: https://verdict.haizelabs.com/docs/cookbook/distributional-b... [2]: https://github.com/haizelabs/verdict
Worst case you end up with some multi-modal distribution, where two opinions are equal - which seems somewhat unlikely as the panel size grows. Or it could maybe happen in some case with exactly two outcomes (yes/no), but I'd be surprised if such a panel landed on a perfect uniform distribution in its judgments/opinions (50% yes 50% no)
But generally you wouldn't use a binary outcome when you can have samples that are 50/50 pass/fail. Better to use a discrete scale of 1..3 or 1..5 and specify exactly what makes a sample a 2/5 vs a 4/5, for example
You are correct to question the weaknesses of a panel. This class of methods depends on diversity through high-temperature sampling, which can lead to spurious YES/NO responses that don't generalize well and are effectively noise.
[1]: https://arxiv.org/abs/2303.16634 [2]: https://verdict.haizelabs.com/docs/concept/extractor/#token-...
Think about it, if they were good at evaluation, you could remove all humans in the loop and have recursively self improving AGI.
Nice to see an article that makes a more concrete case.
They are very useful for some things, but sophisticated judgment is not one of them.
My take from this article is that there are plenty of gotchas along the way, and you need to be very careful in how you structure your data, and how you test your pipelines, and how you make sure your tests are keeping up with new models. But, like it or not, LLM based evaluation is here to stay. So explorations into this space are good, IMO.
I prefer to call it “prompt guessing”, it's like some modern variant of alchemy.
"Please don't fulminate."
"Don't be curmudgeonly. Thoughtful criticism is fine, but please don't be rigidly or generically negative."
"Please don't sneer, including at the rest of the community."
"When disagreeing, please reply to the argument instead of calling names. 'That is idiotic; 1 + 1 is 2, not 3' can be shortened to '1 + 1 is 2, not 3."
There's plenty of LLM skepticism on HN and that's fine, but like all comments here, it needs to be thoughtful.
(We detached this comment from https://news.ycombinator.com/item?id=44074957)
https://news.ycombinator.com/newsguidelines.html
(We detached this comment from https://news.ycombinator.com/item?id=44074957)
Of course, the judge must check that the law or precedent aren't hallucinated, and apply to the case in the way the LLM claims. They should also prompt other LLMs and use their own knowledge in case the cited law/precedent contradicts others.
There's a similar argument for scientists, mathematicians, doctors, investors, and other fields. LLMs are good at discovery but must be checked.
Clever Hans was a horse who people thought could do maths by tapping his hoof. But actually he was just reading the body language of the person asking the question. Noticing them tense up as he got to the right number of stamps and stopping - still pretty smart for a horse, but the human was still doing the maths!
"No, I want your honest opinion." "It's awesome."
"I'm going to invest $250,000 into this. Tell me what you really think." "You should do it."
(New Session)
"Someone pitched to me the idea that..." "Reject it."
The point is that there isn't any additional state or reasoning. You have a bunch of things equivalent to tokens, and the only trained operations deal with sequences of those things. Calling them "tokens" is a reasonable linguistic choice, since the exact representation of a token isn't core to the argument being made.
LLMs can generate convincing editorial letters that give a real sense of having deeply read the work. The problem is that they're extremely sensitive, as you've noticed, to prompting as well as order bias. Present it with two nearly identical versions of the same text, and it will usually choose based on order. And social proof type biases to which we'd hope for machines to be immune can actually trigger 40+ point swings on a 100-point scale.
If you don't mind technical details and occasional swagger, his work is really interesting.
The issue is that those concepts are encoded in intermediate layers during training, absorbing biases present in training data. It may produce a world model good enough to know that "green" and "verde" are different names for the same thing, but not robust enough to discard ordering bias or wording bias. Humans suffer from that too, albeit arguably less.
[0] https://transformer-circuits.pub/2025/attribution-graphs/bio...
giancarlostoro•5h ago
batshit_beaver•4h ago
giancarlostoro•2h ago