Five frontier LLMs disagree on 67% of 1k real-world fact-check claims

https://lenz.io/research/llm-disagreement

91•kostaj•52m ago

Comments

kostaj•50m ago

Author here. 67% (95% CI 64–70%) of 1,000 recent real user claims to a fact-checking platform had at least one of GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro+Search, and Sonar Pro dissent from the panel majority — or no majority formed at all. Panel-level Krippendorff's α (ordinal) = 0.639, i.e. nontrivial but limited agreement.

Quick context on what's in the writeup and what isn't:

- What's measured: parsed-label agreement between the 5 models. Forced 4-choice (True / Mostly True / Misleading / False), no Abstain. No LLM grader, no reference verdict — every number is direct label equality.

- What's not measured: which model is right. There's no ground truth in this paper. The 67% figure is a floor on rubric inconsistency (at least one model is label-inconsistent under the 4-bucket rubric on 67% of claims), not "model X is factually wrong on claim Y."

- Why not AVeriTeC / PolitiFact / SimpleQA: those have been public for years and almost certainly appear in current frontier training data, so measured disagreement on them confounds inference with memorization. This corpus is structurally fresh — recent user submissions, 180-day window, near-duplicates collapsed, never paired with canonical verdicts in any public training set.

- Our own platform's verdict is deliberately NOT used in this analysis. The paper measures frontier-panel disagreement only, not Lenz-vs-frontier.

- Follow-up in progress: human-labeling every claim in this corpus so we can evaluate both the panel and our own platform verdict against a human reference.

Critiques I'd most like to hear: (a) the iid CI assumption (Lenz claims cluster around topics and news events, so Wilson is probably optimistic), (b) ordinal-α vs alternatives for a 4-class ordered scale, (c) forced-choice vs allowing Abstain.

Permanent archive: https://doi.org/10.5281/zenodo.20344847

airstrike•26m ago

Nice work. Sonar who?

kostaj•9m ago

sonar-pro for the retrieval capabilities

simonw•6m ago

It's one of Perplexity's search-tools-using models.

https://docs.perplexity.ai/docs/agent-api/models

jiggawatts•18m ago

Many of the rows in that spreadsheet reference "current events", which models aren't expected to do much better at than a human making an educated guess! They all have cutoff dates either last year or early this year and know nothing about what happened in "April 2026".

This is doubly problematic because you evaluated earlier models like Gemini Pro 3 instead of 3.1, GPT 5.4 instead of 5.5, etc...

Given that it's only a thousand short questions, you should be able to re-run your test in about an hour with the latest models, so... why haven't you?

Similarly, LLM output is non-deterministic, so if you could get more interesting stats of your data set by repeating each question 'n' times for each model.

kostaj•10m ago

Two of the models used have retrieval capabilities and have access to newer information through search. The other three are parametric.

christophilus•29m ago

They get more human by the day.

kilroy123•26m ago

This made me chuckle.

This brings up a very valid point, though. So many _humans_ can't agree on what the facts are these days. It seems to be getting worse. Not sure of the solution.

embedding-shape•25m ago

> So many _humans_ can't agree on what the facts are these days.

Ask ten people what "knowledge" is, and they'll come up with ten different answers. Go back 10, 50 or 100 years and humanity struggled with exactly the same issue for so long time. There is even an entire field of study literally just for trying to figure out what "knowledge" is: https://en.wikipedia.org/wiki/Epistemology

ipunchghosts•29m ago

I think ppl only care about how Claude or codex does.

airstrike•27m ago

I agree but the market is pricing way beyond that

spprashant•23m ago

Looks like they land at the average number of 67% disagreement.

spacebacon•27m ago

And they could all see exactly why if they chose to. https://huggingface.co/spaces/RiverRider/srt-introspect

embedding-shape•26m ago

> These aren't benchmark items with public answer keys — they're claims real users submitted for verification to a fact-checking platform.

Cool.

I wonder if anything of this matters when the authors don't disclose exactly how much of their report was written and made with LLMs in the first place? There even is a "11. Ethics & data use" section, and the research is about LLMs being infallible in some ways, yet the usage of LLMs for the production of this report isn't even mentioned once.

kostaj•20m ago

Data collection and processing was done manually. LLMs helped with the report drafting. Everything was human reviewed before publishing.

embedding-shape•18m ago

So it's not a secret, why you don't add this upfront to the report? The report itself is even about LLMs, makes a lot of sense to disclose your usage of them for writing the report, especially when you're presenting evidence that boils down to LLMs being infallible.

kostaj•12m ago

It's an omission on my side. Will add in the next version.

ars2424•25m ago

Would love to learn more

simonw•25m ago

Here's the prompt they used:

  Classify this claim as of <date>: "<atomic claim>"

  Output exactly one label: True,
  Mostly True, Misleading, or False.
  No explanations, no qualifiers.

The claims look like this: https://lenz.io/research/llm-disagreement/data.csv

I put that in Datasette Lite to make it easier to explore. Here's an example of a disagreement: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...

The claim was "All almonds are grown in the U.S. state of California.". All but one model said False, Opus 4.7 said "misleading".

I feel like having "mostly true" and "misleading in there weakens the story, especially given the "no explanations" rule in the prompt.

The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".

The prompt lacks any kind of rubric to clarify how those terms should be applied.

As is so often the case with this kind of study, it's an evaluation of the prompt and harness used by the study in addition to being an evaluation of the underlying models.

Update: here's a better example: "Incomplete Egypt visa application forms are among the most common reasons Egyptian visa applications are rejected."

The models were split between "true" and "mostly true". Given the "among the most" language either of those answers means effectively the same thing.

harpastum•17m ago

Without providing definitions of "True / Mostly True / Misleading / False" to each rater, I rate the article's claim that "Only one verdict bucket can be correct per claim" as false.

Something can be simultaneously "misleading" and either true or false. Which category should something go in if it's "mostly false"?

How much can something be wrong before it goes from "mostly true" to "false" (objectively, both have some part of the fact that is not true)?

This is at least partly testing the model's definition of "mostly" and "misleading". Not its understanding of the fact. Claiming that this means the models have fundamental disagreement on the facts themselves is an overreach.

apples_oranges•25m ago

That's better than all agreeing on the wrong answer, however.

f_devd•24m ago

Inject some adversarial priming as is in actual usage, and you can probably get that number to >=95%

kostaj•21m ago

Our experience with Lenz is that forcing a multi-step process, incl. adversarial debates, helps improve the verdicts.

andai•22m ago

This is an odd one. The paper is real, but was written by Claude? I am assuming OP is human, but also appears to be using Claude to post.

proofofcontempt•13m ago

Let's be real, we all asked Claude to summarise this because it was written by Claude

bobosmrad•20m ago

looking at the claims i would say 5 humans would disagree even more than the llms

some of the claims where llms disagree:

"On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia."

"The slogan "Simon Go Back" was chanted in opposition to the Simon Commission in British India (1928â€“1930)."

"Neptune Deep will start delivering natural gas in 2027."

"A hotel villa in Kyrgyzstan displayed a sign stating 'no Jews, no dogs'."

"Donald Trump said that an attack on Iran was postponed at the request of Gulf allies."

pjc50•9m ago

> "Neptune Deep will start delivering natural gas in 2027."

This is a "forward-looking statement", and presents special problems because you cannot really evaluate it until that date. You can only assign "likely or unlikely".

simonw•8m ago

If you are an LLM with a knowledge cutoff in the past and no access to a search tool the only correct answer to "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia" is "this claim is impossible for me to verify". And that wasn't an option.

ecshafer•6m ago

These "Facts" are interesting. "Neptune Deep will start delivering natural gas in 2027." for example is not a fact, its a prediction. "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia." is less of a fact and more of a litmus test for which sources of information you trust.

proofofcontempt•15m ago

What does this show that we didn't know already? LLMs cannot provide accurate answers to questions where data is not included in their training sets. This doesn't appear to have much substance

101008•8m ago

Unfortunately most people are not aware of this and treat LLM models as this superpowered brain who knows everything and can do everything.

throw310822•13m ago

Not sure I'm understanding this. The models are asked to evaluate the truth of random claims out of their own head (except for Gemini with search grounding)? Isn't it exactly the same as asking people to play any quiz game and then rating them as "they disagree n% of the time"?

The output buckets are also pretty questionable- the difference between "True" and "Mostly true" is pretty fuzzy. Is this marked as a "disagreement"?

bayarearefugee•13m ago

(Brought to you by) Lenz...? a crummy commercial...?

...son of a bitch

Razengan•12m ago

Recently, in May 2026, I asked ChatGPT 5.5 High to search for flights to a certain city that has recently had a new airport since like December 2025

It said the airport code didn't exist

I mean, I get the "knowledge cut off date" and whatnot, but for that sort of thing, you'd think they'd check live information before gaslighting the user, specially since it's a "live" task anyway.

rastrojero2000•10m ago

Given that models are fundamentally incapable of comprehending what truths or falsehoods are beyond their location in their self made representational space, it's actually pretty impressive that they managed to make it not a cointoss. That 17% right there is thousands of man-hours poured over making the word vomiting process slightly closer to whatever their little ports say is happening in reality.

utopiah•6m ago

Don't forget people Goodhart's law will make this "benchmark" moot in weeks if not days. It will get integrated back into the fold, it will look "solved" but there will still be no reasoning, just more statistical technical correctness because light has be shown on a new "problem" to solve. It will then be clamored as great "progress" that will "change everything".

PS: yes, I might or might not have a degree in corporate strategy & PR.

China removes hukou hurdle for migrant workers in social insurance shake-up

Food Insecurity and Consumer Pessimism

Performance: SIMD, Vectorization and Performance Tuning (2016) [video]

Will This $11B AI Startup Disrupt Big Law? [audio]

The Illusion of Velocity: AI Compresses Code Production Not Engineering Judgment

Claude Code Skills set for SEO research

Holo-RegeneraçãO versus Regenwash

Acer and Qualcomm take on MacBook Neo with first Snapdragon C laptop

Show HN: AI Philosopher King – how would an AI run the world?

Amazon MGM Studios Embraces AI: 3 New AI-Created Series Coming

Nobody on the internet knows if you are a human

Show HN: Lattice – place attractors, ignite stars, outlast three AIs

Eric Fossum on his invention of the chip that made cell phone cameras possible

Four ppm measurement of the antihydrogen ground-state hyperfine splitting

Show HN: The Prometheus org has 188 cross-repo dependencies (no signup)

Show HN: Continue? Y/N: A 60-second game about AI agent permission fatigue

Move over, AlphaFold: open-source model predicts 1B protein shapes

Is Russia Maneuvering to Threaten an Iceye Satellite?

The Case for Lowering Speculative Land Values

From KDE Neon to Debian and KDE: Chasing a Latest and Stable Linux Desktop

Everyone says "get out from comfort zone",but no one say "how"

People who care are having the hardest time

Pavona vs. OpenTitan, why the hard fork?

Plug-in solar is gaining traction in the U.S.

JobPilot AI – Job search that runs 100% in the browser

Deep humanistic context for every engineering decision

Apple's iOS 27 Siri Overhaul and AI Features Previewed in New Images

Tom Cruise's Body of Work (With Aled Maclean-Jones) [audio]

Beverly Hills Supper Club fire (1977)

A Thousand Inertial Samples in One Edge – Perception in Robotics