frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Five frontier LLMs disagree on 67% of 1k real-world fact-check claims

https://lenz.io/research/llm-disagreement
91•kostaj•52m ago

Comments

kostaj•50m ago
Author here. 67% (95% CI 64–70%) of 1,000 recent real user claims to a fact-checking platform had at least one of GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro+Search, and Sonar Pro dissent from the panel majority — or no majority formed at all. Panel-level Krippendorff's α (ordinal) = 0.639, i.e. nontrivial but limited agreement.

Quick context on what's in the writeup and what isn't:

- What's measured: parsed-label agreement between the 5 models. Forced 4-choice (True / Mostly True / Misleading / False), no Abstain. No LLM grader, no reference verdict — every number is direct label equality.

- What's not measured: which model is right. There's no ground truth in this paper. The 67% figure is a floor on rubric inconsistency (at least one model is label-inconsistent under the 4-bucket rubric on 67% of claims), not "model X is factually wrong on claim Y."

- Why not AVeriTeC / PolitiFact / SimpleQA: those have been public for years and almost certainly appear in current frontier training data, so measured disagreement on them confounds inference with memorization. This corpus is structurally fresh — recent user submissions, 180-day window, near-duplicates collapsed, never paired with canonical verdicts in any public training set.

- Our own platform's verdict is deliberately NOT used in this analysis. The paper measures frontier-panel disagreement only, not Lenz-vs-frontier.

- Follow-up in progress: human-labeling every claim in this corpus so we can evaluate both the panel and our own platform verdict against a human reference.

Critiques I'd most like to hear: (a) the iid CI assumption (Lenz claims cluster around topics and news events, so Wilson is probably optimistic), (b) ordinal-α vs alternatives for a 4-class ordered scale, (c) forced-choice vs allowing Abstain.

Permanent archive: https://doi.org/10.5281/zenodo.20344847

airstrike•26m ago
Nice work. Sonar who?
kostaj•9m ago
sonar-pro for the retrieval capabilities
simonw•6m ago
It's one of Perplexity's search-tools-using models.

https://docs.perplexity.ai/docs/agent-api/models

jiggawatts•18m ago
Many of the rows in that spreadsheet reference "current events", which models aren't expected to do much better at than a human making an educated guess! They all have cutoff dates either last year or early this year and know nothing about what happened in "April 2026".

This is doubly problematic because you evaluated earlier models like Gemini Pro 3 instead of 3.1, GPT 5.4 instead of 5.5, etc...

Given that it's only a thousand short questions, you should be able to re-run your test in about an hour with the latest models, so... why haven't you?

Similarly, LLM output is non-deterministic, so if you could get more interesting stats of your data set by repeating each question 'n' times for each model.

kostaj•10m ago
Two of the models used have retrieval capabilities and have access to newer information through search. The other three are parametric.
christophilus•29m ago
They get more human by the day.
kilroy123•26m ago
This made me chuckle.

This brings up a very valid point, though. So many _humans_ can't agree on what the facts are these days. It seems to be getting worse. Not sure of the solution.

embedding-shape•25m ago
> So many _humans_ can't agree on what the facts are these days.

Ask ten people what "knowledge" is, and they'll come up with ten different answers. Go back 10, 50 or 100 years and humanity struggled with exactly the same issue for so long time. There is even an entire field of study literally just for trying to figure out what "knowledge" is: https://en.wikipedia.org/wiki/Epistemology

ipunchghosts•29m ago
I think ppl only care about how Claude or codex does.
airstrike•27m ago
I agree but the market is pricing way beyond that
spprashant•23m ago
Looks like they land at the average number of 67% disagreement.
spacebacon•27m ago
And they could all see exactly why if they chose to. https://huggingface.co/spaces/RiverRider/srt-introspect
embedding-shape•26m ago
> These aren't benchmark items with public answer keys — they're claims real users submitted for verification to a fact-checking platform.

Cool.

I wonder if anything of this matters when the authors don't disclose exactly how much of their report was written and made with LLMs in the first place? There even is a "11. Ethics & data use" section, and the research is about LLMs being infallible in some ways, yet the usage of LLMs for the production of this report isn't even mentioned once.

kostaj•20m ago
Data collection and processing was done manually. LLMs helped with the report drafting. Everything was human reviewed before publishing.
embedding-shape•18m ago
So it's not a secret, why you don't add this upfront to the report? The report itself is even about LLMs, makes a lot of sense to disclose your usage of them for writing the report, especially when you're presenting evidence that boils down to LLMs being infallible.
kostaj•12m ago
It's an omission on my side. Will add in the next version.
ars2424•25m ago
Would love to learn more
simonw•25m ago
Here's the prompt they used:

  Classify this claim as of <date>: "<atomic claim>"

  Output exactly one label: True,
  Mostly True, Misleading, or False.
  No explanations, no qualifiers.
The claims look like this: https://lenz.io/research/llm-disagreement/data.csv

I put that in Datasette Lite to make it easier to explore. Here's an example of a disagreement: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...

The claim was "All almonds are grown in the U.S. state of California.". All but one model said False, Opus 4.7 said "misleading".

I feel like having "mostly true" and "misleading in there weakens the story, especially given the "no explanations" rule in the prompt.

The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".

The prompt lacks any kind of rubric to clarify how those terms should be applied.

As is so often the case with this kind of study, it's an evaluation of the prompt and harness used by the study in addition to being an evaluation of the underlying models.

Update: here's a better example: "Incomplete Egypt visa application forms are among the most common reasons Egyptian visa applications are rejected."

The models were split between "true" and "mostly true". Given the "among the most" language either of those answers means effectively the same thing.

harpastum•17m ago
Without providing definitions of "True / Mostly True / Misleading / False" to each rater, I rate the article's claim that "Only one verdict bucket can be correct per claim" as false.

Something can be simultaneously "misleading" and either true or false. Which category should something go in if it's "mostly false"?

How much can something be wrong before it goes from "mostly true" to "false" (objectively, both have some part of the fact that is not true)?

This is at least partly testing the model's definition of "mostly" and "misleading". Not its understanding of the fact. Claiming that this means the models have fundamental disagreement on the facts themselves is an overreach.

apples_oranges•25m ago
That's better than all agreeing on the wrong answer, however.
f_devd•24m ago
Inject some adversarial priming as is in actual usage, and you can probably get that number to >=95%
kostaj•21m ago
Our experience with Lenz is that forcing a multi-step process, incl. adversarial debates, helps improve the verdicts.
andai•22m ago
This is an odd one. The paper is real, but was written by Claude? I am assuming OP is human, but also appears to be using Claude to post.
proofofcontempt•13m ago
Let's be real, we all asked Claude to summarise this because it was written by Claude
bobosmrad•20m ago
looking at the claims i would say 5 humans would disagree even more than the llms

some of the claims where llms disagree:

"On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia."

"The slogan "Simon Go Back" was chanted in opposition to the Simon Commission in British India (1928–1930)."

"Neptune Deep will start delivering natural gas in 2027."

"A hotel villa in Kyrgyzstan displayed a sign stating 'no Jews, no dogs'."

"Donald Trump said that an attack on Iran was postponed at the request of Gulf allies."

pjc50•9m ago
> "Neptune Deep will start delivering natural gas in 2027."

This is a "forward-looking statement", and presents special problems because you cannot really evaluate it until that date. You can only assign "likely or unlikely".

simonw•8m ago
If you are an LLM with a knowledge cutoff in the past and no access to a search tool the only correct answer to "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia" is "this claim is impossible for me to verify". And that wasn't an option.
ecshafer•6m ago
These "Facts" are interesting. "Neptune Deep will start delivering natural gas in 2027." for example is not a fact, its a prediction. "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia." is less of a fact and more of a litmus test for which sources of information you trust.
proofofcontempt•15m ago
What does this show that we didn't know already? LLMs cannot provide accurate answers to questions where data is not included in their training sets. This doesn't appear to have much substance
101008•8m ago
Unfortunately most people are not aware of this and treat LLM models as this superpowered brain who knows everything and can do everything.
throw310822•13m ago
Not sure I'm understanding this. The models are asked to evaluate the truth of random claims out of their own head (except for Gemini with search grounding)? Isn't it exactly the same as asking people to play any quiz game and then rating them as "they disagree n% of the time"?

The output buckets are also pretty questionable- the difference between "True" and "Mostly true" is pretty fuzzy. Is this marked as a "disagreement"?

bayarearefugee•13m ago
(Brought to you by) Lenz...? a crummy commercial...?

...son of a bitch

Razengan•12m ago
Recently, in May 2026, I asked ChatGPT 5.5 High to search for flights to a certain city that has recently had a new airport since like December 2025

It said the airport code didn't exist

I mean, I get the "knowledge cut off date" and whatnot, but for that sort of thing, you'd think they'd check live information before gaslighting the user, specially since it's a "live" task anyway.

rastrojero2000•10m ago
Given that models are fundamentally incapable of comprehending what truths or falsehoods are beyond their location in their self made representational space, it's actually pretty impressive that they managed to make it not a cointoss. That 17% right there is thousands of man-hours poured over making the word vomiting process slightly closer to whatever their little ports say is happening in reality.
utopiah•6m ago
Don't forget people Goodhart's law will make this "benchmark" moot in weeks if not days. It will get integrated back into the fold, it will look "solved" but there will still be no reasoning, just more statistical technical correctness because light has be shown on a new "problem" to solve. It will then be clamored as great "progress" that will "change everything".

PS: yes, I might or might not have a degree in corporate strategy & PR.

simonw
•
6m ago
Comparing models with search tools to models without - when there's no option for "I am unable to answer this question without access to search" - doesn't make sense to me.
throw310822•5m ago
The title mention "fact-checks", but "fact checking" is a process in which facts are checked against sources, not one where you are given a random fact and have to tell if it's true or false from your own memory. That's what is normally called a quiz game. So a more honest title for this research would be "Models answer differently to quiz questions".
LeifCarrotson•6m ago
I don't think that current LLMs really need an abstain option, they'll give an answer regardless of whether they're confident or not. I hope that future LLMs will, and will know when to use it.

I understand why you prompted them to output exactly one label, but I'd bet if you'd asked a parametric or parametric "thinking" model to answer eg "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia." [1] many would say something to the effect of "May 18 is after my knowledge cutoff, so I don't know. But based on the state of the war, the distance from Moscow to Ukraine, and drone range the best option might be...[TRUE]"

[1]: https://lenz.io/c/130f1005

kriro•6m ago
I don't see it mentioned explicitly in the methods section but I assume you prompted each model only once for each question? Did you consider prompting n-times in blank states to see if the models even agree with themselves?

Would also be interesting to add a virtual model that is simply the majority of all models and see how much the individual models differ from the "consensus".

Do you plan to add some sources in the related work section of baseline numbers for human expert disagreement in fact checking tasks (I'm assuming such studies exist).

embedding-shape•15m ago
> I guess the goal is to test the models and not the harness

Less important than the harness, is the system/user prompts themselves (which of course, are put in the harness), which is effectively what this study seems to be testing. With a better prompt, I'm sure the models would look more the same to each other, as the biggest/best models have more or less identical strong prompt-adherence in my experience.

pjc50•12m ago
If you can consistently construct "true but misleading" content, you may be qualified to work at a major newspaper.
wongarsu•7m ago
Yes, the labels are weird. Most misleading statements are true. Any "mostly true" statement is false.

I suspect the intention was "Factually true, and no gotchas exist", "technically not true, but so close to the truth that the difference doesn't matter", "technically true, but there are major gotchas" and "factually false and not even close". But that's not what they specified

tosh•17m ago
ty for digging this up, appreciate the time saving
singpolyma3•16m ago
False vs misleading doesn't seem like a disagreement?
wongarsu•11m ago
According to the benchmark it is. "Only one verdict bucket can be correct per claim, so any disagreement among the panel means at least one model's verdict is label-inconsistent under this 4-bucket rubric (True / Mostly True / Misleading / False)"
kostaj•3m ago
Yes, they are much closer verdicts. True and Mostly True are also close. Used Krippendorff's α (ordinal) to not penalize much closer disagreements. 21% of the claims have models that are on the polar opposite sides - at least one True, and at least one False.
kostaj•15m ago
Used "No explanations, no qualifiers." to force the models to answer only with one of the four labels. It's worth running a separate test with more explanation in the prompt on how to classify between the four buckets.
malfist•14m ago
> All almonds are grown in the U.S. state of California

This isn't misleading, it's flat out false. Characterizing misleading as also acceptable isn't valid here. If you go an ask anyone on the street if this is true, false or misleading, I'm sure almost everyone would say it's false. After all, I can grow almonds myself.

jannyfer•14m ago
Thank you, my eyes glazed over when I saw the article was written with AI.
camillomiller•13m ago
>> The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".

I don’t understand your point. That claim is factually false and as such it’s easy to logically reply “false”. What’s the nuance here? I can’t see any

kordlessagain•12m ago
Give a model a crawler tool (like Grub.nuts.services) and your "problem" goes away.
Forgeties79•12m ago
I really don’t buy the almond explanation you’re giving. That requires the level of logic of kindergartener has. It’s a very simple all or nothing question.

If LLM’s are really supposed to be as consistently useful as they’re made out to be they should all spit out “false.”

andai•9m ago
Thanks. The first link is a spreadsheet. Here's a web-readable version.

https://docs.google.com/spreadsheets/d/e/2PACX-1vSPLSv1P8Tqm...

jerf•7m ago
This seems like another case where the models are acting like humans. Assuming they were not allowed to search the web, I wouldn't expect the models to necessarily have detailed information about all of these things directly in their training set. As large as they are, they are only so large, and they only have so much room for "information storage" in them, and there's a lot more things they need to fit into their numbers.

This test is of only marginal utility in the real world compared to an AI with access to the web. While I wouldn't expect an AI with access to the web to result in Platonic Truth any more than it would in the hand of a human, it would probably get a lot closer to something humanlike.

I recall about a year how we were discussing basically turning web search into LLM queries, and I remember never being clear whether people meant simply directly querying AIs or turning them loose on the web. The former is what this is testing and is fairly transparently stupid, just by an information theoretic argument that the AIs simply can't contain all the answers to every query in them, they're just not large enough (and really can't be, practically). I've had good results with the latter, when using dedicated AI resources that I'm paying for (not the stuff coming out of the search engines right now, which I find are often quite terrible). Even non-frontier models can do OK when they've got good results sitting right there to look at. Again, the standard I'm applying here isn't that they yield Absolute Truth, but just that when I follow the links back, they basically say what the AI said they did and the summary is reasonable. I wouldn't expect a human to do better in a casual overview, not that the result is perfect.

johnbarron•2m ago
Your reply would have more credibility, if instead of commenting on this 25 min after being posted, just to nitpick on some of the questions you have tried to reproduce the research.

As a well known commentator on all things LLM...Will you publicly commit here, to try to reproduce the study, and make a post on how your percentages might differ or agree?

China removes hukou hurdle for migrant workers in social insurance shake-up

https://www.scmp.com/news/article/3354575/china-removes-hukou-hurdle-migrant-workers-social-insur...
1•mooreds•1m ago•0 comments

Food Insecurity and Consumer Pessimism

https://libertystreeteconomics.newyorkfed.org/2026/05/food-insecurity-and-consumer-pessimism/
1•mooreds•1m ago•0 comments

Performance: SIMD, Vectorization and Performance Tuning (2016) [video]

https://www.youtube.com/watch?v=_OJmxi4-twY
1•tosh•1m ago•0 comments

Will This $11B AI Startup Disrupt Big Law? [audio]

https://podcasts.apple.com/ci/podcast/will-this-%2411b-ai-startup-disrupt-big-law/id1744631325?i=...
1•mooreds•2m ago•0 comments

The Illusion of Velocity: AI Compresses Code Production Not Engineering Judgment

https://nuspect.substack.com/p/the-illusion-of-velocity
1•Nuspect•3m ago•0 comments

Claude Code Skills set for SEO research

https://github.com/Senuto/nodeshub-seo-skills
1•NodesHub•4m ago•0 comments

Holo-RegeneraçãO versus Regenwash

https://regendharma.substack.com/p/holo-regeneracao-versus-regeneracao
1•Kirttan_Godoi•5m ago•0 comments

Acer and Qualcomm take on MacBook Neo with first Snapdragon C laptop

https://www.tomshardware.com/laptops/acer-and-qualcomm-take-on-the-macbook-neo-with-first-snapdra...
1•LorenDB•5m ago•0 comments

Show HN: AI Philosopher King – how would an AI run the world?

https://aiking.dwyer.co.za/
1•sixhobbits•6m ago•0 comments

Amazon MGM Studios Embraces AI: 3 New AI-Created Series Coming

https://variety.com/2026/tv/news/amazon-mgm-studios-genai-creators-fund-greenlights-series-123675...
1•bhouston•6m ago•0 comments

Nobody on the internet knows if you are a human

https://danieltan.weblog.lol/2026/05/nobody-on-the-internet-knows-if-you-are-a-human
1•f311a•7m ago•0 comments

Show HN: Lattice – place attractors, ignite stars, outlast three AIs

https://mitenmit.github.io/lattice/
1•mitenmit•9m ago•0 comments

Eric Fossum on his invention of the chip that made cell phone cameras possible

https://www.thedartmouth.com/article/2026/05/alvito-inventor-and-thayer-professor-eric-fossum-dis...
1•SVI•9m ago•0 comments

Four ppm measurement of the antihydrogen ground-state hyperfine splitting

https://www.nature.com/articles/s41586-026-10556-x
1•igortru•9m ago•0 comments

Show HN: The Prometheus org has 188 cross-repo dependencies (no signup)

https://riftmap.dev/showcase/prometheus/
1•DaWe01•9m ago•0 comments

Show HN: Continue? Y/N: A 60-second game about AI agent permission fatigue

https://llmgame.scalex.dev
1•Wirbelwind•10m ago•0 comments

Move over, AlphaFold: open-source model predicts 1B protein shapes

https://www.nature.com/articles/d41586-026-01686-3
1•igortru•10m ago•0 comments

Is Russia Maneuvering to Threaten an Iceye Satellite?

https://integrityisr.com/is-russia-maneuvering-to-threaten-an-iceye-satellite/
1•nradov•11m ago•0 comments

The Case for Lowering Speculative Land Values

https://commonedge.org/back-to-earth-the-case-for-lowering-speculative-land-values/
1•surprisetalk•12m ago•0 comments

From KDE Neon to Debian and KDE: Chasing a Latest and Stable Linux Desktop

https://ivonblog.com/en-us/posts/from-kde-neon-to-debian/
1•ivo8n52•13m ago•0 comments

Everyone says "get out from comfort zone",but no one say "how"

1•unaisshemim•14m ago•1 comments

People who care are having the hardest time

https://www.rawsignal.ca/newsletter-archive/the-people-who-care-are-having-the-hardest-time/
1•adrianhoward•15m ago•0 comments

Pavona vs. OpenTitan, why the hard fork?

https://pavona.org/
1•adapteva•16m ago•1 comments

Plug-in solar is gaining traction in the U.S.

https://www.marketplace.org/episode/2026/05/28/plugin-solar-is-gaining-traction-in-the-us
1•mooreds•17m ago•0 comments

JobPilot AI – Job search that runs 100% in the browser

https://jobpilot-ai.pages.dev/
1•Mythic_Dd•18m ago•0 comments

Deep humanistic context for every engineering decision

https://github.com/kenm47/jensenify-mcp
1•mooreds•19m ago•0 comments

Apple's iOS 27 Siri Overhaul and AI Features Previewed in New Images

https://www.bloomberg.com/news/features/2026-05-28/apple-ios-27-photos-screenshots-revamped-siri-...
1•ChartMaster22•20m ago•0 comments

Tom Cruise's Body of Work (With Aled Maclean-Jones) [audio]

https://www.econtalk.org/tom-cruises-body-of-work-with-aled-maclean-jones/
1•mooreds•20m ago•0 comments

Beverly Hills Supper Club fire (1977)

https://en.wikipedia.org/wiki/Beverly_Hills_Supper_Club_fire
1•petethomas•21m ago•0 comments

A Thousand Inertial Samples in One Edge – Perception in Robotics

https://atomsfrontier.substack.com/p/a-thousand-inertial-samples-in-one
1•jpatel3•21m ago•0 comments