This brings up a very valid point, though. So many _humans_ can't agree on what the facts are these days. It seems to be getting worse. Not sure of the solution.
Ask ten people what "knowledge" is, and they'll come up with ten different answers. Go back 10, 50 or 100 years and humanity struggled with exactly the same issue for so long time. There is even an entire field of study literally just for trying to figure out what "knowledge" is: https://en.wikipedia.org/wiki/Epistemology
Cool.
I wonder if anything of this matters when the authors don't disclose exactly how much of their report was written and made with LLMs in the first place? There even is a "11. Ethics & data use" section, and the research is about LLMs being infallible in some ways, yet the usage of LLMs for the production of this report isn't even mentioned once.
Classify this claim as of <date>: "<atomic claim>"
Output exactly one label: True,
Mostly True, Misleading, or False.
No explanations, no qualifiers.
The claims look like this: https://lenz.io/research/llm-disagreement/data.csvI put that in Datasette Lite to make it easier to explore. Here's an example of a disagreement: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
The claim was "All almonds are grown in the U.S. state of California.". All but one model said False, Opus 4.7 said "misleading".
I feel like having "mostly true" and "misleading in there weakens the story, especially given the "no explanations" rule in the prompt.
The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".
The prompt lacks any kind of rubric to clarify how those terms should be applied.
As is so often the case with this kind of study, it's an evaluation of the prompt and harness used by the study in addition to being an evaluation of the underlying models.
Update: here's a better example: "Incomplete Egypt visa application forms are among the most common reasons Egyptian visa applications are rejected."
The models were split between "true" and "mostly true". Given the "among the most" language either of those answers means effectively the same thing.
Update 2: a much better example:
"On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia"
The only correct answer to that, if you don't have a search tool, is "this claim is impossible for me to verify". And that wasn't an option.
The answers were split between true and false: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
some of the claims where llms disagree:
"On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia."
"The slogan "Simon Go Back" was chanted in opposition to the Simon Commission in British India (1928–1930)."
"Neptune Deep will start delivering natural gas in 2027."
"A hotel villa in Kyrgyzstan displayed a sign stating 'no Jews, no dogs'."
"Donald Trump said that an attack on Iran was postponed at the request of Gulf allies."
This is a "forward-looking statement", and presents special problems because you cannot really evaluate it until that date. You can only assign "likely or unlikely".
Questions like "is mouthwash effective" presumably has one solid data source -- medical journals.
The output buckets are also pretty questionable- the difference between "True" and "Mostly true" is pretty fuzzy. Is this marked as a "disagreement"?
...son of a bitch
It said the airport code didn't exist
I mean, I get the "knowledge cut off date" and whatnot, but for that sort of thing, you'd think they'd check live information before gaslighting the user, specially since it's a "live" task anyway.
PS: yes, I might or might not have a degree in corporate strategy & PR.
All of the models they tested were trained on data from before February 15th ... being asked specific questions about things that happened after they were trained.
Take just one random example: `Hostels in Kota, Rajasthan commonly use caged ceiling fans as a preventive measure against student suicides`
While `Hostels in Kota, Rajasthan commonly use caged ceiling fans` may be a verifiable facts (though I doubt if there are any statistics for verification but let's say there are), `a preventive measure against student suicides` is a claim that no one can prove that. It can just a believe at most.
Arh. Did Biden stole Thump 2nd term? Truth or fact or claim?
I understand why you prompted them to output exactly one label, but I'd bet if you'd asked a parametric or parametric "thinking" model to answer eg "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia." [1] many would say something to the effect of "May 18 is after my knowledge cutoff, so I don't know. But based on the state of the war, the distance from Moscow to Ukraine, and drone range the best option might be...[TRUE]"
Would also be interesting to add a virtual model that is simply the majority of all models and see how much the individual models differ from the "consensus".
Do you plan to add some sources in the related work section of baseline numbers for human expert disagreement in fact checking tasks (I'm assuming such studies exist).
Something can be simultaneously "misleading" and either true or false. Which category should something go in if it's "mostly false"?
How much can something be wrong before it goes from "mostly true" to "false" (objectively, both have some part of the fact that is not true)?
This is at least partly testing the model's definition of "mostly" and "misleading". Not its understanding of the fact. Claiming that this means the models have fundamental disagreement on the facts themselves is an overreach.
Less important than the harness, is the system/user prompts themselves (which of course, are put in the harness), which is effectively what this study seems to be testing. With a better prompt, I'm sure the models would look more the same to each other, as the biggest/best models have more or less identical strong prompt-adherence in my experience.
It seems to me that for many newspapers the bar is now significantly lower, at something like "not quite entirely untrue"
I suspect the intention was "Factually true, and no gotchas exist", "technically not true, but so close to the truth that the difference doesn't matter", "technically true, but there are major gotchas" and "factually false and not even close". But that's not what they specified
https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
A few examples:
> Ruskin Bond was born on May 19, 1934, in Kasauli, Himachal Pradesh, India.
> In the Libra clubs' contract with Grupo Globo for broadcast rights through 2029, the audience-revenue distribution equals 30% of the fixed amount the clubs receive.
This isn't misleading, it's flat out false. Characterizing misleading as also acceptable isn't valid here. If you go an ask anyone on the street if this is true, false or misleading, I'm sure almost everyone would say it's false. After all, I can grow almonds myself.
I don’t understand your point. That claim is factually false and as such it’s easy to logically reply “false”. What’s the nuance here? I can’t see any
If LLM’s are really supposed to be as consistently useful as they’re made out to be they should all spit out “false.”
https://docs.google.com/spreadsheets/d/e/2PACX-1vSPLSv1P8Tqm...
This test is of only marginal utility in the real world compared to an AI with access to the web. While I wouldn't expect an AI with access to the web to result in Platonic Truth any more than it would in the hand of a human, it would probably get a lot closer to something humanlike.
I recall about a year how we were discussing basically turning web search into LLM queries, and I remember never being clear whether people meant simply directly querying AIs or turning them loose on the web. The former is what this is testing and is fairly transparently stupid, just by an information theoretic argument that the AIs simply can't contain all the answers to every query in them, they're just not large enough (and really can't be, practically). I've had good results with the latter, when using dedicated AI resources that I'm paying for (not the stuff coming out of the search engines right now, which I find are often quite terrible). Even non-frontier models can do OK when they've got good results sitting right there to look at. Again, the standard I'm applying here isn't that they yield Absolute Truth, but just that when I follow the links back, they basically say what the AI said they did and the summary is reasonable. I wouldn't expect a human to do better in a casual overview, not that the result is perfect.
As a well known commentator on all things LLM...Will you publicly commit here, to try to reproduce the study, and make a post on how your percentages might differ or agree?
My comment here was meant to save people time in understanding the study. I was entirely open about what I did, and provided tools to help other people come to their own conclusions.
I don't think I need to spend more time on this than I have.
I expect the models are inferring quite a bit from the short prompt, and with structured outputs it would be quite easy to have them give the one word response in one field and explain why in another
I would answer “don’t know” on many, but that’s not an option.
The "majority" in this case meaning about 51%, according to Wikipedia[1]? How could 51% ever be considered to be close to "all", such that "misleading" would be a valid answer?
Am I missing something?
If you argue this, you would be arguing against reality and the English language so as to not upset AI. It's important to understand that AI is very much fallible.
between the bad methodology and bad 'fact' selection and ai-written report, this whole thing is frankly worthless as an assessment.
kostaj•1h ago
Quick context on what's in the writeup and what isn't:
- What's measured: parsed-label agreement between the 5 models. Forced 4-choice (True / Mostly True / Misleading / False), no Abstain. No LLM grader, no reference verdict — every number is direct label equality.
- What's not measured: which model is right. There's no ground truth in this paper. The 67% figure is a floor on rubric inconsistency (at least one model is label-inconsistent under the 4-bucket rubric on 67% of claims), not "model X is factually wrong on claim Y."
- Why not AVeriTeC / PolitiFact / SimpleQA: those have been public for years and almost certainly appear in current frontier training data, so measured disagreement on them confounds inference with memorization. This corpus is structurally fresh — recent user submissions, 180-day window, near-duplicates collapsed, never paired with canonical verdicts in any public training set.
- Our own platform's verdict is deliberately NOT used in this analysis. The paper measures frontier-panel disagreement only, not Lenz-vs-frontier.
- Follow-up in progress: human-labeling every claim in this corpus so we can evaluate both the panel and our own platform verdict against a human reference.
Critiques I'd most like to hear: (a) the iid CI assumption (Lenz claims cluster around topics and news events, so Wilson is probably optimistic), (b) ordinal-α vs alternatives for a 4-class ordered scale, (c) forced-choice vs allowing Abstain.
Permanent archive: https://doi.org/10.5281/zenodo.20344847
airstrike•49m ago
kostaj•32m ago
simonw•29m ago
https://docs.perplexity.ai/docs/agent-api/models
jiggawatts•40m ago
This is doubly problematic because you evaluated earlier models like Gemini Pro 3 instead of 3.1, GPT 5.4 instead of 5.5, etc...
Given that it's only a thousand short questions, you should be able to re-run your test in about an hour with the latest models, so... why haven't you?
Similarly, LLM output is non-deterministic, so if you could get more interesting stats of your data set by repeating each question 'n' times for each model.
kostaj•33m ago