I'd also hesitate to attribute this difference in hallucination rates purely to model size. Yes, GLM-5.2 hallucinates much less frequently than DeepSeek-V4 Pro with twice as many parameters, but DeepSeek-V4 Flash is less than half the size of GLM-5.2 and tops the AA-Omniscience hallucination index. Opus 4.8, which is likely larger than DeepSeek-V4 Pro, has a 36% hallucination rate on the index, above GLM-5.2's 28%, but way below the DeepSeek numbers. Opus also has a 47% accuracy rate vs GLM-5.2's 25%. If you use these numbers to calculate the absolute hallucination rate (i.e., the number of hallucinated responses divided by the total number of responses), you get 19% for Opus and 21% for GLM-5.2.
So yes, all else equal larger models may be more prone to hallucination in scenarios where they don't know the answer, but there are a lot of other factors that affect hallucination rates, and it's not totally clear that this is the main metric that's worth tracking.
If Opus gets all but the hardest questions right, it might have a higher hallucination rate because the questions it gets wrong are the questions where verification or hallucination detection are the most difficult
Do you have a cite for this?
If a human makes up some bullshit lie, I wouldn't accuse them of making it up only if they actually knew the correct answer. If you don't know, the only correct answer is I don't know. Any other answer is made up bullshit. Why is it only a hallucination if and only if the LLM contains the answer? If you make something up it's still wrong. It shouldn't matter if you could give the correct answer. You didn't, and instead invented some bullshit instead?
Follow up question, how can I apply this rule set to the next test I have to take? I'd love to be able to use "I didn't know" as the excuse for why I made something up.
edit:
> and it's not totally clear that this is the main metric that's worth tracking.
I don't know, the rate at which some model is willing to make up something feels useful. If the argument I see repeated on HN so much is that it's impossible to completely get rid of hallucinations; being able to choose a model that's less likely to invent some lie seems like a positive trait, no?
Either way, I'm happy to agree that a restrictive definition, where a lie doesn't count as a hallucination iff the model doesn't know the answer feels strictly, infinitely less useful than an exact error rate. What percentage of emitted tokens are misleading would be useful for me. Anyone know any group that's attempted to quantify the global error rate?
What about using two models, with a smaller model used for this kind of negative reasoning?
In addition, I think that during HFRL, the labs has a bias for interesting answers that admit a solution and under represent the "bad" questions that admit no good answer. In addition they probably do less effort to HFRL on questions the model should admit it doesn't know.
As humans we have been trained all our lives, in the real world, to be confronted with questions we don't know the response right away and we learned to very quickly assess that we don't know or that we are not sure about the answer.
Another thing we have and LLM have not is fear. We have an amygdala in our brain, separated from the logic thinking part, that can raise a signal of fear so that we get much more carefully about what we say. On the other LLM has no fear organ like the amygdala and just learn to respond based on the patterns in it's training corpus. It never "fears" looking bad or being fired because it gave a wrong answer so it can merrily give perfectly wrong answers.
So, we see hallucination rates can be improved with training but currently the lab are not optimizing for that because there is an high stake race to get the most intelligent and capable model.
Alternatively I can see creating a separate amygdala-like organ for an LLM and that organ may asynchronously fires signal, based on the user prompt and the LLM thinking trace, to inject into the LLM reasoning a fear signal so that it can steer it's answer to something more safe.
I'm already hallucinating about how this could work and it involves catapults
Hallucinations all the way down...
"they say u hallucinate 3x more than GLM 5.2, whats your comeback to this? do i need to dump u? $article"
GLM 5.2 tends to stray way more than and 5.1. It also hallucinates you things subtly: morphs requirements, makes unfounded conclusions. This output is not something I experienced in any model I seen so far.
In coding it's especially annoying because it steers whole request. E.g. I give instruction: "make we a Rust-WASM-Canvas app" and GLM 5.2 goes like "Oh user surely doesn't mean that. I'll better build Dioxus app instead".
I've had more success with creating a plan first and then implementing it in (short-lived) sub-agents.
Ironically good software architecture patterns (small functions, single responsibility) heavily impact the performance of these models as well. They do surprisingly well in well architectured codebases.
They do very poorly in anything that's a mess where Opus and GPT 5.5 still get reasonable performance.
From how they measure it, a model that simply answers "I don't know." to any prompt would be the one hallucinates the least. So it's not surprising at all that a smaller model can perform better.
In other words, you shouldn't choose the model that hallucinates the least without detailed prompting, since a well-crafted agents.md clause should go a long way to improving output, and almost certainly the top scoring order will be different. To the point that I don't find this type of raw comparison useful beyond maybe 'make sure you test that one with more explicit prompts'.
You're prompting it wrong is quickly becoming the new, you're holding it wrong.
It's wild how willing software engineers are to blame the user when the actual problem is their own defective design.
Ideally we all, as an industry, will stop accepting this as reasonable excuse for the demonstrated incompetence
Something about the cost model of US near frontier has the cattle prod out whenever a model is uncertain but thrashes on whether to search. Search flinch is roughly all hallucination.
I don't even wait for the model's turn, if there's a man page or Hoogle hit, stuff the last prefix cache cut point. You come out ahead.
I’m not sure how to explain it, but the more I see LLM-written code the more I feel it’s bad code doing a good job of masquerading as good code. I think this take will become less-hot in the next year or two when we see enterprise greenfield projects that were created entirely with LLM “assistance” go to prod. I think we’ll find that the code is difficult for humans to read, understand, debug, and extend- and I think the larger the codebase the harder it will be for LLMs to maintain. More opportunity for hallucination, larger context windows needed, more tokens bought and spent for smaller and smaller code changes. I think the more code an LLM writes for an app, the worse that codebase becomes.
They clearly are only assistants for the moment, you can use them to do work ... but only if you could do the said work yourself alone in the first place.
But as soon as you do minimal reviews and high-level corrections, applications turn out just fine.
Can there be bugs? Sure. That's the price of not reading or understanding every line. It should depend on the criticality of your software how much of these you tolerate and how much you don't (reviewing, understanding, testing everything 100% like you were used to if you had written it yourself will kill most if not all of your gained speed)
But I never got the impression of unmaintainability or unfixable bugs.
Actually the other side around: A really good cleanup pass, architectural changes, or bugfixes are seldom more than a few prompts and 2 hours away, provided your overall base is decent and you actually gave a fuck from the start.
However the fear has to arise in the first place, to raise the alert.
solid_fuel•9h ago
Wow! I already knew from previous research shared here that hallucinations are a fundamental problem for LLMs and likely to be unfixable, just like prompt injection, but I didn't realize the hallucination rates were so bad!
Everyone has been acting like the best models only hallucinate in edge cases, but even the best performing one mentioned here - GLM-5.2 - has a hallucination rate of 28% when it doesn't "know" the answer to something.
That said, I think the title on the blog - "Bigger models are not the way" is probably more fitting and touches on what should be even bigger news. If bigger models and bigger training sets have already stopped producing proportional returns, then it seems likely we are already near the top of the S-curve. That's huge news, considering the valuation of companies like OpenAI and xAI is largely based around the (absurd) idea of ever increasing scaling from these models.
oshrimpton•4h ago