> Why are we talking about “graduate and PhD-level intelligence” in these systems if they can’t find and verify relevant links — even directly after a search?
This is my pet peeves, and recently OpenAI's models seem to have become very militant in how they stand by and push their obviously hallucinated sources. I'm talking about hallucinating answers, when pressed to cite sources they also hallucinate URLs that never existed, when repeatedly prompted to verify how the are hallucinating the stick to their clearly wrong output, and ultimately fall back to claiming they were right but the URL somehow changed even though it never existed ever.
In order to start talking about PhD-level intelligence, in the very least these LLMs must support PhD-level context-seeking and information verification. It is not enough to output a wall of text that reads quite fluently. You must stick to verifiable facts.
No. You only need to check for sources, and then verify these sources exist and they support the claims.
It's the very definition of "fact".
In some cases, all you need to do is check if a URL that was cited does exist.
I can't write a software program, give the source to the greengrocer and expect him to be able to say anything about its quality. Just like I can't really say much about vegetables.
No. If you prompt it to get a response and then you ask it to cite sources, if it outputs broken links that never existed then it clearly failed to deliver correct output.
If the purpose is to accurately cite sources, how is it even possible to hallucinate them? Seems like folks are expecting way too much from these tools. They are not intelligent. Useful, perhaps.
It's a token producer based on trained weights, it doesn't use any sources.
Even if it were "fixed" so that it only generates URLs that exist, it's still incorrect because it did not use any sources so those URLs are not sources.
"correct" for you is "truth that corresponds to the real world"
They are two very different things. The llm's output is, very much, correct. Because it was never meant to mean anything other than similarity of probability distributions.
It's not what you wanted, but that doesn't make it incorrect. You're just under a wrong assumption about what you were asking for. You were asking for something that looks like it could be true. Even if you ask it to not hallucinate, you're just asking it to make it look like it is not hallucinating. Meanwhile you thought you were asking for the actual, real, answer to your question.
Person A: I believe X.
Person B: Do you have a source for that?
A: Yes, it was shown by blah blah in the paper yada yada.
B: I don't think that study exists. Share a link?
A: [posts a URL]
B: That's not a real paper. The URL doesn't even work!
A: Works on my machine.
---
I've seen those kind of chats so many times online. Know what I haven't seen very often? When person A says "You're right, I made up that article. Let me look again for a real one, and I might change my opinion depending on what it says."
For exactly the same reason the author markets his tool as a research assistant
> It also models an approach that is less chatbot, and more research assistant in a way that is appropriate for student researchers, who can use it to aid research while coming to their own conclusions.
The o3 finding matches my own experience: https://simonwillison.net/2025/Apr/21/ai-assisted-search/#o3...
Both o3 and Claude 4 have a crucial new ability: they can run tools such as their search tool as part of their "reasoning" phase. I genuinely think this is one of the most exciting new advances in LLMs in the last six months.
I've noticed that o3 is the one that lies with the most conviction (compared to Gemini Pro and Claude Sonnet). It will be the hardest to convince that it is wrong, will invent excuses and complex explanations for its lies, almost to a Trump level of lying and deception.
But it is also the one that provides the most interesting insights, that will look at what others don't see.
There might some kind deep truth in this correlation. Or it might be myself having an hallucination...
I note that Gemini 2.5 has one of the lowest confabulation/hallucination rates according to this benchmark [1], so am surprised by the results in the blog.
Also, I have found link hallucination and output quality improve when you restrict searches to, for example, only pubmed sources, and to provide the source link directly into the text (as opposed to Gemini deep research usual method for citation).
One reason, I think, is that unrestricted search will get the paper, the related blog posts and press releases, weight them as equal (and independent!) sources of a fact, when we know that nuance is lost in the latter, and maybe because it will then spend more test time compute in the quality sources, not the press-releases.
dr_kiszonka•1d ago
hereonout2•1d ago