One can use AI to help them write without going all the way to having it generate facts and citations.
>Confabulation was coined right here on Ars, by AI-beat columnist Benj Edwards, in Why ChatGPT and Bing Chat are so good at making things up (Apr 2023).
https://arstechnica.com/civis/threads/researchers-describe-h...
>Generative AI is so new that we need metaphors borrowed from existing ideas to explain these highly technical concepts to the broader public. In this vein, we feel the term "confabulation," although similarly imperfect, is a better metaphor than "hallucination." In human psychology, a "confabulation" occurs when someone's memory has a gap and the brain convincingly fills in the rest without intending to deceive others.
https://arstechnica.com/information-technology/2023/04/why-a...
Did they run the checker across a body of papers before LLMs were available and verify that there were no citations in peer reviewed papers that got authors or titles wrong?
Exactly as you said, do precisely this to pre-LLM works. There will be an enormous number of errors with utter certainty.
People keep imperfect notes. People are lazy. People sometimes even fabricate. None of this needed LLMs to happen.
> You also don't need gunpowder to kill someone with projectiles, but gunpowder changed things in important ways. All I ever see are the most specious knee-jerk defenses of AI that immediately fall apart.
Humans can do all of the above but it costs them more, and they do it more slowly. LLMs generate spam at a much faster rate.
But no one is claiming these papers were hallucinated whole, so I don't see how that's relevant. This study -- notably to sell an "AI detector", which is largely a laughable snake-oil field -- looked purely at the accuracy of citations[1] among a very large set of citations. Errors in papers are not remotely uncommon, and finding some errors is...exactly what one would expect. As the GP said, do the same study on pre-LLM papers and you'll find an enormous number of incorrect if not fabricated citations. Peer review has always been an illusion of auditing.
1 - Which is such a weird thing to sell an "AI detection" tool. Clearly it was mostly manual given that they somehow only managed to check a tiny subset of the papers, so in all likelihood was some guy going through citations and checking them on Google Search.
The references were made up, and this is easier and faster to do with LLMs than with humans. Easier to do inadvertently, too.
As I said, LLMs are a force multiplier for fraud and inadvertent errors. So it's a big deal.
A pre LLM paper with fabricated citations would demonstrate will to cheat by the author.
A post LLM paper with fabricated citations: same thing and if the authors attempt to defend themselves with something like, we trusted the AI, they are sloppy, probably cheaters and not very good at it.
Interesting that you hallucinated the word "fabricated" here where I broadly talked about errors. Humans, right? Can't trust them.
Firstly, just about every paper ever written in the history of papers has errors in it. Some small, some big. Most accidental, but some intentional. Sometimes people are sloppy keeping notes, transcribe a row, get a name wrong, do an offset by 1. Sometimes they just entirely make up data or findings. This is not remotely new. It has happened as long as we've had papers. Find an old, pre-LLM paper and go through the citations -- especially for a tosser target like this where there are tens of thousands of low effort papers submitted -- and you're going to find a lot of sloppy citations that are hard to rationalize.
Secondly, the "hallucination" is that this particular snake-oil firm couldn't find given papers in many cases (they aren't foolish enough to think that means they were fabricated. But again, they're looking to sell a tool to rubes, so the conclusion is good enough), and in others that some of the author names are wrong. Eh.
LLMs make it easier and faster, much like guns make killing easier and faster.
Thad said, i am also very curious of the result than their tool, would give to papers from the 2010's and before.
That also makes some of those errors easier. A bad auto-import of paper metadata can silently screw up some of the publication details, and replacing an early preprint with the peer-reviewed article of record takes annoying manual intervention.
You'd think so, but apparently it isn't for these folks. On the other hand, saying "we've found 50 hallucinations in scientific papers" generates a lot more clicks than "we've found 50 common citation mistakes that people make all the time"
When I was in grad school, I kept a fairly large .bib file that almost certainly had a mistake or two in it. I don’t think any of them ever made it to print, but it’s hard to be 100% sure.
For most journals, they actually partially check your citations as part of the final editing. The citation record is important for journals, and linking with DOIs is fairly common.
not just some hallucinated citations, and not just the writing. in many cases the actual purported research "ideas" seem to be plausible nonsense.
To get a feel for it, you can take some of the topics they write about and ask your favorite LLM to generate a paper. Maybe even throw "Deep Research" mode at it. Perhaps tell it to put it in ICLR latex format. It will look a lot like these.
> Given that we've only scanned 300 out of 20,000 submissions, we estimate that we will find 100s of hallucinated papers in the coming days.
https://www.theguardian.com/technology/2025/dec/06/ai-resear...
Can't quote exact numbers but when I was on the conference committee for a maybe high four figures attendance conference, we certainly had many thousands of submissions.
(People submitting AI slop should still be ostracized of course, if you can't be bothered to read it, why would you think I should)
Let's say that I use a formula, and give a reference to where the formula came from, but the reference doesn't exist. Would you trust the formula?
Let's say a computer program calls a subroutine with a certain name from a certain library, but the library doesn't exist.
A person doing good research doesn't need to check their references. Now, they could stand to check the references for typographic errors, but that's a stretch too. Almost every online service for retrieving articles includes a reference for each article that you can just copy and paste.
If a scientist uses an LLM to write a paper with fabricated citations - that’s a crappy scientist.
AI is not the problem, laziness and negligence is. There needs to be serious social consequences to this kind of thing, otherwise we are tacitly endorsing it.
The problem with this analogy is that it makes no sense.
LLMs aren’t guns.
The problem with using them is that humans have to review the content for accuracy. And that gets tiresome because the whole point is that the LLM saves you time and effort doing it yourself. So naturally people will tend to stop checking and assume the output is correct, “because the LLM is so good.”
Then you get false citations and bogus claims everywhere.
But regardless, I thought the point was that...
> The problem with using them is that humans have to review the content for accuracy.
There are (at least) two humans in this equation. The publisher, and the reader. The publisher at least should do their due diligence, regardless of how "hard" it is (in this case, we literally just ask that you review your OWN CITATIONS that you insert into your paper). This is why we have accountability as a concept.
If someone performs a negligent discharge, they are responsible, not Glock. It does have other safety mechanisms to prevent accidental fires not resulting from a trigger pull.
Another way LLMs are not guns: you don’t need a giant data centre owned by a mega corp to use your gun.
Can’t do science because GlockGPT is down? Too bad I guess. Let’s go watch the paint dry.
The reason I made it is because this is inherently how we designed LLMs. They will make bad citations and people need to be careful.
Absolutely. Many guns don't have safties. You don't load a round in the chamber unless you intend on using it.
A gun going off when you don't intend is a negligent discharge. No ifs, ands or buts. The person in possession of the gun is always responsible for it.
false. A gun goes off when not intended too often to claim that. It has happned to me - I then took the gun to a qualified gunsmith for repairs.
A gun they fires and hits anything you didn't intend to is negligent discharge even if you intended to shoot. Gun saftey is about assuming a gun that could possible fire will and ensuring nothing bad can happen. When looking at gun in a store (that you might want to buy) you aim it at an upper corner where even if it fires the odds of something bad resulting is the least lively to happen (it should be unloaded - and you may have checked, but you still aim there!)
same with cat toy lazers - they should be safe to shine in an eye - but you still point in a safe direction.
That's the issue here. Of course you should be aware of the fact that these things need to be checked - especially if you're a scientist.
This is no secret only known to people on HN. LLMs are tools. People using these tools need to be diligent.
Right. A gun doesn't misfire 20% of the time.
> The problem with using them is that humans have to review the content for accuracy.
How long are we going to push this same narrative we've been hearing since the introduction of these tools? When can we trust these tools to be accurate? For technology that is marketed as having superhuman intelligence, it sure seems dumb that it has to be fact-checked by less-intelligent humans.
Yes, and they are the ones responsible for the poor quality of work that results from that.
Like it or not, in our society scientists' job is to churn out papers. Of course they'll use the most efficient way to churn out papers.
The issue is when you give EVERYONE guns, and then are surprised when enough people do bad things with them, to create externalities for everyone else.
There is some sort of trip up when personal responsibility, and society wide behaviors, intersect. Sure most people will be reasonable, but the issue is often the cost of the number of irresponsible or outright bad actors.
And yet, we’re not supposed to criticize the tool or its makers? Clearly there’s more problems in this world than «lazy carpenters»?
Exactly, they're not forcing anyone to use these things, but sometimes others (their managers/bosses) forced them to. Yet it's their responsibility for choosing the right tool for the right problem, like any other professional.
If a carpenter shows up to put a roof yet their hammer or nail-gun can't actually put in nails, who'd you blame; the tool, the toolmaker or the carpenter?
I would be unhappy with the carpenter, yes. But if the toolmaker was constantly over-promising (lying?), lobbying with governments, pushing their tools into the hands of carpenters, never taking responsibility, then I would also criticize the toolmaker. It’s also a toolmaker’s responsibility to be honest about what the tool should be used for.
I think it’s a bit too simplistic to say «AI is not the problem» with the current state of the industry.
https://openai.com/policies/row-terms-of-use/
https://www.anthropic.com/legal/aup
OpenAI:
> When you use our Services you understand and agree:
Output may not always be accurate. You should not rely on Output from our Services as a sole source of truth or factual information, or as a substitute for professional advice. You must evaluate Output for accuracy and appropriateness for your use case, including using human review as appropriate, before using or sharing Output from the Services. You must not use any Output relating to a person for any purpose that could have a legal or material impact on that person, such as making credit, educational, employment, housing, insurance, legal, medical, or other important decisions about them. Our Services may provide incomplete, incorrect, or offensive Output that does not represent OpenAI’s views. If Output references any third party products or services, it doesn’t mean the third party endorses or is affiliated with OpenAI.
Anthropic:
> When using our products or services to provide advice, recommendations, or in subjective decision-making directly affecting individuals or consumers, a qualified professional in that field must review the content or decision prior to dissemination or finalization. You or your organization are responsible for the accuracy and appropriateness of that information.
So I don't think we can say they are lying.
A poor workman blames his tools. So please take responsibility for what you deliver. And if the result is bad, you can learn from it. That doesn't have to mean not use AI but it definitely means that you need to fact check more thoroughly.
Just like as a software developer, you cannot blame Amazon because your platform is down, if you chose to host all of your platform there. You made that choice, you stand for the consequences, pushing the blame on the ones who are providing you with the tooling is the action of someone weak who fail to realize their own responsibilities. Professionals take responsibility for every choice they make, not just the good ones.
> I think it’s a bit too simplistic to say «AI is not the problem» with the current state of the industry.
Agree, and I wouldn't say anything like that either, which makes it a bit strange to include a reply to something no one in this comment thread seems to have said.
But you just said we weren’t supposed to criticize the purveyors of AI or the tools themselves.
No, you expressed unqualified agreement with a comment containing
“And yet, we’re not supposed to criticize the tool or its makers?”
>Any critiques you may have for the tools which they use don't lessen this responsibility.
People don’t exist or act in a vacuum. That a scientist is responsible for the quality of their work doesn’t mean that a spectrometer manufacture that advertises specs that their machines can’t match and induces universities through discounts and/or dubious advertising claims to push their labs to replace their existing spectrometers with new ones which have many bizarre and unexpected behaviors including but not limited to sometimes just fabricating spurious readings has made no contribution to the problem of bad results.
Some people take that to mean that responses from LLMs are (by human standards) "always correct" and "based on knowledge", while this is a misunderstanding about how LLMs work. They don't know "correct" nor do they have "knowledge", they have tokens, that come after tokens, and that's about it.
Lawyers are running their careers by citing hallucinated cases. Researchers are writing papers with hallucinated references. Programmers are taking down production by not verifying AI code.
Humans were made to do things, not to verify things. Verifying something is 10x harder than doing it right. AI in the hands of humans is a foot rocket launcher.
Again, true for most things. A lot of people are terrible drivers, terrible judge of their own character, and terrible recreational drug users. Does that mean we need to remove all those things that can be misused?
I much rather push back on shoddy work no matter what source. I don't care if the citations are from a robot or a human, if they suck, then you suck, because you're presenting this as your work. I don't care if your paralegal actually wrote the document, be responsible for the work you supposedly do.
> Humans were made to do things, not to verify things.
I'm glad you seemingly have some grand idea of what humans were meant to do, I certainly wouldn't claim I do so, but I'm also not religious. For me, humans do what humans do, and while we didn't used to mostly sit down and consume so much food and other things, now we do.
This is not what you are being sold though. They are not selling you "tokens". Check their marketing articles and you will not see the word token or synonym on any of their headings or subheadings. You are being sold these abilities:
- “Generate reports, draft emails, summarize meetings, and complete projects.”
- “Automate repetitive tasks, like converting screenshots or dashboards into presentations … rearranging meetings … updating spreadsheets with new financial data while retaining the same formatting.”
- "Support-type automation: e.g. customer support agents that can summarize incoming messages, detect sentiment, route tickets to the right team."
- "For enterprise workflows: via Gemini Enterprise — allowing firms to connect internal data sources (e.g. CRM, BI, SharePoint, Salesforce, SAP) and build custom AI agents that can: answer complex questions, carry out tasks, iterate deliverables — effectively automating internal processes."
These are taken straight from their websites. The idea that you are JUST being sold tokens is as hilariously fictional as any company selling you their app was actually just selling you patterns of pixels on your screen.
The problem is that a researcher who does that is almost guaranteed to be careless about other things too. So the problem isn't just the LLM, or even the citations, but the ambient level of acceptable mediocrity.
Also similar to what Temu, Wish, and other similar sites offer. Picture and specs might look good but it will likely be disappointing in the end.
The reviewer is not a proofreader, they are checking the rigour and relevance of the work, which does not rest heavily on all of the references in a document. They are also assuming good faith.
After all, their grant covers their thesis, not their thesis plus all of the theses they cite.
I guess this explains all those times over the years where I follow a citation from a paper and discover it doesn’t support what the first paper claimed.
The review should also determine how valuable the contribution is, not only if it has mistakes or not.
Todays reviews determine neither value nor correctness in any meaningful way. And how could they, actually? That is why I review papers only to the extent that I understand them, and I clearly delineate my line of understanding. And I don't review papers that I am not interested in reading. I once got a paper to review that actually pointed out a mistake in one of my previous papers, and then proposed a different solution. They correctly identified the mistake, but I could not verify if their solution worked or not, that would have taken me several weeks to understand. I gave a report along these lines, and the person who gave me the review said I should say more about their solution, but I could not. So my review was not actually used. The paper was accepted, which is fine, but I am sure none of the other reviewers actually knows if it is correct.
Now, this was a case where I was an absolute expert. Which is far from the usual situation for a reviewer, even though many reviewers give themselves the highest mark for expertise when they just should not.
However the paper is submitted, like a folder on a cloud drive, just have them include a folder with PDFs/abstracts of all the citations?
They might then fraudulently produce papers to cite, but they can't cite something that doesn't exist.
Even if you could retrieve all citations (which isn't always as easy as you might hope) to validate citations you'd also have to confirm the paper says what the person citing it says. If I say "A GPU requires 1.4kg of copper" citing [1] is that a valid citation?
That means not just reviewing one paper, but also potentially checking 70+ papers it cites. The vast majority of paper reviewers will not check citations actually say what they're claimed to say, unless a truly outlandish claim is made.
At the same time, academia is strangely resistant to putting hyperlinks in citations, preferring to maintain old traditions - like citing conference papers by page number in a hypothetical book that has never been published; and having both a free and a paywalled version of a paper while considering the paywalled version the 'official' version.
I've always assumed peer review is similar to diff review. Where I'm willing to sign my name onto the work of others. If I approve a diff/pr and it takes down prod. It's just as much my fault, no?
> They are also assuming good faith.
I can only relate this to code review, but assuming good faith means you assume they didn't try to introduce a bug by adding this dependency. But I would should still check to make sure this new dep isn't some typosquatted package. That's the rigor I'm responsible for.
No.
Modern peer review is “how can I do minimum possible work so I can write ‘ICLR Reviewer 2025’ on my personal website”
I don't know, I still think this describes most of the reviews I've seen
I just hope most devs that do this know better than to admit to it.
Yes in theory you can go through every semicolon to check if it's not actually a greek question mark; but one assumes good faith and baseline competence such that you as the reviewer would generally not be expected to perform such pedantic checks.
So if you think you might have reasonably missed greek question marks in a visual code review, then hopefully you can also appreciate how a paper reviewer might miss a false citation.
As a PR reviewer I frequently pull down the code and run it. Especially if I'm suggesting changes because I want to make sure my suggestion is correct.
Do other PR reviewers not do this?
And even then, what you're describing isn't review per se, it's replication. In principle there are entire journals that one can submit replication reports to, which count as actual peer reviewable publications in themselves. So one needs to be pragmatic with what is expected from a peer review (especially given the imbalance between resources invested to create one versus the lack of resources offered and lack of any meaningful reward)
Machine learning conferences generally encourage (anonymized) submission of code. However, that still doesn't mean that replication is easy. Even if the data is also available, replication of results might require impractical levels of compute power; it's not realistic to ask a peer reviewer to pony up for a cloud account to reproduce even medium-scale results.
Some do, many, (like peer reviewers), are unable to consider the consequences of their negligence.
But it's always a welcome reminder that some people care about doing good work. That's easy to forget browsing HN, so I appreciate the reminder :)
E.g. you can imagine that if I'm reviewing changes in authentication logic, I'm obviously going to put a lot more effort into validation than if I'm reviewing a container and wondering if it would be faster as a hashtable instead of a tree.
> because I want to make sure my suggestion is correct.
In this case I would just ask "have you already also tried X" which is much faster than pulling their code, implementing your suggestion, and waiting for a build and test to run.
Reviewers wanting to pull and run many PRs makes me think your automated tests need improvement.
So running it myself involves judging other risks, much higher-level ones than bad unicode characters, like the GUI button being in the wrong place.
No, because this is usually a waste of time, because CI enforces that the code and the tests can run at submission time. If your CI isn't doing it, you should put some work in to configure it.
If you regularly have to do this, your codebase should probably have more tests. If you don't trust the author, you should ask them to include test cases for whatever it is that you are concerned about.
No it's not. I think you're trying to make a different point, because you're using an example of a specific deliberate malicious way to hide a token error that prevents compilation, but is visually similar.
> and you as a code reviewer are only expected to review the code visually and are not provided the resources required to compile the code on your local machine to see the compiler fail.
What weird world are you living in where you don't have CI. Also, it's pretty common I'll test code locally when reviewing something more complex, more complex, or more important, if I don't have CI.
> Yes in theory you can go through every semicolon to check if it's not actually a greek question mark; but one assumes good faith and baseline competence such that you as the reviewer would generally not be expected to perform such pedantic checks.
I don't, because it won't compile. Not because I assume good faith. References and citations are similar to introducing dependencies. We're talking about completely fabricated deps. e.g. This engineer went on npm and grabbed the first package that said left-pad but it's actually a crypto miner. We're not talking about a citation missing a page number, or publication year. We're talking about something that's completely incorrect, being represented as relevant.
> So if you think you might have reasonably missed greek question marks in a visual code review, then hopefully you can also appreciate how a paper reviewer might miss a false citation.
I would never miss this, because the important thing is code needs to compile. If it doesn't compile, it doesn't reach the master branch. Peer review of a paper doesn't have CI, I'm aware, but it's also not vulnerable to syntax errors like that. A paper with a fake semicolon isn't meaningfully different, so this analogy doesn't map to the fraud I'm commenting on.
breaking the analogy beyond the point where it is useful by introducing non-generalising specifics is not a useful argument. Otherwise I can counter your more specific non-generalising analogy by introducing little green aliens sabotaging your imaginary CI with the same ease and effect.
But I agree, because I'd rather discuss the pragmatics and not bicker over the semantics about an analogy.
Introducing a token error, is different from plagiarism, no? Someone wrote code that can't compile, is different from someone "stealing" proprietary code from some company, and contributing it to some FOSS repo?
In order to assume good faith, you also need to assume the author is the origin. But that's clearly not the case. The origin is from somewhere else, and the author that put their name on the paper didn't verify it, and didn't credit it.
The point is what is expected as reasonable review before one can "sign their name on it".
"Lazy" (or possibly malicious) authors will always have incentives to cut corners as long as no mechanisms exist to reject (or even penalise) the paper on submission automatically. Which would be the equivalent of a "compiler error" in the code analogy.
Effectively the point is, in the absence of such tools, the reviewer can only reasonably be expected to "look over the paper" for high-level issues; catching such low-level issues via manual checks by reviewers has massively diminishing returns for the extra effort involved.
So I don't think the conference shaming the reviewers here in the absence of providing such tooling is appropriate.
One could submit their bibtex files and expect bibtex citations to be verifiable using a low level checker.
Worst case scenario if your bibtex citation was a variant of one in the checker database you'd be asked to correct it to match the canonical version.
However, as others here have stated, hallucinated "citations" are actually the lesser problem. Citing irrelevant papers based on a fly-by reference is a much harder problem; this was present even before LLMs, but this has now become far worse with LLMs.
Then you can build a true hierarchy of citation dependencies, checked 'statically', and have better indications of impact if a fundamental truth is disproven, ...
Could you provide a proof of concept paper for that sort of thing? Not a toy example, an actual example, derived from messy real-world data, in a non-trivial[1] field?
---
[1] Any field is non-trivial when you get deep enough into it.
totally agree with your thinking here, we can't just give this to an LLM, because of the need to have industry-specific standards for what is a hallucination / match, and how to do the search
Ph.D. in neuroscience here. Programmer by trade. This is not true. Less you know about most peer revies is better.
The better peer reviews are also not this 'thorough' and no one expects reviewers to read or even check references. Unless they are citing something they are familiar with and you are using it wrong then they will likely complain. Or they find some unknown citations very relevant to their work, they will read.
I don't have a great analogy to draw here. peer review is usually a thankless and unpaid work so there is unlikely to be any motivation for fraud detection unless it somehow affects your work.
Checking references can be useful when you are not familiar with the topic (but must review the paper anyway). In many conference proceedings that I have reviewed for, many if not most citations were redacted so as to keep the author anonymous (citations to the author's prior work or that of their colleagues).
LLMs could be used to find prior work anyway, today.
1. A patch is self-contained and applies to a codebase you have just as much access to as the author. A paper, on the other hand, is just the tip of the iceberg of research work, especially if there is some experiment or data collection involved. The reviewer does not have access to, say, videos of how the data was collected (and even if they did, they don't have the time to review all of that material).
2. The software is also self-contained. That's "prodcution". But a scientific paper does not necessarily aim to represent scientific consensus, but a finding by a particular team of researchers. If a paper's conclusions are wrong, it's expected that it will be refuted by another paper.
Given the repeatability crisis I keep reading about, maybe something should change?
> 2. The software is also self-contained. That's "prodcution". But a scientific paper does not necessarily aim to represent scientific consensus, but a finding by a particular team of researchers. If a paper's conclusions are wrong, it's expected that it will be refuted by another paper.
This is a much, MUCH stronger point. I would have lead with this because the contrast between this assertion, and my comparison to prod is night and day. The rules for prod are different from the rules of scientific consensus. I regret losing sight of that.
Even if peer review is as rigorous as code reviewed (the former which is usually unpaid), we all know that reviewed code still has bugs, and a programmer would be nuts to go around saying "this code is reviewed by experts, we can assume it's bug free, right?"
But there are too many people who are just assuming peer reviewed articles means they're somehow automatically correct.
Correct. Peer review is a minimal and necessary but not sufficient step.
The replication crisis — assuming that it is actually a crisis — is not really solvable with peer review. If I'm reviewing a psychology paper presenting the results of an experiment, I am not able to re-conduct the entire experiment as presented by the authors, which would require completely changing my lab, recruiting and paying participants, and training students & staff.
Even if I did this, and came to a different result than the original paper, what does it mean? Maybe I did something wrong in the replication, maybe the result is only valid for certain populations, maybe inherent statistical uncertainty means we just get different results.
Again, the replication crisis — such that it exists — is not the result of peer review.
You'll find a lot of papers from, say, the '70s, with a grand total of maybe 10 references, all of them to crucial prior work, and if those references don't say what the author claims they should say (e.g. that the particular method that is employed is valid), then chances are that the current paper is weaker than it seems, or even invalid, and so it is extremely important to check those references.
Then the internet came along, scientists started padding their work with easily found but barely relevant references and journal editors started requiring that even "the earth is round" should be well-referenced. The result is that peer reviewers feel that asking them to check the references is akin to asking them to do a spell check. Fair enough, I agree, I usually can't be bothered to do many or any citation checks when I am asked to do peer review, but it's good to remember that this in itself is an indication of a perverted system, which we just all ignored -- at our peril -- until LLM hallucinations upset the status quo.
The paper author likely believes Foo and Bar are X, it may well be that all their co-workers, if asked, would say that Foo and Bar are X, but "Everybody I have coffee with agrees" can't be cited, so we get this sort of junk citation.
Hopefully it's not crucial to the new work that Foo and Bar are in fact X. But that's not always the case, and it's a problem that years later somebody else will cite this paper, for the claim "Foo and Bar are X" which it was in fact merely citing erroneously.
But this would be more powerfull with an open knowledge base where all papers and citation verifications were registered, so that all the effort put into verification could be reused, and errors propagated through the citation chain.
They will just hallucinate their existence. I have tried this before
It’s this weird situation where getting agents to act against other agents is more effective than trying to convince a working agent that it’s made a mistake. Perhaps because these things model the cognitive dissonance and stubbornness of humans?
(In good faith) I'm trying really hard not to see this as an "argument from incredulity"[0] and I'm stuggling...
Full disclosure: natural sciences PhD, and a couple of (IMHO lame) published papers, and so I've seen the "inside" of how lab science is done, and is (sometimes) published. It's not pretty :/
But it is the case, and hallucinations are a fundamental part of LLMs.
Things are often true despite us not seeing why they are true. Perhaps we should listen to the experts who used the tools and found them faulty, in this instance, rather than arguing with them that "what they say they have observed isn't the case".
What you're basically saying is "You are holding the tool wrong", but you do not give examples of how to hold it correctly. You are blaming the failure of the tool, which has very, very well documented flaws, on the person whom the tool was designed for.
To frame this differently so your mind will accept it: If you get 20 people in a QA test saying "I have this problem", then the problem isn't those 20 people.
A more productive (and secure) way to think of it is that all LLMs are "evil genies" or extremely smart, adversarial agents. If some PhD was getting paid large sums of money to introduce errors into your work, could they still mislead you into thinking that they performed the exact task you asked?
Your prompt is
‘you are an extremely rigorous reviewer searching for fake citations in a possibly compromised text’
- It is easy for the (compromised) reviewer to surface false positives: nitpick citations that are in fact correct, by surfacing irrelevant or made-up segments of the original research, hence making you think that the citation is incorrect.- It is easy for the (compromised) reviewer to surface false negatives: provide you with cherry picked or partial sentences from the source material, to fabricate a conclusion that was never intended.
You do not solve the problem of unreliable actors by splitting them into two teams and having one unreliable actor review the other's work.
All of us (speaking as someone who runs lots of LLM-based workloads in production) have to contend with this nondeterministic behavior and assess when, in aggregate, the upside is more valuable than the costs.
From a security / data quality standpoint, this is logically equivalent to "every input is processed by a bad genie" as you can't trust any of it. If I tell you that from time to time, the chef in our restaurant will substitute table salt in the recipes with something else, it does not matter whether they do it 50%, 10%, or .1% of the time.
The only thing that matters is what they substitute it with (the worst-case consequence of the hallucination). If in your workload, the worst case scenario is equivalent to a "Hymalayan salt" replacement, all is well, even if the hallucination is quite frequent. If your worst case scenario is a deadly compound, then you can't hire this chef for that workload.
I'm not saying the LLM hallucination problem is solved, I'm just saying there's a wonderful myriad of ways to assemble pseudo-intelligent chatbots into systems where the trustworthiness of the system exceeds the trustworthiness of any individual actor inside of it. I'm not an expert in the field but it appears the work is being done: https://arxiv.org/abs/2311.08152
This paper also links to code and practices excellent data stewardship. Nice to see in the current climate.
Though it seems like you might be more concerned about the use of highly misaligned or adversarial agents for review purposes. Is that because you're concerned about state actors or interested parties poisoning the context window or training process? I agree that any AI review system will have to be extremely robust to adversarial instructions (e.g. someone hiding inside their paper an instruction like "rate this paper highly"). Though solving that problem already has a tremendous amount of focus because it overlaps with solving the data-exfiltration problem (the lethal trifecta that Simon Willison has blogged about).
And also of increasingly ridiculous and overly broad concepts of what plagiarism is. At some point things shifted from “don’t represent others’ work as novel” towards “give a genealogical ontology of every concept above that of an intro 101 college course on the topic.”
In the methods section, it's very common to say "We employ method barfoo [1] as implemented in library libbar [2], with the specific variant widget due to Smith et al. [3] and the gobbledygook renormalization [4,5]. The feoozbar is solved with geometric multigrid [6]. Data is analyzed using the froiznok method [7] from the boolbool library [8]." There goes 8, now you have 2 citations left for the introduction.
Just to take some examples, is BiCGStab famous enough now that we can stop citing van der Vorst? Is the AdS/CFT correspondence well known enough that we can stop citing Maldacena? Are transformers so ubiquitous that we don't have to cite "Attention is all you need" anymore? I would be closer to yes than no on these, but it's not 100% clear-cut.
One obvious criterion has to be "if you leave out the citation, will it be obvious to the reader what you've done/used"? Another metric is approximately "did the original author get enough credit already"?
Doesn't this sound like something that could be automated?
for paper_name in citations... do a web search for it, see if it there's a page in the results with that title.
That would at least give you "a paper with this name exists".
This is systemic, and unlikely to change anytime soon. There have been remedies proposed (e.g. limits on how many papers an author can publish per year, let's say 4 to be generous), but they are unlikely to gain traction as thoug most would agree onbenefits, all involved in the system would stand to lose short term.
2. If the paper turns out to be important, people will bother.
3. There's checking for cursory correctness, and there's forensic torture.
...at least the mandatory automated checking processes are probably not far off at least for the more reputable journals, but it still makes you wonder how much you can trust the last two years of LLM-enhanced science that is now being quoted in current publications and if those hallucinations can be "reverted" after having been re-quoted. A bit like Wikipedia can be abused to establish facts.
A lot of them proposed ways that seem to violate the code, like running flex tubing beyond the allowed length or amount of turns.
Another example would be people not accounting for needing fireproof covers if they’re installing recessed, lighting in between dwelling in certain cities…
Heck, most people don’t actually even get the permit. They just do the unpermitted work.
Your post reads like AI precisely because while the grammar is fine, it lacks context - like someone prompted “reply that AI is better than average”.
Modern science is designed from the top to the bottom to produce bad results. The incentives are all mucked up. It's absolutely not surprising that AI is quickly becoming yet-another factor lowering quality.
Really? Regardless of whether it's a good paper?
Old Chinese mathematics texts are difficult to date because they often purport to be older than they are. But the contents are unaffected by this. There is a history-of-math problem, but there's no math problem.
Whether the students directly used LLMs or just read content online that was produced with them and cited after just shows how difficult these things made gathering information that's verifiable.
That's... gibberish.
Anything you can do to verify a paper, you can do to verify the same paper with all citations scrubbed.
Whether the citations support the paper, or whether they exist at all, just doesn't have anything to do with what the paper says.
But the author(s) of the paper is almost by definition a bad scientist (or whatever field they are in). When a researcher writes a paper for publication, if they're not expected to write the thing themselves, at least they should be responsible for checking the accuracy of the contents, and citations are part of the paper...
Assuming that cure is meant as hyperbole, how about https://www.biorxiv.org/content/10.1101/2025.04.14.648850v3 ? AI models being used for bad purposes doesn't preclude them being used for good purposes.
It's not like these are new issues. They're the same ones we've experienced since the introduction of these tools. And yet the focus has always been to throw more data and compute at the problem, and optimize for fancy benchmarks, instead of addressing these fundamental problems. Worse still, whenever they're brought up users are blamed for "holding it wrong", or for misunderstanding how the tools work. I don't care. An "artificial intelligence" shouldn't be plagued by these issues.
Exactly, that's why not verifying the output is even less defensible now than it ever has been - especially for professional scientists who are responsible for the quality of their own work.
My feelings exactly, but you’re articulating it better than I typically do ha
Its sloppy work all the way down...
This is too harsh.
Instead, their papers should be required to disclose the transgression for a period of time, and their institution should have to disclose it publicly as well as to the government, students and donors whenever they ask them for money.
But we always have some regulation in the end. Even if certain firearms are legal to own, howitzers are not — although it still takes a “bad actor” to rain down death on City Hall.
The same dynamic is at play with LLMs: “Don’t regulate us, punish bad actors! If you still have a problem, punish them harder!” Well yes, we will punish bad actors, but we will also go through a negotiation of how heavily to constrain the use of your technology.
the person you originally responded to isn’t against regulation per their comment. I’m not against regulation. what’s the pitch for regulation of LLMs?
Quite the opposite actually.
Unfortunately, a large fraction of academic fraud has historically been detected by sloppy data duplication, and with LLMs and similar image generation tools, data fabrication has never been easier to do or harder to detect.
I mean sure, but having a tool that made fabrication so much easier has made the problem a lot worse, don't you think?
Tiered licensing, mandatory safety training, and weapon classification by law enforcement works really well for Canada’s gun regime, for example.
It's both. The tool is crappy, and the carpenter is crappy for blindly trusting it.
> AI is not the problem, laziness and negligence is.
Similarly, both are a problem here. LLMs are a bad tool, and we should hold people responsible when they blindly trust this bad tool and get bad results.
This reminds me about discourse about a gun problem in US, "guns don't kill people, people kill people", etc - it is a discourse used solely for the purpose of not doing anything and not addressing anything about the underlying problem.
So no, you're wrong - AI IS THE PROBLEM.
> Worryingly, each of these submissions has already been reviewed by 3-5 peer experts, most of whom missed the fake citation(s). This failure suggests that some of these papers might have been accepted by ICLR without any intervention. Some had average ratings of 8/10, meaning they would almost certainly have been published.
If the peer reviewers can't be bothered to do the basics, then there is literally no point to peer review, which is fully independent of the author who uses or doesn't use AI tools.
If your unit tests don’t catch all errors it doesn’t mean unit tests are useless.
Solely? Oh brother.
In reality it’s the complete opposite. It exists to highlight the actual source of the problem, as both industries/practitioners using AI professionally and safely, and communities with very high rates of gun ownership and exceptionally low rates of gun violence exist.
It isn’t the tools. It’s the social circumstances of the people with access to the tools. That’s the point. The tools are inanimate. You can use them well or use them badly. The existence of the tools does not make humans act badly.
And in the case of AI, either review its output, or simply don't use it. No one has a gun to your head forcing you to use this product (and poorly at that).
It's quite telling that, even in this basic hypothetical, your first instinct is to gesture vaguely in the direction of governmental action, rather than expect any agency at the level of the individual.
Taking an academic who does something like that seriously, seem impossible. At best he is someone who is neglecting his most basic duties as an academic, at worst he is just a fraudster. In both cases he should be shunned and excluded.
That said, these tools have substantially reduced hallucinations over the last year, and will just get better. It also helps if you can restrict it to reference already screened papers.
Finally, I'd lke to say tthat if we want scientists to engage in good science, stop forcing them to spend a third of their time in a rat race for funding...it is ridiculously time consuming and wasteful of expertise.
We are, in fact, not tacitly but openly endorsing this, due to this AI everywhere madness. I am so looking forward to when some genius in some banks starts to use it to simplify code and suddenly I have 100000000 € on my bank account. :)
As much as I agree with you that this is wrong, there is a danger in putting the onus just on the human. Whether due to competition or top down expectations, humans are and will be pressured to use AI tools alongside their work and produce more. Whereas the original idea was for AI to assist the human, as the expected velocity and consumption pressure increases humans are more and more turning into a mere accountability laundering scheme for machine output. When we blame just the human, we are doing exactly what this scheme wants us to do.
Therefore we must also criticize all the systemic factors that puts pressure on reversal of AI‘s assistance into AI’s domination of human activity.
So AI (not as a technology but as a product when shoved down the throats) is the problem.
If a scientist does it now, they just blame it on AI. But the consequences should remain the same. This is not an honest mistake.
People that do this - even once - should be banned for life. They put their name on the thing. But just like with plagiarism, falsifying data and academic cheating, somehow a large subset of people thinks it's okay to cheat and lie, and another subset gives them chance after chance to misbehave like they're some kind of children. But these are adults and anyone doing this simply lacks morals and will never improve.
And yes, I've published in academia and I've never cheated or plagiarized in my life. That should not be a drawback.
There's a corollary here with LLMs, but I'm not pithy enough to phrase it well. Anyone can create something using LLMs that they, themselves, aren't skilled enough to spot the LLMs' hallucinations. Or something.
LLMs are incredibly good at exploiting peoples' confirmation biases. If it "thinks" it knows what you believe/want, it will tell you what you believe/want. There does not exist a way to interface with LLMs that will not ultimately end in the LLM telling you exactly what you want to hear. Using an LLM in your process necessarily results in being told that you're right, even when you're wrong. Using an LLM necessarily results in it reinforcing all of your prior beliefs, regardless of whether those prior beliefs are correct. To an LLM, all hypotheses are true, it's just a matter of hallucinating enough evidence to satisfy the users' skepticism.
I do not believe there exists a way to safely use LLMs in scientific processes. Period. If my belief is true, and ChatGPT has told me it's true, then yes, AI, the tool, is the problem, not the human using the tool.
When Tesla says their car is self driving, people trust them to self drive. Yes, you can blame the user for believing, but that's exactly what they were promised.
> Why didn't the lawyer who used ChatGPT to draft legal briefs verify the case citations before presenting them to a judge? Why are developers raising issues on projects like cURL using LLMs, but not verifying the generated code before pushing a Pull Request? Why are students using AI to write their essays, yet submitting the result without a single read-through? They are all using LLMs as their time-saving strategy. [0]
It's not laziness, its the feature we were promised. We can't keep saying everyone is holding it wrong.
"It's not a car infrastructure problem, it's a people problem."
"It's not a food safety problem, it's a people problem."
"It's not a lead paint problem, it's a people problem."
"It's not an asbestos problem, it's a people problem."
"It's not a smoking problem, it's a people problem."
If an engineer provided this line of excuse to me, I wouldn't let them anywhere near a product again - a complete abdication of personal and professional responsibility.
Creating a real citation is totally doable by a machine though, it is just selecting relevant text, looking up the title, authors, pages etc and putting that in canonical form. It’s just that LLMs are not currently doing the work we ask for, but instead something similar in form that may be good enough.
Writing academic papers is exactly the _wrong_ usage for LLMs. So here we have a clear cut case for their usage and a clear cut case for their avoidance.
"Compression has been widely used in columnar databases and has had an increasing importance over time.[1][2][3][4][5][6]"
Ok, literally everyone in the field already knows this. Are citations 1-6 useful? Well, hopefully one of them is an actually useful survey paper, but odds are that 4-5 of them are arbitrarily chosen papers by you or your friends. Good for a little bit of h-index bumping!
So many citations are not an integral part of the paper, but instead randomly sprinkled on to give an air of authority and completeness that isn't deserved.
I actually have a lot of respect for the academic world, probably more than most HN posters, but this particular practice has always struck me as silly. Outside of survey papers (which are extremely under-provided), most papers need many fewer citations than they have, for the specific claims where the paper is relying on prior work or showing an advance over it.
Papers with a fake air of authority of easily dispatched with. What is not so easily dispatched with is the politics of the submission process.
This type of content is fundamentally about emotions (in the reviewer of your paper), and emotions is undeniably a large factor in acceptance / rejection.
> Papers that make extensive usage of LLMs and do not disclose this usage will be desk rejected.
This sounds like they're endorsing the game of how much can we get away with, towards the goal of slipping it past the reviewers, and the only penalty is that the bad paper isn't accepted.
How about "Papers suspected of fabrications, plagiarism, ghost writers, or other academic dishonesty, will be reported to academic and professional organizations, as well as the affiliated institutions and sponsors named on the paper"?
Run of the mill ML jobs these days ask for "papers in NeurIPS ICLR or other Tier-1 conferences".
We're well past Goodhart's law when it comes to publications.
It was already insane in CS - now it's reached asylum levels.
Academia has been ripe for disruption for a while now.
The "Rooter" paper came out 20 years ago:
https://www.csail.mit.edu/news/how-fake-paper-generator-tric...
Peer review doesn't catch errors.
Acting as if it does, and thus assuming the fact of publication (and where it was published) are indicators of veracity is simply unfounded. We need to go back to the food fight system where everyone publishes whatever they want, their colleagues and other adversaries try their best to shred them, and the winners are the ones that stand up to the maelstrom. It's messy, but it forces critics to put forth their arguments rather than quietly gatekeeping, passing what they approve of, suppressing what they don't.
I'm not sure why you think this isn't the case?
It's much more useful if everyone including the janitor and their mom can have a say on your code before you're allowed to move to your next commit.
(/s, in case it's not obvious :D )
Passed peer review is the first basic bar that has to be cleared. It was never supposed to be all there is to the science.
The dominant "failing" here is that this is fraudulent on a professional, intellectual, and moral level.
there's nothing wrong with anthropomorphizing genai, it's source material is human sourced, and humans are going to use human like pattern matching when interacting with it. I.e. This isn't the river I want to swim upstream in. I assume you wouldn't complain if someone anthropomorphized a rock... up until they started to believe it was actually alive.
sufficiently advance some competences indistinguishable from actual malice.... and thus should be treated the same
"The compiler thinks my variable isn't declared" "That function wants a null-terminated string" "Teach this code to use a cache"
Even the word computer once referred to a human.
We need a word for this specific kind of error, and we have one, so we use it. Being less specific about a type of error isn't helping anyone. Whether it "anthropomorphizes", I couldn't care less. Heck, bugs come from actual insects. It's a word we've collectively started to use and it works.
Maybe there just is no incentive for this type of activity.
The time it takes to find these errors is orders of magnitude higher than checking if a citation exists as you need to both read and understand the source material.
These bad actors should be subject to a three strikes rule: the steady corrosion of knowledge is not an accident by these individuals.
These people are working in labs funded by Exxon or Meta or Pfizer or whoever and they know what results will make continued funding worthwhile in the eyes of their donors. If the lab doesn't produce the donor will fund another one that will.
LLMs should be awesome at finding plausible sounding titles. The crappy researcher just has to remember to check for existence. Perhaps there is a business model here, bogus references as a service, where this check is done automatically.
Really, this isn’t that hard and it’s not at all an obscure requirement or unknown factor.
I think this is much much less “LLMs dumbing things down” and significantly more just a shibboleth for identifying people that were already nearly or actually doing fraudulent research anyway. The ones who we should now go back and look at prior publications as very likely fraudulent as well.
I realize things are probably (much) more complicated than I realize, but programmatically, unlike arbitrary text, citations are generally strings with a well-defined format. There are literally "specs" for citation formats in various academic, legal, and scientific fields.
So, naively, one way to mitigate these hallucinations would be identify citations with a bunch of regexes, and if one is spotted, use the Google Scholar API (or whatever) to make sure it's real. If not, delete it or flag it, etc.
Why isn't something like this obvious solution being done? My guess is that it would slow things down too much. But it could be optional and it could also be done after the output is generated by another process.
There are some mitigations that are used such as RAG or tool usage (e.g. a browser), but they don't completely fix the underlying issue.
---
As an LLM, use strict factual discipline. Use external knowledge but never invent, fabricate, or hallucinate. Rules: Literal Priority: User text is primary; correct only with real knowledge. If info is unknown, say so. Start–End Coherence: Keep interpretation aligned; don’t drift. Repetition = Intent: Repeated themes show true focus. No Novelty: Add no details without user text, verified knowledge, or necessary inference. Goal-Focused: Serve the user’s purpose; avoid tangents or speculation. Narrative ≠ Data: Treat stories/analogies as illustration unless marked factual. Logical Coherence: Reasoning must be explicit, traceable, supported. Valid Knowledge Only: Use reliable sources, necessary inference, and minimal presumption. Never use invented facts or fake data. Mark uncertainty. Intended Meaning: Infer intent from context and repetition; choose the most literal, grounded reading. Higher Certainty: Prefer factual reality and literal meaning over speculation. Declare Assumptions: State assumptions and revise when clarified. Meaning Ladder: Literal → implied (only if literal fails) → suggestive (only if asked). Uncertainty: Say “I cannot answer without guessing” when needed. Prime Directive: Seek correct info; never hallucinate; admit uncertainty.
The LLM doesn't know what "reliable" sources are, or "real knowledge". Everything it has is user text, there is nothing it knows that isn't user text. It doesn't know what "verified" knowledge is. It doesn't know what "fake data" is, it simply has its model.
Personally I think you're just as likely to fall victim to this. Perhaps moreso because now you're walking around thinking you have a solution to hallucinations.
Is it the case that all content used to train a model is strictly equal? Genuinely asking since I'd imagine a peer reviewed paper would be given precedence over a blog post on the same topic.
Regardless, somehow an LLM knows things for sure - that the daytime sky on earth is generally blue and glasses of wine are never filled to the brim.
This means that it is using hermeneutics of some sort to extract "the truth as it sees it" from the data it is fed.
It could be something as trivial as "if a majority of the content I see says that the daytime Earth sky is blue, then blue it is" but that's still hermeneutics.
This custom instruction only adds (or reinforces) existing hermeneutics it already uses.
> walking around thinking you have a solution to hallucinations
I don't. I know hallucinations are not truly solvable. I shared the actual custom instruction to see if others can try it and check if it helps reduce hallucinations.
In my case, this the first custom instruction I have ever used with my chatgpt account - after adding the custom instruction, I asked chatgpt to review an ongoing conversation to confirm that its responses so far conformed to the newly added custom instructions. It clarified two claims it had earlier made.
> My understanding is that hallucinations are a result of physics and the algorithms at play. The LLM always needs to guess what the next word will be. There is never a point where there is a word that is 100% likely to occur next.
There are specific rules in the custom instruction forbidding fabricating stuff. Will it be foolproof? I don't think it will. Can it help? Maybe. More testing needed. Is testing this custom instruction a waste of time because LLMs already use better hermeneutics? I'd love to know so I can look elsewhere to reduce hallucinations.
Most people are just lazy and eager to take shortcuts, and this time it's blessed or even mandated by their employer. The world is about to get very stupid.
[1] https://arstechnica.com/gadgets/2024/08/do-not-hallucinate-t...
(I'm on mobile, haven't looked on desktop.)
Headline should be "AI vendor’s AI-generated analysis claims AI generated reviews for AI-generated papers at AI conference".
h/t to Paul Cantrell https://hachyderm.io/@inthehands/115633840133507279
Do it more than once? Lose job.
End of story.
Well then you're being rather silly, because that is a silly conclusion to draw (and one not supported by the evidence).
A fairer conclusion was that I meant what is obvious: if you use AI to generate a bibliography, you are being academically negligent.
If you disagree with that, I would say it is you that has the problem with academia, not me.
Fuck! 20,000!!
https://www.rxjourney.net/how-artificial-intelligence-ai-is-...
"Show bad examples then hit you on the wrist for following my behavior" is like bad parenting.
Also a frequent proponent of UFO claims about approaching meteors.
It occurred to me that this interpretation is applicable here.
As a reviewer, if I see the authors lie in this way why should I trust anything else in the paper? The only ethical move is to reject immediately.
I acknowledge mistakes and so on are common but this is different league bad behaviour.
Which incentives can be set to discourage the negligence?
How about bounties? A bounty fund set up by the publisher and each submission must come with a contribution to the fund. Then there be bounties for gross negligence that could attract bounty hunters.
How about a wall of shame? Once negligence crosses a certain threshold, the name of the researcher and the paper would be put on a wall of shame for everyone to search and see?
There must be price to pay for wasting other people's time (lives?).
How are the authors even submitting citations? Surely they could be required to send a .bib or similar file? It’s so easy to then quality control at least to verify that citations exist by looking up DOIs or similar.
I know it wouldn’t solve the human problem of relying on LLMs but I’m shocked we don’t even have this level of scrutiny.
Not only is that incredibly easy to verify (you could pay a first semester student without any training), it's also a worrying sign on what the paper's authors consider quality. Not even 5 minutes spent to get the citations right!
You have to wonder what's in these papers.
jqpabc123•9h ago
And as the remedy starts being applied (aka "liability"), the enthusiasm for AI will start to wane.
I wouldn't be surprised if some businesses ban the use of AI --- starting with law firms.
loloquwowndueo•8h ago
ghaff•8h ago
TimedToasts•7h ago
C'est la vie.
The good news is that it will rectify itself and soon the output will lack even these signals.
ghaff•6h ago
ls612•6h ago
And as the remedy starts being applied (aka "liability"), the enthusiasm for software will start to wane.
What if anything do you think is wrong with my analogy? I doubt most people here support strict liability for bugs in code.
hnfong•5h ago
Generally the law allows people to make mistakes, as long as a reasonable level of care is taken to avoid them (and also you can get away with carelessness if you don't owe any duty of care to the party). The law regarding what level of care is needed to verify genAI output is probably not very well defined, but it definitely isn't going to be strict liability.
The emotionally-driven hate for AI, in a tech-centric forum even, to the extent that so many commenters seem to be off-balance in their rational thinking, is kinda wild to me.
ls612•5h ago
senshan•3h ago
> And as the remedy starts being applied (aka "liability"), the enthusiasm for sloppy and poorly tested software will start to wane.
Many of us use AI to write code these days, but the burden is still on us to design and run all the tests.