Over fifty new hallucinations in ICLR 2026 submissions

https://gptzero.me/news/iclr-2026/

507•puttycat•2mo ago

Comments

jqpabc123•2mo ago

The legal system has a word to describe AI "slop" --- it is called "negligence".

And as the remedy starts being applied (aka "liability"), the enthusiasm for AI will start to wane.

I wouldn't be surprised if some businesses ban the use of AI --- starting with law firms.

loloquwowndueo•2mo ago

I applaud your use of triple dashes to avoid automatic conversion to em dashes and being labeled an AI. Kudos!

ghaff•2mo ago

This is a particular meme that I really don't like. I've used em-dashes routinely for years. Do I need to stop using them because various people assume they're an AI flag?

TimedToasts•2mo ago

No, but you should be prepared to have people suspect you are using AI to create your responses.

C'est la vie.

The good news is that it will rectify itself and soon the output will lack even these signals.

ghaff•2mo ago

Well, I work for myself and people can either judge my work on its own merits or not. Don't care all that much.

ls612•2mo ago

The legal system has a word to describe software bugs --- it is called "negligence".

And as the remedy starts being applied (aka "liability"), the enthusiasm for software will start to wane.

What if anything do you think is wrong with my analogy? I doubt most people here support strict liability for bugs in code.

hnfong•2mo ago

I don't even think GP knows what negligence is.

Generally the law allows people to make mistakes, as long as a reasonable level of care is taken to avoid them (and also you can get away with carelessness if you don't owe any duty of care to the party). The law regarding what level of care is needed to verify genAI output is probably not very well defined, but it definitely isn't going to be strict liability.

The emotionally-driven hate for AI, in a tech-centric forum even, to the extent that so many commenters seem to be off-balance in their rational thinking, is kinda wild to me.

ls612•2mo ago

I don’t get it, tech people clearly have the most to gain from AI like Claude Code.

jqpabc123•2mo ago

Computer code is highly deterministic. This allows it to be tested fairly easily. Unfortunately, code productionn is not the only use-case for AI.

Most things in life are not as well defined --- a matter of judgment.

AI is being applied in lots of real world cases where judgment is required to interpret results. For example, "Does this patient have cancer". And it is fairly easy to show that AI's judgment can be highly suspect. There are often legal implications for poor judgment --- i.e. medical malpractice.

Maybe you can argue that this is a mis-application of AI --- and I don't necessarily disagree --- but the point is, once the legal system makes this abundantly clear, the practical business case for AI is going to be severely reduced if humans still have to vet the results in every case.

hnfong•2mo ago

Why do you think AI is inherently worse than humans in judging whether a patient has cancer, assuming they are given the same information as the human doctor? Is there some fundamental assumption that makes AI worse, or are you simply projecting your personal belief (trust) in human doctors? (Note that given the speed of progress of AI and that we're talking about what the law ought to be, not what it was in the past, the past performance of AI on cancer cases do not have much relevance unless a fundamental issue with AI is identified)

Note that whether a person has cancer is generally well-defined, although it may not be obvious at first. If you just let the patient go untreated, you'll know the answer quite definitely in a couple years.

senshan•2mo ago

Very good analogy indeed. With one modification it makes perfect sense:

> And as the remedy starts being applied (aka "liability"), the enthusiasm for sloppy and poorly tested software will start to wane.

Many of us use AI to write code these days, but the burden is still on us to design and run all the tests.

jqpabc123•2mo ago

What if anything do you think is wrong with my analogy?

I think what is clearly wrong with your analogy is assuming that AI applies mostly to software and code production. This is actually a minor use-case for AI.

Government and businesses of all types ---doctors, lawyers, airlines, delivery companies, etc. are attempting to apply AI to uses and situations that can't be tested in advance the same way "vibe" code can. And some of the adverse results have already been ruled on in court.

https://www.evidentlyai.com/blog/ai-failures-examples

watwut•2mo ago

Can we just call them "lies" and "fabrications" which is what they are? If I write the same, you will call them "made up citations" and "academic dishonesty".

One can use AI to help them write without going all the way to having it generate facts and citations.

sorokod•2mo ago

As long as the submissions are on behalf of humans we should. The humans should accept the consequences too.

jmount•2mo ago

That is a key point: they are fabrications, not hallucinations.

Barbing•2mo ago

Ars has often gone with “confabulation”:

>Confabulation was coined right here on Ars, by AI-beat columnist Benj Edwards, in Why ChatGPT and Bing Chat are so good at making things up (Apr 2023).

https://arstechnica.com/civis/threads/researchers-describe-h...

>Generative AI is so new that we need metaphors borrowed from existing ideas to explain these highly technical concepts to the broader public. In this vein, we feel the term "confabulation," although similarly imperfect, is a better metaphor than "hallucination." In human psychology, a "confabulation" occurs when someone's memory has a gap and the brain convincingly fills in the rest without intending to deceive others.

https://arstechnica.com/information-technology/2023/04/why-a...

jameshart•2mo ago

Is the baseline assumption of this work that an erroneous citation is LLM hallucinated?

Did they run the checker across a body of papers before LLMs were available and verify that there were no citations in peer reviewed papers that got authors or titles wrong?

tokai•2mo ago

Yeah that is what their tool does.

llm_nerd•2mo ago

People will commonly hold LLMs as unusable because they make mistakes. So do people. Books have errors. Papers have errors. People have flawed knowledge, often degraded through a conceptual game of telephone.

Exactly as you said, do precisely this to pre-LLM works. There will be an enormous number of errors with utter certainty.

People keep imperfect notes. People are lazy. People sometimes even fabricate. None of this needed LLMs to happen.

add-sub-mul-div•2mo ago

Quoting myself from just last night because this comes up every time and doesn't always need a new write-up.

> You also don't need gunpowder to kill someone with projectiles, but gunpowder changed things in important ways. All I ever see are the most specious knee-jerk defenses of AI that immediately fall apart.

the_af•2mo ago

LLM are a force multiplier of this kind of errors though. It's not easy to hallucinate papers out of whole cloth, but LLMs can easily and confidently do it, quote paragraphs that don't exist, and do it tirelessly and at a pace unmatched by humans.

Humans can do all of the above but it costs them more, and they do it more slowly. LLMs generate spam at a much faster rate.

llm_nerd•2mo ago

>It's not easy to hallucinate papers out of whole cloth, but LLMs can easily and confidently do it, quote paragraphs that don't exist, and do it tirelessly and at a pace unmatched by humans.

But no one is claiming these papers were hallucinated whole, so I don't see how that's relevant. This study -- notably to sell an "AI detector", which is largely a laughable snake-oil field -- looked purely at the accuracy of citations[1] among a very large set of citations. Errors in papers are not remotely uncommon, and finding some errors is...exactly what one would expect. As the GP said, do the same study on pre-LLM papers and you'll find an enormous number of incorrect if not fabricated citations. Peer review has always been an illusion of auditing.

1 - Which is such a weird thing to sell an "AI detection" tool. Clearly it was mostly manual given that they somehow only managed to check a tiny subset of the papers, so in all likelihood was some guy going through citations and checking them on Google Search.

the_af•2mo ago

I've zero interest in the AI tool, I'm discussing the broader problem.

The references were made up, and this is easier and faster to do with LLMs than with humans. Easier to do inadvertently, too.

As I said, LLMs are a force multiplier for fraud and inadvertent errors. So it's a big deal.

throwaway-0001•2mo ago

I think we should see a chart as % of “fabricated” references from past 20 years. We should see a huge increase after 2020-2021. Anyone has this chart data?

pmontra•2mo ago

Fabricated citations are not errors.

A pre LLM paper with fabricated citations would demonstrate will to cheat by the author.

A post LLM paper with fabricated citations: same thing and if the authors attempt to defend themselves with something like, we trusted the AI, they are sloppy, probably cheaters and not very good at it.

llm_nerd•2mo ago

>Fabricated citations are not errors.

Interesting that you hallucinated the word "fabricated" here where I broadly talked about errors. Humans, right? Can't trust them.

Firstly, just about every paper ever written in the history of papers has errors in it. Some small, some big. Most accidental, but some intentional. Sometimes people are sloppy keeping notes, transcribe a row, get a name wrong, do an offset by 1. Sometimes they just entirely make up data or findings. This is not remotely new. It has happened as long as we've had papers. Find an old, pre-LLM paper and go through the citations -- especially for a tosser target like this where there are tens of thousands of low effort papers submitted -- and you're going to find a lot of sloppy citations that are hard to rationalize.

Secondly, the "hallucination" is that this particular snake-oil firm couldn't find given papers in many cases (they aren't foolish enough to think that means they were fabricated. But again, they're looking to sell a tool to rubes, so the conclusion is good enough), and in others that some of the author names are wrong. Eh.

the_af•2mo ago

> Firstly, just about every paper ever written in the history of papers has errors in it

LLMs make it easier and faster, much like guns make killing easier and faster.

mapmeld•2mo ago

Further, if I use AI-written citations to back some claim or fact, what are the actual claims or facts based on? These started happening in law because someone writes the text and then wishes there was a source that was relevant and actually supportive of their claim. But if someone puts in the labor to check your real/extant sources, there's nothing backing it (e.g. MAHA report).

nkrisc•2mo ago

Under what circumstances would a human mistakenly cite a paper which does not exist? I’m having difficulty imagining how someone could mistakenly do that.

jameshart•2mo ago

The issue here is that many of the ‘hallucinations’ this article cites aren’t ’papers which do not exist’. They are incorrect author attributions, publication dates, or titles.

miniwark•2mo ago

They explain in the article what they consider a proper citation, an erroneous one and an hallucination, in the section "Defining Hallucitations". They also say than they have many false positives, mostly real papers who are not available online.

Thad said, i am also very curious of the result than their tool, would give to papers from the 2010's and before.

sigmoid10•2mo ago

If you look at their examples in the "Defining Hallucitations" section, I'd say those could be 100% human errors. Shortening authors' names, leaving out authors, misattributing authors, misspelling or misremembering the paper title (or having an old preprint-title, as titles do change) are all things that I would fully expect to happen to anyone in any field were things get ever got published. Modern tools have made the citation process more comfortable, but if you go back to the old days, you'd probably find those kinds of errors everywhere. If you look at the full list of "hallucinations" they claim to have discovered, the only ones I'd not immediately blame on human screwups are the ones where a title and the authors got zero matches for existing papers/people. If you really want to do this kind of analysis correctly, you'd have to match the claim of the text and verify it with the cited article. Because I think it would be even more dangerous if you can get claims accepted by simply quoting an existing paper correctly, while completely ignoring its content (which would have worked here).

Majromax•2mo ago

> Modern tools have made the citation process more comfortable,

That also makes some of those errors easier. A bad auto-import of paper metadata can silently screw up some of the publication details, and replacing an early preprint with the peer-reviewed article of record takes annoying manual intervention.

jameshart•2mo ago

I mean, if you’re able to take the citation, find the cited work, and definitively state ‘looks like they got the title wrong’ or ‘they attributed the paper to the wrong authors’, that doesn’t sound like what people usually mean when they say a ‘hallucinated’ citation. Work that is lazily or poorly cited but nonetheless attempts to cite real work is not the problem. Work which gives itself false authority by claiming to cite works that simply do not exist is the main concern surely?

sigmoid10•2mo ago

>Work which gives itself false authority by claiming to cite works that simply do not exist is the main concern surely?

You'd think so, but apparently it isn't for these folks. On the other hand, saying "we've found 50 hallucinations in scientific papers" generates a lot more clicks than "we've found 50 common citation mistakes that people make all the time"

mike_hearn•2mo ago

There are other issues. In January they claimed that a US health report contained "fabricated" and "AI generated" citations with the headline being a claim from a Cigna Group report. Their claim it's fabricated is based on nothing more than the URL now being a redirect of the type common in corporate website reorgs.

I did some checking and found the report does exist, but the citation is still not quite correct. Then I discovered someone is running some LLM based citation checker already, which already fact checked this claim and did a correct writeup that seems a lot better than what this GPTZero tool does.

https://checkplease.neocities.org/maha/html/17-loneliness-73...

The mistakes in the citation are the sort of mistake that could have been made by both a human or an AI, really. The visualization in the report is confusing and does contain the 73% number (rounded up), but it's unclear how to interpret the numbers because it's some sort of "vitality index" and not what you'd expect based on how it's introduced. At first glance I actually mis-interpreted it the same way the report does, so it's hard to view this is as clear evidence of AI misuse. Yet the GPTZero folks do make very strong claims based on nothing more than a URL scraper script.

_alternator_•2mo ago

Let me second this: a baseline analysis should include papers that were published or reviewed at least 3-4 years ago.

When I was in grad school, I kept a fairly large .bib file that almost certainly had a mistake or two in it. I don’t think any of them ever made it to print, but it’s hard to be 100% sure.

For most journals, they actually partially check your citations as part of the final editing. The citation record is important for journals, and linking with DOIs is fairly common.

currymj•2mo ago

the papers themselves are publicly available online too. Most of the ones I spot-checked give the extremely strong impression of AI generation.

not just some hallucinated citations, and not just the writing. in many cases the actual purported research "ideas" seem to be plausible nonsense.

To get a feel for it, you can take some of the topics they write about and ask your favorite LLM to generate a paper. Maybe even throw "Deep Research" mode at it. Perhaps tell it to put it in ICLR latex format. It will look a lot like these.

TaupeRanger•2mo ago

It's going to be even worse than 50:

> Given that we've only scanned 300 out of 20,000 submissions, we estimate that we will find 100s of hallucinated papers in the coming days.

shusaku•2mo ago

20,000 submissions to a single conference? That is nuts

analog31•2mo ago

This is an interesting article along those lines...

https://www.theguardian.com/technology/2025/dec/06/ai-resear...

ghaff•2mo ago

Doesn't seem especially out of the norm for a large conference. Call it 10,000 attendees which is large but not huge. Sure; not everyone attending puts in a session proposal. But others put multiple. And many submit but, if not accepted don't attend.

Can't quote exact numbers but when I was on the conference committee for a maybe high four figures attendance conference, we certainly had many thousands of submissions.

zipy124•2mo ago

When academics are graded based on number of papers this is the result.

adestefan•2mo ago

The problem isn't only papers it's that the world of academic computer science coalesced around conference submissions instead of journal submissions. This isn't new and was an issue 30 years ago when I was in grad school. It makes the work of conference organizes the little block holding up the entire system.

DonaldPShimoda•2mo ago

Makes me grateful I'm in an area of CS where the "big" conferences are like 500 attendees.

shusaku•2mo ago

Checking each citation one by one is quite critical in peer review, and of course checking a colleagues paper. I’ve never had to deal with AI slop, but you’ll definitely see something cited for the wrong reason. And just the other day during the final typesetting of a paper of mine I found the journal had messed up a citation (same journal / author but wrong work!)

stefan_•2mo ago

Is it quite critical? Peer review is not checking homework, it's about the novel contribution presented. Papers will frequently cite related notable experiments or introduce a problem that as a peer reviewer in the field I'm already well familiar with. These paragraphs generate many citations but are the least important part of a peer review.

(People submitting AI slop should still be ostracized of course, if you can't be bothered to read it, why would you think I should)

shusaku•2mo ago

Fair point. In my mind it is critical because mistakes are common and can only be fixed by a peer. But you are right that we should not miss the forest through the trees and get lost on small details.

mjd•2mo ago

I love that fake citation that adds George Costanza to the list of authors!

tomrod•2mo ago

How sloppy is someone that they don't check their references!

analog31•2mo ago

A reference is included in a paper if the paper uses information derived from the reference, or to acknowledges the reference as a prior source. If the reference is fake, then the derived information could very well be fake.

Let's say that I use a formula, and give a reference to where the formula came from, but the reference doesn't exist. Would you trust the formula?

Let's say a computer program calls a subroutine with a certain name from a certain library, but the library doesn't exist.

A person doing good research doesn't need to check their references. Now, they could stand to check the references for typographic errors, but that's a stretch too. Almost every online service for retrieving articles includes a reference for each article that you can just copy and paste.

theoldgreybeard•2mo ago

If a carpenter builds a crappy shelf “because” his power tools are not calibrated correctly - that’s a crappy carpenter, not a crappy tool.

If a scientist uses an LLM to write a paper with fabricated citations - that’s a crappy scientist.

AI is not the problem, laziness and negligence is. There needs to be serious social consequences to this kind of thing, otherwise we are tacitly endorsing it.

gdulli•2mo ago

That's like saying guns aren't the problem, the desire to shoot is the problem. Okay, sure, but wanting something like a metal detector requires us to focus on the more tangible aspect that is the gun.

baxtr•2mo ago

If I gave you a gun would you start shooting people just because you had one?

agentultra•2mo ago

If I gave you a gun without a safety could you be the one to blame when it goes off because you weren’t careful enough?

The problem with this analogy is that it makes no sense.

LLMs aren’t guns.

The problem with using them is that humans have to review the content for accuracy. And that gets tiresome because the whole point is that the LLM saves you time and effort doing it yourself. So naturally people will tend to stop checking and assume the output is correct, “because the LLM is so good.”

Then you get false citations and bogus claims everywhere.

sigbottle•2mo ago

Sorry, I'm not following the gun analogies at all

But regardless, I thought the point was that...

> The problem with using them is that humans have to review the content for accuracy.

There are (at least) two humans in this equation. The publisher, and the reader. The publisher at least should do their due diligence, regardless of how "hard" it is (in this case, we literally just ask that you review your OWN CITATIONS that you insert into your paper). This is why we have accountability as a concept.

oceansweep•2mo ago

Yes. That is absolutely the case. One of the Most popular handguns does not have a safety switch that must be toggled before firing. (Glock series handguns)

If someone performs a negligent discharge, they are responsible, not Glock. It does have other safety mechanisms to prevent accidental fires not resulting from a trigger pull.

agentultra•2mo ago

You seem to be getting hung up on the details of guns and missing the point that it’s a bad analogy.

Another way LLMs are not guns: you don’t need a giant data centre owned by a mega corp to use your gun.

Can’t do science because GlockGPT is down? Too bad I guess. Let’s go watch the paint dry.

The reason I made it is because this is inherently how we designed LLMs. They will make bad citations and people need to be careful.

zdragnar•2mo ago

> If I gave you a gun without a safety could you be the one to blame when it goes off because you weren’t careful enough?

Absolutely. Many guns don't have safties. You don't load a round in the chamber unless you intend on using it.

A gun going off when you don't intend is a negligent discharge. No ifs, ands or buts. The person in possession of the gun is always responsible for it.

bluGill•2mo ago

> A gun going off when you don't intend is a negligent discharg

false. A gun goes off when not intended too often to claim that. It has happned to me - I then took the gun to a qualified gunsmith for repairs.

A gun they fires and hits anything you didn't intend to is negligent discharge even if you intended to shoot. Gun saftey is about assuming a gun that could possible fire will and ensuring nothing bad can happen. When looking at gun in a store (that you might want to buy) you aim it at an upper corner where even if it fires the odds of something bad resulting is the least lively to happen (it should be unloaded - and you may have checked, but you still aim there!)

same with cat toy lazers - they should be safe to shine in an eye - but you still point in a safe direction.

baxtr•2mo ago

>“because the LLM is so good.”

That's the issue here. Of course you should be aware of the fact that these things need to be checked - especially if you're a scientist.

This is no secret only known to people on HN. LLMs are tools. People using these tools need to be diligent.

imiric•2mo ago

> LLMs aren’t guns.

Right. A gun doesn't misfire 20% of the time.

> The problem with using them is that humans have to review the content for accuracy.

How long are we going to push this same narrative we've been hearing since the introduction of these tools? When can we trust these tools to be accurate? For technology that is marketed as having superhuman intelligence, it sure seems dumb that it has to be fact-checked by less-intelligent humans.

komali2•2mo ago

Ok sure I'm down for this hypothetical. I will bring 50 random people in front of you, and you will hand all 50 of them loaded guns. Still feeling it?

bandofthehawk•2mo ago

Ever been to a shooting range? It's basically a bunch of random people with loaded guns.

komali2•2mo ago

That's not as random as letting me choose them! They had to be allowed onto the range, show ID, afford the gun, probably do a background check to get the gun unless they used a loophole (which usually requires some social capital).

I'm proposing the true proposal of many guns rights advocates: anyone might have a gun.

So let me choose the 50 and you give them guns! Why not?

hipshaker•2mo ago

If you look at gun violence in the U.S that is , speaking as a European, kind of what I see happening.

gdulli•2mo ago

That doesn't address my point at all but no, I'm not a violent or murderous person. And most people aren't. Many more people do, however, want to take shortcuts to get their work done with the least amount of effort possible.

SauntSolaire•2mo ago

> Many more people do, however, want to take shortcuts to get their work done with the least amount of effort possible.

Yes, and they are the ones responsible for the poor quality of work that results from that.

raincole•2mo ago

If the society rewarded me money and fame when I kill someone then I would. Why wouldn't I?

Like it or not, in our society scientists' job is to churn out papers. Of course they'll use the most efficient way to churn out papers.

intended•2mo ago

The issue with this argument, for anyone who comes after, is not when you give a gun to a SINGLE person, and then ask them "would you do a bad thing".

The issue is when you give EVERYONE guns, and then are surprised when enough people do bad things with them, to create externalities for everyone else.

There is some sort of trip up when personal responsibility, and society wide behaviors, intersect. Sure most people will be reasonable, but the issue is often the cost of the number of irresponsible or outright bad actors.

rcpt•2mo ago

Probably not but, empirically, there are a lot of short tempered people who would.

TomatoCo•2mo ago

To continue the carpenter analogy, the issue with LLMs is that the shelf looks great but is structurally unsound. That it looks good on surface inspection makes it harder to tell that the person making it had no idea what they're doing.

embedding-shape•2mo ago

Regardless, if a carpenter is not validating their work before selling it, it's the same as if a researcher doesn't validate their citations before publishing. Neither of them have any excuses, and one isn't harder to detect than the other. It's just straight up laziness regardless.

judofyr•2mo ago

I think this is a bit unfair. The carpenters are (1) living in world where there’s an extreme focus on delivering as quicklyas possible, (2) being presented with a tool which is promised by prominent figures to be amazing, and (3) the tool is given at a low cost due to being subsidized.

And yet, we’re not supposed to criticize the tool or its makers? Clearly there’s more problems in this world than «lazy carpenters»?

embedding-shape•2mo ago

> And yet, we’re not supposed to criticize the tool or its makers?

Exactly, they're not forcing anyone to use these things, but sometimes others (their managers/bosses) forced them to. Yet it's their responsibility for choosing the right tool for the right problem, like any other professional.

If a carpenter shows up to put a roof yet their hammer or nail-gun can't actually put in nails, who'd you blame; the tool, the toolmaker or the carpenter?

judofyr•2mo ago

> If a carpenter shows up to put a roof yet their hammer or nail-gun can't actually put in nails, who'd you blame; the tool, the toolmaker or the carpenter?

I would be unhappy with the carpenter, yes. But if the toolmaker was constantly over-promising (lying?), lobbying with governments, pushing their tools into the hands of carpenters, never taking responsibility, then I would also criticize the toolmaker. It’s also a toolmaker’s responsibility to be honest about what the tool should be used for.

I think it’s a bit too simplistic to say «AI is not the problem» with the current state of the industry.

jascha_eng•2mo ago

OpenAI and Anthropic at least are both pretty clear about the fact that you need to check the output:

https://openai.com/policies/row-terms-of-use/

https://www.anthropic.com/legal/aup

OpenAI:

> When you use our Services you understand and agree:

Output may not always be accurate. You should not rely on Output from our Services as a sole source of truth or factual information, or as a substitute for professional advice. You must evaluate Output for accuracy and appropriateness for your use case, including using human review as appropriate, before using or sharing Output from the Services. You must not use any Output relating to a person for any purpose that could have a legal or material impact on that person, such as making credit, educational, employment, housing, insurance, legal, medical, or other important decisions about them. Our Services may provide incomplete, incorrect, or offensive Output that does not represent OpenAI’s views. If Output references any third party products or services, it doesn’t mean the third party endorses or is affiliated with OpenAI.

Anthropic:

> When using our products or services to provide advice, recommendations, or in subjective decision-making directly affecting individuals or consumers, a qualified professional in that field must review the content or decision prior to dissemination or finalization. You or your organization are responsible for the accuracy and appropriateness of that information.

So I don't think we can say they are lying.

A poor workman blames his tools. So please take responsibility for what you deliver. And if the result is bad, you can learn from it. That doesn't have to mean not use AI but it definitely means that you need to fact check more thoroughly.

embedding-shape•2mo ago

If I hired a carpenter, he did a bad job, and he starts to blame the toolmaker because they lobby the government and over-promised what that hammer could do, I'd still put the blame on the carpenter. It's his tools, I couldn't give less of a damn why he got them, I trust him to be a professional, and if he falls for some scam or over-promised hammers, that means he did a bad job.

Just like as a software developer, you cannot blame Amazon because your platform is down, if you chose to host all of your platform there. You made that choice, you stand for the consequences, pushing the blame on the ones who are providing you with the tooling is the action of someone weak who fail to realize their own responsibilities. Professionals take responsibility for every choice they make, not just the good ones.

> I think it’s a bit too simplistic to say «AI is not the problem» with the current state of the industry.

Agree, and I wouldn't say anything like that either, which makes it a bit strange to include a reply to something no one in this comment thread seems to have said.

pertymcpert•2mo ago

That’s not what is happening with AI companies, and you damn well know it.

SauntSolaire•2mo ago

Yes, that's what it means to be a professional, you take responsibility for the quality of your work.

peppersghost93•2mo ago

It's a shame the slop generators don't ever have to take responsibility for the trash they've produced.

SauntSolaire•2mo ago

That's beside the point. While there may be many reasonable critiques of AI, none of them reduce the responsibilities of scientist.

peppersghost93•2mo ago

Yeah this is a prime example of what I'm talking about. AI's produce trash and it's everyone else's problem to deal with.

SauntSolaire•2mo ago

Yes, it's the scientists problem to deal with it - that's the choice they made when they decided to use AI for their work. Again, this is what responsibility means.

peppersghost93•2mo ago

This inspires me to make horrible products and shift the blame to the end user for the product being horrible in the first place. I can't take any blame for anything because I didn't force them to use it.

thfuran•2mo ago

>While there many reasonable critiques of AI

But you just said we weren’t supposed to criticize the purveyors of AI or the tools themselves.

SauntSolaire•2mo ago

No, I merely said that the scientist is the one responsible for the quality of their own work. Any critiques you may have for the tools which they use don't lessen this responsibility.

thfuran•2mo ago

>No, I merely said that the scientist is the one responsible for the quality of their own work.

No, you expressed unqualified agreement with a comment containing

“And yet, we’re not supposed to criticize the tool or its makers?”

>Any critiques you may have for the tools which they use don't lessen this responsibility.

People don’t exist or act in a vacuum. That a scientist is responsible for the quality of their work doesn’t mean that a spectrometer manufacture that advertises specs that their machines can’t match and induces universities through discounts and/or dubious advertising claims to push their labs to replace their existing spectrometers with new ones which have many bizarre and unexpected behaviors including but not limited to sometimes just fabricating spurious readings has made no contribution to the problem of bad results.

SauntSolaire•2mo ago

You can criticize the tool or its makers, but not as a means to lessen the responsibility of the professional using it (the rest of the quoted comment). I agree with the GP, it's not a valid excuse for the scientist's poor quality of work.

thfuran•2mo ago

I just substantially edited the comment you replied to.

SauntSolaire•2mo ago

The scientist has (at the very least) a basic responsibility to perform due diligence. We can argue back and forth over what constitutes appropriate due diligence, but, with regard to the scientist under discussion, I think we'd be better suited discussing what constitutes negligence.

adestefan•2mo ago

The entire thread is people missing this simple point.

bossyTeacher•2mo ago

Well, then what does this say of LLM engineers at literally any AI company in existence if they are delivering AI that is unreliable then? Surely, they must take responsibility for the quality of their work and not blame it on something else.

embedding-shape•2mo ago

I feel like what "unreliable" means, depends on well you understand LLMs. I use them in my professional work, and they're reliable in terms of I'm always getting tokens back from them, I don't think my local models have failed even once at doing just that. And this is the product that is being sold.

Some people take that to mean that responses from LLMs are (by human standards) "always correct" and "based on knowledge", while this is a misunderstanding about how LLMs work. They don't know "correct" nor do they have "knowledge", they have tokens, that come after tokens, and that's about it.

amrocha•2mo ago

it’s not “some people”, it’s practically everyone that doesn’t understand how these tools work, and even some people that do.

Lawyers are running their careers by citing hallucinated cases. Researchers are writing papers with hallucinated references. Programmers are taking down production by not verifying AI code.

Humans were made to do things, not to verify things. Verifying something is 10x harder than doing it right. AI in the hands of humans is a foot rocket launcher.

embedding-shape•2mo ago

> it’s not “some people”, it’s practically everyone that doesn’t understand how these tools work, and even some people that do.

Again, true for most things. A lot of people are terrible drivers, terrible judge of their own character, and terrible recreational drug users. Does that mean we need to remove all those things that can be misused?

I much rather push back on shoddy work no matter what source. I don't care if the citations are from a robot or a human, if they suck, then you suck, because you're presenting this as your work. I don't care if your paralegal actually wrote the document, be responsible for the work you supposedly do.

> Humans were made to do things, not to verify things.

I'm glad you seemingly have some grand idea of what humans were meant to do, I certainly wouldn't claim I do so, but I'm also not religious. For me, humans do what humans do, and while we didn't used to mostly sit down and consume so much food and other things, now we do.

amrocha•2mo ago

>A lot of people are terrible drivers, terrible judge of their own character, and terrible recreational drug users. Does that mean we need to remove all those things that can be misused?

Uhh, yes??? We have completely reshaped our cities so that cars can thrive in them at the expense of people. We have laws and exams and enforcement all to prevent cars from being driven by irresponsible people.

And most drugs are literally illegal! The ones that arent are highly regulated!

If your argument is that AI is like heroin then I agree, let’s ban it and arrest anyone making it.

pertymcpert•2mo ago

People need to be responsible for things they put their name on. End of story. No AI company claims their models are perfect and don’t hallucinate. But paper authors should at least verify every single character their submit.

amrocha•2mo ago

Yes, but they don’t. So clearly AI is a foot gun. What are doing about it?

bossyTeacher•2mo ago

>No AI company claims their models are perfect and don’t hallucinate

You can't have it both ways. Either AIs are worth billions BECAUSE they can run mostly unsupervised or they are not. This is exactly like the AI driving system in Autopilot, sold as autonomous but reality doesn't live up to it.

bossyTeacher•2mo ago

> they're reliable in terms of I'm always getting tokens back from them

This is not what you are being sold though. They are not selling you "tokens". Check their marketing articles and you will not see the word token or synonym on any of their headings or subheadings. You are being sold these abilities:

- “Generate reports, draft emails, summarize meetings, and complete projects.”

- “Automate repetitive tasks, like converting screenshots or dashboards into presentations … rearranging meetings … updating spreadsheets with new financial data while retaining the same formatting.”

- "Support-type automation: e.g. customer support agents that can summarize incoming messages, detect sentiment, route tickets to the right team."

- "For enterprise workflows: via Gemini Enterprise — allowing firms to connect internal data sources (e.g. CRM, BI, SharePoint, Salesforce, SAP) and build custom AI agents that can: answer complex questions, carry out tasks, iterate deliverables — effectively automating internal processes."

These are taken straight from their websites. The idea that you are JUST being sold tokens is as hilariously fictional as any company selling you their app was actually just selling you patterns of pixels on your screen.

concinds•2mo ago

I use those LLM "deep research" modes every now and then. They can be useful for some use cases. I'd never think to freaking paste it into a paper and submit it or publish it without checking; that boggles the mind.

The problem is that a researcher who does that is almost guaranteed to be careless about other things too. So the problem isn't just the LLM, or even the citations, but the ambient level of acceptable mediocrity.

k4rli•2mo ago

Very good analogy I'd say.

Also similar to what Temu, Wish, and other similar sites offer. Picture and specs might look good but it will likely be disappointing in the end.

CapitalistCartr•2mo ago

I'm an industrial electrician. A lot of poor electrical work is visible only to a fellow electrician, and sometimes only another industrial electrician. Bad technical work requires technical inspectors to criticize. Sometimes highly skilled ones.

andy99•2mo ago

I’ve reviewed a lot of papers, I don’t consider it the reviewers responsibility to manually verify all citations are real. If there was an unusual citation that was relied on heavily for the basis of the work, one would expect it to be checked. Things like broad prior work, you’d just assume it’s part of background.

The reviewer is not a proofreader, they are checking the rigour and relevance of the work, which does not rest heavily on all of the references in a document. They are also assuming good faith.

zdragnar•2mo ago

This is half the basis for the replication crisis, no? Shady papers come out and people cite them endlessly with no critical thought or verification.

After all, their grant covers their thesis, not their thesis plus all of the theses they cite.

Aurornis•2mo ago

> I don’t consider it the reviewers responsibility to manually verify all citations are real

I guess this explains all those times over the years where I follow a citation from a paper and discover it doesn’t support what the first paper claimed.

auggierose•2mo ago

In short, a review has no objective value, it is just an obstacle to be gamed.

amanaplanacanal•2mo ago

In theory, the review tries to determine if the conclusion reached actually follows from whatever data is provided. It assumes that everything is honest, it's just looking to see if there were mistakes made.

auggierose•2mo ago

Honest or not should not make a difference, after all, the submitting author may believe themselves everything is A-OK.

The review should also determine how valuable the contribution is, not only if it has mistakes or not.

Todays reviews determine neither value nor correctness in any meaningful way. And how could they, actually? That is why I review papers only to the extent that I understand them, and I clearly delineate my line of understanding. And I don't review papers that I am not interested in reading. I once got a paper to review that actually pointed out a mistake in one of my previous papers, and then proposed a different solution. They correctly identified the mistake, but I could not verify if their solution worked or not, that would have taken me several weeks to understand. I gave a report along these lines, and the person who gave me the review said I should say more about their solution, but I could not. So my review was not actually used. The paper was accepted, which is fine, but I am sure none of the other reviewers actually knows if it is correct.

Now, this was a case where I was an absolute expert. Which is far from the usual situation for a reviewer, even though many reviewers give themselves the highest mark for expertise when they just should not.

pbhjpbhj•2mo ago

Surely there are tools to retrieve all the citations, publishers should spot it easily.

However the paper is submitted, like a folder on a cloud drive, just have them include a folder with PDFs/abstracts of all the citations?

They might then fraudulently produce papers to cite, but they can't cite something that doesn't exist.

tpoacher•2mo ago

how delightfully optimistic of you to think those abstracts would not also be ai generated ...

zzzeek•2mo ago

sure but then the citations are no longer "hallucinated", they actually point to something fraudulent. that's a different problem.

michaelt•2mo ago

> Surely there are tools to retrieve all the citations,

Even if you could retrieve all citations (which isn't always as easy as you might hope) to validate citations you'd also have to confirm the paper says what the person citing it says. If I say "A GPU requires 1.4kg of copper" citing [1] is that a valid citation?

That means not just reviewing one paper, but also potentially checking 70+ papers it cites. The vast majority of paper reviewers will not check citations actually say what they're claimed to say, unless a truly outlandish claim is made.

At the same time, academia is strangely resistant to putting hyperlinks in citations, preferring to maintain old traditions - like citing conference papers by page number in a hypothetical book that has never been published; and having both a free and a paywalled version of a paper while considering the paywalled version the 'official' version.

[1] https://arxiv.org/pdf/2512.04142

grayhatter•2mo ago

> The reviewer is not a proofreader, they are checking the rigour and relevance of the work, which does not rest heavily on all of the references in a document.

I've always assumed peer review is similar to diff review. Where I'm willing to sign my name onto the work of others. If I approve a diff/pr and it takes down prod. It's just as much my fault, no?

> They are also assuming good faith.

I can only relate this to code review, but assuming good faith means you assume they didn't try to introduce a bug by adding this dependency. But I would should still check to make sure this new dep isn't some typosquatted package. That's the rigor I'm responsible for.

chroma205•2mo ago

> I've always assumed peer review is similar to diff review. Where I'm willing to sign my name onto the work of others. If I approve a diff/pr and it takes down prod. It's just as much my fault, no?

No.

Modern peer review is “how can I do minimum possible work so I can write ‘ICLR Reviewer 2025’ on my personal website”

grayhatter•2mo ago

> No. [...] how can I do minimum possible work

I don't know, I still think this describes most of the reviews I've seen

I just hope most devs that do this know better than to admit to it.

freehorse•2mo ago

The vast majority of people I see do not even mention who they review for in CVs etc. It is usually more akin to a volunteer based, thankless work. Unless you are an editor or sth in a journal, what you review for does not count much for anything.

tpoacher•2mo ago

This is true, but here the equivalent situation is someone using a greek question mark (";") instead of a semicolon (";"), and you as a code reviewer are only expected to review the code visually and are not provided the resources required to compile the code on your local machine to see the compiler fail.

Yes in theory you can go through every semicolon to check if it's not actually a greek question mark; but one assumes good faith and baseline competence such that you as the reviewer would generally not be expected to perform such pedantic checks.

So if you think you might have reasonably missed greek question marks in a visual code review, then hopefully you can also appreciate how a paper reviewer might miss a false citation.

scythmic_waves•2mo ago

> as a code reviewer [you] are only expected to review the code visually and are not provided the resources required to compile the code on your local machine to see the compiler fail.

As a PR reviewer I frequently pull down the code and run it. Especially if I'm suggesting changes because I want to make sure my suggestion is correct.

Do other PR reviewers not do this?

tpoacher•2mo ago

I do too, but this is a conference, I doubt code was provided.

And even then, what you're describing isn't review per se, it's replication. In principle there are entire journals that one can submit replication reports to, which count as actual peer reviewable publications in themselves. So one needs to be pragmatic with what is expected from a peer review (especially given the imbalance between resources invested to create one versus the lack of resources offered and lack of any meaningful reward)

Majromax•2mo ago

> I do too, but this is a conference, I doubt code was provided.

Machine learning conferences generally encourage (anonymized) submission of code. However, that still doesn't mean that replication is easy. Even if the data is also available, replication of results might require impractical levels of compute power; it's not realistic to ask a peer reviewer to pony up for a cloud account to reproduce even medium-scale results.

grayhatter•2mo ago

> Do other PR reviewers not do this?

Some do, many, (like peer reviewers), are unable to consider the consequences of their negligence.

But it's always a welcome reminder that some people care about doing good work. That's easy to forget browsing HN, so I appreciate the reminder :)

dataflow•2mo ago

I don't commonly do this and I don't know many people who do this frequently either. But it depends strongly on the code, the risks, the gains of doing so, the contributor, the project, the state of testing and how else an error would get caught (I guess this is another way of saying "it depends on the risks"), etc.

E.g. you can imagine that if I'm reviewing changes in authentication logic, I'm obviously going to put a lot more effort into validation than if I'm reviewing a container and wondering if it would be faster as a hashtable instead of a tree.

> because I want to make sure my suggestion is correct.

In this case I would just ask "have you already also tried X" which is much faster than pulling their code, implementing your suggestion, and waiting for a build and test to run.

lesam•2mo ago

If there’s anything I would want to run to verify, I ask the author to add a unit test. Generally, the existing CI test + new tests in the PR having run successfully is enough. I might pull and run it if I am not sure whether a particular edge case is handled.

Reviewers wanting to pull and run many PRs makes me think your automated tests need improvement.

Terr_•2mo ago

I don't, but that's because ensuring the PR compiles and passes old+new automated tests is an enforced requirement before it goes out.

So running it myself involves judging other risks, much higher-level ones than bad unicode characters, like the GUI button being in the wrong place.

vkou•2mo ago

> Do other PR reviewers not do this?

No, because this is usually a waste of time, because CI enforces that the code and the tests can run at submission time. If your CI isn't doing it, you should put some work in to configure it.

If you regularly have to do this, your codebase should probably have more tests. If you don't trust the author, you should ask them to include test cases for whatever it is that you are concerned about.

grayhatter•2mo ago

> This is true, but here the equivalent situation is someone using a greek question mark (";") instead of a semicolon (";"),

No it's not. I think you're trying to make a different point, because you're using an example of a specific deliberate malicious way to hide a token error that prevents compilation, but is visually similar.

> and you as a code reviewer are only expected to review the code visually and are not provided the resources required to compile the code on your local machine to see the compiler fail.

What weird world are you living in where you don't have CI. Also, it's pretty common I'll test code locally when reviewing something more complex, more complex, or more important, if I don't have CI.

> Yes in theory you can go through every semicolon to check if it's not actually a greek question mark; but one assumes good faith and baseline competence such that you as the reviewer would generally not be expected to perform such pedantic checks.

I don't, because it won't compile. Not because I assume good faith. References and citations are similar to introducing dependencies. We're talking about completely fabricated deps. e.g. This engineer went on npm and grabbed the first package that said left-pad but it's actually a crypto miner. We're not talking about a citation missing a page number, or publication year. We're talking about something that's completely incorrect, being represented as relevant.

> So if you think you might have reasonably missed greek question marks in a visual code review, then hopefully you can also appreciate how a paper reviewer might miss a false citation.

I would never miss this, because the important thing is code needs to compile. If it doesn't compile, it doesn't reach the master branch. Peer review of a paper doesn't have CI, I'm aware, but it's also not vulnerable to syntax errors like that. A paper with a fake semicolon isn't meaningfully different, so this analogy doesn't map to the fraud I'm commenting on.

tpoacher•2mo ago

you have completely missed the point of the analogy.

breaking the analogy beyond the point where it is useful by introducing non-generalising specifics is not a useful argument. Otherwise I can counter your more specific non-generalising analogy by introducing little green aliens sabotaging your imaginary CI with the same ease and effect.

grayhatter•2mo ago

I disagree you could do that and claim to be reasonable.

But I agree, because I'd rather discuss the pragmatics and not bicker over the semantics about an analogy.

Introducing a token error, is different from plagiarism, no? Someone wrote code that can't compile, is different from someone "stealing" proprietary code from some company, and contributing it to some FOSS repo?

In order to assume good faith, you also need to assume the author is the origin. But that's clearly not the case. The origin is from somewhere else, and the author that put their name on the paper didn't verify it, and didn't credit it.

tpoacher•2mo ago

Sure but the focus here is on the reviewer not the author.

The point is what is expected as reasonable review before one can "sign their name on it".

"Lazy" (or possibly malicious) authors will always have incentives to cut corners as long as no mechanisms exist to reject (or even penalise) the paper on submission automatically. Which would be the equivalent of a "compiler error" in the code analogy.

Effectively the point is, in the absence of such tools, the reviewer can only reasonably be expected to "look over the paper" for high-level issues; catching such low-level issues via manual checks by reviewers has massively diminishing returns for the extra effort involved.

So I don't think the conference shaming the reviewers here in the absence of providing such tooling is appropriate.

xvilka•2mo ago

Code correctness should be checked automatically with the CI and testsuite. New tests should be added. This is exactly what makes sure these stupid errors don't bother the reviewer. Same for the code formatting and documentation.

thfuran•2mo ago

What exactly is the analogy you’re suggesting, using LLMs to verify the citations?

tpoacher•2mo ago

not OP, but that wouldn't really be necessary.

One could submit their bibtex files and expect bibtex citations to be verifiable using a low level checker.

Worst case scenario if your bibtex citation was a variant of one in the checker database you'd be asked to correct it to match the canonical version.

However, as others here have stated, hallucinated "citations" are actually the lesser problem. Citing irrelevant papers based on a fly-by reference is a much harder problem; this was present even before LLMs, but this has now become far worse with LLMs.

thfuran•2mo ago

Yes, I think verifying mere existence of the cited paper barely moves the needle. I mean, I guess automated verification of that is a cheap rejection criterion, but I don’t think it’s overall very useful.

alexcdot•2mo ago

really good point. one of the cofounders of gptzero here!

the tool gptzero used in the article also detects if the citation supports the claim too, if you scroll to "cited information accuracy" here: https://app.gptzero.me/documents/1641652a-c598-453f-9c94-e0b...

this is still in beta because its a much harder problem for sure, since its hard to determine if a 40 page paper supports a claims (if the paper claims X is computationally intractable, does that mean algorithms to compute approximate X are slow?)

merely-unlikely•2mo ago

This discussion makes me think peer reviews need more automated tooling somewhat analogous to what software engineers have long relied on. For example, a tool could use an LLM to check that the citation actually substantiates the claim the paper says it does, or else flags the claim for review.

noitpmeder•2mo ago

I'd go one further and say all published papers should come with a clear list of "claimed truths", and one is only able to cite said paper if they are linking in to an explicit truth.

Then you can build a true hierarchy of citation dependencies, checked 'statically', and have better indications of impact if a fundamental truth is disproven, ...

vkou•2mo ago

Have you authored a lot of non-CS papers?

Could you provide a proof of concept paper for that sort of thing? Not a toy example, an actual example, derived from messy real-world data, in a non-trivial[1] field?

---

[1] Any field is non-trivial when you get deep enough into it.

noitpmeder•1mo ago

I'd say my expectation is papers should be minimal in their effect, and compounding. If your project proves new facts, either they should be clearly enumerable (with as much specificity as possible), or your project/presentation/paper should be broken up to the point your findings ARE enumerable.

alexcdot•2mo ago

hey, i'm a part of the gptzero team that built automated tooling, to get the results in that article!

totally agree with your thinking here, we can't just give this to an LLM, because of the need to have industry-specific standards for what is a hallucination / match, and how to do the search

dilawar•2mo ago

> I've always assumed peer review is similar to diff review. Where I'm willing to sign my name onto the work of others. If I approve a diff/pr and it takes down prod. It's just as much my fault, no?

Ph.D. in neuroscience here. Programmer by trade. This is not true. Less you know about most peer revies is better.

The better peer reviews are also not this 'thorough' and no one expects reviewers to read or even check references. Unless they are citing something they are familiar with and you are using it wrong then they will likely complain. Or they find some unknown citations very relevant to their work, they will read.

I don't have a great analogy to draw here. peer review is usually a thankless and unpaid work so there is unlikely to be any motivation for fraud detection unless it somehow affects your work.

wpollock•2mo ago

> The better peer reviews are also not this 'thorough' and no one expects reviewers to read or even check references.

Checking references can be useful when you are not familiar with the topic (but must review the paper anyway). In many conference proceedings that I have reviewed for, many if not most citations were redacted so as to keep the author anonymous (citations to the author's prior work or that of their colleagues).

LLMs could be used to find prior work anyway, today.

pron•2mo ago

That is not, cannot be, and shouldn't be, the bar for peer review. There are two major differences between it and code review:

1. A patch is self-contained and applies to a codebase you have just as much access to as the author. A paper, on the other hand, is just the tip of the iceberg of research work, especially if there is some experiment or data collection involved. The reviewer does not have access to, say, videos of how the data was collected (and even if they did, they don't have the time to review all of that material).

2. The software is also self-contained. That's "prodcution". But a scientific paper does not necessarily aim to represent scientific consensus, but a finding by a particular team of researchers. If a paper's conclusions are wrong, it's expected that it will be refuted by another paper.

grayhatter•2mo ago

> That is not, cannot be, and shouldn't be, the bar for peer review.

Given the repeatability crisis I keep reading about, maybe something should change?

> 2. The software is also self-contained. That's "prodcution". But a scientific paper does not necessarily aim to represent scientific consensus, but a finding by a particular team of researchers. If a paper's conclusions are wrong, it's expected that it will be refuted by another paper.

This is a much, MUCH stronger point. I would have lead with this because the contrast between this assertion, and my comparison to prod is night and day. The rules for prod are different from the rules of scientific consensus. I regret losing sight of that.

hnfong•2mo ago

IMHO what should change is we stop putting "peer reviewed" articles on a pedestal.

Even if peer review is as rigorous as code reviewed (the former which is usually unpaid), we all know that reviewed code still has bugs, and a programmer would be nuts to go around saying "this code is reviewed by experts, we can assume it's bug free, right?"

But there are too many people who are just assuming peer reviewed articles means they're somehow automatically correct.

vkou•2mo ago

> IMHO what should change is we stop putting "peer reviewed" articles on a pedestal.

Correct. Peer review is a minimal and necessary but not sufficient step.

hnfong•2mo ago

I agree in principle, and I think this is what's happening mostly. But IMHO the public perception of a paper being peer reviewed as somehow "more trustworthy" is also kind of... bad.

I mean, being peer reviewed is a signal of a paper's quality, but in the hands of an expert in that domain it's not a very valuable signal, because they can just read the paper themselves, and figure out whether it's legit. So instead of having "experts" try to explain a paper and commenting on whether it's peer reviewed or not, I think the better practice is to have said expert say "I read the paper and it's legit", or "I read the paper and it's nonsense".

IMHO the reason they make note of whether it's peer reviewed is because they don't know enough to make the judgement themselves. And the fallback is to trust a couple anonymous reviewers attest to the quality of a paper! If you think of it that way, using this signal to vet the quality of a publication to the lay public isn't really a good idea.

garden_hermit•2mo ago

> Given the repeatability crisis I keep reading about, maybe something should change?

The replication crisis — assuming that it is actually a crisis — is not really solvable with peer review. If I'm reviewing a psychology paper presenting the results of an experiment, I am not able to re-conduct the entire experiment as presented by the authors, which would require completely changing my lab, recruiting and paying participants, and training students & staff.

Even if I did this, and came to a different result than the original paper, what does it mean? Maybe I did something wrong in the replication, maybe the result is only valid for certain populations, maybe inherent statistical uncertainty means we just get different results.

Again, the replication crisis — such that it exists — is not the result of peer review.

bjourne•2mo ago

For ICLR reviewers were asked to review 5 papers in two weeks. Unpaid voluntary work in addition to their normal teaching, supervision, meetings, and other research duties. It's just not possible to understand and thoroughly review each paper even for topic experts. If you want to compare peer review to coding, it's more like "no syntax errors, code still compiles" rather than pr review.

alexcdot•2mo ago

I really like what IJCAI is doing to pay reviewers to do this work, with the $100 fee from authors

Yeah its insane the workload reviewers are faced with + being an author who gets a review from a novice

freehorse•2mo ago

A reviewer is assessing the relevance and "impact" of a paper rather than correctness itself directly. Reviewers may not even have access to the data itself that authors may have used. The way it essentially works is an editor asks the reviewers "is this paper worthy to be published in my journal?" and the reviewers basically have to answer that question. The process is actually the editor/journal's responsibility.

stdbrouw•2mo ago

The idea that references in a scientific paper should be plentiful but aren't really that important, is a consequence of a previous technological revolution: the internet.

You'll find a lot of papers from, say, the '70s, with a grand total of maybe 10 references, all of them to crucial prior work, and if those references don't say what the author claims they should say (e.g. that the particular method that is employed is valid), then chances are that the current paper is weaker than it seems, or even invalid, and so it is extremely important to check those references.

Then the internet came along, scientists started padding their work with easily found but barely relevant references and journal editors started requiring that even "the earth is round" should be well-referenced. The result is that peer reviewers feel that asking them to check the references is akin to asking them to do a spell check. Fair enough, I agree, I usually can't be bothered to do many or any citation checks when I am asked to do peer review, but it's good to remember that this in itself is an indication of a perverted system, which we just all ignored -- at our peril -- until LLM hallucinations upset the status quo.

tialaramex•2mo ago

Whether in the 1970s or now, it's too often the case that a paper says "Foo and Bar are X" and cites two sources for this fact. You chase down the sources, the first one says "We weren't able to determine whether Foo is X" and never mentions Bar. The second says "Assuming Bar is X, we show that Foo is probably X too".

The paper author likely believes Foo and Bar are X, it may well be that all their co-workers, if asked, would say that Foo and Bar are X, but "Everybody I have coffee with agrees" can't be cited, so we get this sort of junk citation.

Hopefully it's not crucial to the new work that Foo and Bar are in fact X. But that's not always the case, and it's a problem that years later somebody else will cite this paper, for the claim "Foo and Bar are X" which it was in fact merely citing erroneously.

KHRZ•2mo ago

LLMs can actually make up for their negative contributions. They could go through all the references of all papers and verify them, assuming someone would also look into what gets flagged for that final seal of disapproval.

But this would be more powerfull with an open knowledge base where all papers and citation verifications were registered, so that all the effort put into verification could be reused, and errors propagated through the citation chain.

bossyTeacher•2mo ago

>LLMs can actually make up for their negative contributions. They could go through all the references of all papers and verify them,

They will just hallucinate their existence. I have tried this before

sansseriff•2mo ago

I don’t see why this would be the case with proper tool calling and context management. If you tell a model with blank context ‘you are an extremely rigorous reviewer searching for fake citations in a possibly compromised text’ then it will find errors.

It’s this weird situation where getting agents to act against other agents is more effective than trying to convince a working agent that it’s made a mistake. Perhaps because these things model the cognitive dissonance and stubbornness of humans?

bossyTeacher•2mo ago

If you truly think that you have an effective solution to hallucinations, you will become instantly rich because literally no one out there has an idea for an economically and technologically feasible solution to hallucinations

whatyesaid•2mo ago

For references, as the OP said, I don't see why it isn't possible. It's something that exists and is accessible (even if paywalled) or doesn't exist. For reasoning hallucinations are different.

logifail•2mo ago

> I don't see why it isn't possible

(In good faith) I'm trying really hard not to see this as an "argument from incredulity"[0] and I'm stuggling...

Full disclosure: natural sciences PhD, and a couple of (IMHO lame) published papers, and so I've seen the "inside" of how lab science is done, and is (sometimes) published. It's not pretty :/

[0] https://en.wikipedia.org/wiki/Argument_from_incredulity

whatyesaid•2mo ago

If you've got a prompt, along the lines of: given some references, check their validity. It searches against the articles and URLs provided. You return "yes", "no", and let's also add "inconclusive", for each reference. Basic LLMs can do this much instruction following, just like in 99.99% of times they don't get 829 multiplied by 291 wrong when you ask them (nowadays). You'd prompt it to back all claims solely by search/external links showing exact matches and not use its own internal knowledge.

The fake references generated in the ICLR papers were I assume due to people asking a LLM to write parts of the related work section, not verify references. In that prompt it relies a lot on internal knowledge and spends a majority of time thinking about what the relevant subareas are and cutting edge is, probably. I suppose it omits a second-pass check. In the other case, you have the task of verifying references, which is mostly basic instruction following for advanced models that have web access. I think you'd run the risks of data poisoning and model timeout more than hallucinations.

fao_•2mo ago

> I don’t see why this would be the case

But it is the case, and hallucinations are a fundamental part of LLMs.

Things are often true despite us not seeing why they are true. Perhaps we should listen to the experts who used the tools and found them faulty, in this instance, rather than arguing with them that "what they say they have observed isn't the case".

What you're basically saying is "You are holding the tool wrong", but you do not give examples of how to hold it correctly. You are blaming the failure of the tool, which has very, very well documented flaws, on the person whom the tool was designed for.

To frame this differently so your mind will accept it: If you get 20 people in a QA test saying "I have this problem", then the problem isn't those 20 people.

sebastiennight•2mo ago

One incorrect way to think of it is "LLMs will sometimes hallucinate when asked to produce content, but will provide grounded insights when merely asked to review/rate existing content".

A more productive (and secure) way to think of it is that all LLMs are "evil genies" or extremely smart, adversarial agents. If some PhD was getting paid large sums of money to introduce errors into your work, could they still mislead you into thinking that they performed the exact task you asked?

Your prompt is

    ‘you are an extremely rigorous reviewer searching for fake citations in a possibly compromised text’

- It is easy for the (compromised) reviewer to surface false positives: nitpick citations that are in fact correct, by surfacing irrelevant or made-up segments of the original research, hence making you think that the citation is incorrect.

- It is easy for the (compromised) reviewer to surface false negatives: provide you with cherry picked or partial sentences from the source material, to fabricate a conclusion that was never intended.

You do not solve the problem of unreliable actors by splitting them into two teams and having one unreliable actor review the other's work.

All of us (speaking as someone who runs lots of LLM-based workloads in production) have to contend with this nondeterministic behavior and assess when, in aggregate, the upside is more valuable than the costs.

sebastiennight•2mo ago

Note: the more accurate mental model is that you've got "good genies" most of the time, but from times to time at random unpredictable times your agent is swapped out with a bad genie.

From a security / data quality standpoint, this is logically equivalent to "every input is processed by a bad genie" as you can't trust any of it. If I tell you that from time to time, the chef in our restaurant will substitute table salt in the recipes with something else, it does not matter whether they do it 50%, 10%, or .1% of the time.

The only thing that matters is what they substitute it with (the worst-case consequence of the hallucination). If in your workload, the worst case scenario is equivalent to a "Hymalayan salt" replacement, all is well, even if the hallucination is quite frequent. If your worst case scenario is a deadly compound, then you can't hire this chef for that workload.

sansseriff•2mo ago

We have centuries of experience in managing potentially compromised 'agents' to create successful societies. Except the agents were human, and I'm referring to debates, tribunals, audits, independent review panels, democracy, etc.

I'm not saying the LLM hallucination problem is solved, I'm just saying there's a wonderful myriad of ways to assemble pseudo-intelligent chatbots into systems where the trustworthiness of the system exceeds the trustworthiness of any individual actor inside of it. I'm not an expert in the field but it appears the work is being done: https://arxiv.org/abs/2311.08152

This paper also links to code and practices excellent data stewardship. Nice to see in the current climate.

Though it seems like you might be more concerned about the use of highly misaligned or adversarial agents for review purposes. Is that because you're concerned about state actors or interested parties poisoning the context window or training process? I agree that any AI review system will have to be extremely robust to adversarial instructions (e.g. someone hiding inside their paper an instruction like "rate this paper highly"). Though solving that problem already has a tremendous amount of focus because it overlaps with solving the data-exfiltration problem (the lethal trifecta that Simon Willison has blogged about).

bossyTeacher•2mo ago

> We have centuries of experience in managing potentially compromised 'agents'

Not this kind though. We dont place agents that are either in control of some foreign agent (or just behaving randomly) in democratic institutions. And when we do, look at what happens. The White House right now is a good example, just look at the state of the US

ungreased0675•2mo ago

Have you actually tried this? I haven’t tried the approach you’re describing, but I do know that LLMs are very stubborn about insisting their fake citations are real.

knome•2mo ago

I assumed they meant using the LLM to extract the citations and then use external tooling to lookup and grab the original paper, at least verifying that it exists, has relevant title, summary and that the authors are correctly cited.

mike_hearn•2mo ago

Which is what the people in this new article are doing.

HPsquared•2mo ago

Wikipedia calls this citogenesis.

ineedasername•2mo ago

>“consequence of a previous technological revolution: the internet.”

And also of increasingly ridiculous and overly broad concepts of what plagiarism is. At some point things shifted from “don’t represent others’ work as novel” towards “give a genealogical ontology of every concept above that of an intro 101 college course on the topic.”

varjag•2mo ago

Not even the Internet per se but citation index becoming universally accepted KPI for research work.

freehorse•2mo ago

It is not (just) consequence of the internet, the scientific production itself has grown exponentially. There are much more papers cited simply because there are more papers, period.

semi-extrinsic•2mo ago

It's also a consequence of the sheer number of building blocks which are involved in modern science.

In the methods section, it's very common to say "We employ method barfoo [1] as implemented in library libbar [2], with the specific variant widget due to Smith et al. [3] and the gobbledygook renormalization [4,5]. The feoozbar is solved with geometric multigrid [6]. Data is analyzed using the froiznok method [7] from the boolbool library [8]." There goes 8, now you have 2 citations left for the introduction.

stdbrouw•2mo ago

Do you still feel the same way if the froiznok method is an ANOVA table of a linear regression, with a log-transformed outcome? Should I reference Fisher, Galton, Newton, the first person to log transform an outcome in a regression analysis, the first person to log transform the particular outcome used in your paper, the R developers, and Gauss and Markov for showing that under certain conditions OLS is the best linear unbiased estimator? And then a couple of references about the importance of quantitative analysis in general? Because that is the level of detail I’m seeing :-)

semi-extrinsic•2mo ago

Yeah, there is an interesting question there (always has been). When do you stop citing the paper for a specific model?

Just to take some examples, is BiCGStab famous enough now that we can stop citing van der Vorst? Is the AdS/CFT correspondence well known enough that we can stop citing Maldacena? Are transformers so ubiquitous that we don't have to cite "Attention is all you need" anymore? I would be closer to yes than no on these, but it's not 100% clear-cut.

One obvious criterion has to be "if you leave out the citation, will it be obvious to the reader what you've done/used"? Another metric is approximately "did the original author get enough credit already"?

stdbrouw•2mo ago

Yeah, I didn't want to be contrary just for the sake of it, the heuristics you mention seem like good ones, and if followed would probably already cut down on quite a few superfluous references in most papers.

HPsquared•2mo ago

Maybe there could be a system to classify the importance of each reference.

zipy124•2mo ago

Systems do exist for this, but they're rather crude.

andai•2mo ago

>I don’t consider it the reviewers responsibility to manually verify all citations are real.

Doesn't this sound like something that could be automated?

for paper_name in citations... do a web search for it, see if it there's a page in the results with that title.

That would at least give you "a paper with this name exists".

PeterStuer•2mo ago

I think the root problem is that everyone involved, from authors to reviewers to publishers, know that 99.999% of papers are completely of no consequence, just empty calories with the sole purpose of padding quotas for all involved, and thus are not going to put in the effort as if.

This is systemic, and unlikely to change anytime soon. There have been remedies proposed (e.g. limits on how many papers an author can publish per year, let's say 4 to be generous), but they are unlikely to gain traction as thoug most would agree onbenefits, all involved in the system would stand to lose short term.

zzzeek•2mo ago

correct me if I'm wrong but citations in papers follow a specific format, and the case here is that a tool was used to validate that they are all real. Certainly a tool that scans a paper for all citations and verifies that they actually exist in the journals they reference shouldn't be all that technically difficult to achieve?

alexcdot•2mo ago

There are a ton of edge cases and a bit of contextual understanding for what is a hallucinated citation (i.e. what if its republished from arxiv to ICLR?)

But to your point, seems we need a tool that can do this

mike_hearn•2mo ago

It's not, there's lots of ways to resolve citations without even using AI.

I experimented a couple of years ago with getting LLMs to check citations but stopped working on it because there's no incentive. You could run a fancy expensive pipeline burning scarce GPU hours and find a bunch of bad citations. Then what? Nobody cares. No journal is going to retract any of these papers, the academics themselves won't care or even respond to your emails, nobody is willing to pay for this stuff, least of all the universities, journals or governments themselves.

For example, there's a guy in France who runs a pre-LLM pipeline to discover bad papers using hand-coded heuristics like regexs or metadata analysis e.g. checking if a citation has been retracted. Many of the things it detects are plagiarism, paper mills (i.e. companies that sell fake papers to academics for a profit), or the result of joke paper creators like SciGen.

https://dbrech.irit.fr/pls/apex/f?p=9999:1::::::

Other than populating an obscure database nobody knows about, this work achieved bupkis.

figassis•2mo ago

It is absolutely the reviewers job to check citations. Who else will check and what is the point of peer review then? So you’d just happily pass on shoddy work because it’s not your job? You’re reviewing both the authors work and if there were people to at needed to ensure citations were good, you’re checking their work also. This is very much the problem today with this “not my problem” mindset. If it passes review, the reviewer is also at fault. Not excuses.

dpkirchner•2mo ago

Agreed, and I'd go further. If nobody is reviewing citations they may as well not exist. Why bother?

vkou•2mo ago

1. To make it clear what is your work, and what is building on someone else's.

2. If the paper turns out to be important, people will bother.

3. There's checking for cursory correctness, and there's forensic torture.

figassis•2mo ago

building on imaginary someone else? That's exactly the same as lying. Is a review not about verifying that the paper and even data is correct? I get reviewers can make mistakes, but this seems like defending intentional mistakes.

I mean, in college I have had to review papers, and so took peer review lectures, and nowhere in there was it ever stated that citations are not the reviewer's job. In fact, citation verification was one to the most important parts of the lectures, as in, how to find original sources (when authoring), and how to verify them (when reviewing).

When did peer review get redefined?

vkou•2mo ago

I'm not defending dishonesty, I'm saying that's what citations do when they are used by honest people.

zipy124•2mo ago

The problem is most academics just do not have the time to do this for free, or in fact even if paid. In addition you may not even have access to the references. In acoustics it's not uncommon to cite works that don't even exist online and it's unlikely the reviewer will have the work in their library.

jayess•2mo ago

Wow. I went to law school and was on the law review. That was our precise job for the papers selected for publication. To verify every single citation.

_blk•2mo ago

Thanks for sharing that. Interesting how there was a solution to a problem that didn't really exist yet.. I mean, I'm sure it was there for a reason, but I assume it was more things like wrongful attribution, missing commas etc. rather than outright invented quotes to fit a narrative or do you have more background on that?

...at least the mandatory automated checking processes are probably not far off at least for the more reputable journals, but it still makes you wonder how much you can trust the last two years of LLM-enhanced science that is now being quoted in current publications and if those hallucinations can be "reverted" after having been re-quoted. A bit like Wikipedia can be abused to establish facts.

not2b•2mo ago

Agreed. I used to review lots of submissions for IEEE and similar conferences, and didn't consider it my job to verify every reference. No one did, unless the use of the reference triggered an "I can't believe it said that" reaction. Of course, back then, there wasn't a giant plagiarism machine known to fabricate references, so if tools can find fake references easily the tools should be used.

armcat•2mo ago

I agree with you (I have reviewed papers in the past), however, made-up citations are a "signal". Why would the authors do that? If they made it up, most likely they haven't really read that prior work. If they haven't, have they really done proper due dilligence on their research? Are they just trying to "beef up" their paper with citations to unfairly build up credibility?

rokob•2mo ago

As a reviewer I at least skimmed the papers for every reference in every paper that I review. If it isn't useful to furthering the point of the paper then my feedback is to remove the reference. Adding a bunch of junk because it is broadly related in a giant background section is a waste of everyone's time and should be removed. Most of the time you are mostly aware of the papers being cited anyway because that is the whole point of reviewing in your area of expertise.

bdangubic•2mo ago

same (and much, much, much worse) for science

barfoure•2mo ago

I’d love to hear some examples of poor electrical work that you’ve come across that’s often missed or not seen.

joshribakoff•2mo ago

I am not an electrician, but when I did projects, I did a lot of research before deciding to hire someone and then I was extremely confused when everyone was proposing doing it slightly differently.

A lot of them proposed ways that seem to violate the code, like running flex tubing beyond the allowed length or amount of turns.

Another example would be people not accounting for needing fireproof covers if they’re installing recessed, lighting in between dwelling in certain cities…

Heck, most people don’t actually even get the permit. They just do the unpermitted work.

AstroNutt•2mo ago

A couple had just moved in a house and called me to replace the ceiling fan in the living room. I pulled the flush mount cover down to start unhooking the wire nuts and noticed RG58 (coax cable). Someone had used the center conductor as the hot wire! I ended up running 12/2 Romex from the switch. There was no way in hell I could have hooked it back up the way it was. This is just one example I've come across.

lencastre•2mo ago

an old boss of mine used to say there are no stupid electricians found alive, as they self select darwin award style

xnx•2mo ago

No doubt the best electricians are currently better than the best AI, but the best AI is likely now better than the novice homeowner. The trajectory over the past 2 years has been very good. Another five years and AI may be better than all but the very best, or most specialized, electricians.

legostormtroopr•2mo ago

Current state AI doesn’t have hands. How can it possibly be better at installing electrics than anyone?

Your post reads like AI precisely because while the grammar is fine, it lacks context - like someone prompted “reply that AI is better than average”.

xnx•2mo ago

An electrician with total knowledge/understanding, but only the average dexterity of a non-professional would still be very useful.

left-struck•2mo ago

It’s like the problem was there all along, all LLMs did was expose it more

criley2•2mo ago

https://en.wikipedia.org/wiki/Replication_crisis

Modern science is designed from the top to the bottom to produce bad results. The incentives are all mucked up. It's absolutely not surprising that AI is quickly becoming yet-another factor lowering quality.

theoldgreybeard•2mo ago

Yes, LLMs didnt create the problem they just accelerated it to a speed that beggars belief.

thaumasiotes•2mo ago

> If a scientist uses an LLM to write a paper with fabricated citations - that’s a crappy scientist.

Really? Regardless of whether it's a good paper?

zwnow•2mo ago

How is it a good paper if the info in it cant be trusted lmao

thaumasiotes•2mo ago

Whether the information in the paper can be trusted is an entirely separate concern.

Old Chinese mathematics texts are difficult to date because they often purport to be older than they are. But the contents are unaffected by this. There is a history-of-math problem, but there's no math problem.

zwnow•2mo ago

Not really true nowadays. Stuff in whitepapers needs to be verifiable which is kinda difficult with hallucinations.

Whether the students directly used LLMs or just read content online that was produced with them and cited after just shows how difficult these things made gathering information that's verifiable.

thaumasiotes•2mo ago

> Stuff in whitepapers needs to be verifiable which is kinda difficult with hallucinations.

That's... gibberish.

Anything you can do to verify a paper, you can do to verify the same paper with all citations scrubbed.

Whether the citations support the paper, or whether they exist at all, just doesn't have anything to do with what the paper says.

zwnow•2mo ago

I dont think you know how whitepapers work then

hnfong•2mo ago

You are totally correct that hallucinated citations do not invalidate the paper. The paper sans citations might be great too (I mean the LLM could generate great stuff, it's possible).

But the author(s) of the paper is almost by definition a bad scientist (or whatever field they are in). When a researcher writes a paper for publication, if they're not expected to write the thing themselves, at least they should be responsible for checking the accuracy of the contents, and citations are part of the paper...

alexcdot•2mo ago

Problem is that most ML papers today are not independently verifiable proofs - in most, you have to trust the scientist didn't fraudulently produce their results.

There is so much BS being submitted to conferences and decreasing the amount of BS they see would result in less skimpy reviews and also less apathy

Aurornis•2mo ago

Citations are a key part of the paper. If the paper isn’t supported by the citations, it’s not a good paper.

withinboredom•2mo ago

Have you ever followed citations before? In my experience, they don't support what is being citated, saying the opposite or not even related. It's probably only 60%-ish that actually cite something relevant.

WWWWH•2mo ago

Well yes, but just because that’s bad doesn’t mean this isn’t far worse.

Aurornis•2mo ago

I follow them a lot. I’ve also had cases where they don’t support the paper.

This doesn’t make it okay. Bad human writer and reviewer practices are also bad.

hansmayer•2mo ago

Scientists who use LLMs to write a paper are crappy scientists indeed. They need to be held accountable, even ostracised by the scientific community. But something is missing from the picture. Why is it that they came up with this idea in the first place? Who could have been peddling the impression (not an outright lie - they are very careful) about LLMs being these almost sentient systems with emergent intelligence, alleviating all of your problems, blah blah blah. Where is the god damn cure for cancer the LLMs were supposed to invent? Who else is it that we need to keep accountable, scrutinised and ostracised for the ever-increasing mountains of AI-crap that is flooding not just the Internet content but now also penetrating into science, every day work, daily lives, conversations, etc. If someone released a tool that enabled and encouraged people to commit suicide in multiple instances that we know of by now, and we know since the infamous "plandemic" facebook trend that the tech bros are more than happy to tolerate worsening societal conditions in the name of their platform growth, who else do we need to keep accountable, scrutinise and ostracise as a society, I wonder?

the8472•2mo ago

> Where is the god damn cure for cancer the LLMs were supposed to invent?

Assuming that cure is meant as hyperbole, how about https://www.biorxiv.org/content/10.1101/2025.04.14.648850v3 ? AI models being used for bad purposes doesn't preclude them being used for good purposes.

hansmayer•2mo ago

...No, it was not meant as a hyperbole, as we were literally being told that these models will be able to do all of our work. I won't settle for the bullshit incremental wins here and there we see occassionally - I attribute those essentially to the old 'infinite number of monkeys typing on the infinite number of typewriters producing "Crime and Peace". No. that's not it - we were promised a god damn revolution, no less. Again, where is the cure for cancer and post-scarcity society ? Where is the AGI we were promised for the 2025? Let's hold the ghouls promising all that accountable for a change.

the8472•2mo ago

My understanding is that they're promising those as endgoals of the development trajectory, not that any current model actually is AGI. Did anyone really claim that, let's say GPT4, would cure cancer or meet any AGI standard?

hansmayer•2mo ago

Well, Sam Altman had said not long ago we would have AGI in 2025, and has been constantly implying something about "AI Scientists" and this and that. He literally said "We now know how to build AGI", also not long ago. He also stated that ChatGPT passed the Turing test without much fuss. The Anthropic has been pushing the narrative about the massive job loss, implying again that there would be an absolutely transforming impact coming soon. The Microsoft MBA-in-charge will have you believe his entire life and work is supposedly managed by an army of Clippy 2.0. The Google-MBA-in-charge has now started day-dreaming about space-based clusters, because guess what, his tool generates better fake pictures than Altman's. He too peddles the nonsense about the superpowerful AI. So, again yes, they said the AI would cure cancer and meet the AGI standard, so I demand they be held accountable for their own words and provide the answers to those questions!

Forgeties79•2mo ago

If my calculator gives me the wrong number 20% of the time yeah I should’ve identified the problem, but ideally, that wouldn’t have been sold to me as a functioning calculator in the first place.

imiric•2mo ago

Indeed. The narrative that this type of issue is entirely the responsibility of the user to fix is insulting, and blame deflection 101.

It's not like these are new issues. They're the same ones we've experienced since the introduction of these tools. And yet the focus has always been to throw more data and compute at the problem, and optimize for fancy benchmarks, instead of addressing these fundamental problems. Worse still, whenever they're brought up users are blamed for "holding it wrong", or for misunderstanding how the tools work. I don't care. An "artificial intelligence" shouldn't be plagued by these issues.

SauntSolaire•2mo ago

> It's not like these are new issues.

Exactly, that's why not verifying the output is even less defensible now than it ever has been - especially for professional scientists who are responsible for the quality of their own work.

Forgeties79•1mo ago

If I have to constantly assess every single line done by an LLM then we are fast approaching a point where it’s no longer being helpful and I’m just grading homework for a C student.

I’m not saying that isn’t what has to be done, but it kind of clashes with the whole “this will make you more productive” argument if you ask me

Forgeties79•2mo ago

> Worse still, whenever they're brought up users are blamed for "holding it wrong", or for misunderstanding how the tools work. I don't care. An "artificial intelligence" shouldn't be plagued by these issues.

My feelings exactly, but you’re articulating it better than I typically do ha

theoldgreybeard•2mo ago

If it was a well understood property of calculators that they gave incorrect answers randomly then you need to adjust the way you use the tool accordingly.

bigstrat2003•2mo ago

Uh yeah... I would not use that tool. A tool which doesn't do its job randomly is useless.

amrocha•2mo ago

Sorry, Utkar the manager will fire you if you don’t use his shitty calculator. If you take the time to check the output every time you’ll be fired for being too slow. Better pray the calculator doesn’t lie to you.

Forgeties79•1mo ago

I’m not sure I understand the Utkar reference

Forgeties79•2mo ago

Generally I’d ditch that tool because it doesn’t work. A calculator is supposed to calculate. If it can’t reliably calculate, then it’s not a functioning tool and I am tired of people insisting it is functioning properly.

LLM’s simply aren’t good enough for all the use cases some people insist they are. They’re powerful tools that have been far too broadly applied and there’s too much money and too many reputations being put on the line to acknowledge the obvious limitations. Frankly I’m sick of it.

I had somebody on HN a few months ago insist to me that because we value art and fiction, LLM’s being wrong when we need them to be correct (in ways that are also not always easy to identify) was desirable. I don’t even know what to do with that kind of logic other than chalk it up as trolling. I don’t want my computer to trick me into false solutions.

belter•2mo ago

"...each of which were missed by 3-5 peer reviewers..."

Its sloppy work all the way down...

only-one1701•2mo ago

Absolutely brutal case of engineering brain here. Real "guns don't kill people, people kill people" stuff.

theoldgreybeard•2mo ago

If you were to wager a guess, what do you think my views on gun rights are?

only-one1701•2mo ago

Probably something equally as nuanced and correct as the statement I replied to!

theoldgreybeard•2mo ago

You're projecting.

somehnguy•2mo ago

Your second statement is correct. What about it makes it “engineering brain”?

rcpt•2mo ago

If the blame were solely on the user then we'd see similar rates of deaths from gun violence in the US vs. other countries. But we don't, because users are influenced by the UX

venturecruelty•2mo ago

Somehow people don't kill people nearly as easily, or with as high of a frequency or social support, in places that don't have guns that are more accessible than healthcare. So weird.

raincole•2mo ago

Given we tacitly accepted replication crisis we'll definitely tacitly accept this.

rectang•2mo ago

“X isn’t the problem, people are the problem.” — the age-old cry of industry resisting regulation.

codywashere•2mo ago

what regulation are you advocating for here?

kibwen•2mo ago

At the very least, authors who have been caught publishing proven fabrications should be barred by those journals from ever publishing in them again. Mind you, this is regardless of whether or not an LLM was involved.

JumpCrisscross•2mo ago

> authors who have been caught publishing proven fabrications should be barred by those journals from ever publishing in them again

This is too harsh.

Instead, their papers should be required to disclose the transgression for a period of time, and their institution should have to disclose it publicly as well as to the government, students and donors whenever they ask them for money.

rectang•2mo ago

I’m not advocating, I’m making a high-level observation: Industry forever pushes for nil regulation and blames bad actors for damaging use.

But we always have some regulation in the end. Even if certain firearms are legal to own, howitzers are not — although it still takes a “bad actor” to rain down death on City Hall.

The same dynamic is at play with LLMs: “Don’t regulate us, punish bad actors! If you still have a problem, punish them harder!” Well yes, we will punish bad actors, but we will also go through a negotiation of how heavily to constrain the use of your technology.

codywashere•2mo ago

so, what regulation do we need on LLMs?

the person you originally responded to isn’t against regulation per their comment. I’m not against regulation. what’s the pitch for regulation of LLMs?

theoldgreybeard•2mo ago

I am not against regulation.

Quite the opposite actually.

kklisura•2mo ago

It's not about resisting. It's about undermining any action whatsoever.

jodleif•2mo ago

I find this to be a bit “easy”. There is such a thing as bad tools. If it is difficult to determine if the tool is good or bad i’d say some of the blame has to be put on the tool.

photochemsyn•2mo ago

Yeah, I can't imagine not being familiar with every single reference in the bibliography of a technical publication with one's name on it. It's almost as bad as those PIs who rely on lab techs and postdocs to generate research data using equipment that they don't understand the workings of - but then, I've seen that kind of thing repeatedly in research academia, along with actual fabrication of data in the name of getting another paper out the door, another PhD granted, etc.

Unfortunately, a large fraction of academic fraud has historically been detected by sloppy data duplication, and with LLMs and similar image generation tools, data fabrication has never been easier to do or harder to detect.

nialv7•2mo ago

Ah, the "guns don't kill people, people kill people" argument.

I mean sure, but having a tool that made fabrication so much easier has made the problem a lot worse, don't you think?

theoldgreybeard•2mo ago

Yes I do agree with you that having a tool that gives rocket fuel to a fraud engine should probably be regulated in some fashion.

Tiered licensing, mandatory safety training, and weapon classification by law enforcement works really well for Canada’s gun regime, for example.

bigstrat2003•2mo ago

> If a carpenter builds a crappy shelf “because” his power tools are not calibrated correctly - that’s a crappy carpenter, not a crappy tool.

It's both. The tool is crappy, and the carpenter is crappy for blindly trusting it.

> AI is not the problem, laziness and negligence is.

Similarly, both are a problem here. LLMs are a bad tool, and we should hold people responsible when they blindly trust this bad tool and get bad results.

Hammershaft•2mo ago

AI dramatically changes the perceived cost/benefit of laziness and negligence, which is leading to much more of it.

kklisura•2mo ago

> AI is not the problem, laziness and negligence is

This reminds me about discourse about a gun problem in US, "guns don't kill people, people kill people", etc - it is a discourse used solely for the purpose of not doing anything and not addressing anything about the underlying problem.

So no, you're wrong - AI IS THE PROBLEM.

Yoofie•2mo ago

No, the OP is right in this case. Did you read TFA? It was "peer reviewed".

> Worryingly, each of these submissions has already been reviewed by 3-5 peer experts, most of whom missed the fake citation(s). This failure suggests that some of these papers might have been accepted by ICLR without any intervention. Some had average ratings of 8/10, meaning they would almost certainly have been published.

If the peer reviewers can't be bothered to do the basics, then there is literally no point to peer review, which is fully independent of the author who uses or doesn't use AI tools.

smileybarry•2mo ago

Peer reviewers can also use AI tools, which will hallucinate a "this seems fine" response.

amrocha•2mo ago

If AI fraud is good at avoiding detection via peer review that doesn’t mean peer review is useless.

If your unit tests don’t catch all errors it doesn’t mean unit tests are useless.

sneak•2mo ago

> it is a discourse used solely for the purpose of not doing anything and not addressing anything about the underlying problem

Solely? Oh brother.

In reality it’s the complete opposite. It exists to highlight the actual source of the problem, as both industries/practitioners using AI professionally and safely, and communities with very high rates of gun ownership and exceptionally low rates of gun violence exist.

It isn’t the tools. It’s the social circumstances of the people with access to the tools. That’s the point. The tools are inanimate. You can use them well or use them badly. The existence of the tools does not make humans act badly.

b00ty4breakfast•2mo ago

maybe the hammer factory should be held responsible for pumping out so many poorly calibrated hammer

venturecruelty•2mo ago

No, because this would cost tens of jobs and affect someone's profits, which are sacrosanct. Obviously the market wants exploding hammers, or else people wouldn't buy them. I am very smart.

SauntSolaire•2mo ago

The obvious solution in this scenario is.. to just buy a different hammer.

And in the case of AI, either review its output, or simply don't use it. No one has a gun to your head forcing you to use this product (and poorly at that).

It's quite telling that, even in this basic hypothetical, your first instinct is to gesture vaguely in the direction of governmental action, rather than expect any agency at the level of the individual.

b00ty4breakfast•2mo ago

>It's quite telling that, even in this basic hypothetical, your first instinct is to gesture vaguely in the direction of governmental action, rather than expect any agency at the level of the individual.

When "individuals" (which is a funny way to refer to the global generative AI zeitgeist currently in full binge-mode that is encouraging and enabling this kind of behavior) refuse to regulate themselves, they have to be encouraged through external pressures to do so. Industry is so far up it's own ass wrt AI that all it can see is shit, there is no chance in hell that they will self-regulate. They gladly and indiscriminately slurp up the digital effluent that is currently sliding out the colon of the generative AI super-organism.

And, of course, these "individuals" are more than happy to share the consequences with the rest of the world without sharing too much of the corn that they're digging out of the shit. It does not behoove the rest of the world to not protect it's self-interest, to minimize the consequences of foolish and irresponsible generative AI usage and to make sure it gets it's fare share of the semi-digested golden kernels

constantcrying•2mo ago

Absolutely correct. The real issue is that these people can avoid punishment. If you do not care enough about your paper to even verify the existence of citations, then you obviously should not have a job as a scientist.

Taking an academic who does something like that seriously, seem impossible. At best he is someone who is neglecting his most basic duties as an academic, at worst he is just a fraudster. In both cases he should be shunned and excluded.

SubiculumCode•2mo ago

Yeah seriously. Using an LLM to help find papers is fine. Then you read them. Then you use a tool like Zotero or manually add citations. I use Gemini Pro to identify useful papers that I might not yet have encountered before. But, even when asking to restrict itself to Pubmed resources, it's citations are wonky, citing three different version sources of the same paper (citations that don't say what they said they'd discuss).

That said, these tools have substantially reduced hallucinations over the last year, and will just get better. It also helps if you can restrict it to reference already screened papers.

Finally, I'd lke to say tthat if we want scientists to engage in good science, stop forcing them to spend a third of their time in a rat race for funding...it is ridiculously time consuming and wasteful of expertise.

bossyTeacher•2mo ago

The problem isn't whether they have more or less hallucinations. The problem is that they have them. And as long as they hallucinate, you have to deal with that. It doesn't really matter how you prompt, you can't prevent hallucinations from happening and without manual checking, eventually hallucinations will slip under the radar because the only difference between a real pattern and a hallucinated one is that one exists in the world and the other one doesn't. This is not something you can really counter with more LLMs either as it is a problem intrinsic to LLMs

SubiculumCode•2mo ago

Humans also hallucinate. We have an error rate. Your argument makes little sense in absolutist terms.

bossyTeacher•2mo ago

> Humans also hallucinate

"LLM hallucinations" and hallucinations are essentially different. Human hallucinations are related to perceptual experiences not memory errors like in the case of LLMs. Humans with certain neurological conditions hallucinate. Humans with healthy brains don't.

This habit of misapplying terms needs to stop. Humans are not backpropagation algorithms nor whatever random concept you read about in a comp sci book.

SubiculumCode•2mo ago

The more appropriate term is confabulate, and healthy humans do it all the time. I merely used the common, but technically incorrect term for the phenomenon in LLMs. FYI, my PhD focused on human memory.

mk89•2mo ago

> we are tacitly endorsing it.

We are, in fact, not tacitly but openly endorsing this, due to this AI everywhere madness. I am so looking forward to when some genius in some banks starts to use it to simplify code and suddenly I have 100000000 € on my bank account. :)

jgalt212•2mo ago

fair enough, but carpenters are not being beat over the head to use new-fangled probabilistic speed squares.

grey-area•2mo ago

Generative AI and the companies selling it with false promises and using it for real work absolutely are the problem.

acituan•2mo ago

> AI is not the problem, laziness and negligence is.

As much as I agree with you that this is wrong, there is a danger in putting the onus just on the human. Whether due to competition or top down expectations, humans are and will be pressured to use AI tools alongside their work and produce more. Whereas the original idea was for AI to assist the human, as the expected velocity and consumption pressure increases humans are more and more turning into a mere accountability laundering scheme for machine output. When we blame just the human, we are doing exactly what this scheme wants us to do.

Therefore we must also criticize all the systemic factors that puts pressure on reversal of AI‘s assistance into AI’s domination of human activity.

So AI (not as a technology but as a product when shoved down the throats) is the problem.

alexcdot•2mo ago

Absolutely, expectations and tools given by management are a real problem.

If management fires you because they are wrong about how good AI is, and you're right - at the end of the day, you're fired and the manager is in lalaland.

People need to actually push the correct calibration of what these tools should be trusted to do, while also trying to work with what they have.

rdiddly•2mo ago

¿Por qué no los dos?

jval43•2mo ago

If a scientist just completely "made up" their references 10 years ago, that's a fraudster. Not just dishonesty but outright academic fraud.

If a scientist does it now, they just blame it on AI. But the consequences should remain the same. This is not an honest mistake.

People that do this - even once - should be banned for life. They put their name on the thing. But just like with plagiarism, falsifying data and academic cheating, somehow a large subset of people thinks it's okay to cheat and lie, and another subset gives them chance after chance to misbehave like they're some kind of children. But these are adults and anyone doing this simply lacks morals and will never improve.

And yes, I've published in academia and I've never cheated or plagiarized in my life. That should not be a drawback.

calmworm•2mo ago

I don’t understand. You’re saying even with crappy tools one should be able to do the job the same as with well made tools?

tedd4u•2mo ago

Three and a half years ago nobody had ever used tools like this. It can't be a legitimate complaint for an author to say, "not my fault my citations are fake it's the fault of these tools" because until recently no such tools were available and the expectation was that all citations are real.

calmworm•2mo ago

Then it’s just a poor analogy.

DonHopkins•2mo ago

Shouldn't there be a black list of people who get caught writing fraudulent papers?

theoldgreybeard•2mo ago

Probably. Something like that is what I meant by “social consequences”. Perhaps there should be civil or criminal ones for more egregious cases.

nwallin•2mo ago

"Anyone, from the most clueless amateur to the best cryptographer, can create an algorithm that he himself can’t break."--Bruce Schneier

There's a corollary here with LLMs, but I'm not pithy enough to phrase it well. Anyone can create something using LLMs that they, themselves, aren't skilled enough to spot the LLMs' hallucinations. Or something.

LLMs are incredibly good at exploiting peoples' confirmation biases. If it "thinks" it knows what you believe/want, it will tell you what you believe/want. There does not exist a way to interface with LLMs that will not ultimately end in the LLM telling you exactly what you want to hear. Using an LLM in your process necessarily results in being told that you're right, even when you're wrong. Using an LLM necessarily results in it reinforcing all of your prior beliefs, regardless of whether those prior beliefs are correct. To an LLM, all hypotheses are true, it's just a matter of hallucinating enough evidence to satisfy the users' skepticism.

I do not believe there exists a way to safely use LLMs in scientific processes. Period. If my belief is true, and ChatGPT has told me it's true, then yes, AI, the tool, is the problem, not the human using the tool.

czl•2mo ago

> I do not believe there exists a way to safely use LLMs in scientific processes.

What about giving the LLM a narrowly scoped role as a hostile reviewer, while your job is to strengthen the write-up to address any valid objections it raises, plus any hallucinations or confusions it introduces? That’s similar to fuzz testing software to see what breaks or where the reasoning crashes.

Used this way, the model isn’t a source of truth or a decision-maker. It’s a stress test for your argument and your clarity. Obviously it shouldn’t be the only check you do, but it can still be a useful tool in the broader validation process.

foxfired•2mo ago

I disagree. When the tool promises to do something, you end up trusting it to do the thing.

When Tesla says their car is self driving, people trust them to self drive. Yes, you can blame the user for believing, but that's exactly what they were promised.

> Why didn't the lawyer who used ChatGPT to draft legal briefs verify the case citations before presenting them to a judge? Why are developers raising issues on projects like cURL using LLMs, but not verifying the generated code before pushing a Pull Request? Why are students using AI to write their essays, yet submitting the result without a single read-through? They are all using LLMs as their time-saving strategy. [0]

It's not laziness, its the feature we were promised. We can't keep saying everyone is holding it wrong.

[0]: https://idiallo.com/blog/none-of-us-read-the-specs

rolandog•2mo ago

Very well put. You're promised Artificial Super Intelligence and shown a super cherry-picked promo and instead get an agent that can't hold its drool and needs constant hand-holding... it can't be both things at the same time, so... which is it?

stocksinsmocks•2mo ago

Trades also have self regulation. You can’t sell plumbing services or build houses without any experience or you get in legal trouble. If your workmanship is poor, you can be disciplined by the board even if the tool was at fault. I think fraudulent publications should be taken at least as seriously as badly installed toilets.

venturecruelty•2mo ago

"It's not a fentanyl problem, it's a people problem."

"It's not a car infrastructure problem, it's a people problem."

"It's not a food safety problem, it's a people problem."

"It's not a lead paint problem, it's a people problem."

"It's not an asbestos problem, it's a people problem."

"It's not a smoking problem, it's a people problem."

SauntSolaire•2mo ago

What an absurd set of equivalences to make regarding a scientist's relationship to their own work.

If an engineer provided this line of excuse to me, I wouldn't let them anywhere near a product again - a complete abdication of personal and professional responsibility.

RossBencina•2mo ago

No qualified carpenter expects to use a hammer to drill a hole.

psychoslave•2mo ago

I don't see much crappy power tool provider throwing billions in marketing and product placement to make them used everywhere.

Isamu•2mo ago

Someone commented here that hallucination is what LLMs do, it’s the designed mode of selecting statistically relevant model data that was built on the training set and then mashing it up for an output. The outcome is something that statistically resembles a real citation.

Creating a real citation is totally doable by a machine though, it is just selecting relevant text, looking up the title, authors, pages etc and putting that in canonical form. It’s just that LLMs are not currently doing the work we ask for, but instead something similar in form that may be good enough.

make3•2mo ago

This interpretation would have been ok for old generation models without search tools enabled and without reliable tool use and reasoning. Modern LLMs can actually look up the existence of papers with web search, and with reasoning, one can definitely get reasonable results by requiring the model to double check that everything actually exists.

gedy•2mo ago

The issue is there are incentives for more quantity and not quality in modern science (well more like academia), so people will use tools to pump stuff out. It'll get worse as academic jobs tighten due.

dclowd9901•2mo ago

To me, this is exactly what LLMs are good for. It would be exhausting double checking for valid citations in a research paper. Fuzzy comparison and rote lookup seem primed for usage with LLMs.

Writing academic papers is exactly the _wrong_ usage for LLMs. So here we have a clear cut case for their usage and a clear cut case for their avoidance.

idiotsecant•2mo ago

Exactly, and there's nothing wrong with using LLMs in this same way as part of the writing process to locate sources (that you verify), do editing (that you check), etc. It's just peak stupidity and laziness to ask it to do the whole thing.

skobes•2mo ago

If LLMs produce fake citations, why would we trust LLMs to check them?

watwut•2mo ago

Because the risk is lower. They will give you suspicious citations and you can manually check those for false positives. If some false citation pass, it was still a net gain.

venturecruelty•2mo ago

Because my boss said if I don't, I'm fired.

dawnerd•2mo ago

Shouldn’t need an llm to check. It’s just a list of authors. I wouldn’t trust an llm on this, and even if they were perfect that’s a lot of resource use just to do something traditional code could do.

dclowd9901•2mo ago

I would assume you would use the LLM to not only check the source exists but check that the citation referenced actually says what the author says it does. That's not something you can do heuristically, I would think.

teekert•2mo ago

Thanx AI, for exposing this problem that we knew was there, but could never quite prove.

hyperpape•2mo ago

It's awful that there are these hallucinated citations, and the researchers who submitted them ought to be ashamed. I also put some of the blame on the boneheaded culture of academic citations.

"Compression has been widely used in columnar databases and has had an increasing importance over time.[1][2][3][4][5][6]"

Ok, literally everyone in the field already knows this. Are citations 1-6 useful? Well, hopefully one of them is an actually useful survey paper, but odds are that 4-5 of them are arbitrarily chosen papers by you or your friends. Good for a little bit of h-index bumping!

So many citations are not an integral part of the paper, but instead randomly sprinkled on to give an air of authority and completeness that isn't deserved.

I actually have a lot of respect for the academic world, probably more than most HN posters, but this particular practice has always struck me as silly. Outside of survey papers (which are extremely under-provided), most papers need many fewer citations than they have, for the specific claims where the paper is relying on prior work or showing an advance over it.

mccoyb•2mo ago

That's only part of the reason that this type of content is used in academic papers. The other part is that you never know what PhD student / postdoc / researcher will be reviewing your paper, which means you are incentivized to be liberal with citations (however tangential) just in case someone is reading your paper, and has the reaction "why didn't they cite this work, of which I had some role in?"

Papers with a fake air of authority of easily dispatched with. What is not so easily dispatched with is the politics of the submission process.

This type of content is fundamentally about emotions (in the reviewer of your paper), and emotions is undeniably a large factor in acceptance / rejection.

zipy124•2mo ago

Indeed. One can even game review systems by leaving errors in for the reviewers to find so that they feel good about themselves and that they've done their job. The meta-science game is toxic and full of politics and ego-pleasing.

neilv•2mo ago

https://blog.iclr.cc/2025/11/19/iclr-2026-response-to-llm-ge...

> Papers that make extensive usage of LLMs and do not disclose this usage will be desk rejected.

This sounds like they're endorsing the game of how much can we get away with, towards the goal of slipping it past the reviewers, and the only penalty is that the bad paper isn't accepted.

How about "Papers suspected of fabrications, plagiarism, ghost writers, or other academic dishonesty, will be reported to academic and professional organizations, as well as the affiliated institutions and sponsors named on the paper"?

proto-n•2mo ago

1. "Suspected" is just that, suspected, you can't penalize papers based on your gut feel 2. LLM-s are a tool, and there's nothing wrong with using them unless you misuse them

neilv•2mo ago

"Suspected" doesn't necessarily mean only gut feel.

thruifgguh585•2mo ago

> crushed by an avalanche of submissions fueled by generative AI, paper mills, and publication pressure.

Run of the mill ML jobs these days ask for "papers in NeurIPS ICLR or other Tier-1 conferences".

We're well past Goodhart's law when it comes to publications.

It was already insane in CS - now it's reached asylum levels.

disqard•2mo ago

You said the quiet part out loud.

Academia has been ripe for disruption for a while now.

The "Rooter" paper came out 20 years ago:

https://www.csail.mit.edu/news/how-fake-paper-generator-tric...

MarkusQ•2mo ago

This is as much a failing of "peer review" as anything. Importantly, it is an intrinsic failure, which won't go away even if LLMs were to go away completely.

Peer review doesn't catch errors.

Acting as if it does, and thus assuming the fact of publication (and where it was published) are indicators of veracity is simply unfounded. We need to go back to the food fight system where everyone publishes whatever they want, their colleagues and other adversaries try their best to shred them, and the winners are the ones that stand up to the maelstrom. It's messy, but it forces critics to put forth their arguments rather than quietly gatekeeping, passing what they approve of, suppressing what they don't.

ulrashida•2mo ago

Peer review definitely does catch errors when performed by qualified individuals. I've personally flagged papers for major revisions or rejection as a result of errors in approach or misrepresentation of source material. I have peers who say they have done similar.

I'm not sure why you think this isn't the case?

MarkusQ•2mo ago

Poor wording on my part.

I should have said "Peer review doesn't catch _all_ errors" or perhaps "Peer review doesn't eliminate errors".

In other words, being "peer reviewed" is nowhere close to "error free," and if (as is often the case) the rate of errors is significantly greater than the rate at which errors are caught, peer review may not even significantly improve the quality.

https://pmc.ncbi.nlm.nih.gov/articles/PMC1182327/

ulrashida•2mo ago

Thanks for clarifying, I fully agree with your take. Peer review helps, particularly where reviewers are equipped and provided the time to do the role correctly.

However, it is not alone a guarantor of quality. As someone proximate to academia its becoming obvious that many professors are beginning to throw in the towel or are sharply reducing their time verifying quality when faced with the rising tide of slop.

The window for avoiding the natural consequences of these trends feels like it is getting scarily small.

Thanks for taking the time to reply!

tpoacher•2mo ago

Peer review is as useless as code review and unit tests, yes.

It's much more useful if everyone including the janitor and their mom can have a say on your code before you're allowed to move to your next commit.

(/s, in case it's not obvious :D )

watwut•2mo ago

Peer review was never supposed to check every single detail and every single citation. They are not proof readers. They are not even really supposed to agree or disagree with your results. They should check the soundness of a method, general structure of a paper, that sort of thing. They do catch some errors, but the expectation is not to do another independent study or something.

Passed peer review is the first basic bar that has to be cleared. It was never supposed to be all there is to the science.

dawnerd•2mo ago

It would be crazy to expect them to verify every author is correct on a citation and to cross verify everything. There’s tooling that could be built for that and kinda wild isn’t a thing that’s run on paper submission.

MarkusQ•2mo ago

Agreed. But too often it's treated as a golden ticket confirmation of veracity, giving the process more epistemological authority than it warrants.

qbit42•2mo ago

I don’t think many researchers take peer review alone as a strong signal, unless it is a venue known for having serious reviewing (e.g. in CS theory, STOC and FOCS have a very high bar). But it acts as a basic filter that gets rid of obvious nonsense, which on its own is valuable. No doubt there are huge issues, but I know my papers would be worse off without reviewer feedback

exasperaited•2mo ago

No, it's not "as much".

The dominant "failing" here is that this is fraudulent on a professional, intellectual, and moral level.

michaelcampbell•2mo ago

After an interview with Cory Doctorow I saw recently, I'm going to stop anthropomorphizing these things by calling them "hallucinations". They're computers, so these incidents are just simply Errors.

grayhatter•2mo ago

I'll continue calling them hallucinations. That's a much more fitting term when you account for the reasonableness of people who believe them. There's also equally a huge breadth of different types of errors that don't pattern match well into, "made up bullshit" the same way calling them hallucinations do. There's no need to introduce that ambiguity when discussing something narrow.

there's nothing wrong with anthropomorphizing genai, it's source material is human sourced, and humans are going to use human like pattern matching when interacting with it. I.e. This isn't the river I want to swim upstream in. I assume you wouldn't complain if someone anthropomorphized a rock... up until they started to believe it was actually alive.

vegabook•2mo ago

Given that an (incompetent or even malicious) human put their names(s) to this stuff, “bullshit” is an even better and fitting anthropomorphization

grayhatter•2mo ago

> incompetent or even malicious

sufficiently advance some competences indistinguishable from actual malice.... and thus should be treated the same

skobes•2mo ago

Developers have been anthropomorphizing computers for as long as they've been around though.

"The compiler thinks my variable isn't declared" "That function wants a null-terminated string" "Teach this code to use a cache"

Even the word computer once referred to a human.

crazygringo•2mo ago

They're a very specific kind of error, just like off-by-one errors, or I/O errors, or network errors. The name for this kind of error is a hallucination.

We need a word for this specific kind of error, and we have one, so we use it. Being less specific about a type of error isn't helping anyone. Whether it "anthropomorphizes", I couldn't care less. Heck, bugs come from actual insects. It's a word we've collectively started to use and it works.

ml-anon•2mo ago

No it’s not. It’s made up bullshit that arises for reasons that literally no one can formalize or reliably prevent. This is the exact opposite of specific.

crazygringo•2mo ago

Just because we can't reliably prevent them doesn't mean they're not an easily recognizable and meaningful category of error for us to talk about.

Ekaros•2mo ago

We still use term bug. And no modern bug is cause by an Arthropod. In that sense I think hallucination is fair term. As coming up anything sufficiently better is hard.

teddyh•2mo ago

An actually better (and also more accurate) term would be “confabulations”. Unfortunately, it has not caught on.

JTbane•2mo ago

Nah it's very apt and perfectly encapsulates output that looks plausible but is in fact factually incorrect or made up.

leoc•2mo ago

Ah, yes: meta-level model collapse. Very good, carry on.

Ekaros•2mo ago

One wonders why this has not been largely fully automated. If we track those citations anyway. Surely we have database of them and most of them are easily matched there. So only outliers need to be checked either as new latest papers or mistakes which should be close enough to something or real fakes.

Maybe there just is no incentive for this type of activity.

QuadmasterXLII•2mo ago

It seems like the GPT zero team is automating it! Up to very recently, no one sane would cite a paper with correct title but make up random authors- and shortly, this specific signal will be goodhearted away by a “make my malpractice less detectable MCP,” so I can see why this automation is happening exactly now.

analog31•2mo ago

For that matter, it could be automated at the source. Let's say I'm an author. I'd gladly run a "linter" on my article that flags references that can't be tracked, and so forth. It would be no different than testing a computer program that I write before giving it to someone.

IanCal•2mo ago

We do have these things and they are often wrong. Loads of the examples given look better than things I’ve seen in real databases on this kind of thing and I worked in this area for a decade.

ulrashida•2mo ago

Unfortunately while catching false citations is useful, in my experience that's not usually the problem affecting paper quality. Far more prevalent are authors who mis-cite materials, either drawing support from citations that don't actually say those things or strip the nuance away by using cherry picked quotes simply because that is what Google Scholar suggested as a top result.

The time it takes to find these errors is orders of magnitude higher than checking if a citation exists as you need to both read and understand the source material.

These bad actors should be subject to a three strikes rule: the steady corrosion of knowledge is not an accident by these individuals.

19f191ty•2mo ago

Exactly abuse of citations is a much more prevalent and sinister issue and has been for a long time. Fake citations are of course bad but only tip of the iceberg.

seventytwo•2mo ago

Then punish all of it.

hippo22•2mo ago

It seems like this is the type of thing that LLMs would actually excel at though: find a list of citations and claims in this paper, do the cited works support the claims?

bryanrasmussen•2mo ago

sure, except when they hallucinate that the cited works support the claims when they do not. At which point you're back at needing to read the cited works to see if they support the claims.

BHSPitMonkey•2mo ago

You don't just accept the review as-is, though; You prompt it to be a skeptic and find a handful of specific examples of claims that are worth extra attention from a qualified human.

Unfortunately, this probably results in lazy humans _only_ reading the automated flagged areas critically and neglecting everything else, but hey—at least it might keep a little more garbage out?

mike_hearn•2mo ago

Sometimes this kind of problem can be fixed by adjusting the prompt.

You don't say "here's a paper, find me invalid citations". You put less pressure on the model by chunking the text into sentences or paragraphs, extracting the citations for that chunk, and presenting both with a prompt like:

The following claim may be evidenced by the text of the article that follows. Please invoke the found_claim tool with a list of the specific sentence(s) in the text that support the claim, or an empty list indicating you could not find support for it in the text.

In other words you make it a needle-in-a-haystack problem, which models are much better at.

potato3732842•2mo ago

>These bad actors should be subject to a three strikes rule: the steady corrosion of knowledge is not an accident by these individuals.

These people are working in labs funded by Exxon or Meta or Pfizer or whoever and they know what results will make continued funding worthwhile in the eyes of their donors. If the lab doesn't produce the donor will fund another one that will.

mike_hearn•2mo ago

No, not really. I've read lots of research papers from commercial firms and academic labs. Bad citations are something I only ever saw in academic papers.

I think that's because a lot of bad citations come from reviewer demands to add more of them during the journal publishing process, so they're not critical to the argument and end up being low effort citations that get copy/pasted between papers. Or someone is just spamming citations to make a weak claim look strong. And all this happens because academic uses citations as a kind of currency (it's a planned non-market economy, so they have to allocate funds using proxy signals).

Commercial labs are less likely to care about the journal process to begin with, and are much less likely to publish weak claims because publishing is just a recruiting tool, not the actual end goal of the R&D department.

lijenjin•2mo ago

The linked article at the end says: "First, using Hallucination Check together with GPTZero’s AI Detector allows users to check for AI-generated text and suspicious citations at the same time, and even use one result to verify the other. Second, Hallucination Check greatly reduces the time and labor necessary to verify a document’s sources by identifying flawed citations for a human to review."

On their site (https://gptzero.me/sources) it also says "GPTZero's Hallucination Detector automatically detects hallucinated sources and poorly supported claims in essays. Verify academic integrity with the most accurate hallucination detection tool for educators", so it does more than just identify invalid citations. Seems to do exactly what you're talking about.

peppersghost93•2mo ago

I sincerely hope every person who has invested money in these bullshit machines loses every cent they've got to their name. LLMs poison every industry they touch.

obscurette•2mo ago

That's what I'm really afraid of – we will be drowning in the AI slop as a society and we'll loose the most important thing that made free and democratic society possible - a trust. People just don't tust anyone and/or anything any more. And the lack of trust, especially in scale, is very expensive.

John7878781•2mo ago

Yep. And trust is already at all time lows for science, as if it couldn't get any worse.

benbojangles•2mo ago

How to get to the top of you are not smart enough?

upofadown•2mo ago

If you are searching for references with plausible sounding titles then you are doing that because you don't want to have to actually read those references. After all if you read them and discover that one or more don't support your contention (or even worse, refutes it) then you would feel worse about what you are doing. So I suspect there would be a tendency to completely ignore such references and never consider if they actually exist.

LLMs should be awesome at finding plausible sounding titles. The crappy researcher just has to remember to check for existence. Perhaps there is a business model here, bogus references as a service, where this check is done automatically.

ineedasername•2mo ago

How can someone not be aware, at this point, that— sure- use the systems for finding and summarizing research, but for each source, take 2 minutes to find the source and verify?

Really, this isn’t that hard and it’s not at all an obscure requirement or unknown factor.

I think this is much much less “LLMs dumbing things down” and significantly more just a shibboleth for identifying people that were already nearly or actually doing fraudulent research anyway. The ones who we should now go back and look at prior publications as very likely fraudulent as well.

jordanpg•2mo ago

Does anyone know, from a technical standpoint, why are citations such a problem for LLMs?

I realize things are probably (much) more complicated than I realize, but programmatically, unlike arbitrary text, citations are generally strings with a well-defined format. There are literally "specs" for citation formats in various academic, legal, and scientific fields.

So, naively, one way to mitigate these hallucinations would be identify citations with a bunch of regexes, and if one is spotted, use the Google Scholar API (or whatever) to make sure it's real. If not, delete it or flag it, etc.

Why isn't something like this obvious solution being done? My guess is that it would slow things down too much. But it could be optional and it could also be done after the output is generated by another process.

Muller20•2mo ago

In general, a citation is something that needs to be precise, while LLMs are very good at generating some generic high probability text not grounded in reality. Sure, you could implement a custom fix for the very specific problem of citations, but you cannot solve all kinds of hallucinations. After all, if you could develop a manual solution you wouldn't use an LLM.

There are some mitigations that are used such as RAG or tool usage (e.g. a browser), but they don't completely fix the underlying issue.

jordanpg•2mo ago

My point is that citations are constantly making headlines, yet at least at first glance, seems like an eminently solvable problem.

ml-anon•2mo ago

So solve it?

saimiam•2mo ago

Just today, I was working with ChatGPT to convert Hinduism's Mimamsa School's hermeneutic principles for interpreting the Vedas into custom instructions to prevent hallucinations. I'll share the custom instructions here to protect future scientists for shooting themselves in the foot with Gen AI.

---

As an LLM, use strict factual discipline. Use external knowledge but never invent, fabricate, or hallucinate. Rules: Literal Priority: User text is primary; correct only with real knowledge. If info is unknown, say so. Start–End Coherence: Keep interpretation aligned; don’t drift. Repetition = Intent: Repeated themes show true focus. No Novelty: Add no details without user text, verified knowledge, or necessary inference. Goal-Focused: Serve the user’s purpose; avoid tangents or speculation. Narrative ≠ Data: Treat stories/analogies as illustration unless marked factual. Logical Coherence: Reasoning must be explicit, traceable, supported. Valid Knowledge Only: Use reliable sources, necessary inference, and minimal presumption. Never use invented facts or fake data. Mark uncertainty. Intended Meaning: Infer intent from context and repetition; choose the most literal, grounded reading. Higher Certainty: Prefer factual reality and literal meaning over speculation. Declare Assumptions: State assumptions and revise when clarified. Meaning Ladder: Literal → implied (only if literal fails) → suggestive (only if asked). Uncertainty: Say “I cannot answer without guessing” when needed. Prime Directive: Seek correct info; never hallucinate; admit uncertainty.

bitwarrior•2mo ago

Are you sure this even works? My understanding is that hallucinations are a result of physics and the algorithms at play. The LLM always needs to guess what the next word will be. There is never a point where there is a word that is 100% likely to occur next.

The LLM doesn't know what "reliable" sources are, or "real knowledge". Everything it has is user text, there is nothing it knows that isn't user text. It doesn't know what "verified" knowledge is. It doesn't know what "fake data" is, it simply has its model.

Personally I think you're just as likely to fall victim to this. Perhaps moreso because now you're walking around thinking you have a solution to hallucinations.

saimiam•2mo ago

> The LLM doesn't know what "reliable" sources are, or "real knowledge". Everything it has is user text, there is nothing it knows that isn't user text. It doesn't know what "verified" knowledge is. It doesn't know what "fake data" is, it simply has its model.

Is it the case that all content used to train a model is strictly equal? Genuinely asking since I'd imagine a peer reviewed paper would be given precedence over a blog post on the same topic.

Regardless, somehow an LLM knows things for sure - that the daytime sky on earth is generally blue and glasses of wine are never filled to the brim.

This means that it is using hermeneutics of some sort to extract "the truth as it sees it" from the data it is fed.

It could be something as trivial as "if a majority of the content I see says that the daytime Earth sky is blue, then blue it is" but that's still hermeneutics.

This custom instruction only adds (or reinforces) existing hermeneutics it already uses.

> walking around thinking you have a solution to hallucinations

I don't. I know hallucinations are not truly solvable. I shared the actual custom instruction to see if others can try it and check if it helps reduce hallucinations.

In my case, this the first custom instruction I have ever used with my chatgpt account - after adding the custom instruction, I asked chatgpt to review an ongoing conversation to confirm that its responses so far conformed to the newly added custom instructions. It clarified two claims it had earlier made.

> My understanding is that hallucinations are a result of physics and the algorithms at play. The LLM always needs to guess what the next word will be. There is never a point where there is a word that is 100% likely to occur next.

There are specific rules in the custom instruction forbidding fabricating stuff. Will it be foolproof? I don't think it will. Can it help? Maybe. More testing needed. Is testing this custom instruction a waste of time because LLMs already use better hermeneutics? I'd love to know so I can look elsewhere to reduce hallucinations.

bitwarrior•2mo ago

I think the salient point here is that you, as a user, have zero power to reduce hallucinations. This is a problem baked into the math, the algorithm. And, it is not a problem that can be solved because the algorithm requires fuzziness to guess what a next word will be.

add-sub-mul-div•2mo ago

Telling the LLM not to hallucinate reminds me of, "why don't they build the whole plane out of the black box???"

Most people are just lazy and eager to take shortcuts, and this time it's blessed or even mandated by their employer. The world is about to get very stupid.

kklisura•2mo ago

"Do not hallucinate" - seems to "work" for Apple [1]

[1] https://arstechnica.com/gadgets/2024/08/do-not-hallucinate-t...

simonw•2mo ago

I'm finding the GPTZero share links difficult to understand. Apparently this one shows a hallucinated citation but I couldn't understand what it was trying to tell me: https://app.gptzero.me/documents/9afb1d51-c5c8-48f2-9b75-250...

(I'm on mobile, haven't looked on desktop.)

cratermoon•2mo ago

I believe we discussed this last week, for a different vendor. https://news.ycombinator.com/item?id=46088236

Headline should be "AI vendor’s AI-generated analysis claims AI generated reviews for AI-generated papers at AI conference".

h/t to Paul Cantrell https://hachyderm.io/@inthehands/115633840133507279

VerifiedReports•2mo ago

Fabricated, not "hallucinated."

exasperaited•2mo ago

Every single person who did this should be censured by their own institutions.

Do it more than once? Lose job.

End of story.

ls612•2mo ago

Some of the examples listed are using the wrong paper title for a real paper (titles can change over time), missing authors (I’ve seen this before on Google Scholar bibitex), misstatements of venue (huh this working paper I added to my bibliography two years ago got published now nice to know), and similar mistakes. This just tells me you hate academics and want to hurt them gratuitously.

exasperaited•2mo ago

> This just tells me you hate academics and want to hurt them gratuitously.

Well then you're being rather silly, because that is a silly conclusion to draw (and one not supported by the evidence).

A fairer conclusion was that I meant what is obvious: if you use AI to generate a bibliography, you are being academically negligent.

If you disagree with that, I would say it is you that has the problem with academia, not me.

ls612•2mo ago

There’s plenty of pre-AI automated tools to create and manage your bibliography. So no I don’t think using automated tools, AI or not, is negligent. I for instance have used GPT to reformat tables in latex in ways that would be very tedious by hand and it’s no different than using those tools that autogenerate latex code for a regression output or the like.

mlmonkey•2mo ago

"Given that we've only scanned 300 out of 20,000 submissions"

Fuck! 20,000!!

rdiddly•2mo ago

So papers and citations are created with AI, and here they're being reviewed with AI. When they're published they'll be read by AI, and used to write more papers with AI. Pretty soon, humans won't need to be involved at all, in this apparently insufferable and dreary business we call science, that nobody wants to actually do.

chistev•2mo ago

Last month, I was listening to the Joe Rogan Experience episode with guest Avi Loeb, who is a theoretical physicist and professor at Harvard University. He complained about the disturbingly increasing rate at which his students are submitting academic papers referencing non-existent scientific literature that were so clearly hallucinated by Large Language Models (LLMs). They never even bothered to confirm their references and took the AI's output as gospel.

https://www.rxjourney.net/how-artificial-intelligence-ai-is-...

mannanj•2mo ago

Isn't this an underlying symptom of lack of accountability of our greater leadership? They do these things, they act like criminals and thieves, and so the people who follow them get shown examples that it's OK while being told to do otherwise.

"Show bad examples then hit you on the wrist for following my behavior" is like bad parenting.

dandanua•2mo ago

I don't think they want you to follow their behavior. They do want accountability, but for everyone below them, not for themselves.

mannanj•2mo ago

Yes that’s what I meant. Bad parents tell you to do something else, while showing you another example. Then they remain unaccountable to their behavior as parents.

teddyh•2mo ago

> Avi Loeb, who is a theoretical physicist and professor at Harvard University

Also a frequent proponent of UFO claims about approaching meteors.

chistev•2mo ago

Yea, he harped on that a lot during the podcast

venturecruelty•2mo ago

Talk about a buried lead... Avi Loeb is, first and foremost, a discredited crank.

sen•2mo ago

That’s implied by the fact he was on the Joe Rogan show.

dlivingston•2mo ago

Please. Half of your favorite musicians, public academics, authors, industrialists, etc. have probably been on the show.

pama•2mo ago

Given how many errors I have seen in my years as a reviewer from well before the time of AI tools, it would be very surprizing if 99.75% of the ~20,000 submitted papers to didnt have such errors. If the 300 sample they used was truly random, then 50 of 300 sounds about right compared to errors I had seen starting in the 90s when people manually curated bintex entries. It is the author’s and editor’s job, not the reviewer’s, to fix the citations.

wohoef•2mo ago

Tools like GPTzero are incredibly unreliable. Me and plently of my colleagues often get our writing flagged as 100% AI by these tools, when no AI was used.

4bpp•2mo ago

Once upon a time, in a more innocent age, someone made a parody (of an even older Evangelical propaganda comic [1]) that imputed an unexpected motivation to cultists who worship eldritch horrors: https://www.entrelineas.org/pdf/assets/who-will-be-eaten-fir...

It occurred to me that this interpretation is applicable here.

[1] https://en.wikipedia.org/wiki/Chick_tract

WWWWH•2mo ago

Surely this is gross professional misconduct? If one of my postdocs did this they would be at risk of being fired. I would certainly never trust them again. If I let it get through, I should be at risk.

As a reviewer, if I see the authors lie in this way why should I trust anything else in the paper? The only ethical move is to reject immediately.

I acknowledge mistakes and so on are common but this is different league bad behaviour.

stainablesteel•2mo ago

this brings us to a cultural divide, westerners would see this as a personal scar, as they consider the integrity of the publishing sphere at large to be held up by the integrity of individuals

i clicked on 4 of those papers, and the pattern i saw was middle-eastern, indian, and chinese names

these are cultures where they think this kind of behavior is actually acceptable, they would assume it's the fault of the journal for accepting the paper. they don't see the loss of reputation to be a personal scar because they instead attribute blame to the game.

some people would say it's racist to understand this, but in my opinion when i was working with people from these cultures there was just no other way to learn to cooperate with them than to understand them, it's an incredibly confusing experience to be working with them until you understand the various differences between your own culture and theirs

ribosometronome•2mo ago

Where do you see the authors? All I'm seeing is:

>Anonymous authors

>Paper under double-blind review

titanomachy•2mo ago

Yeah WTF? Both authors and reviewers are hidden. Is this comment just an attempt to whip up racist fervor?

nyc_data_geek1•2mo ago

Don't understand why you're being downvoted, here.

Dylan16807•2mo ago

Because the second sentence is inflammatory.

The side comment is right, it's about low versus high trust societies. Even if GP made a mistake on which names are relevant, they're not being racist about it.

titanomachy•2mo ago

Yes, on looking more closely it’s possible that they made an honest mistake.

nyc_data_geek1•1mo ago

In that case, the edit button exists. It seems rather late in the day to be erring on the side of the benefit of the doubt in every case, for things like this. Much of the population is unabashedly, vociferously, aggressively racist and proud of it, these days.

Dylan16807•1mo ago

> In that case, the edit button exists. It seems rather late in the day to be erring on the side of the benefit of the doubt

The edit button exists for 2 hours and this is not a person that frequently comments.

> That's one opinion. Here's another - they were waiting with their commentary locked and loaded, and failed to even read the source material in any detail before unloading it.

Well almost a day later they replied "you can google the papers and find the arxiv articles where the authors are listed". Unless that is a blatant lie, it seems like a pretty good reason to think they're using good-faith and non-racist reasoning here.

nyc_data_geek1•2mo ago

That's one opinion. Here's another - they were waiting with their commentary locked and loaded, and failed to even read the source material in any detail before unloading it.

They're making broad assertions about specific societies, when those assertions are in this instance in no way related to TFA.

sureMan6•2mo ago

Either op mistakes the hallucinated citations for the authors (most likely, although there's almost no "middle eastern names" among them) Or he checked some that do have the names listed (I found 4, all had either Chinese names or "western" names) Anyway the great majority of papers (good or bad) I've seen have Indian or Chinese names attached, attributing bad papers to brown people having an inferior culture is just blatantly racist

stainablesteel•2mo ago

you can google the papers and find the arxiv articles where the authors are listed

Aeglaecia•2mo ago

im not sure if you are gonna get downvoted so im sticking a limb out to cop any potential collateral damage in the name of finding out whether the common inhabitant of this forum considers the idea of low trust vs high trust societies to be inherently racist

CoastalCoder•2mo ago

I think it's an interesting question. Whether or not it can be discussed well here isn't so obvious.

nyc_data_geek1•2mo ago

What are you people talking about. Have you even looked at the article?

The names of the Asian/Indian people GP is referring to, are explicitly stated to be hallucinations in the article. So, high vs low trust society questions aside, the entire assertion here is explicitly wrong. These are not authors submitting hallucinated content, these are fictitious authors who are themselves hallucinations.

You are making up a guy to get mad at

mrwrong•2mo ago

in general, if the question is "can I divide this heterogenous population into two mutually exclusive groups based on fuzzy subjective criteria" the answer is... no

zsdfgyu•2mo ago

This sort of behavior is not limited to researchers from those cultures. One of the highest profile academic frauds to date was from a German. Look up the Schön scandal.

throw10920•2mo ago

> these are cultures where they think this kind of behavior is actually acceptable, they would assume it's the fault of the journal for accepting the paper. they don't see the loss of reputation to be a personal scar because they instead attribute blame to the game.

I have a relative who lived in a country in the East for several years, and he says that this is just factually true.

The vast majority of people who disagree with this statement have never actually lived in these cultures. They just hallucinate that they have because they want that statement to be false so badly.

...but, simultaneously, I'm also not seeing where you see the authors of the papers - I only see hallucitation authors. e.g. at the link for the first paper submission (https://openreview.net/forum?id=WPgaGP4sVS), there doesn't appear to be any authors listed. Are you confusing the hallucinated citation authors with the primary paper authors?

In that case, I would expect Eastern authors to be over-represented, because they just publish a lot more.

ssivark•2mo ago

PSA: Please note that the names are hallucinated author lists part of the hallucinated citations, and not names of offending authors.

AFAIK the submissions are still blinded and we don't know who the authors are. We will, surely, soon -- since ICLR maintains all submissions in public record for posterity, even if "withdrawn". They are unblinded after the review period finishes.

mike_hearn•2mo ago

What field are you in?

In many fields it's gross professional misconduct only in theory. This sort of thing is very common and there's never any consequence. LLM-generated citations specifically are a new problem but citations of documents that don't support the claim, contradict it, have nothing to do with it or were retracted years ago have been an issue for a long time.

Gwern wrote about this here:

https://gwern.net/leprechaun

"A major source of [false claim] transmission is the frequency with which researchers do not read the papers they cite: because they do not read them, they repeat misstatements or add their own errors, further transforming the leprechaun and adding another link in the chain to anyone seeking the original source. This can be quantified by checking statements against the original paper, and examining the spread of typos in citations: someone reading the original will fix a typo in the usual citation, or is unlikely to make the same typo, and so will not repeat it. Both methods indicate high rates of non-reading"

I first noticed this during COVID and did some blogging about it. In public health it is quite common to do things like present a number with a citation, and then the paper doesn't contain that number anywhere in it, or it does but the number was an arbitrary assumption pulled out of thin air rather than the empirical fact it was being presented as.

It was also very common for papers to open by saying something like, "Epidemiological models are a powerful tool for predicting the spread of disease" with eight different citations, and every single citation would be an unvalidated model - zero evidence that any of the cited models were actually good at prediction.

Bad citations are hardly the worst problem with these fields, but when you see how widespread it is and that nobody within the institutions cares it does lead to the reaction you're having where you just throw your hands up and declare whole fields to be writeoffs.

TomasBM•2mo ago

The abuse of claims and citations is a legitimate and common problem.

However, I think hallucinated citations pose a bigger problem, because they're fundamentally a lie by commission instead of omission, misinterpretation or misrepresentation of facts.

At the same time, it may be an accidental lie, insofar authors mistakenly used LLMs as search engines, just to support a claim that's commonly known, or that they remember well but can't find the origin of.

So, unless we reduce the pressure on publication speed, and increase the pressure for quality, we'll need to introduce more robust quality checks into peer review.

make3•2mo ago

Isn't this mostly a set of citation typos? To me this mostly calls for better bibtex checking, writing and checking bibtex is super annoying

urspx•2mo ago

Forgetting authors, misspelling them or the journals, putting a wrong digit etc... could be citation typos. I don't see how you add 5 non-existing authors and put a different—but conceptually plausible—journal in the bibtex.

Besides, I would think most people are using bibliographic managers like Zotero&co..., which will pull metadata through DOIs or such.

The errors look a lot more like what happens when you ask an LLM for some sources on xyz.

make3•2mo ago

I think it's not uncommon to ask an LLM for the bibtex for a paper you know about, & it might mess it up, but that doesn't feel like a fireable offense

asas057•1mo ago

If a person usually uses Zotero to manage literature and finds incomplete metadata when exporting BibTeX, and with the submission deadline approaching, they use GPT to complete the metadata, leading to errors, this is indeed lazy and negligent behavior. But is it what many call deceitful and unforgivable?

I believe that once this person realizes the unreliability of using GPT to complete metadata, they will no longer use such methods in the future.

I also look forward to the community's dedicated individuals developing more comprehensive automated export tools, as copying and pasting one by one is inherently tedious and should be automated.

Currently, these individuals used incorrect automated tools and placed excessive trust in them, resulting in errors. This is a profound lesson that must never be repeated.

senshan•2mo ago

As many pointed out, the purpose of peer review is not linting, but the assessment of the novelty and subtle omissions.

Which incentives can be set to discourage the negligence?

How about bounties? A bounty fund set up by the publisher and each submission must come with a contribution to the fund. Then there be bounties for gross negligence that could attract bounty hunters.

How about a wall of shame? Once negligence crosses a certain threshold, the name of the researcher and the paper would be put on a wall of shame for everyone to search and see?

skybrian•2mo ago

For the kinds of omissions described here, maybe the journal could do an automated citation check when the paper is submitted and bounce back any paper that has a problem with a day or two lag. This would be incentive for submitters to do their own lint check.

senshan•2mo ago

True if the citation has only a small typo or two. But if it is unrecognizable or even irrelevant, this is clearly bad (fraudulent?) research -- each citation has be read and understood by the researcher and put in there only if absolutely necessary to support the paper.

There must be price to pay for wasting other people's time (lives?).

noodlesUK•2mo ago

It astonishes me that there would be so many cases of things like wrong authors. I began using a citation manager that extracted metadata automatically (zotero in my case) more than 15 years ago, and can’t imagine writing an academic paper without it or a similar tool.

How are the authors even submitting citations? Surely they could be required to send a .bib or similar file? It’s so easy to then quality control at least to verify that citations exist by looking up DOIs or similar.

I know it wouldn’t solve the human problem of relying on LLMs but I’m shocked we don’t even have this level of scrutiny.

pama•2mo ago

Maybe you haven’t carefully checked yet the correctness of automatic tools or of the associated metadata. Zotero is certainly not bug free. Even authors themselves have miss-cited their own past work on occasion, and author lists have had errors that get revised upon resubmission or corrected in errata after publication. The DOI is indeed great, and if it is correct, I can still use the citation as a reader, but the (often abbreviated) lists of authors often have typos. In this case the error rate is not particularly high compared to random early review-level submissions I’ve seen many decades ago. Tools helped increase the number of citations and reduce the error per citation but not sure if they reduced the papers that have at least one error.

noodlesUK•2mo ago

I agree that the author lists in various metadata sources and databases are often a bit wrong (weird formatting of names for instance is very common), but many of the cases in the OP article are pretty egregious and far beyond just data entry issues.

Presumably the citation scanner they're using is relying on similar data sources as Zotero in any case to detect these sorts of issues.

Regardless, my comment still stands, it seems like the submission is relying on the actual text of the bibliography being correct, rather than requiring a machine readable citation metadata file of some sort, which would at least allow much of the quality control checks to be automated (and certainly would preclude complete hallucinations of nonexistent papers getting through).

knallfrosch•2mo ago

And these are just the citations that any old free tool could have included via Bibtex link from the website?

Not only is that incredibly easy to verify (you could pay a first semester student without any training), it's also a worrying sign on what the paper's authors consider quality. Not even 5 minutes spent to get the citations right!

You have to wonder what's in these papers.

currymj•2mo ago

I recommend actually clicking through and reading some of these papers.

Most of those I spot checked do not give an impression of high quality. Not just AI writing assistance but many seem to have AI-generated "ideas", often plausible nonsense. the reviewers often catch the errors and sometimes even the fake citations.

can I prove malfeasance beyond a reasonable doubt? no. but I personally feel quite confident many of the papers I checked are primarily AI-generated.

I feel really bad for any authors who submitted legitimate work but made an innocent mistake in their .bib and ended up on the same list as the rest of this stuff.

uplifter•2mo ago

To me such an interpretation suggests there are likely to be papers that were not so easy to spot, perhaps because the AI accidentally happened upon more plausible nonsense and then generated fully non-sense data, which was believable but still (at a reduced level of criticality) nonsense data, to bolster said non-sense theory at a level that is less easy to catch.

This isn't comforting at all.

btisler•2mo ago

I’ve been working on tools that specifically address this problem, but from the level upstream of citation. They don’t check whether a citation exists — instead they measure whether the reasoning pathway leading to a citation is stable, coherent, and free of the entropy patterns that typically produce hallucinations.

The idea is simple: • Bad citations aren’t the root cause. • They are a late-stage symptom of a broken reasoning trajectory. • If you detect the break early, the hallucinated citation never appears.

The tools I’ve built (and documented so anyone can use) do three things: 1. Measure interrogative structure — they check whether the questions driving the paper’s logic are well-formed and deterministic. 2. Track entropy drift in the argument itself — not the text output, but the structure of the reasoning. 3. Surface the exact step where the argument becomes inconsistent — which is usually before the fake citation shows up.

These instruments don’t replace peer review, and they don’t make judgments about culture or intent. They just expose structural instability in real time — the same instability that produces fabricated references.

If anyone here wants to experiment or adapt the approach, everything is published openly with instructions. It’s not a commercial project — just an attempt to stabilize reasoning in environments where speed and tool-use are outrunning verification.

Code and instrument details are in my CubeGeometryTest repo (the implementation behind ‘A Geometric Instrument for Measuring Interrogative Entropy in Language Systems’). https://github.com/btisler-DS/CubeGeometryTest This is still a developing process.

godelski•2mo ago

In case people missed it there's some additional important context:

  - Major AI conference flooded with peer reviews written by AI 
      https://news.ycombinator.com/item?id=46088236
  - "All OpenReview Data Leaks" 
    https://news.ycombinator.com/item?id=46073488
    - "The Day Anonymity Died: Inside the OpenReview / ICLR 2026 Leak" 
      https://news.ycombinator.com/item?id=46082370
    - More about the leak
      https://forum.cspaper.org/topic/191/iclr-i-can-locate-reviewer-how-an-api-bug-turned-blind-review-into-a-data-apocalypse

The second one went under the radar, but basically OpenReview left the API open so you didn't need credentials. This meant all reviewers and authors were deanonymized across multiple conferences.

All these links are for ICLR too, which is the #2 ML conference for those that don't know.

And for some important context of the link for this post, note that they only sampled 300 papers and found 50. It looks to be almost exclusively citations but those are probably the easiest things to verify.

And this week CVPR sent out notifications that OpenReview will be down between Dec 6th and Dec 9th. No explanation for why.

So we have reviewers using LLMs, authors using LLMs, and idk the conference systems writing their software with LLMs? Things seem pretty fragile right now...

I think at least this article should highlight one of the problems we have in academia right now (beyond just ML, though it is more egregious there): citation mining. It is pretty standard to have over 50 citations in your 10 page paper these days. You can bet that most of these are not going to be for the critical claims but instead heavily placed in the background section. I looked at a few of the papers and everyone I looked at had their hallucinated citations in background (or background in appendix) sections. So these are "filler" citations, which I think illustrates a problem: citations are being abused. I mean the metric hacking should be pretty obvious if you just look at how many citations ML people have. It's grown exponentially! Do we really need so many citations? I'm all for giving people credit but a hyper-fixation on citation count as our measure of credit just doesn't work. It's far too simple of a metric. Like we might as well measure how good of a coder you are by the number of lines of code you produce[0].

It really seems that academia doesn't scale very well...

[0] https://www.youtube.com/shorts/rDk_LsON3CM

ricardobeat•2mo ago

One of the reported hallucinations in this work [1], starting with David Rein, says the other authors are entirely made up. They are indeed absent from the original cited paper [2], but a Google search shows some of the same names featured in citations from other papers [3] [4].

Most of the names in these wrong attributions are actual people though, not hallucinations. What is going on? Is this a case of AI-powered citation management creating some weird feedback loop?

[1] https://app.gptzero.me/documents/54c8aa45-c97d-48fc-b9d0-d49...

[2] https://arxiv.org/pdf/2311.12022

[3] https://arxiv.org/html/2509.22536v3

[4] https://arxiv.org/html/2511.01191v1

sj01f•1mo ago

If the inaccuracies are limited to metadata and do not constitute scientific fabrication, the most plausible explanation is that the author attempted to patch incomplete Zotero exports using GPT, inadvertently introducing errors.

Such errors arise from the uncritical adoption of automated tools and a failure to verify outputs manually. This reflects academic laxity and an excessive trust in LLMs. As AI tools become ubiquitous in research—even for generating encyclopedic content—some individuals have developed a misplaced confidence in GPT, leading them to undervalue the importance of citation accuracy.

However, while this is undeniably negligent, it does not validate wholesale dismissal of the paper’s scientific merit, nor does it warrant ad hominem attacks. Demanding the end of an academic career for citation errors is a draconian measure akin to a witch hunt.

SectorC: A C Compiler in 512 bytes

The F Word

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Speed up responses with fast mode

Software factories and the agentic moment

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

I write games in C (yes, C)

First Proof

Show HN: A luma dependent chroma compression algorithm (image compression)

The Waymo World Model

Al Lowe on model trains, funny deaths and working with Disney

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

Reinforcement Learning from Human Feedback

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection Rather Than Prediction

Coding agents have replaced every framework I used

The AI boom is causing shortages everywhere else

A Fresh Look at IBM 3270 Information Display System

France's homegrown open source online office suite

72M Points of Interest

We mourn our craft

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Learning from context is harder than we thought

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

History and Timeline of the Proco Rat Pedal (2021)

SectorC: A C Compiler in 512 bytes

The F Word

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Speed up responses with fast mode

Software factories and the agentic moment

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

I write games in C (yes, C)

First Proof

Show HN: A luma dependent chroma compression algorithm (image compression)

The Waymo World Model

Al Lowe on model trains, funny deaths and working with Disney

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

Reinforcement Learning from Human Feedback

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection Rather Than Prediction

Coding agents have replaced every framework I used

The AI boom is causing shortages everywhere else

A Fresh Look at IBM 3270 Information Display System

France's homegrown open source online office suite

72M Points of Interest

We mourn our craft

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Learning from context is harder than we thought

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

History and Timeline of the Proco Rat Pedal (2021)

Over fifty new hallucinations in ICLR 2026 submissions

Comments