GPTZero finds 100 new hallucinations in NeurIPS 2025 accepted papers

937•segmenta•2w ago

Comments

cogman10•2w ago

Yuck, this is going to really harm scientific research.

There is already a problem with papers falsifying data/samples/etc, LLMs being able to put out plausible papers is just going to make it worse.

On the bright side, maybe this will get the scientific community and science journalists to finally take reproducibility more seriously. I'd love to see future reporting that instead of saying "Research finds amazing chemical x which does y" you see "Researcher reproduces amazing results for chemical x which does y. First discovered by z".

godzillabrennus•2w ago

Have they solved the issue where papers that cite research already invalidated are still being cited?

cogman10•2w ago

AFAIK, no, but I could see there being cause to push citations to also cite the validations. It'd be good if standard practice turned into something like

Paper A, by bob, bill, brad. Validated by Paper B by carol, clare, charlotte.

Paper A, by bob, bill, brad. Unvalidated.

gcr•2w ago

Academics typically use citation count and popularity as a rough proxy for validation. It's certainly not perfect, but it is something that people think about. Semantic Scholar in particular is doing great work in this area, making it easy to see who cites who: https://www.semanticscholar.org/

Google Scholar's PDF reader extension turns every hyperlinked citation into a popout card that shows citation counts inline in the PDF: https://chromewebstore.google.com/detail/google-scholar-pdf-...

rtkwe•2w ago

That is a factor most people miss when thinking about the replication crisis. For the harder physical sciences a wrong paper will fairly quickly be found because as people go to expand on the ideas/use that data and get results that don't match the model informed by paper X they're going to eventually figure out that X is wrong. There might be issues with getting incentives to write and publish that negative result but each paper where the results of a previous paper are actually used in the new paper is a form of replication.

reliabilityguy•2w ago

Nope.

I am still reviewing papers that propose solutions based on a technique X, conveniently ignoring research from two years ago that shows that X cannot be used on its own. Both the paper I reviewed and the research showing X cannot be used are in the same venue!

b00ty4breakfast•2w ago

does it seem to be legitimate ignorance or maybe folks pushing ahead regardless of x being disproved?

freedomben•2w ago

IMHO, It's mostly ignorance coming a push/drive to "publish or perish." When the stakes are so high and output is so valued, and when reproducability isn't required, it disincentivizes thorough work. The system is set up in a way that is making it fail.

There is also the reality that "one paper" or "one study" can be found contradicted almost anything, so if you just went with "some other paper/study debunks my premise" then you'd end up producing nothing. Plus many inside know that there's a lot of slop out there that gets published, so they can (sometimes reasonably IMHO) dismiss that "one paper" even when they do know about it.

It's (mostly) not fraud or malicious intent or ignorance, it's (mostly) humans existing in the system in which they must live.

reliabilityguy•2w ago

Poor scholarship.

However, given the feedback by other reviewers, I was the only one who knew that X doesn’t work. I am not sure how these people mark themselves as “experts” in the field if they are not following the literature themselves.

f311a•2w ago

For ML/AI/Comp sci articles, providing reproducible code is a great option. Basically, PoC or GTFO.

StableAlkyne•2w ago

The most annoying ones are those which discuss loosely the methodology but then fail to publish the weights or any real algorithms.

It's like buying a piece of furniture from IKEA, except you just get an Allen key, a hint at what parts to buy, and blurry instructions.

alansaber•2w ago

This is so egregious. The value of such papers is basically nothing but they're extremely common.

j45•2w ago

It will better expose the behaviour of false scientists.

StableAlkyne•2w ago

> I'd love to see future reporting that instead of saying "Research finds amazing chemical x which does y" you see "Researcher reproduces amazing results for chemical x which does y. First discovered by z".

Most people (that I talk to, at least) in science agree that there's a reproducibility crisis. The challenge is there really isn't a good way to incentivize that work.

Fundamentally (unless you're independent wealthy and funding your own work), you have to measure productivity somehow, whether you're at a university, government lab, or the private sector. That turns out to be very hard to do.

If you measure raw number of papers (more common in developing countries and low-tier universities), you incentivize a flood of junk. Some of it is good, but there is such a tidal wave of shit that most people write off your work as a heuristic based on the other people in your cohort.

So, instead it's more common to try to incorporate how "good" a paper is, to reward people with a high quantity of "good" papers. That's quantifying something subjective though, so you might try to use something like citation count as a proxy: if a work is impactful, usually it gets cited a lot. Eventually you may arrive at something like the H-index, which is defined as "The highest number H you can pick, where H is the number of papers you have written with H citations." Now, the trouble with this method is people won't want to "waste" their time on incremental work.

And that's the struggle here; even if we funded and rewarded people for reproducing results, they will always be bumping up the citation count of the original discoverer. But it's worse than that, because literally nobody is going to cite your work. In 10 years, they just see the original paper, a few citing works reproducing it, and to save time they'll just cite the original paper only.

There's clearly a problem with how we incentivize scientific work. And clearly we want to be in a world where people test reproducibility. However, it's very very hard to get there when one's prestige and livelihood is directly tied to discovery rather than reproducibility.

warkdarrior•2w ago

> If you measure raw number of papers (more common in developing countries and low-tier universities), you incentivize a flood of junk.

This is exactly what rewarding replication papers (that reproduce and confirm an existing paper) will lead to.

pixl97•2w ago

And yet if we can't reproduce an existing paper, it's very possible that existing paper is junk itself.

Catch-22 is a fun game to get caught in.

maerF0x0•2w ago

> The challenge is there really isn't a good way to incentivize that work.

What if we got Undergrads (with hope of graduate studies) to do it? Could be a great way to train them on the skills required for research without the pressure of it also being novel?

StableAlkyne•2w ago

Those undergrads still need to be advised and they use lab resources.

If you're a tenure-track academic, your livelihood is much safer from having them try new ideas (that you will be the corresponding author on, increasing your prestige and ability to procure funding) instead of incrementing.

And if you already have tenure, maybe you have the undergrad do just that. But the tenure process heavily filters for ambitious researchers, so it's unlikely this would be a priority.

If instead you did it as coursework, you could get them to maybe reproduce the work, but if you only have the students for a semester, that's not enough time to write up the paper and make it through peer review (which can take months between iterations)

suddenlybananas•2w ago

Unfortunately, that might just lead to a bunch of type II errors instead, if an effect requires very precise experimental conditions that undergrads lack the expertise for.

retsibsi•2w ago

Could it be useful as a first line of defence? A failed initial reproduction would not be seen as disqualifying, but it would bring the paper to the attention of more senior people who could try to reproduce it themselves. (Maybe they still wouldn't bother, but hopefully they'd at least be more likely to.)

rtkwe•2w ago

Most interesting results are not so simple to recreate that would could reliably expect undergrads to do perform the replication even if we ignore the cost of the equipment and consumables that replication would need and the time/supervision required to walk them through the process.

jimbokun•2w ago

> The challenge is there really isn't a good way to incentivize that work.

Ban publication of any research that hasn't been reproduced.

wpollock•2w ago

> Ban publication of any research that hasn't been reproduced.

Unless it is published, nobody will know about it and thus nobody will try to reproduce it.

sroussey•2w ago

Just have a new journal of only papers that have been reproduced, and include the reproduction papers.

gcr•2w ago

lol, how would the first paper carrying some new discovery get published?

dekhn•2w ago

If we did that, CERN could not publish, because nobody else has the capabilities they do. Do we really want to punish CERN (which has a good track record of scientific integrity) because their work can't be reproduced? I think the model in many of these cases is that the lab publishing has to allow some number of postdocs or competitor labs to come to their lab and work on reproducing it in-house with the same reagents (biological experiments are remarkably fragile).

poulpy123•2w ago

But nobody want to pay for it

geokon•2w ago

usually you reproduce previous research as a byproduct of doing something novel "on top" of the previous result. I dont really see the problem with the current setup.

sometimes you can just do something new and assume the previous result, but thats more the exception. youre almost always going to at least in part reproducr the previous one. and if issues come up, its often evident.

thats why citations work as a good proxy. X number of people have done work based around this finding and nobody has seen a clear problem

theres a problem of people fabricating and fudging data and not making their raw data available ("on request" or with not enough meta data to be useful) which wastes everyones time and almost never leads to negative consequences for the authors

gcr•2w ago

It's often quite common to see a citation say "BTW, we weren't able to reproduce X's numbers, but we got fairly close number Y, so Table 1 includes that one next to an asterisk."

The difficult part is surfacing that information to readers of the original paper. The semantic scholar people are beginning to do some work in this area.

geokon•2w ago

yeah thats a good point. the citation might actually be pointing out a problem and not be a point in favor. its a slog to figure out... but seems like the exact type of problem an LLM could handle

give it a published paper and it runs through papers that have cited it and give you an evaluation

gcr•2w ago

I'd personally like to see top conferences grow a "reproducibility" track. Each submission would be a short tech report that chooses some other paper to re-implement. Cap 'em at three pages, have a lightweight review process. Maybe there could be artifacts (git repositories, etc) that accompany each submission.

This would especially help newer grad students learn how to begin to do this sort of research.

Maybe doing enough reproductions could unlock incentives. Like if you do 5 reproductions than the AC would assign your next paper double the reviewers. Or, more invasively, maybe you can't submit to the conference until you complete some reproduction.

azan_•2w ago

The problem is that reproducing something is really, really hard! Even if something doesn't reproduce in one experiment, it might be due to slight changes in some variables we don't even think about. There are some ways to circumvent it (e.g. team that's being reproduced cooperating with reproducing team and agreeing on what variables are important for the experiemnt and which are not), but it's really hard. The solutions you propose will unfortunately incentivize bad reproductions and we might reject theories that are actually true because of that. I think that one of the best way to fight the crisis is to actually improve quality of science - articles where authors reject to share their data should be automatically rejected. We should also move towards requiring preregistration with strict protocols for almost all studies.

AnIrishDuck•2w ago

Yeah, this feels like another reincarnation of the ancient "who watches the watchmen?" problem [1]. Time and time again we see that the incentives _really really_ matter when facing this problem; subtle changes can produce entirely new problems.

1. https://en.wikipedia.org/wiki/Quis_custodiet_ipsos_custodes%...

gcr•2w ago

That's fine! The tech report should talk about what the researchers tried and what didn't work. I think submissions to the reproducibility track shouldn't necessarily have to be positive to be accepted, and conversely, I don't think the presence of a negative reproduction should necessarily impact an author's career negatively.

gowld•2w ago

Every time some easy "Reproducibility is hard / not worth the effort" I hear "The original research wasn't meaningful or valuable".

azan_•2w ago

And that's true! It doesn't make sense to spend a lot of resources on reproducing things when there is low hanging fruit of just requiring better research in the first place.

dataflow•2w ago

Is it time for some sort of alternate degree to a PhD beyond a Master's? Showing, essentially, "this person can learn, implement, validate, and analyze the state of the art in this field"?

gogopromptless•2w ago

Thats what we call a Staff level engineer. Proven ability to learn, implement and validate is basically the "it factor" businesses are looking for.

If you are thinking about this from an academic angle then sure its sounds weird to say "Two Staff jobs in a row from the University of LinkedIn" as a degree. But I submit this as basically the certificate you desire.

dataflow•2w ago

No, this is not at all being a staff engineer. One is about delivering high-impact projects toward a business's needs, with all the soft/political things that involves, and the other is about implementing and validating cutting-edge research, with all the deep academic and technical knowledge and work that that involves. They're incredibly different skillsets, and many people doing one would easily fail in the other.

gcr•2w ago

That sounds precisely like the function of a Ph.D. to me.

dataflow•2w ago

A PhD is for making new contributions to a field, not validating existing ones.

MetaWhirledPeas•2w ago

> Eventually you may arrive at something like the H-index, which is defined as "The highest number H you can pick, where H is the number of papers you have written with H citations."

It's the Google search algorithm all over again. And it's the certificate trust hierarchy all over again. We keep working on the same problems.

Like the two cases I mentioned, this is a matter of making adjustments until you have the desired result. Never perfect, always improving (well, we hope). This means we need liquidity with the rules and heuristics. How do we best get that?

sroussey•2w ago

Incentives.

First X people that reproduce Y get Z percent of patent revenue.

Or something similar.

wizzwizz4•2w ago

I'm delighted to inform you that I have reproduced every patent-worthy finding of every major research group active in my field in the past 10 years. You can check my data, which is exactly as theory predicts (subject to some noise consistent with experimental error). I accept payment in cash.

jltsiren•2w ago

Patent revenue is mostly irrelevant, as it's too unpredictable and typically decades in the future. Academics rarely do research that can be expected to produce economic value in the next 10–20 years, because the industry can easily outspend the academia in such topics.

rtkwe•2w ago

Most papers generate zero patent revenue or even lead to patents at all. For major drugs maybe that works but we already have clinical trials before the drug goes to market that validate the efficacy of the drugs.

graemep•2w ago

> you have to measure productivity somehow,

No, you do not have to. You give people with the skills and interest in doing research the money. You need to ensure its spent correctly, that is all. People will be motivated by wanting to build a reputation and the intrinsic reward of the work

soiltype•2w ago

That feels arbitrary as a measure of quality. Why isn't new research simply devalued and replication valued higher?

"Dr Alice failed to reproduce 20 would-be headline-grabbing papers, preventing them from sucking all the air out of the room in cancer research" is something laudable, but we're not lauding it.

agumonkey•2w ago

I think, at least I hope, that a part of the LLM value will be to create their retirement for specific needs. Instead of asking it to solve any problem, restrict the space to a tool that can help you then reach your goal faster without the statistical nature of LLMs.

mike_hearn•2w ago

Reproducibility is overrated and if you could wave a wand to make all papers reproducible tomorrow, it wouldn't fix the problem. It might even make it worse.

https://blog.plan99.net/replication-studies-cant-fix-science...

biophysboy•2w ago

? More samples reduces the variance of a statistic. Obviously it cannot identify systematic bias in a model, or establish causality, or make a "bad" question "good". Its not overrated though -- it would strengthen or weaken the case for many papers.

mike_hearn•2w ago

If you have a strong grip on exactly what it means, sure, but look at any HN thread on the topic of fraud in science. People think replication = validity because it's been described as the replication crisis for the last 15 years. And that's the best case!

Funding replication studies in the current environment would just lead to lots of invalid papers being promoted as "fully replicated" and people would be fooled even harder than they already are. There's got to be a fix for the underlying quality issues before replication becomes the next best thing to do.

doctorpangloss•2w ago

while i agree that "reproducibility is overrated", i went ahead and read your medium post. my feedback to you is, my summary of that writing: "mike_hearn's take on policy-adjacent writing conducted by public health officials and published in journals that interacted with mike_hearn's valid and common but nonetheless subjective political dispute about COVID-19."

i don't know how any of that writing generalizes to other parts of academic research. i mean, i know that you say it does, but i don't think it does. what exactly do you think most academic research institutions and the federal government spend money on? for example, wet lab research. you don't know anything about wet lab research. i think if you took a look at a typical e.g. basic science in immunology paper, built on top of mouse models, you would literally lose track of any of its meaning after the first paragraph, you would feed it into chatgpt, and you would struggle to understand the topic well enough to read another immunology paper, you would have an immense challenge talking about it with a researcher in the field. it would take weeks of reading. you have no medicine background, so you wouldn't understand the long horizon context of any of it. you wouldn't be able to "chatbot" your way into it, it would be a real education. so after all of that, would you still be able to write the conclusion you wrote in the medium post? i don't think so, because you would see that by many measures, you cannot generalize a froo-froo policy between "subjective political dispute about COVID-19" writing and wet lab research. you'd gain the wisdom to see that they're different things, and you lack the background, and you'd be much more narrow in what you'd say.

it doesn't even have to be in the particulars, it's just about wisdom. that is my feedback. you are at once saying that there is greater wisdom to be had in the organization and conduct of research, and then, you go and make the highly low wisdom move to generalize about all academic research. which you are obviously doing not because it makes sense to, you're a smart guy. but because you have some unknown beef with "academics" that stems from anger about valid, common but nonetheless subjective political disputes about COVID-19.

mike_hearn•2w ago

Thanks for reading it, or scan reading it maybe. Of the 18 papers discussed in the essay here's what they're about in order:

- Alzheimers

- Cancer

- Alzheimers

- Skin lesions (first paper discussed in the linked blog post)

- Epidemiology (COVID)

- Epidemiology (COVID, foot and mouth disease, Zika)

- Misinformation/bot studies

- More misinformation/bot studies

- Archaeology/history

- PCR testing (in general, discussion opens with testing of whooping cough)

- Psychology, twice (assuming you count "men would like to be more muscular" as a psych claim)

- Misinformation studies

- COVID (the highlighted errors in the paper are objective, not subjective)

- COVID (the highlighted errors are software bugs, i.e. objective)

- COVID (a fake replication report that didn't successfully replicate anything)

- Public health (from 2010)

- Social science

Your summary of this as being about a "valid and common but subjective political dispute" I don't agree is accurate. There's no politics involved in any of these discussions or problems, just bad science.

Immunology has the same issues as most other medical fields. Sure, there's also fraud that requires genuinely deep expertise to find, but there's plenty that doesn't. Here's a random immunology paper from a few days ago identified as having image duplications, Photoshopping of western blots, numerous irrelevant citations and weird sentence breaks all suggestive that the paper might have been entirely faked or at least partly generated by AI: https://pubpeer.com/publications/FE6C57F66429DE2A9B88FD245DD...

The authors reply, claiming the problems are just rank incompetence, and each time someone finds yet another problem with the paper leading to yet another apology and proclamation of incompetence. It's just another day on PubPeer, nothing special about this paper. I plucked it off the front page. Zero wet lab experience is needed to understand why the exact same image being presented as two different things in two different papers is a problem.

And as for other fields, they're often extremely shallow. I actually am an expert in bot detection but that doesn't help at all in detecting validity errors in social science papers, because they do things like define a bot as anyone who tweets five times after midnight from a smartphone. A 10 year old could notice that this isn't true.

doctorpangloss•2w ago

it only takes one drop of talking about COVID to make it about politics haha

biophysboy•2w ago

> look at any HN thread on the topic of fraud in science.

HN is very tedious/lazy when it comes to science criticism -- very much agree with you on this.

My only point is replication is necessary to establish validity, even if it is not sufficient. Whether it gives a scientist a false sense of security doesn't change the math of sampling.

I also agree with you on quality issues. I think alternative investment strategies (other than project grants) would be a useful step for reducing perverse incentives, for example. But there's a lot of things science could do.

nickpsecurity•2w ago

Part of replication is skeptical review. That's also part of the scientific method. If we're not doing thorough review and replication, we're not really doing science. It's a lot of faith in people incentivized to do sloppy or dishonest work.

Edit: I just read your article linked upthread. It was really good. I don't think we disagree except I say we need to attempt the steps of science wherever sensible and there's human/political problems trying to corrupt them. I try to seperately address those by changing hearts with the Gospel of Jesus Christ. (Cuz self-interest won't fix science.)

So, we need the replications. We also need to address whatever issues would pop up with them.

mike_hearn•2w ago

Yes, that's true. In theory, by the time it gets to the replication stage a paper has already been reviewed. In practice a replication is often the first time a paper is examined adversarially. There might be a useful form of hybrid here, like paying professional skeptics to review papers. The peer review concept academia works on is a very naive setup of the sort you'd expect given the prevailing ideology ("from each according to their ability, to each according to their needs"). Paying professionals to do it would be a good start, but only if there are consequences to a failed review, which there just aren't today.

nickpsecurity•2w ago

"There might be a useful form of hybrid here, like paying professional skeptics to review papers."

This is how the scientific method is described. It's what much of the public thinks their money is paying for. So, I'm definitely for doing it for real or not calling it science.

Even the amount of review I saw you do on papers on your blog seems to exceed what much peer review is doing. So, how can we treat things as science if that aren't even meeting that standard, much less replication?

vld_chk•2w ago

In my mental model, the fundamental problem of reproducibility is that scientists have very hard time to find a penny to fund such research. No one wants to grant “hey I need $1m and 2 years to validate the paper from last year which looks suspicious”.

Until we can change how we fund science on the fundamental level; how we assign grants — it will be indeed very hard problem to deal with.

parpfish•2w ago

In theory, asking grad students and early career folks to run replications would be a great training tool.

But the problem isn’t just funding, it’s time. Successfully running a replication doesn’t get you a publication to help your career.

iugtmkbdfil834•2w ago

Yeah, but doesn't publishing an easily falsifiable paper end one?

wizzwizz4•2w ago

Not in most fields, unless misconduct is evident. (And what constitutes "misconduct" is cultural: if you have enough influence in a community, you can exert that influence on exactly where that definitional border lies.) Being wrong is not, and should not be, a career-ending move.

iugtmkbdfil834•2w ago

If we are aiming for quality, then being wrong absolutely should be. I would argue that is how it works in real life anyway. What we quibble over is what is the appropriate cutoff.

rtkwe•2w ago

There's a big gulf between being wrong because you or a collaborator missed an uncontrolled confounding factor and falsifying or altering results. Science accepts that people sometimes make mistakes in their work because a) they can also be expected to miss something eventually and b) a lot of work is done by people in training in labs you're not directly in control of (collaborators). They already aim for quality and if you're consistently shown to be sloppy or incorrect when people try to use your work in their own.

The final bit is a thing I think most people miss when they think about replication. A lot of papers don't get replicated directly but their measurements do when other researchers try to use that data to perform their own experiments, at least in the more physical sciences this gets tougher the more human centric the research is. You can't fake or be wrong for long when you're writing papers about the properties of compounds and molecules. Someone is going to come try to base some new idea off your data and find out you're wrong when their experiment doesn't work. (or spend months trying to figure out what's wrong and finally double check the original data).

wizzwizz4•2w ago

In fields like psychology, though, you can be wrong for decades. If your result is foundational enough, and other people have "replicated" it, then most researchers will toss out contradictory evidence as "guess those people were an unrepresentative sample". This can be extremely harmful when, for instance, the prevailing view is "this demographic are just perverts" or "most humans are selfish thieves at heart, held back by perceived social consensus" – both examples where researcher misconduct elevated baseless speculation to the position of "prevailing understanding", which led to bad policy, which had devastating impacts on people's lives.

(People are better about this in psychology, now: schoolchildren are taught about some of the more egregious cases, even before university, and individual researchers are much more willing to take a sceptical view of certain suspect classes of "prevailing understanding". The fact that even I, a non-psychologist, know about this, is good news. But what of the fields whose practitioners don't know they have this problem?)

rtkwe•2w ago

Yeah like I said the soft validation by subsequent papers is more true in more baseline physical sciences because it involves fewer uncontrollable variables. That's why I mentioned 'hard' sciences in my post, messy humans are messy and make science waaay harder.

aoasadflkjafl•2w ago

Well, this is why the funniest and smartest way people commit fraud is faking studies that corroborate very careful collaborators' findings (who are collaborating with many people, to make sure their findings are replicated). That way, they get co-authorship on papers that check out, and nobody looks close enough to realize that they actually didn't do those studies and just photoshopped the figures to save time and money. Eliezer Masliah, btw. Ironically only works if you can be sure your collaborators are honest scientists, lol.

parpfish•2w ago

But the thing is… nobody is doing the replication to falsify it. And if the did, it wouldn’t be published because it’s a null result

bnchrch•2w ago

One, it doesnt damage your reputation as much as one would think.

But two, and more importantly, no one is checking.

Tree falls in the forest, no one hears, yadi-yada.

iugtmkbdfil834•2w ago

<< no one is checking.

I think this is the big part of it. There is no incentive to do it even when the study can be reproduced.

godelski•2w ago

Here's a work from last year which was plagiarized. The rare thing about this work is it was submitted to ICLR, which opened reviews for both rejected and accepted works.

You'll notice you can click on author names and you'll get links to their various scholar pages but notably DBLP, which makes it easy to see how frequently authors publish with other specific authors.

Some of those authors have very high citation counts... in the thousands, with 3 having over 5k each (one with over 18k).

https://openreview.net/forum?id=cIKQp84vqN

Telaneo•2w ago

Not really, since nobody (for values of) ends up actually falsifying it, and if they do, it's years down the line.

m-schuetz•2w ago

The vast majority of papers is so insignifcant, nobody bothers to try and use and thereby replicate it.

goalieca•2w ago

Grad students don’t get to publish a thesis on reproduction. Everyone from the undergraduate research assistant to the tenured professor with research chairs are hyper focused on “publishing” as much “positive result” on “novel” work as possible

Kinrany•2w ago

Publishing a replication could be a prerequisite to getting the degree

The question is, how can universities coordinate to add this requirement and gain status from it

ihaveajob•2w ago

I think Arxiv and similar could contribute positively by listing replications/falsifications, with credit to the validating authors. That would be enough of an incentive for aspiring researchers to start making a dent.

bonoboTP•2w ago

Prerequisite required by who, and why is that entity motivated to design such a requirement? Universities also want more novel breakthrough papers to boast about and to outshine other universities in the rankings. And if one is honest, other researchers also get more excited about new ideas than a failed replication that may for a thousand different reasons and the original authors will argue you did something wrong, or evaluated in an unfair way, and generally publicly accusing other researchers of doing bad work won't help your career much. It's a small world, you'd be making enemies with people who will sit on your funding evaluation committees, hiring committees and it just generally leads to drama. Also papers are superseded so fast that people don't even care that a no longer state of the art paper may have been wrong. There are 5 newer ones that perform better and nobody uses the old one. I'm just stating how things actually are, I don't say that this is good, but when you say something "should" happen, think about who exactly is motivated to drive such a change.

derektank•2w ago

> Prerequisite required by who, and why is that entity motivated to design such a requirement?

Grant awarding institutions like the NIH and NSF presumably? The NSF has as one of its functions, “to develop and encourage the pursuit of a national policy for the promotion of basic research and education in the sciences”. Encouraging the replication of research as part of graduate degree curricula seems to fall within bounds. And the government’s interest in science isn’t novelty per se, it’s the creation and dissemination of factually correct information that can be useful to its constituents.

bonoboTP•2w ago

The commenter I was replying to wanted it to be a prerequisite for a degree, not for a grant. Grant awarding institutions also have to justify their spending to other parts of government and/or parliament (specifically, politicians). Both politicians and the public want to see breakthrough results that have the potential to cure cancer and whatnot. They want to boast that their funding contributed to winning some big-name prize and so on. You have to think with the mind of specific people in specific positions and what makes them look good, what gets them praise, promotions and friends.

> And the government’s interest in science isn’t novelty per se, it’s the creation and dissemination of factually correct information that can be useful to its constituents.

This sounds very naive.

soiltype•2w ago

But that seems almost trivially solved. In software it's common to value independent verification - e.g. code review. Someone who is only focused on writing new code instead of careful testing, refactoring, or peer review is widely viewed as a shitty developer by their peers. Of course there's management to consider and that's where incentives are skewed, but we're talking about a different structure. Why wouldn't the following work?

A single university or even department could make this change - reproduction is the important work, reproduction is what earns a PhD. Or require some split, 20-50% novel work maybe is also expected. Now the incentives are changed. Potentially, this university develops a reputation for reliable research. Others may follow suit.

Presumably, there's a step in this process where money incentivizes the opposite of my suggestion, and I'm not familiar with the process to know which.

Is it the university itself which will be starved of resources if it's not pumping out novel (yet unreproducible) research?

worik•2w ago

> In software it's common to value independent verification - e.g. code review. Someone who is only focused on writing new code instead of careful testing, refactoring, or peer review is widely viewed as a shitty developer by their peers.

That is good practice

It is rare, not common. Managers and funders pay for features

Unreliable insecure software sells very well, so making reliable secure software is a "waste of money", generally

soiltype•1w ago

Actually yes you're 100% right, I phrased that badly

DSMan195276•2w ago

> Presumably, there's a step in this process where money incentivizes the opposite of my suggestion, and I'm not familiar with the process to know which.

> Is it the university itself which will be starved of resources if it's not pumping out novel (yet unreproducible) research?

Researchers apply for grants to fund their research, the university is generally not paying for it and instead they receive a cut of the grant money if it is awarded (IE. The grant covers the costs to the university for providing the facilities to do the research). If a researcher could get funding to reproduce a result then they could absolutely do it, but that's not what funds are usually being handed out for.

soiltype•1w ago

Hmm I see. So the grant makers are more of a problem here. And what are their incentives to fund ~bad research?

bonoboTP•2w ago

Universities are not really motivated to slow down the research careers of their employees, on the contrary. They are very much interested in their employees making novel, highly cited publications and bringing in grants that those publications can lead to.

coryrc•2w ago

Enough people will falsify the replication and pocket the money, taking you back to where you were in the first place and poorer for it. The loss of trust is an existential problem for the USA.

eks-reigh•2w ago

You may well know this, but I get the sense that it isn’t necessarily common knowledge, so I want to spell it out anyway:

In a lot of cases, the salary for a grad student or tech is small potatoes next to the cost of the consumables they use in their work.

For example,I work for a lab that does a lot of sequencing, and if we’re busy one tech can use 10k worth of reagents in a week.

bonoboTP•2w ago

We are on the comment section about an AI conference and up until the last few years material/hardware costs for computer science research was very cheap compare to other sciences like medicine, biology etc. where they use bespoke instruments and materials. In CS, up until very recently, all you needed was a good consumer PC for each grad student that lasted for many years. Nowadays GPU clusters are more needed but funding is generally not keeping up with that, so even good university labs are way underresourced on this front.

rtkwe•2w ago

That.. still requires funding. Even if your lab happens to have all the equipment required to replicate you're paying the grad student for their time spent on replicating this paper and you'll need to buy some supplies; chemicals, animal subjects, pay for shared equipment time, etc.

bandrami•2w ago

Grad students have this weird habit of eating food and renting places to live, though, so that's also money

aoasadflkjafl•2w ago

There is actually a ton of replication going on at any given moment, usually because we work off of each other's work, whether those others are internal or external. But, reporting anything basically destroys your career in the same way saying something about Weinstein before everyone's doing it does. So, most of us just default to having a mental list of people and circles we avoid as sketchy and deal with it the way women deal with creepy dudes in music scenes, and sometimes pay the troll toll. IMO, this is actually one of the reasons for recent increases in silo-ing, not just stuff being way more complicated recently; if you switch fields, you have to learn this stuff and pay your troll tolls all over again. Anyway, I have discovered or witnessed serious replication problems four times --

(1) An experiment I was setting up using the same method both on a protein previously analyzed by the lab as a control and some new ones yielded consistently "wonky" results (read: need different method, as additional interactions are implied that make standard method inappropriate) in both. I wasn't even in graduate school yet and was assumed to simply be doing shoddy work, after all, the previous work was done by a graduate student who is now faculty at Harvard, so clearly someone better trained and more capable. Well, I finally went through all of his poorly marked lab notebooks and got all of his raw data... his data had the same "wonkiness," as mine, he just presumably wanted to stick to that method and "fixed" it with extreme cherry-picking and selective reporting. Did the PI whose lab I was in publish a retraction or correction? No, it would be too embarrassing to everyone involved, so the bad numbers and data live on.

(2) A model or, let's say "computational method," was calibrated on a relatively small, incomplete, and partially hypothetical data-set maybe 15 years ago, but, well, that was what people had. There are many other models that do a similar task, by the way, no reason to use this one... except this one was produced by the lab I was in at the time. I was told to use the results of this one into something I was working on and instead, when reevaluating it on the much larger data-set we have now, found it worked no better than chance. Any correction or mention of this outside the lab? No, and even in the lab, the PI reacted extremely poorly and I was forced to run numerous additional experiments which all showed the same thing, that there was basically no context this model was useful. I found a different method worked better and subsequently, had my former advisor "forget" (for the second time) to write and submit his portion of a fellowship he previously told me to apply to. This model is still tweaked in still useless ways and trotted out in front of the national body that funds a "core" grant that the PI basically uses as a slush fund, as sign of the "core's" "computational abilities." One of the many reasons I ended up switching labs. PI is a NAS member, by the way, and also auto-rejects certain PIs from papers and grants because "he just doesn't like their research" (i.e. they pissed him off in some arbitrary way), also flew out a member of the Swedish RAS and helped them get an American appointment seemingly in exchange for winning a sub-Nobel prize for research... they basically had nothing to do with, also used to basically use various members as free labor on super random stuff to faculty who approved his grants, so you know the type.

(3) Well, here's a fun one with real stakes. Amyloid-β oligomers, field already rife with fraud. A lab that supposedly has real ones kept "purifying" them for the lab involved in 2, only for the vial to come basically destroyed. This happened multiple times, leading them to blame the lab, then shipping. Okay, whatever. They send raw material, tell people to follow a protocol carefully to make new ones. Various different people try, including people who are very, very careful with such methods and can make everything else. Nobody can make them. The answer is "well, you guys must suck at making them." Can anyone else get the protocol right? Well, not really... But, admittedly, someone did once get a different but similar protocol to work only under the influence of a strong magnetic field, so maybe there's something weird going on in their building that they actually don't know about and maybe they're being truthful. But, alternatively, they're coincidentally the only lab in the world that can make super special sauce, and everybody else is just a shitty scientist. Does anyone really dig around? No, why would a PI doing what the PI does in 2 want to make an unnecessary enemy of someone just as powerful and potentially shitty? Predators don't like fighting.

(4) Another one that someone just couldn't replicate at all, poured four years into it, origin was a big lab. Same vibe as third case, "you guys must just suck at doing this," then "well, I can't get in contact with the graduate student who wrote the paper, they're now in consulting, and I can't find their data either." No retraction or public comment, too big of a name to complain about except maybe on PubPeer. Wasted an entire R21.

poszlem•2w ago

I often think we should movefrom peer review as "certification" to peer review as "triage", with replication determining how much trust and downstream weight a result earns over time.

pas•2w ago

grants should come with money and requirement for independent reproduction

academia is too fragmented and extremely inefficient

jghn•2w ago

Partially. There's also the issue that some sciences, like biology, are a lot messier & less predicatble than people like to believe.

godelski•2w ago

Funding is definitely a problem, but frankly reproduction is common. If you build off someone else's work (as is the norm) you need to reproduce first.

But without repetition being impactful to your career and the pressure to quickly and constantly push new work, a failure to reproduce is generally considered a reason to move on and tackle a different domain. It takes longer to trace the failure and the bar is higher to counter an existing work. It's much more likely you've made a subtle mistake. It's much more likely the other work had a subtle success. It's much more likely the other work simply wasn't written such that a work could be sufficiently reproduced.

I speak from experience too. I still remember in grad school I was failing to reproduce a work that was the main competitor to the work I had done (I needed to create comparisons). I emailed the author and got no response. Luckily my advisor knew the author's advisor and we got a meeting set up and I got the code. It didn't do what was claimed in the paper and the code structure wasn't what was described either. The result? My work didn't get published and we moved on. The other work was from a top 10 school and the choice was to burn a bridge and put a black mark on my reputation (from someone with far more merit and prestige) or move on.

That type of thing won't change in a reproduction system but needs an open system and open reproduction system as well. Mistakes are common and we shouldn't punish them. The only way to solve these issues is openness

bandrami•2w ago

> If you build off someone else's work (as is the norm) you need to reproduce first.

Not if the result you're building off of is a model, you can just assume it

godelski•2w ago

Everything is a model. The word is just so vague. Did you use math? Great, mathematical model. Program? That's a model.

You can assume a model is true but you know what they say about assumptions

pas•2w ago

yes, this should be built-in to grants and publishing

of course the problem is that academia likes to assert its autonomy (and grant orgs are staffed by academia largely)

benob•2w ago

Maybe it will also change the whole publication as evaluation of science.

lxgr•2w ago

> LLMs being able to put out plausible papers is just going to make it worse

If correct form (LaTeX two-column formatting, quoting the right papers and authors of the year etc.) has been allowing otherwise reject-worthy papers to slip through peer review, academia arguably has bigger problems than LLMs.

LPisGood•2w ago

Correct form and relevant citations have been, for generations up to a couple of years ago, mighty strong signals that a work is good and done by a serious and reliable author. This is no longer the case and we are worse off for it.

CamperBob2•2w ago

I'd need to see the same scrutiny applied to pre-AI papers. If a field has a poor replication rate, meaning there's a good chance that a given published paper is just so much junk science, is that better or worse than letting AI hallucinate the data in the first place?

Sparkyte•2w ago

If there is one thing which scientific reports must require is not using AI to produce the documentation. They can be of the data but not of the source or anything else. AI is a tool, not a replacement for actual work.

lallysingh•2w ago

On the bright side, an LLM can really help set up a reproduction environment.

Perhaps repro should become the basis of peer review?

mort96•2w ago

No, it can't. No LLM can purchase the equipment and chemicals and machinery you need to reproduce experiments, nor should you want it.

lallysingh•2w ago

I was still thinking of CS :/

colechristensen•2w ago

Reading the article, this is about CITATIONS which are trivially verifiable.

This is just article publishers not doing the most basic verification failing to notice that the citations in the article don't exist.

What this should trigger is a black mark for all of the authors and their institutions, both of which should receive significant reputational repercussions for publishing fake information. If they fake the easiest to verify information (does the cited work exist) what else are they faking?

godelski•2w ago

  > to finally take reproducibility more seriously

I've long argued for this, as reproduction is the cornerstone of science. There's a lot of potential ways to do this but one that I like is linking to the original work. Suppose you're looking at the OpenReview page and they have a link for "reproduction efforts" and with at minimum an annotation for confirmation or failure.

This is incredibly helpful to the community as a whole. Reproduction failures can be incredibly helpful even when the original work has no fraud. In those cases a reprising failure reveals important information about the necessary conditions that the original work relies on.

But honestly, we'll never get this until we drop the entire notion of "novel" or "impact" and "publish or perish". Novel is in the eye of the reviewer and the lower the reviewer's expertise the less novel a work seems (nothing is novel as a high enough level). Impact can almost never be determined a priori, and when it can you already have people chasing those directions because why the fuck would they not? But publish or perish is the biggest sin. It's one of those ideas that looks nice on paper, like you are meaningfully determining who is working hard and who is hardly working. But the truth is that you can't tell without being in the weeds. The real result is that this stifles creativity, novelty, and impact as it forces researchers to chase lower hanging fruit. Things you're certain will work and can get published. It creates a negative feedback loop as we compete: "X publishes 5 papers a year, why can't you?" I've heard these words even when X has far fewer citations (each of my work had "more impact").

Frankly, I believe fraud would dramatically reduce were researchers not risking job security. The fraud is incentivized by the cutthroat system where you're constantly trying to defend your job, your work, and your grants. They'll always be some fraud but (with a few exceptions) researchers aren't rockstar millionaires. It takes a lot of work to get to point where fraud even works, so there's a natural filter.

I have the same advice as Mervin Kelly, former director of Bell Labs:

  How do you manage genius?
  You don't

andai•2w ago

I heard that most papers in a given field are already not adding any value. (Maybe it depends on the field though.)

There seems to be a rule in every field that "99% of everything is crap." I guess AI adds a few more nines to the end of that.

The gems are lost in a sea of slop.

So I see useless output (e.g. crap on the app store) as having negative value, because it takes up time and space and energy that could have been spent on something good.

My point with all this is that it's not a new problem. It's always been about curation. But curation doesn't scale. It already didn't. I don't know what the answer to that looks like.

anishrverma•2w ago

Yeah, spot on. If all we do is add more plausible sounding text on top of already fragile review and incentive structures, that really could make things worse rather than better

Your second point is the important one. AI may be the thing that finally forces the community to take reproducibility, attribution, and verification seriously. That’s very much the motivation behind projects like Liberata, which try to shift publishing away from novelty first narratives and toward explicit credit for replication, verification, and followthrough. If that cultural shift happens, this moment might end up being a painful but necessary correction.

qwertox•2w ago

It would be great if those scientists who use AI without disclosing it get fucked for life.

direwolf20•2w ago

"scientists" FYI. Making shit up isn't science.

yesitcan•2w ago

One fuck seems appropriate.

oofbey•2w ago

Harsh sentiment. Pretty soon every knowledge worker will use AI every day. Should people disclose spellcheckers powered by AI? Disclosing is not useful. Being careful in how you use it and checking work is what matters.

ambicapter•2w ago

> Should people disclose spellcheckers powered by AI?

Thank you for that perfect example of a strawman argument! No, spellcheckers that use AI is not the main concern behind disclosing the use of AI in generating scientific papers, government reports, or any large block of nonfiction text that you paid for that is supposed to make to sense.

fisf•2w ago

People are accountable for the results they produce using AI. So a scientist is responsible for made up sources in their paper, which is plain fraud.

oofbey•2w ago

I completely agree. But “disclosing the use of AI” doesn’t solve that one bit.

barbazoo•2w ago

I don’t disclose what keyboard I use to write my code or if I applied spellcheck afterward. The result is 100% theirs.

eichin•2w ago

"responsible for made up sources" leads to the hilarious idea that if you cite a paper that doesn't exist, you're now obliged to write that paper (getting it retroactively published might be a challenge though)

Proziam•2w ago

False equivalence. This isn't about "using AI" it's about having an AI pretend to do your job.

What people are pissed about is the fact their tax dollars fund fake research. It's just fraud, pure and simple. And fraud should be punished brutally, especially in these cases, because the long tail of negative effects produces enormous damage.

freedomben•2w ago

I was originally thinking you were being way too harsh with your "punish criminally" take, but I must admit, you're winning me over. I think we would need to be careful to ensure we never (or realistically, very rarely) convict an innocent person, but this is in many cases outright theft/fraud when someone is making money or being "compensated" for producing work that is fraudulent.

For people who think this is too harsh, just remember we aren't talking about undergrads who cheat on a course paper here. We're talking about people who were given money (often from taxpayers) that committed fraud. This is textbook white collar crime, not some kid being lazy. At a minimum we should be taking all that money back from them and barring them from ever receiving grant money again. In some cases I think fines exceeding the money they received would be appropriate.

Proziam•2w ago

Thank you for the comment!

I think the negative reaction people have comes from fear of punishment for human error, but fraud (meaning the real legal term, not colloquially) requires knowledge and intent.

That legal standard means that the risk of ruinous consequences for a 'lazy kid' who took a foolish shortcut is very low. It also requires that a prosecutor look at the circumstances and come to the conclusion that they can meet this standard in a courtroom. The bar is pretty high.

That said, it's very important to note that fraud has a pretty high rearrest (not just did it, but got arrested for it) rate between 35-50%. So when it gets to the point that someone has taken that step, a slap on the wrist simply isn't going to work. Ultimately, when that happens every piece of work they've touched, and every piece of work that depended on their work, gets called into question. The dependency graph affected by a single fraudster can be enormous.

geremiiah•2w ago

What they are doing is plain cheating the system to get their 3 conference papers so they can get their $150k+ job at FAANG. It's plain cheating with no value.

barbazoo•2w ago

People that cheat with AI now probably found ways to cheat before as well.

shermantanktop•2w ago

Cheating by people in high status positions should get the hammer. But it gets the hand-wringing what-have-we-come-to treatment instead.

WarmWash•2w ago

We are only looking at one side of the equation here, in this whole thread.

This feels a bit like the "LED stoplights shouldn't be used because they don't melt snow" argument.

mikkupikku•2w ago

Confront the culprit and ask for their side; you'll just get some sob story about how busy they are and how they were only using the AI to check their grammar and they just don't know how the whole thing ended up fabricated... Waste of time. Just blacklist these people, they're no better than any other scammer.

Der_Einzige•2w ago

Rookie numbers. After NeurIPS main conference, you’re dumb not to ask for 300K YOY. I watched IBM pay that amount prorated to an intern with a single first author NeurIPS publication.

vimda•2w ago

"Pretty soon every knowledge worker will use AI every day" is a wild statement considering the reporting that most companies deploying AI solutions are seeing little to no benefit, but also, there's a pretty obvious gap between spell checkers and tools that generate large parts of the document for you

PunchyHamster•2w ago

nice job moving the goalpost from "hallucinated the research/data" to "spellchecker error"

duskdozer•2w ago

>Pretty soon every knowledge worker will use AI every day.

Maybe? There's certainly a push to force the perception of inevitability.

Sharlin•2w ago

In general we're pretty good at drawing a line between purely editorial stuff like using a spellchecker, or even the services a professional editor (no need to acknowledge), and independent intellectual contribution (must be acknowledged). There's no slippery slope.

bwfan123•2w ago

> It would be great if those scientists who use AI without disclosing it get fucked for life.

There need to be dis-incentives for sloppy work. There is a tension between quality and quantity in almost every product. Unfortunately academia has become a numbers-game with paper-mills.

pandemic_region•2w ago

Instead of publishing their papers in the prestigious zines - which is what they're after - we will publish them in "AI Slop Weekly" with name and picture. Up the submission risk a bit.

jordanpg•2w ago

If these are so easy to identify, why not just incorporate some kind of screening into the early stages of peer review?

DetectDefect•2w ago

Because real work takes time and effort, and there is no real incentive for it here.

tossandthrow•2w ago

What makes you believe that are easy to identify?

emil-lp•2w ago

One could require DOIs for each reference. That's both realistic to achieve and easy to verify.

Although then why not just cite existing papers for bogus reasons?

jordanpg•2w ago

Isn't that what GPTZero does?

direwolf20•2w ago

Wow! They're literally submitting references to papers by Firstname Lastname, John Doe and Jane Smith and nobody is noticing or punishing them.

emil-lp•2w ago

They might (I hope) still be punished after discovery.

an0malous•2w ago

It’s the way of the future

thaw13579•2w ago

To be honest, this one could just as well be a sloppy bibliography.

CGMthrowaway•2w ago

Which is worse:

a) p-hacking and suppressing null results

b) hallucinations

c) falsifying data

Would be cool to see an analysis of this

Proziam•2w ago

All 3 of these should be categorized as fraud, and punished criminally.

internetter•2w ago

criminally feels excessive?

Proziam•2w ago

If I steal hundreds of thousands of dollars (salary, plus research grants and other funds) and produce fake output, what do you think is appropriate?

To me, it's no different than stealing a car or tricking an old lady into handing over her fidelity account. You are stealing, and society says stealing is a criminal act.

WarmWash•2w ago

We have a civil court system to handle stuff like this already.

Proziam•2w ago

Stealing more than a few thousand dollars is a felony, and felonies are handled in criminal court, not civil.

EDIT - The threshold amount varies. Sometimes it's as low as a few hundred dollars. However, the point stands on its own, because there's no universe where the sum in question is in misdemeanor territory.

WarmWash•2w ago

It would fall under the domain of contract law, because maybe the contract of the grant doesn't prohibit what the researcher did. The way to determine that would be in court - civil court.

Most institutions aren't very chill with grant money being misused, so we already don't need to burden then state with getting Johnny muncipal prosecutor to try and figure out if gamma crystallization imaging sources were incorrect.

Proziam•2w ago

Fraud implies intent, either intent to deceive or intentionally negligent.

If you're taking public funds (directly or otherwise) with the intent to either:

A) Do little to no real work, and pass of the work of an AI as being your own work, or

B) Knowingly publish falsified data

Then you are, without a single shred of doubt, in criminal fraud territory. Further, the structural damage you inflict when you do the above is orders of magnitude greater than the initial fraud itself. That is a matter for civil courts ("Our company based on development on X fraudulent data, it cost us Y in damages").

Whether or not charges are pressed is going to happen way after all the internal reviews have demonstrated the person being charged has gone beyond the "honest mistake" threshold. It's like Walmart not bothering to call the cops until you're into felony territory, there's no point in doing so.

wat10000•2w ago

We also have a criminal court system to handle stuff like this.

WarmWash•2w ago

No we don't. I've never seen a private contract dispute go to criminal court, probably because it's a civil matter.

If they actually committed theft, well then that already is illegal too.

But right now, doing "shitty research" isn't illegal and it's unlikely it ever will be.

wat10000•2w ago

The claim is that this would qualify as fraud, which is also illegal.

If you do a search for "contractor imprisoned for fraud" you'll find plenty of cases where a private contract dispute resulted in criminal convictions for people who took money and then didn't do the work.

I don't know if taking money and then merely pretending to do the research would rise to the level of criminal fraud, but it doesn't seem completely outlandish.

jacquesm•2w ago

You could make a good case for a white collar crime here, fraud for instance.

Der_Einzige•2w ago

Only when we can arrest people who say dumb stuff on the internet too. Much like how trump and bubba (bill Clinton) should share a jail cell, those who pontificate about what they don’t know about (I.e non academics critiquing academia) can sit in the same jail cell as the supposed criminal academics.

You gotta horse trade if you want to win. Take one for the team or get out of the way.

Proziam•2w ago

Non-academics can definitely offer valid critiques of academia.

You don't need to be in academia to understand that scientific progress depends on trust. If you don't trust the results people are publishing, you can't then build upon them. Reproducibility has been a known issue for a long time[0], and is widely agreed upon to be a 'crisis' by academics[1].

The advent of an easier way to publish correct-looking papers, or to plagiarize and synthesize other works without actually validating anything is only going to further diminish trust.

[0] https://www.nature.com/articles/533452a#citeas

[1] https://journals.plos.org/plosbiology/article?id=10.1371/jou...

amitav1•2w ago

I'm doing some research, and this is something I'm unsure of. I see that "suppressing null results" is a bad thing, and I sort of agree, but for me personally, a lot of the null results are just the result of my own incompetence and don't contain any novel insights.

fulafel•2w ago

Is there a comparison to rate of reference errors in other forums?

dtartarotti•2w ago

It is very concerning that these hallucinations passed through peer review. It's not like peer review is a fool-proof method or anything, but the fact that reviewers did not check all references and noticed clearly bogus ones is alarming and could be a sign that the article authors weren't the only ones using LLMs in the process...

amanaplanacanal•2w ago

Is it common for peer reviewers to check references? Somehow I thought they mostly focused on whether the experiment looked reasonable and the conclusions followed.

emil-lp•2w ago

In journal publications it is, but without DOIs it's difficult.

In conference publications, it's less common.

Conference publications (like NEURips) is treated as announcement of results, not verified.

empiko•2w ago

Nobody in ML or AI is verifying all your references. Reviewers will point out if you miss a super related work, but that's it. This is especially true with the recent (last two decades?) inflation in citation counts. You regularly have papers with 50+ references for all kinds of claims and random semirelated work. The citation culture is really uninspiring.

smallpipe•2w ago

Could you run a similar analysis for pre-2020 papers? It'd be interesting to know how prevalent making up sources was before LLMs.

tasuki•2w ago

Also, it'd be interesting how many pre-2020 papers their "AI detector" marks as AI-generated. I distrust LLMs somewhat, but I distrust AI detectors even more.

theptip•2w ago

Yeah, it’s kind of meaningless to attribute this to AI without measuring the base rate.

It’s for sure plausible that it’s increasing, but I’m certain this kind of thing happened with humans too.

jcmp•2w ago

at the end of the article they made a clear distinction between flawed and hallucinated cititations. I feels its hard to argue that through a mistake a hallucinated citation emerge:

> Real Citation Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521:436-444, 2015.

Flawed Citation

Y. LeCun, Y. Bengio, and Geoff Hinton. Deep leaning. nature, 521(7553):436-444, 2015.

Hallucinated Citation

Samuel LeCun Jackson. Deep learning. Science & Nature: 23-45, 2021.

bonsai_spool•2w ago

This suggests that nobody was screening this papers in the first place—so is it actually significant that people are using LLMs in a setting without meaningful oversight?

These clearly aren't being peer-reviewed, so there's no natural check on LLM usage (which is different than what we see in work published in journals).

emil-lp•2w ago

As one who reviews 20+ papers per year, we don't have time to verify each reference.

We verify: is the stuff correct, and is it worthy of publication (in the given venue) given that it is correct.

There is still some trust in the authors to not submit made-up-stuff, albeit it is diminishing.

paulmist•2w ago

I'm surprised the conference doesn't provide tooling to validate all references automatically.

Sharlin•2w ago

How would you do that? Even in cases where there's a standard format, a DOI on every reference, and some giant online library of publication metadata, including everything that only exists in dead tree format, that just lets you check whether the cited work exists, not whether it's actually a relevant thing to cite in the context.

its_ethan•2w ago

Sorry, but if someone makes a claim and cites a reference, how do you verify "is the stuff correct" without checking that reference?

emil-lp•2w ago

Those are typically things you are familiar with or can easily check.

Fake references are more common in the introduction where you list relevant material to strengthen your results. They often don't change the validity of the claim, but the potential impact or value.

gcr•2w ago

Academic venues don't have enough reviewers. This problem isn't new, and as publication volumes increase, it's getting sharply worse.

Consider the unit economics. Suppose NeurIPS gets 20,000 papers in one year. Suppose each author should expect three good reviews, so area chairs assign five reviewers per paper. In total, 100,000 reviews need to be written. It's a lot of work, even before factoring emergency reviewers in.

NeurIPS is one venue alongside CVPR, [IE]CCV, COLM, ICML, EMNLP, and so on. Not all of these conferences are as large as NeurIPS, but the field is smaller than you'd expect. I'd guess there are 300k-1m people in the world who are qualified to review AI papers.

khuey•2w ago

Seems like using tooling like this to identify papers with fake citations and auto-rejecting them before they ever get in front of a reviewer would kill two birds with one stone.

gcr•2w ago

It's not always possible to distinguish between fake citations and citations that are simply hard to find (e.g. wonderful old books that aren't on the Internet).

Another problem is that conferences move slowly and it's hard to adjust the publication workflow in such an invasive way. CVPR only recently moved from Microsoft's CMT to OpenReview to accept author submissions, for example.

There's a lot of opportunity for innovation in this space, but it's hard when everyone involved would need to agree to switch to a different workflow.

(Not shooting you down. It's just complicated because the people who would benefit are far away from the people who would need to do the work to support it...)

khuey•2w ago

Sure, I agree that it's far from trivial to implement.

alain94040•2w ago

When I was reviewing such papers, I didn't bother checking that 30+ citations were correctly indexed. I focused on the article itself, and maybe 1 or 2 citations that are important. That's it. For most citations, they are next to an argument that I know is correct, so why would I bother checking. What else do you expect? My job was to figure out if the article ideas are novel and interesting, not if they got all their citations right.

geremiiah•2w ago

A lot of research in AI/ML seems to me to be "fake it and never make it". Literally it's all about optics, posturing, connections, publicity. Lots of bullshit and little substance. This was true before AI slop, too. But the fact that AI slop can make it pass the review really showcases how much a paper's acceptance hinges on things, other than the substance and results of the paper.

I even know PIs who got fame and funding based on some research direction that supposedly is going to be revolutionary. Except all they had were preliminary results that from one angle, if you squint, you can envision some good result. But then the result never comes. That's why I say, "fake it, and never make it".

gcr•2w ago

I was getting completely AI-generated reviews for a WACV publication back in 2024. The area chairs are so overworked that authors don't have much recourse, which sucks but is also really hard to handle unless more volunteers step up to the bat to help organize the conference.

(If you're qualified to review papers, please email the program chair of your favorite conference and let them know -- they really need the help!)

As for my review, the review form has a textbox for a summary, a textbox for strengths, a textbox for weaknesses, and a textbox for overall thoughts. The review I received included one complete set of summary/strengths/weaknesses/closing thoughts in the summary text box, another distinct set of summary/strengths/weaknesses/closing thoughts in the strengths, another complete and distinct review in the weaknesses, and a fourth complete review in the closing thoughts. Each of these four reviews were slightly different and contradicted each other.

The reviewer put my paper down as a weak reject, but also said "the pros greatly outweigh the cons."

They listed "innovative use of synthetic data" as a strength, and "reliance on synthetic data" as a weakness.

cubefox•2w ago

Wow...

alansaber•2w ago

This is 100% negligence.

Tom1380•2w ago

No ETH Zurich, let's go

gcr•2w ago

NeurIPS leadership doesn’t think hallucinated references are necessarily disqualifying; see the full article from Fortune for a statement from them: https://archive.ph/yizHN

> When reached for comment, the NeurIPS board shared the following statement: “The usage of LLMs in papers at AI conferences is rapidly evolving, and NeurIPS is actively monitoring developments. In previous years, we piloted policies regarding the use of LLMs, and in 2025, reviewers were instructed to flag hallucinations. Regarding the findings of this specific work, we emphasize that significantly more effort is required to determine the implications. Even if 1.1% of the papers have one or more incorrect references due to the use of LLMs, the content of the papers themselves are not necessarily invalidated. For example, authors may have given an LLM a partial description of a citation and asked the LLM to produce bibtex (a formatted reference). As always, NeurIPS is committed to evolving the review and authorship process to best ensure scientific rigor and to identify ways that LLMs can be used to enhance author and reviewer capabilities.”

Analemma_•2w ago

Kinda gives the whole game away, doesn’t it? “It doesn’t actually matter if the citations are hallucinated.”

In fairness, NeurIPS is just saying out loud what everyone already knows. Most citations in published science are useless junk: it’s either mutual back-scratching to juice h-index, or it’s the embedded and pointless practice of overcitation, like “Human beings need clean water to survive (Franz, 2002)”.

Really, hallucinated citations are just forcing a reckoning which has been overdue for a while now.

jacquesm•2w ago

There should be a way to drop any kind of circular citation ring from the indexes.

gcr•2w ago

It's tough because some great citations are hard to find/procure still. I sometimes refer to papers that aren't on the Internet (eg. old wonderful books / journals).

jacquesm•2w ago

But that actually strengthens those citiations. The I scratch your back you scratch mine ones are the ones I'm getting at and that is quite hard to do with old and wonderful stuff, the authors there are probably not in a position to reciprocate by virtue of observing the grass from the other side.

gcr•2w ago

I think it's a hard problem. The semanticscholar folks are doing the sort of work that would allow them to track this; I wonder if they've thought about it.

A somewhat-related parable: I once worked in a larger lab with several subteams submitting to the same conference. Sometimes the work we did was related, so we both cited each other's paper which was also under review at the same venue. (These were flavor citations in the "related work" section for completeness, not material to our arguments.) In the review copy, the reference lists the other paper as written by "anonymous (also under review at XXXX2025)," also emphasized by a footnote to explain the situation to reviewers. When it came time to submit the camera-ready copy, we either removed the anonymization or replaced it with an arxiv link if the other team's paper got rejected. :-) I doubt this practice improved either paper's chances of getting accepted.

Are these the sorts of citation rings you're talking about? If authors misrepresented the work as if it were accepted, or pretended it was published last year or something, I'd agree with you, but it's not too uncommon in my area for well-connected authors to cite manuscripts in process. I don't think it's a problem as long as they don't lean on them.

jacquesm•2w ago

No, I'm talking about the ones where the citation itself is almost or even completely irrelevant and used as a way to inflate the citation count of the authors. You could find those by checking whether or not the value as a reference (ie: contributes to the understanding of the paper you are reading) is exceeded by the value of the linkage itself.

zipy124•2w ago

The flavour citations in related work is the best place to launder citations.

fc417fc802•2w ago

> Most citations in published science are useless junk:

Can't say that matches my experience at all. Once I've found a useful paper on a topic thereafter I primarily navigate the literature by traveling up and down the citation graph. It's extremely effective in practice and it's continued to get easier to do as the digitization of metadata has improved over the years.

empath75•2w ago

I think a _single_ instance of an LLM hallucination should be enough to retract the whole paper and ban further submissions.

gcr•2w ago

Going through a retraction and blacklisting process is also a lot of work -- collecting evidence, giving authors a chance to respond and mediate discussion, etc.

Labor is the bottleneck. There aren't enough academics who volunteer to help organize conferences.

(If a reader of this comment is qualified to review papers and wants to step up to the plate and help do some work in this area, please email the program chairs of your favorite conference and let them know. They'll eagerly put you to work.)

pessimizer•2w ago

That's exactly why the inclusion of a hallucinated reference is actually a blessing. Instead going back and forth with the fraudster, just tell them to find the paper. If they can't, case closed. Massive amount of time and money saved.

gcr•2w ago

Isn't telling them to find the paper just "going back and forth with a fraudster"?

One "simple" way of doing this would be to automate it. Have authors step through a lint step when their camera-ready paper is uploaded. Authors would be asked to confirm each reference and link it to a google scholar citation. Maybe the easy references could be auto-populated. Non-public references could be resolved by uploading a signed statement or something.

There's no current way of using this metadata, but it could be nice for future systems.

Even the Scholar team within Google is woefully understaffed.

My gut tells me that it's probably more efficient to just drag authors who do this into some public execution or twitter mob after-the-fact. CVPR does this every so often for authors who submit the same paper to multiple venues. You don't need a lot of samples for deterrence to take effect. That's kind of what this article is doing, in a sense.

wing-_-nuts•2w ago

I dunno about banning them, humans without LLMs make mistakes all the time, but I would definitely place them under much harder scrutiny in the future.

pessimizer•2w ago

Hallucinations aren't mistakes, they're fabrications. The two are probably referred to by the same word in some languages.

Institutions can choose an arbitrary approach to mistakes; maybe they don't mind a lot of them because they want to take risks and be on the bleeding edge. But any flexible attitude towards fabrications is simply corruption. The connected in-crowd will get mercy and the outgroup will get the hammer. Anybody criticizing the differential treatment will be accused of supporting the outgroup fraudsters.

gcr•2w ago

Fabrications carry intent to decieve. I don't think hallucinations necessarily do. If anything, they're a matter of negligence, not deception.

Think of it this way: if I wanted to commit pure academic fraud maliciously, I wouldn't make up a fake reference. Instead, I'd find an existing related paper and merely misrepresent it to support my own claims. That way, the deception is much harder to discover and I'd have plausible deniability -- "oh I just misunderstood what they were saying."

I think most academic fraud happens in the figures, not the citations. Researchers are more likely to to be successful at making up data points than making up references because it's impossible to know without the data files.

direwolf20•2w ago

Generating a paper with an LLM is already academic fraud. You, the fraudster, are trying to optimize your fraud-to-effort ratio which is why you don't bother to look for existing papers to mis-cite.

andy99•2w ago

   For example, authors may have given an LLM a partial description of a citation and asked the LLM to produce bibtex

This is equivalent to a typo. I’d like to know which “hallucinations” are completely made up, and which have a corresponding paper but contain some error in how it’s cited. The latter I don’t think matters.

burkaman•2w ago

If you click on the article you can see a full list of the hallucinations they found. They did put in the effort to look for plausible partial matches, but most of them are some variation of "No author or title match. Doesn't exist in publication."

Here's a random one I picked as an example.

Paper: https://openreview.net/pdf?id=IiEtQPGVyV

Reference: Asma Issa, George Mohler, and John Johnson. Paraphrase identification using deep contextual- ized representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 517–526, 2018.

Asma Issa and John Johnson don't appear to exist. George Mohler does, but it doesn't look like he works in this area (https://www.georgemohler.com/). No paper with that title exists. There are some with sort of similar titles (https://arxiv.org/html/2212.06933v2 for example), but none that really make sense as a citation in this context. EMNLP 2018 exists (https://aclanthology.org/D18-1.pdf), but that page range is not a single paper. There are papers in there that contain the phrases "paraphrase identification" and "deep contextualized representations", so you can see how an LLM might have come up with this title.

gold23•2w ago

It's not the equivalent of a typo. A typo would be immediately apparent to the reader. This is a semantic error that is much less likely to be caught by the reader.

jklinger410•2w ago

> the content of the papers themselves are not necessarily invalidated. For example, authors may have given an LLM a partial description of a citation and asked the LLM to produce bibtex (a formatted reference)

Maybe I'm overreacting, but this feels like an insanely biased response. They found the one potentially innocuous reason and latched onto that as a way to hand-wave the entire problem away.

Science already had a reproducibility problem, and it now has a hallucination problem. Considering the massive influence the private sector has on the both the work and the institutions themselves, the future of open science is looking bleak.

paulmist•2w ago

Isn't disqualifying X months of potentially great research due to a misformed, but existing reference harsh? I don't think they'd be okay with references that are actually made up.

suddenlybananas•2w ago

It's a sign of dishonesty, not a perfect one, but an indicator.

jklinger410•2w ago

When your entire job is confirming that science is valid, I expect a little more humility when it turns out you've missed a critical aspect.

How did these 100 sources even get through the validation process?

> Isn't disqualifying X months of potentially great research due to a misformed, but existing reference harsh?

It will serve as a reminder not to cut any corners.

paulmist•2w ago

> When your entire job is confirming that science is valid, I expect a little more humility when it turns out you've missed a critical aspect.

I wouldn't call a misformed reference a critical issue, it happens. That's why we have peer reviews. I would contend drawing superficially valid conclusions from studies through use of AI is a much more burning problem that speaks more to the integrity of the author.

> It will serve as a reminder not to cut any corners.

Or yet another reason to ditch academic work for industry. I doubt the rise of scientific AI tools like AlphaXiv [1], whether you consider them beneficial or detrimental, can be avoided - calling for a level pragmatism.

jklinger410•2w ago

> I wouldn't call a misformed reference a critical issue, it happens. That's why we have peer reviews.

Crazy to say this in a discussion where peer review missed hallucinated citations

pas•2w ago

even the fact that citations are not automatically verified by the journal is crazy, the whole academia and publishing enterprise is an empire built on inefficiency, hubris, and politics (but I'm repeating myself).

zipy124•2w ago

Science relies on trust.. a lot. So things which show dishonesty are penalised greatly. If we were to remove trust then peer reviewing a paper might take months of work or even years.

paulmist•2w ago

And that timeline only grows with the complexity of the field in question. I think this is inherently a function of the complexity of the study, and rather than harshly penalizing such shortcomings we should develop tools that address them and improve productivity. AI can speed up the verification of requirements like proper citations, both on the author's and reviewer's side.

loglog•2w ago

Math does that. Peer review cycles are measured in years there. This does not stop fashionable subfields from publishing sloppy papers, and occasionally even irrecoverably false ones.

orbital-decay•2w ago

The wording is not hand-wavy. They said "not necessarily invalidated", which could mean that innocuous reason and nothing extra.

mikkupikku•2w ago

Even if some of those innocuous mistakes happen, we'll all be better off if we accept people making those mistakes as acceptable casualties in an unforgiving campaign against academic fraudsters.

It's like arguing against strict liability for drunk driving because maybe somebody accidentally let their grape juice sit to long and they didn't know it was fermented... I can conceive of such a thing, but that doesn't mean we should go easy on drunk driving.

jklinger410•2w ago

I really think it is. The primary function of these publications is to validate science. When we find invalid citations, it shows they're not doing their job. When they get called on that, they cite the volume of work their publication puts out and call out the only potential not-disqualifying outcome.

Seems like CYA, seems like hand wave. Seems like excuses.

anishrverma•2w ago

I don’t read the NeurIPS statement as malicious per se, but I do think it’s incomplete

They’re right that a citation error doesn’t automatically invalidate the technical content of a paper, and that there are relatively benign ways these mistakes get introduced. But focusing on intent or severity sidesteps the fact that citations, claims, and provenance are still treated as narrative artifacts rather than things we systematically verify

Once that’s the case, the question isn’t whether any single paper is “invalid” but whether the workflow itself is robust under current incentives and tooling.

A student group at Duke has been trying to think about with Liberata, i.e. what publishing looks like if verification, attribution, and reproducibility are first class rather than best effort

They have a short explainer here that lays out the idea if useful context helps: https://liberata.info/

michaelmior•2w ago

I found at least one example[0] of authors claiming the reason for the hallucination was exactly this. That said, I do think for this kind of use, authors should go to the effort of verifying the correctness of the output. I also tend to agree with others who have commented that while a hallucinated citation or two may not be particularly egregious, it does raise concerns about what other errors may have been missed.

[0] https://openreview.net/forum?id=IiEtQPGVyV&noteId=W66rrM5XPk

derf_•2w ago

This will continue to happen as long as it is effectively unpunished. Even retracting the paper would do little good, as odds are it would not have been written if the author could not have used an LLM, so they are no worse off for having tried. Scientific publications are mostly a numbers game at this point. It is just one more example of a situation where behaving badly is much cheaper than policing bad behavior, and until incentives are changed to account for that, it will only get worse.

Aurornis•2w ago

> Even if 1.1% of the papers have one or more incorrect references due to the use of LLMs, the content of the papers themselves are not necessarily invalidated.

This statement isn’t wrong, as the rest of the paper could still be correct.

However, when I see a blatant falsification somewhere in a paper I’m immediately suspicious of everything else. Authors who take lazy shortcuts when convenient usually don’t just do it once, they do it wherever they think they can get away with it. It’s a slippery slope from letting an LLM handle citations to letting the LLM write things for you to letting the LLM interpret the data. The latter opens the door to hallucinated results and statistics, as anyone who has experimented with LLMs for data analysis will discover eventually.

red75prime•2w ago

Yep, it's a slippery slope. No one in their right mind would have tried to use GPT 2.0 for writing a part of their paper. But hallucination-error-rate kept decreasing. How do you think, is there acceptable hallucination-error-rate greater than 0?

mlmonkey•2w ago

Why not run every submitted paper through GPTZero (before sending to reviewers) and summarily reject any paper with a hallucination?

gcr•2w ago

That's how GPTZero wants to situate themselves.

Who would pay them? Conference organizers are already unpaid and undestaffed, and most conferences aren't profitable.

I think rejections shouldn't be automatic. Sometimes there are just typos. Sometimes authors don't understand BibTeX. This needs to be done in a way that reduces the workload for reviewers.

One way of doing this would be for GPTZero to annotate each paper during the review step. If reviewers could review a version of each paper with yellow-highlighted "likely-hallucinated" references in the bibliography, then they'd bring it up in their review and they'd know to be on their guard for other probably LLM-isms. If there's only a couple likely typos in the references, then reviewers could understand that, and if they care about it, they'd bring it up in their reviews and the author would have the usual opportunity to rebut.

I don't know if GPTZero is willing to provide this service "for free" to the academic community, but if they are, it's probably worth bringing up at the next PAMI-TC meeting for CVPR.

zipy124•2w ago

Most publication venues already pay for a plagiarism detection service, it seems it would be trivial to add it on as a cost. Especially given APCs for journals are several thousand dollars, what's a few dollars more per paper.

nonethewiser•2w ago

>NeurIPS leadership doesn’t think hallucinated references are necessarily disqualifying

That seems ridiculous.

Molitor5901•2w ago

AI might just extinguish the entire paradigm of publish or perish. The sheer volume of papers makes it nearly impossible to properly decide which papers have merit, which are non-replicate and suspect, and which are just a desperate rush to publish. The entire practice needs to end.

shermantanktop•2w ago

But how could we possibly evaluate faculty and researcher quality without counting widgets on an assembly line? /s

It’s a problem. The previous regime prior to publishing-mania was essentially a clubby game of reputation amongst peers based on cocktail party socialization.

The publication metrics came out of the harder sciences, I believe, and then spread to the softest of humanities. It was always easy to game a bit if you wanted to try, but now it’s trivial to defeat.

SJC_Hacker•2w ago

Its not publish or perish so much as get grant money or perish.

Publishing is just the way to get grants.

A PI explained it to me once, something like this

Idea(s) -> Grant -> Experiments -> Data -> Paper(s) -> Publication(s) -> Idea(s) -> Grant(s)

Thats the current cycle ... remove any step and its a dead end

armcat•2w ago

This is awful but hardly surprising. Someone mentioned reproducible code with the papers - but there is a high likelihood of the code being partially or fully AI generated as well. I.e. AI generated hypothesis -> AI produces code to implement and execute the hypothesis -> AI generates paper based on the hypothesis and the code.

Also: there were 15 000 submissions that were rejected at NeurIPS; it would be very interesting to see what % of those rejected were partially or fully AI generated/hallucinated. Are the ratios comperable?

blackbear_•2w ago

Whether the code is AI generated or not is not important, what matters is that it really works.

Sharing code enables others to validate the method on a different dataset.

Even before LLMs came around there were lots of methods that looked good on paper but turned out not to work outside of accepted benchmarks

depressionalt•2w ago

This is nice and all, but what repercussion does GPTZero get when their bullshit AI detection hallucinates a student using AI? And when that student receives academic discipline because of it?

Many such cases of this. More than 100!

They claim to have custom detection for GPT-5, Gemini, and Claude. They're making that up!

freedomben•2w ago

Indeed. My son has been accused by bullshit AI detection as having used AI, and it has devastated his work quality. After being "disciplined" for using AI (when he didn't), he now intentionally tries to "dumb down" his writing so that it doesn't sound so much like AI. The result is he writes much worse. What a shitty, shitty outcome. I've even found myself leaving typos and things in (even on sites like HN) because if you write too well, inevitably some comment replier will call you out as being an LLM even when you aren't. I'm as annoyed by the LLM posts as everybody else, but the answer surely is not to dumb us down into Idiocracy.

Sharlin•2w ago

It's almost as if this whole LLM stuff wasn't a net benefit to the society after all.

Der_Einzige•2w ago

Stop using em dashes, the fancy quotes that can’t be easily typed. Stop using overused words like certainly and delve. Stop using LLM template slop like “it’s not X, it’s Y”. Stop always doing lists of 3s. We know you didn’t use to use so many emojis or bolded text. Also, AI really fking hates the exclamation mark so that’s a great proof of humanity!

Most people getting flagged are getting flagged because they actually used AI and couldn’t even be bothered to manually deslop it.

People who are too lazy to put even a tiny bit of human intentionality into their work deserve it.

freedomben•2w ago

That's all good advice, but it's not enough. He never uses em dashes or emojis in papers, and in the past when using exclamation marks he had teachers say, "don't use these in academic papers, they're not appropriate." Also mac OS loves to use the fancy quotes by default so when he's writing on a Mac, it's a pain in the ass to use regular quotes. It seems absurd to me that you'd have to jump through that hoop anyway just so it doesn't look like AI.

We've had teachers show us the screenshot output from their AI tool and it flags on things like "vocabulary word unusual for grade level." In my early 20s when I was dating my now-wife, she had a great vocabulary and I admired her for it, so I spent a lot of effort improving my vocabulary (well worth it by the way). When my son was born I intentionally used "big words" all the time with him (and explained what they meant when he didn't know) in the hopes that he would have a naturally large vocabulary when he got older. It worked very well. He routinely uses words even in conversation that even his teachers don't know. He writes even better than he speaks. But now being a statistical outlier is punishing him.

It flags plenty of other things like direct quotes (which he puts in quotation marks as he should) and includes it in the "score", so a quote heavy paper will sometimes show something like "65% produced by AI". He uses Google Docs so we can literally go through the whole history and see him writing the paper through time.

> Most people getting flagged are getting flagged because they actually used AI and couldn’t even be bothered to manually deslop it.

I'm sure that's true, but it doesn't excuse people using an automated tool that they don't understand and messing with other people's lives because of it. Just like when some cloud provider decides that your workload looks too much like crypto mining or something so AI auto-bans your account and shuts off your stuff.

theptip•2w ago

This is mostly an ad for their product. But I bet you can get pretty good results with a Claude Code agent using a couple simple skills.

Should be extremely easy for AI to successfully detect hallucinated references as they are semi-structured data with an easily verifiable ground truth.

leggerss•2w ago

I don't understand: why aren't there automated tools to verify citations' existence? The data for a citation has a structured styling (APA, MLA, Chicago) and paper metadata is available via e.g. a web search, even if the paper contents are not

I guess GPTZero has such a tool. I'm confused why it isn't used more widely by paper authors and reviewers

gh02t•2w ago

Citations are too open ended and prone to variation, and legitimate minor mistskes that wouldn't bother a human verifier but would break automated tools to easily verify in their current form. DOI was supposed to solve some of the literal mechanical variation of the existence of a source, but journal paywalls and limited adoption mean that is not a universal solution. Plus DOI still doesn't easily verify the factual accuracy of a citation, like "does the source say what the citation says it does," which is the most important part.

In my experience you will see considerable variation in citation formats, even in journals that strictly define it and require using BibTex. And lots of journals leave their citation format rules very vague. Its a problem that runs deep.

leggerss•2w ago

Thanks for the thoughtful reply!

eichin•2w ago

Looks like GPTZero Source Finder was only released a year ago - if anything, I'm surprised slop-writers aren't using it preemptively, since they're "ahead of the curve" relative to reviewers on this sort of thing...

yepyeaisntityea•2w ago

No surprises. Machine learning has, at least since 2012, been the go-to field for scammers and grifters. Machine learning, and technology in general, is basically a few real ideas, a small number of honest hard workers, and then millions of fad chasers and scammers.

mt_•2w ago

It would be ironic if the very detection of hallucinations contained hallucinations of its own.

doug_durham•2w ago

Getting papers published is now more about embellishing your CV versus a sincere desire to present new research. I see this everywhere at every level. Getting a paper published anywhere is a checkbox in completing your resume. As an industry we need to stop taking this into consideration when reviewing candidates or deciding pay. In some sense it has become an anti-signal.

londons_explore•2w ago

I'd like to see a financial approach to deciding pay by giving researchers a small and perhaps nonlinear or time bounded share of any profits that arise from their research.

Then peoples CV's could say "My inventions have led to $1M in licensing revenue" rather than "I presented a useless idea at a decent conference because I managed to make it sound exciting enough to get accepted".

direwolf20•2w ago

That's what patents do.

autoexec•2w ago

A lot of good research isn't ever going to make anyone a single dime, but that doesn't mean it doesn't matter.

autoexec•2w ago

It'd be nice if there were a dedicated journal for papers published just because you have to publish for your CV or to get your degree. That way people can keep publishing for the sake of publishing, but you could see at a glance what the deal was.

biophysboy•2w ago

I think its fairer to say that perverse incentives have added more noise to the publishing signal. Publishing 0 times is not better than 100 times, even if 90% of those are Nth author formality/politeness citations.

nerdjon•2w ago

The downstream effects of this are extremely concerning. We have already seen the damage caused by human written research that was later retracted like the “research” on vaccines causing autism.

As we get more and more papers that may be citing information that was originally hallucinated in the first place we have a major reliability issue here. What is worse is people that did not use AI in the first place will be caught in the crosshairs since they will be referencing incorrect information.

There needs to be a serious amount of education done on what these tools can and cannot do and importantly where they fail. Too many people see these tools as magic since that is what the big companies are pushing them as.

Other than that we need to put in actual repercussions for publishing work created by an LLM without validating it (or just say you can’t in the first place but I guess that ship has sailed) or it will just keep happening. We can’t just ignore it and hope it won’t be a problem.

And yes, humans can make mistakes too. The difference is accountability and the ability to actually be unsure about something so you question yourself to validate.

pandemic_region•2w ago

What if they would only accept handwritten papers? Basically the current system is beyond repair, so may as well go back to receiving 20 decent papers instead of 20k hallucinated ones.

ctoth•2w ago

How you know it's really real is that they clearly tell the FPR, and compare against a pre-llm baseline.

But I saw it in Apple News, so MISSION ACCOMPLISHED!

yobbo•2w ago

As long as these sorts of papers serve more important purposes for the careers of the authors than anything related to science or discovery of knowledge, then of course this happens and continues.

The best possible outcome is that these two purposes are disconflated, with follow-on consequences for the conferences and journals.

poulpy123•2w ago

All papers proved to have used a LLM beyond writing improvement should be automatically retracted

brador•2w ago

The problem isn’t scale.

The problem is consequences (lack of).

Doing this should get you barred from research. It won’t.

uhfraid•2w ago

Scale IS a problem, just not the only one.

Consequences are the inevitable solution. Accountability starting with authors, followed by organizations/institutions.

Warning for first offense, ban after

CrzyLngPwd•2w ago

This is not the AI future we dreamed of, or feared.

nospice•2w ago

We've been talking about a "crisis of reproducibility" for years and the incentive to crank out high volumes of low-quality research. We now have a tool that brings down the cost of producing plausibly-looking research down to zero. So of course we're going to see that tool abused on a galactic scale.

But here's the thing: let's say you're an university or a research institution that wants to curtail it. You catch someone producing LLM slop, and you confirm it by analyzing their work and conducting internal interviews. You fire them. The fired researcher goes public saying that they were doing nothing of the sort and that this is a witch hunt. Their blog post makes it to the front page of HN, garnering tons of sympathy and prompting many angry calls to their ex-employer. It gets picked up by some mainstream outlets, too. It happened a bunch of times.

In contrast, there are basically no consequences to institutions that let it slide. No one is angrily calling the employers of the authors of these 100 NeurIPS papers, right? If anything, there's the plausible deniability of "oh, I only asked ChatGPT to reformat the citations, the rest of the paper is 100% legit, my bad".

meindnoch•2w ago

Jamie, bring up their nationalities.

neom•2w ago

I wrote before about my embarrassing time with ChatGPT during a period (https://news.ycombinator.com/item?id=44767601) - I decided to go back through those old 4o chats with 5.2 pro extended thinking, the reply was pretty funny because it first slightly ridiculed me, heh - but what it showed was: basically I would say "what 5 research papers from any area of science talk to these ideas" and it would find 1 and invent 4 if it didn't know 4 others, and not tell me, and then I'd keep working with it and it would invent what it thought might be in the papers long the way, making up new papers in it's own work to cite to make it's own work valid, lol. Anyway, I'm a moron, sure, and no real harm came of it for me, just still slightly shook I let that happen to me.

Shocka1•2w ago

Just to clarify, you didn't actually look up the publications it was citing? For example, you just stayed in ChatGPT web and used the resources it provided there? Not ridiculing you of course, but am just curious. The last paper I wrote a couple months back I had GPT search out the publications for me, but I would always open a new tab and retrieve the actual publication.

neom•2w ago

I didn't because I wasn't really doing anything serious to my mind, I think? basically felt like watching an episode of pbs spacetime, I think the difference is it's more like playing a video game while thinking you're watching an episode of spacetime, if that makes sense? I don't use chatgpt for me real work that much, and I'm not a scientist, so it was for me just mucking around, it pushed me slightly over a line into "I was just playing but now this seems real", it didn't occur to me to go back through and check all the papers, I guess because quite a lot of chatting had happened since then and, I dunno, I just didn't think to? Not sure that makes much sense. This was also over a year ago, during the time they had the gpt4o sycophancy mode that made the news, and it wasn't backed by webserch, so I took for granted what was in it's training data. No good excuse I'm afraid. tldr: poor critical thinking skills on my part there! :)

Shocka1•2w ago

Okay this makes more sense now and thanks for the explanation.

londons_explore•2w ago

And this is the tip of the iceberg, because these are the easy to check/validate things.

I'm sure plenty of more nuanced facts are also entirely without basis.

techIA•2w ago

They will turn it into a party drug.

trash_cat•2w ago

Clearly there is some demand for those papers, and research, to exist. Good opportunity to fill the gaps.

captainbland•2w ago

What's wild is so many of these are from prestigious universities. MIT, Princeton, Oxford and Cambridge are all on there. It must be a terrible time to be an academic who's getting outcompeted by this slop because somebody from an institution with a better name submitted it.

cflewis•2w ago

I'm going to be charitable and say that the papers from prestigious universities were honest mistakes rather than paper mill university fabrications.

One thing that has bothered me for a very long time is that computer science (and I assume other scientific fields) has long since decided that English is the lingua franca, and if you don't speak it you can't be part of it. Can you imagine if being told that you could only do your research if you were able to write technical papers in a language you didn't speak, maybe even using glyphs you didn't know? It's crazy when you think about it even a little bit, but we ask it of so many. Let's not include the fact that 90% of the English-speaking population couldn't crank out a paper to the required vocabulary level anyway.

A very legitimate, not trying to cheat, use for LLMs is translation. While it would be an extremely broad and dangerous brush to paint with, I wonder if there is a correlation between English-as-a-Second (or even third)-Language authors and the hallucinations. That would indicate that they were trying to use LLMs to help craft the paper to the expected writing level. The only problem being that it sometimes mangles citations, and if you've done good work and got 25+ citations, it's easy for those errors to slip through.

zipy124•2w ago

I can't speak for the American universities, but remember there is no entrance exam for UK PhDs, you just require a 2:1 or 1st class bachelor's degree/masters (going straight without a masters is becoming more common) usually, which is trivial to obtain. The hard part is usually getting funding, but if you provide your own funding you can go to any university you want. They are only really hard universities to get into for a bachelors, not for masters or PhD where you are more of a money/labour source than anything else.

captainbland•2w ago

Yeah in principle funded PhD positions are quite competitive and as I understand it you tend to be interviewed and essentially ranked against other candidates. But I guess if you're paying for yourself to be there you'll face lower scrutiny

teekert•2w ago

We have the h score and such, can we have something similar that goes down when you pull stunts like these? Preferably link it to people’s orcid ids.

dev_l1x_be•2w ago

I am wondering if we are going to reach hallucination collapse sooner than we reach AGI.

Nevermark•2w ago

With regard to confabulating (hallucinating) sources, or anything else, it is worth noting this is a first class training requirement imposed on models. Not models simply picking up the habit from humans.

When training a student, normally we expect a lack of knowledge early, and reward self-awareness, self-evaluation and self-disclosure of that.

But the very first epoch of a model training run, when the model has all the ignorance of a dropped plate of spaghetti, we optimize the network to respond to information, as anything from a typical human to an expert, without any base of understanding.

So the training practice for models is inherently extreme enforced “fake it until you make it”, to a degree far beyond any human context or culture.

(Regardless, humans need to verify, not to mention read, the sources they site. But it will be nice when models can be trusted to accurately access what they know/don’t-know too.)

rabbitlord•2w ago

You will find out that Top CS conference is never scientific, if you really go to their GitHub and run their code.

ctoth•2w ago

The innumeracy is load-bearing for the entire media ecosystem. If readers could do basic proportional reasoning, half of health journalism and most tech panic coverage would collapse overnight.

GPTZero of course knows this. "100 hallucinations across 53 papers at prestigious conference" hits different than "0.07% of citations had issues, compared to unknown baseline, in papers whose actual findings remain valid."

MeetingsBrowser•2w ago

I’m not sure that’s fair in this context.

In the past, a single paper with questionable or falsified results at a top tier conference was big news.

Something that casts doubt on the validity of 53 papers at a top AI conference is at least notable.

> whose actual findings remain valid

Remain valid according to who? The same group that missed hundreds of hallucinated citations?

ctoth•2w ago

Which of these papers had falsified results and not bad citations?

What is the base rate of bad citations pre-AI?

And finally yes. Peer review does not mean clicking every link in the footnotes to make sure the original paper didn't mislink, though I'm sure after this bruhaha this too will be automated.

MeetingsBrowser•2w ago

> Peer review does not mean clicking every link in the footnotes

It wasn't just broken links, but citing authors like "lastname, firstname" and made up titles.

I have done peer reviews for a (non-AI) CS conference and did at least skim the citations. For papers related to my domain, I was familiar with most of the citations already, and looked into any that looked odd.

Being familiar with the state of the art is, in theory, what qualifies you to do peer reviews.

nonethewiser•2w ago

> "0.07% of citations had issues

Nope, you are getting this part wrong. On purpose or by accident? Because it's pretty clear if you read the article they are not counting all citations that simply had issues. See "Defining Hallucinated Citations".

rfrey•2w ago

There's a lot of good arguments in this thread about incentives: extremely convincing about why current incentives lead to exactly this behaviour, and also why creating better incentives is a very hard problem.

If we grant that good carrots are hard to grow, what's the argument against leaning into the stick? Change university policies and processes so that getting caught fabricating data or submitting a paper with LLM hallucinations is a career ending event. Tip the expected value of unethical behaviours in favour of avoiding them. Maybe we can't change the odds of getting caught but we certainly can change the impact.

This would not be easy, but maybe it's more tractable than changing positive incentives.

currymj•2w ago

the harsher the punishment, the more due process required.

i don't think there are any AI detection tools that are sufficiently reliable that I would feel comfortable expelling a student or ending someone's career based on their output.

for example, we can all see what's going on with these papers (and it appears to be even worse among ICLR submissions). but it is possible to make an honest mistake with your BibTeX. Or to use AI for grammar editing, which is widely accepted, and have it accidentally modify a data point or citation. There are many innocent mistakes which also count as plausible excuses.

in some cases further investigation maybe can reveal a smoking gun like fabricated data, which is academic misconduct whether done by hand or because an AI generated the LaTeX tables. punishments should be harsher for this than they are.

rfrey•2w ago

Fabricated citations seem to be a popular and non ambiguous way for AI to sabotage science.

godelski•2w ago

Given that many of these detections are being made from references, I don't understand why we're not using automatic citation checkers.

Just ask authors to submit their bib file so we don't need to do OCR on the PDF. Flag the unknown citations and ask reviewers to verify their existence. Then contact authors and ban if they can't produce the cited work.

This is low hanging fruit here!

Detecting slop where the authors vet citations is much harder. The big problem with all the review rules is they have no teeth. If it were up to me we'd review in the open, or at least like ICLR. Publish the list of known bad actors and let is look at the network. The current system is too protective of egregious errors like plagiarism. Authors can get detected in one conference, pull, and submit to another, rolling the dice. We can't allow that to happen and we should discourage people from associating with these conartists.

AI is certainly a problem in the world of science review, but it's far from the only one and I'm not even convinced it's the biggest. The biggest is just that reviewers are lazy and/or not qualified to review the works they're assigned. It takes at least an hour to properly review a paper in your niche, much more when it's outside. We're over worked as is, with 5+ works to review, not to mention all the time we got to spend reworking our own works that were rejected due to the slot machine. We could do much better if we dropped this notion of conference/journal prestige and focused on the quality of the works and reviews.

Addressing those issues also addresses the AI issues because, frankly, *it doesn't matter if the whole work was done by AI, what matters is if the work is real.*

pacbard•2w ago

The ironic part about these hallucinations is that a research paper includes a literature review because the goal of the research is to be in dialogue with prior work, to show a gap in the existing literature, and to further the knowledge that this prior work has built.

By using an LLM to fabricate citations, authors are moving away from this noble pursuit of knowledge built on the "shoulders of giants" and show that behind the curtain output volume is what really matters in modern US research communities.

andy_xor_andrew•2w ago

I guess that makes this "standing on the shoulders of fabrications"

stogot•2w ago

Fabrication should be immediate academic ban for life

physPop•2w ago

Paper mills existed long before LLMs

einpoklum•2w ago

Standing on the shoulders of imaginary giants then :-\

AlienRobot•2w ago

That's going to be the philosophical question of our times: do LLMs generate slop out of nowhere or does it simply amplify the slop machinery that was already there?

abktowa•2w ago

Implicitly this makes sense but the amount cited in this article is still hard for me to grasp. Wow.

gtirloni•2w ago

Why focus on hallucinations/LLMs and not on the authors? There are rules for submitting papers.

If I drop a loaded gun and it fires, killing someone, we don't go after the gun's manufacturer in most cases.

phyzome•2w ago

This isn't directly to your point, but: A civil suit for such an incident would generally name both the weapon owner (for negligence, etc.) and the manufacturer (for dangerous design).

Der_Einzige•2w ago

Actually, if you’re the US navy, you DO go after the manufacturer!

Go look up the P320 pistol and the tons of accidental discharges that’s it’s caused.

https://stateline.org/2025/03/10/more-law-enforcement-agenci...

gtirloni•2w ago

Thanks. But's not actually the point I'm trying to make.

What I'm saying is that the authors have a responsibility, whether they wrote the papers themselves, asked an AI to write and didn't read it thoroughly, or asked their grandparents while on LSD to write it... it all comes back to whoever put their names on the paper and submitted it.

I think AI is a red herring here.

not2b•2w ago

This is going to be a huge problem for conferences. While journals have a longer time to get things right, as a conference reviewer (for IEEE conferences) I was often asked to review 20+ papers in a short time to determine who gets a full paper, who gets to present just a poster, etc. There was normally a second round, but often these would just look at submissions near the cutoff margin in the rankings. Obvious slop can be quickly rejected, but it will be easier to sneak things in.

cyber_kinetist•2w ago

AI conferences are already fucked. Students who are doing their Master's degrees are reviewing those top-tier papers, since there are just too many submissions for existing reviewers.

currymj•2w ago

Especially for your first NeurIPS paper as a PhD student, getting one published is extremely lucrative.

Most big tech PhD intern job postings have NeurIPS/ICML/ICLR/etc. first author paper as a de facto requirement to be considered. It's like getting your SAG card.

If you get one of these internships, it effectively doubles or triples your salary that year right away. You will make more in that summer than your PhD stipend. Plus you can now apply in future summers and the jobs will be easier to get. And it sets your career on a good path.

A conservative estimate of the discounted cash value of a student's first NeurIPS paper would certainly be five figures. It's potentially much higher depending on how you think about it, considering potential path dependent impacts on future career opportunities.

We should not be surprised to see cheating. Nonetheless, it's really bad for science that these attempts get through. I also expect some people did make legitimate mistakes letting AI touch their .bib.

Der_Einzige•2w ago

This is 100% true, if anything you’re massively undercounting the value of publications.

Most industry AI jobs that aren’t research based know that NeurIPS publications are a huge deal. Many of the managers don’t even know what a workshop is (so you can pass off NeurIPS workshop work as just “NeurIPS”)

A single first author main conference work effectively allows a non Ph.D holder to be treated like they have a Ph.d (be qualified for professional researcher jobs). This means that a decent engineer with 1 NeurIPS publication is easily worth 300K+ YOY assuming US citizen. Even if all they have is a BS ;)

And if you are lucky to get a spotlight or an oral, that’s probably worth closer to 7 figures…

alcasa•2w ago

Didn't know the L in Samuel L Jackson was for LeCun.

j2kun•2w ago

I spot-checked one of the flagged papers (from Google, co-authored by a colleague of mine)

The paper was https://openreview.net/forum?id=0ZnXGzLcOg and the problem flagged was "Two authors are omitted and one (Kyle Richardson) is added. This paper was published at ICLR 2024." I.e., for one cited paper, the author list was off and the venue was wrong. And this citation was mentioned in the background section of the paper, and not fundamental to the validity of the paper. So the citation was not fabricated, but it was incorrectly attributed (perhaps via use of an AI autocomplete).

I think there are some egregious papers in their dataset, and this error does make me pause to wonder how much of the rest of the paper used AI assistance. That said, the "single error" papers in the dataset seem similar to the one I checked: relatively harmless and minor errors (which would be immediately caught by a DOI checker), and so I have to assume some of these were included in the dataset mainly to amplify the author's product pitch. It succeeded.

davidguetta•2w ago

Yeah even the entire "Jane Doe / Jame Smith" my first thought is that it could have been a latex default value

There was dumb stuff like this before the GPT era, it's far from convincing

nativeit•2w ago

> Between 2020 and 2025, submissions to NeurIPS increased more than 220% from 9,467 to 21,575. In response, organizers have had to recruit ever greater numbers of reviewers, resulting in issues of oversight, expertise alignment, negligence, and even fraud.

I don’t think the point being made is “errors didn’t happen pre-GPT”, rather the tasks of detecting errors have become increasingly difficult because of the associated effects of GPT.

ctoth•2w ago

> rather the tasks of detecting errors have become increasingly difficult because of the associated effects of GPT.

Did the increase to submissions to NeurIPS from 2020 to 2025 happen because ChatGPT came out in November of 2022? Or was AI getting hotter and hotter during this period, thereby naturally increasing submissions to ... an AI conference?

amitav1•2w ago

I guess the way one would verify that this is more general trend in academia would be to run this on accepted papers to a non-AI conference?

mturmon•2w ago

I was an area chair on the NeurIPS program committee in 1997. I just looked and it seems that we had 1280 submissions. At that time, we were ultimately capped by the book size that MIT Press was willing to put out - 150 8-page articles. Back in 1997 we were all pretty sure we were on to something big.

I'm sure people made mistakes on their bibliographies at that time as well!

And did we all really dig up and read Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller (1953)?

Edited to add: Someone made a chart! Here: https://papercopilot.com/statistics/neurips-statistics/

You can see the big bump after the book-length restriction was lifted, and the exponential rise starting ~2016.

dekhn•2w ago

I cited Watson and Crick '53 in my PhD thesis and I did go dig it up and read it.

I had to go to the basement of the library, use some sort of weird rotating knob to move a heavy stack of journals over, find some large bound book of the year's journals, and navigate to the paper. When I got the page, it had been cut out by somebody previous and replaced with a photocopied verison.

(I also invested a HUGE amount of my time into my bibliography in every paper I've written as first author, curating a database and writing scripts to format in the various journal formats. This involved multiple independent checks from several sources, repeated several times.

mturmon•2w ago

Totally! If you haven't burrowed in the stacks as a grad student, you missed out.

The real challenges there aren't the "biggies" above, though, it's the ones in obscure journals you have to get copies of by inter-library agreements. My PhD was in applied probability and I was always happy if there were enough equations so that I could parse out the French or Russian-language explanation nearby.

bsder•2w ago

> And did we all really dig up and read Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller (1953)?

If you didn't, you are lying. Full stop.

If you cite something, yes, I expect that you, at least, went back and read the original citation.

The whole damn point of a citation is to provide a link for the reader. If you didn't find it worth the minimal amount of time to go read, then why would your reader? And why did you inflict it on them?

mturmon•2w ago

I meant this more as a rueful acknowledgment of an academic truism - not all citations are read by those citing. But I have touched a nerve, so let me explain at least part of the nuance I see here.

In mathematics/applied math consider cited papers claimed to establish a certain result, but where that was not quite what was shown. Or, there is in effect no earthly way to verify that it does.

Or even: the community agrees it was shown there, but perhaps has lost intimate contact with the details — I’m thinking about things like Laplace’s CLT (published in French), or the original form of the Glivenko-Cantelli theorem (published in Italian). These citations happen a lot, and we should not pretend otherwise.

Here’s the example that crystallized that for me. “VC dimension” is a much-cited combinatorial concept/lemma. It’s typical for a very hard paper of Saharon Shelah (https://projecteuclid.org/journalArticle/Download?urlId=pjm%...) to be cited, along with an easier paper of Norbert Sauer. There are currently 800 citations of Shelah’s paper.

I read a monograph by noted mathematician David Pollard covering this work. Pollard, no stranger to doing the hard work, wrote (probably in an endnote) that Shelah’s paper was often cited, but he could not verify that it established the result at all. I was charmed by the candor.

This was the first acknowledgement I had seen that something was fishy with all those citations.

By this time, I had probably seen Shelah’s paper cited 50 times. Let’s just say that there is no way all 50 of those citing authors (now grown to 800) were working their way through a dense paper on transfinite cardinals to verify this had anything to do with VC dimension.

Of course, people were wanting to give credit. So their intentions were perhaps generous. But in no meaningful sense had they “read” this paper.

So I guess the short answer to your question is, citations serve more uses than telling readers to literally read the cited work, and by extension, should not always taken to mean that the cited work was indeed read.

ls612•2w ago

There are people who just want to punish academics for the sake of punishing academics. Look at all the people downthread salivating over blacklisting or even criminally charging people who make errors like this with felony fraud. Its the perfect brew of anti AI and anti academia sentiment.

Also, in my field (economics), by far the biggest source of finding old papers invalid (or less valid, most papers state multiple results) is good old fashioned coding bugs. I'd like to see the software engineers on this site say with a straight face that writing bugs should lead to jail time.

worik•2w ago

> I'd like to see the software engineers on this site say with a straight face that writing bugs should lead to jail time.

My hand is up.

I do not believe in gaol, but I do agree with the sentiment.

ls612•2w ago

Let he who is without sin cast the first stone…

girvo•2w ago

If there were real consequences, we wouldn't be forced to churn out buggy nonsense by our employers. So we'd be able to take the time to do the right thing. Bug free software is possible, the world just says its not worth it today.

ls612•2w ago

>Bug free software is possible, ...

Mr. Turing and his halting problem would like to politely disagree with this assertion.

ted_dunning•2w ago

You misread the comment and DR Turing's paper.

Getting all possible software correct is impossible, clearly. Getting all the software you release is more possible because you can choose not to release the software that it is too hard to prove correct.

Not that the suggestion is practical or likely, but your assertion that it is impossible is incorrect.

ls612•2w ago

If you want to be pedantic I’m pretty sure every single general purpose OS (and thus also the programs running under it) falls into the category of not provably correct so it’s a distinction without a difference in real life.

miki123211•2w ago

And research codebases (in AI and otherwise) are usually of extremely bad quality. It's usually a bunch of extremely poorly-written scripts, with no indication which order to run them in, how inputs and outputs should flow between them, and which specific files the scripts were run on to calculate the statistics presented in the paper.

davidguetta•1w ago

Codebase can bé of high quality but still you have no idea how they got the paper result

bjourne•2w ago

Still a citation to a work you clearly have not read...

nativeit•2w ago

I see your point, but I don’t see where the author makes any claims about the specifics of the hallucinations, or their impact on the papers’ broader validity. Indeed, I would have found the removal of supposed “innocuous” examples to be far more deceptive than simply calling a spade a spade, and allowing the data to speak for itself.

gowld•2w ago

The point is that they should focus on the meaningful errors, not the automiation of meaningless errors.

reliabilityguy•2w ago

Why these are meaningless? How do I know now that the whole paper is not a slop?

j2kun•2w ago

The author calls the mistakes "confirmed hallucinations" without proof (just more or less evidence). The data never "speak for itself." The author curates the data and crafts a story about it. This story presented here is very suggestive (even using the term "hallucination" is suggestive). But calling it "100 suspected hallucinations", or "25 very likely hallucinations" does less for the author's end goal: selling their service.

gold23•2w ago

Obviously a post on a startup's blog will be more editorialized than an academic paper. Still, this seems like an important discussion to have.

m-schuetz•2w ago

Bibtex are often also incorrectly generated. E.g., google scholar sometimes puts the names of the editors instead of the authors into the bibtex entry.

worik•2w ago

> Bibtex are often also incorrectly generated

...and including the erroneous entry is squarely the author's fault.

Papers should be carefully crafted, not churned out.

I guess that makes me sweetly naive

daveFNbuck•2w ago

You want the content of the paper to be carefully crafted. Bibtex entries are the sort of thing you want people to copy and paste from a trusted source, as they can be difficult to do consistently correctly.

tuckerman•2w ago

I don't think the original comment was saying this isn't a problem but that flagging it as a hallucination from an LLM is a much more serious allegation. In this case, it also seems like it was done to market a paid product which makes the collateral damage less tolerable in my opinion.

> Papers should be carefully crafted, not churned out.

I think you can say the same thing for code and yet, even with code review, bugs slip by. People aren't perfect and problems happen. Trying to prevent 100% of problems is usually a bad cost/benefit trade-off.

m-schuetz•2w ago

That's not happening for a similar reason people do not bug-check every single line of every single third-party library in their code. It's a chore that costs valuable time that you can instead spend on getting the actual stuff done. What's really important is that the scientific contribution is 100% correct and solid. For the references, the "good enough" paradigm applies. They mustn't be complete bogus, like the referenced work not existing at all which would indicate that the authors didnt even look at the reference. But minor issues like typos or rare issues with wrong authors can happen.

bonzini•2w ago

To be honest, validating bibliographies does not cost valuable time. Every research group will have their own bibtex file to which every paper the group ever cited is added.

Typically when you add it you get the info from another paper or copy the bibtex entry from Google scholar, but it's really at most 10 minutes work, more likely 2-5. Every paper might have 5-10 new entries in the bibliography, so that's 1 hour or less of work?

daveFNbuck•2w ago

The complaint here is about this being an insufficient amount of effort because the bibtex entry from Google Scholar is wrong sometimes.

miki123211•2w ago

What's the benefit to society of making sure that academics waste even more of their valuable hours verifying that Google Scholar did not include extraneous authors in some citation which is barely even relevant to their work? With search engines being as good as they are, it's not like we can't easily find that paper anyway.

The entire idea of super-detailed citations is itself quite outdated in my view. Sure, citing the work you rely on is important, but that could be done just as well via hyperlinks. It's not like anybody (exclusively) relies on printed versions any more.

simsla•2w ago

Pointing out these errors isn't wrong. But making the leap to "therefore: AI hallucinations!" without substantiating those accusations is.

i_am_proteus•2w ago

>this error does make me pause to wonder how much of the rest of the paper used AI assistance

And this is what's operative here. The error spotted, the entire class of error spotted, is easily checked/verified by a non-domain expert. These are the errors we can confirm readily, with obvious and unmistakable signature of hallucination.

If these are the only errors, we are not troubled. However: we do not know if these are the only errors, they are merely a signature that the paper was submitted without being thoroughly checked for hallucinations. They are a signature that some LLM was used to generate parts of the paper and the responsible authors used this LLM without care.

Checking the rest of the paper requires domain expertise, perhaps requires an attempt at reproducing the authors' results. That the rest of the paper is now in doubt, and that this problem is so widespread, threatens the validity of the fundamental activity these papers represent: research.

ls612•2w ago

Google scholar and the vagaries of copy/paste errors has mangled bibitex ever since it became a thing, a single citation with these sorts of errors may not even be AI, just “normal” mistakes.

fn-mote•2w ago

This seems like finding spelling errors and using them to cast the entire paper into doubt.

I am unconvinced that the particular error mentioned above is a hallucination, and even less convinced that it is a sign of some kind of rampant use of AI.

I hope to find better examples later in the comment section.

j2kun•2w ago

I actually believe it was an AI hallucination, but I agree with you that it seems the problem is far more concentrated to a few select papers (e.g., one paper made up more than 10% of the detected errors).

gold23•2w ago

Why don't you look at the actual article? There are several more egregious examples, e.g., the authors being cited as "John Smith and Jane Doe"

ted_dunning•2w ago

I can see that either way. It could also be a placeholder until the actual author list is inserted. This could happen if you know the title, but not the authors and insert a temporary reference entry.

gs17•2w ago

The first Doe and Smith example I could give that to (the title is real and the arxiv ID they give is "arXiv:2401.00001", which is definitely placeholder), but the second one doesn't match a title and has fake URL/DOI that don't actually go anywhere. There's a few that are unambiguously placeholders, but they really should have been caught in review for a conference this high up.

mikkupikku•2w ago

How does a "placeholder citation" even happens? Either enter the citation properly now, or do it properly later. What role does a "placeholder citation" serve, besides giving you something to forget about and fuck up?

I do not believe the placeholder citation theory at all.

recursive•2w ago

What's the big deal with one dead canary? This coal mine's productivity is at record highs!

hojinkoh•2w ago

> This seems like finding spelling errors and using them to cast the entire paper into doubt.

Well, to be fair, I did encounter this from actual human peer reviewers before the whole LLM thing. People do that.

jvanderbot•2w ago

The problem is, 10 years ago when I was still publishing even I would let an incorrect citation go through b/c of an old bibtex file or some such.

0xWTF•2w ago

Yeah, errors of omission are so common that "Errors and Omissions" is a category of professional liability insurance.

neilv•2w ago

> If these are the only errors, we are not troubled. However: we do not know if these are the only errors, they are merely a signature that the paper was submitted without being thoroughly checked for hallucinations. They are a signature that some LLM was used to generate parts of the paper and the responsible authors used this LLM without care.

I am troubled by people using an LLM at all to write academic research papers.

It's a shoddy, irresponsible way to work. And also plagiarism, when you claim authorship of it.

I'd see a failure of the 'author' to catch hallucinations, to be more like a failure to hide evidence of misconduct.

If academic venues are saying that using an LLM to write your papers is OK ("so long as you look it over for hallucinations"?), then those academic venues deserve every bit of operational pain and damaged reputation that will result.

mapontosevenths•2w ago

> It's a shoddy, irresponsible way to work. And also plagiarism, when you claim authorship of it.

It reminds me of kids these days and their fancy calculators! Those new fangled doohickeys just aren't reliable, and the kids never realize that they won't always have a calculator on them! Everyone should just do it the good old fashioned way with slide rules!

Or these darn kids and their unreliable sources like Wikipedia! Everyone knows that you need a nice solid reliable source that's made out of dead trees and fact checked but up to 3 paid professionals!

neilv•2w ago

Annoying dismissal.

In an academic paper, you condense a lot of thinking and work, into a writeup.

Why would you blow off the writeup part, and impose AI slop upon the reviewers and the research community?

HKH2•2w ago

I don't necessarily disagree, but researchers are not required to be good communicators. An academic can lead their field and be a terrible lecturer. A specialist can let a generalist help explain concepts for them.

They should still review the final result though. There is no excuse for not doing that.

dasyud•2w ago

I disagree here. A good researcher has to be a good communicator. I am not saying that it is necessarily the case that you don't understand the topic if you cannot explain it well enough to someone new, but it is essential to communicate to have a good exchange of ideas with others, and consequently, become a better researcher. This is one of the skills you learn in a PhD program.

HKH2•2w ago

That is how it should be, yes. Do PhDs always meet that standard though? No.

foxes•2w ago

Im really not motivated by this argument; it seems a false equivalence. Its not merely a spell checker or removing some tedium.

As a professional mathematician I used wikipedia all the time to lookup quick facts before verifying it myself or elsewhere. A calculator well; I can use an actual programming language.

Up until this point neither of those tools were asvertised or used by people to entirely replace human input.

sodapopcan•2w ago

I hate to sound like a 19 year old on Reddit but:

AI People: "AI is a completely unprecedented technology where its introduction is unlike the introduction of any other transformative technology in history! We must treat it totally differently!"

Also AI People: "You're worried about nothing, this is just like when people were worried about the internet."

foxes•2w ago

Well the internet has definitely changed things; but also it wasnt initially controlled by a bunch of megacorps with the same level of power and centralisation today.

sodapopcan•2w ago

:pointing-up-emoji:

mikkupikku•2w ago

The internet analogy is apt because it was in fact a massive bubble, but that bubble popping didn't mean the tech went away. Same will happen again, which is a point both extremes miss. One would have you believe there is no bubble and you should dump all your money into this industry, while the other would have us believe that once the bubble pops all this AI stuff will be debunked and discarded as useless scamware.

ekidd•2w ago

There are some interesting possibilities for LLMs in math, especially in terms of generating machine-checked proofs using languages like Lean. But this is a supplement to the actual result, where the LLM would actually be adding a more rigorous version of a human's argument with all the boring steps included.

In a few cases, I see Terrance Tao has pointed out examples LLMs actually finding proofs of open problems unassisted. Not necessarily problems anyone cared deeply about. But there's still the fact that if the proof holds, then it's valid no matter who or what came up with it.

So it's complicated I guess?

api•2w ago

One issue with this analogy is that calculators really are precise when used correctly. LLMs are not.

I do think they can be used in research but not without careful checking. In my own work I’ve found them most useful as search aids and brainstorming sounding boards.

ragnarok451•2w ago

One issue with this analogy is that paper encyclopedias really are precise when used correctly. Wikipedia is not.

I do think it can be used in research but not without careful checking. In my own work I've found it most useful as a search aid and for brainstorming.

^ this same comment 10 years ago

mapontosevenths•2w ago

> I do think it can be used in research but not without careful checking.

This is really just restating what I already said in this thread, but you're right. That's because wikipedia isn't a primary source and was never, ever meant to be. You are SUPPOSED to go read it then click through to the primary sources and cite those.

Lots of people use it incorrectly and get bad results because they still haven't realized this... all these years later.

Same thing with treating stochastic LLM's like sources of truth and knowledge. Those folks are just doing it wrong.

mikkupikku•2w ago

Paper encyclopedias were neither precise nor accurate. You could count on them to give you ballpark figures most of the time, but certainly not precise answers. And that's assuming the set was new, but in reality most encyclopedias ever encountered by people in reality were several years old at least. I remember the encyclopedia set I had access to in the 90s was written before the USSR fell..

mapontosevenths•2w ago

> I do think they can be used in research but not without careful checking.

Of course you are right. It is the same with all tools, calculators included, if you use them improperly you get poor results.

In this case they're stochastic, which isn't something people are used to happening with computers yet. You have to understand that and learn how to use them or you will get poor results.

mapontosevenths•2w ago

> One issue with this analogy is that calculators really are precise when used correctly. LLMs are not.

I made this a separate comment, because it's wildly off topic, but... they actually aren't. Especially for very large numbers or for high precision. When's the last time you did a firmware update on yours?

It's fairly trivial to find lists of calculator flaws and then identify them in research papers. I recall reading a research paper about it in the 00's.

usefulcat•2w ago

I doubt that it's common for anyone to read a research paper and then question whether the researcher's calculator was working reliably.

Sure, maybe someday LLMs will be able to report facts in a mostly reliable fashion (like a typical calculator), but we're definitely not even close to that yet, so until we are the skepticism is very much warranted. Especially when the details really do matter, as in scientific research.

mapontosevenths•2w ago

> whether the researcher's calculator was working reliably.

LLM's do not work reliably, that's not their purpose.

If you use them that way it's akin to using a butter knife as a screwdriver. You might get away with it once or twice, but then you slip and stab yourself. Better to go find screwdriver if you need reliable.

westurner•2w ago

> I doubt that it's common for anyone to read a research paper and then question whether the researcher's calculator was working reliably

Reproducibility and repeatability in the sciences?

Replication crisis > Causes > Problems with the publication system in science > Mathematical errors; Causes > Questionable research practices > In AI research, Remedies > [..., open science, reproducible workflows, disclosure, ] https://en.wikipedia.org/wiki/Replication_crisis#Mathematica...

Already verifiable proofs are too impossibly many pages for human review

There are "verify each Premise" and "verify the logical form of the Argument" (P therefore Q) steps that still the model doesn't do for the user.

For your domain, how insufficient is the output given process as a prompt like:

Identify hallucinations from models prior to (date in the future)

Check each sentence of this: ```{...}```

Research ScholarlyArticles (and then their Datasets) which support and which reject your conclusions. Critically review findings and controls.

Suggest code to write to apply data science principles to proving correlative and causative relations given already-collected observations.

Design experiment(s) given the scientific method to statistically prove causative (and also correlative) relations

Identify a meta-analytic workflow (process, tools, schema, and maybe code) for proving what is suggested by this chat

andrepd•2w ago

> Those new fangled doohickeys just aren't reliable

Except they are (unlike a chatbot, a calculator is perfectly deterministic), and the unreliability of LLMs is one of their most, if not the most, widespread target of criticism.

Low effort doesn't even begin to describe your comment.

mapontosevenths•2w ago

> Except they are (unlike a chatbot, a calculator is perfectly deterministic)

LLM's are supposed to be stochastic. That is not a bug, I can see why you find that disappointing but it's just the reality of the tool.

However, as I mentioned elsewhere calculators also have bugs and those bugs make their way into scientific research all the time. Floating point errors are particularly common, as are order of operations problems because physical devices get it wrong frequently and are hard to patch. Worse, they are not SUPPOSED TO BE stochastic so when they fail nobody notices until it's far too late. [0 - PDF]

Further, spreadsheets are no better, for example a scan of ~3,600 genomics papers found that about 1 in 5 had gene‑name errors (e.g., SEPT2 → “2‑Sep”) because that's how Excel likes to format things.[1] Again, this is much worse than a stochastic machine doing it's stochastic job... because it's not SUPPOSED to be random, it's just broken and on a truly massive scale.

[0] https://ttu-ir.tdl.org/server/api/core/bitstreams/7fce5b73-1...

[1]https://www.washingtonpost.com/news/wonk/wp/2016/08/26/an-al...

raddan•2w ago

That’s a strange argument. There are plenty of stochastic processes that have perfectly acceptable guarantees. A good example is Karger’s min-cut algorithm. You might not know what you get on any given single run, but you know EXACTLY what you’re going to get when you crank up the number of trials.

Nobody can tell you what you are going to get when you run an LLM once. Nobody can tell you what you’re going to get when you run it N times. There are, in fact, no guarantees at all. Nobody even really knows why it can solve some problems and why it can’t solve other except maybe it memorized the answer at some point. But this is not how they are marketed.

They are marketed as wondrous inventions that can SOLVE EVERYTHING. This is obviously not true. You can verify it yourself, with a simple deterministic problem: generate an arithmetic expression of length N. As you increase N, the probability that an LLM can solve it drops to zero.

Ok, fine. This kind of problem is not a good fit for an LLM. But which is? And after you’ve found a problem that seems like a good fit, how do you know? Did you test it systematically? The big LLM vendors are fudging the numbers. They’re testing on the training set, they’re using ad hoc measurements and so on. But don’t take my word for it. There’s lots of great literature out there that probes the eccentricities of these models; for some reason this work rarely makes its way into the HN echo chamber.

Now I’m not saying these things are broken and useless. Far from it. I use them every day. But I don’t trust anything they produce, because there are no guarantees, and I have been burned many times. If you have not been burned, you’re either exceptionally lucky, you are asking it to solve homework assignments, or you are ignoring the pain.

Excel bugs are not the same thing. Most of those problems can be found trivially. You can find them because Excel is a language with clear rules (just not clear to those particular users). The problem with Excel is that people aren’t looking for bugs.

mapontosevenths•2w ago

> But I don’t trust anything they produce, because there are no guarantees

> Did you test it systematically?

Yes! That is exactly the right way to use them. For example, when I'm vibe coding I don't ask it to write code. I ask it to write unit tests. THEN I verify that the test is actually testing for the right things with my own eyeballs. THEN I ask it to write code that passes the unit tests.

Same with even text formatting. Sometimes I ask it to write a pydantic script to validate text inputs of "x" format. Often writing the text to specify the format is itself a major undertaking. Then once the script is working I ask for the text, and tell it to use the script to validate it. After that I can know that I can expect deterministic results, though it often takes a few tries for it to pass the validator.

You CAN get deterministic results, you just have to adapt your expectations to match what the tool is capable of instead of expecting your hammer to magically be a great screwdriver.

I do agree that the SOLVE EVERYTHING crowd are severely misguided, but so are the SOLVE NOTHING crowd. It's a tool, just use it properly and all will be well.

jama211•2w ago

As low effort as you hand waving away any nuance because it doesn’t agree with you?

derefr•2w ago

I would argue that an LLM is a perfectly sensible tool for structure-preserving machine translation from another language to English. (Where by "another language", you could also also substitute "very poor/non-fluent English." Though IMHO that's a bit silly, even though it's possible; there's little sense in writing in a language you only half know, when you'd get a less-lossy result from just writing in your native tongue, and then having it translate from that.)

Google Translate et al were never good enough at this task to actually allow people to use the results for anything professional. Previous tools were limited to getting a rough gloss of what words in another language mean.

But LLMs can be used in this way, and are being used in this way; and this is increasingly allowing non-English-fluent academics to publish papers in English-language journals (thus engaging with the English-language academic community), where previously those academics they may have felt "stuck" publishing in what few journals exist for their discipline in their own language.

Would you call the use of LLMs for translation "shoddy" or "irresponsible"? To me, it'd be no more and no less "shoddy" or "irresponsible" than it would be to hire a freelance human translator to translate the paper for you. (In fact, the human translator might be a worse idea, as LLMs are more likely to understand how to translate the specific academic jargon of your discipline than a randomly-selected human translator would be.)

neilv•2w ago

Good point. There may be a place for LLMs for science writing translation (hopefully not adding nor subtracting anything) when you're not fluent in the language of a venue.

You need a way to validate the correctness of the translation, and to be able to stand behind whatever the translation says. And the translation should be disclosed on the paper.

gus_massa•2w ago

Autotranslating technical texts is very hard. After the translation, you muct check that all the technical words were translated correctly, instead of a fancy synonym that does not make sense.

(A friend has an old book translated a long time ago (by a human) from Russian to Spanish. Instead of "complex numbers", the book calls them "complicated numbers". :) )

QuercusMax•2w ago

I remember one time when I had written a bunch of user facing text for an imaging app and was reviewing our French translation. I don't speak French but I was pretty sure "plane" (as in geometry) shouldn't be translated as "avion". And this was human translated!

jfim•2w ago

You'd be surprised how shoddy human translations can be, and it's not necessarily because of the translators themselves.

Typically what happens is that translators are given an Excel sheet with the original text in a column, and the translated text must be put into the next column. Because there's no context, it's not necessarily clear to the translator whether the translation for plane should be avion (airplane) or plan (geometric plane). The translator might not ever see the actual software with their translated text.

Davidzheng•2w ago

idk I think Gemini 2.5 did a great job at almost all research math papers translating from french to english...

neves•2w ago

When reading technical material in my native language, I sometimes need to translate it back to English to fully understand it.

derefr•2w ago

The convenient thing in this case (verification of translation of academic papers from the speaker's native language to English) is that the authors of the paper likely already 1. can read English to some degree, and 2. are highly likely to be familiar specifically with the jargon terms of their field in both their own language and in English.

This is because, even in countries with a different primary spoken language, many academic subjects, especially at a graduate level (masters/PhD programs — i.e. when publishing starts to matter), are still taught at universities at least partly in English. The best textbooks are usually written in English (with acceptably-faithful translations of these texts being rarer than you'd think); all the seminal papers one might reference are likely to be in English; etc. For many programs, the ability to read English to some degree is a requirement for attendance.

And yet these same programs are also likely to provide lectures (and TA assistance) in the country's own native language, with the native-language versions of the jargon terms used. And any collaborative work is likely to also occur in the native language. So attendees of such programs end up exposed to both the native-language and English-language terms within their field.

This means that academics in these places often have very little trouble in verifying the fidelity of translation of the jargon in their papers. It's usually all the other stuff in the translation that they aren't sure is correct. But this can be cheaply verified by handing the paper to any fluently-multilingual non-academic and asking them to check the translation, with the instruction to just ignore the jargon terms because they were already verified.

gus_massa•2w ago

> with the native-language versions of the jargon terms used

It depends on the country. Here in Argentina we use a lot of loaned words for technical terms, but I think in Spain they like to translate everything.

melagonster•2w ago

If they can write a whole draft in their first language, they can easily read the translated English version and correct it. The errors described by gp/op were generated when authors directly required LLM to generate a full paragraph of text. Look at my terrible English; I really have the experience of the full process from draft to English version before :)

abbassix•2w ago

We still do not have a standardized way to represent Machine Learning concepts. For example in vision model, I see lots of papers confused about the "skip connections" and "residual connection" and when they concatenate channels they call them "residual connection" while it shows that they haven't understood why we call them "residual" in the first place. In my humble opinion, each conference, and better be a confederation of conferences, work together to provide a glossary, a technical guideline, and also a special machine translation tool, to correct a non-clear-with-lots-of-grammatical-error-English like mine!

neves•2w ago

I'm surprised by these results. I agree that LLMs are a great tool for offsetting the English-speaking world's advantage. I would have expected non-Anglo-American universities to rank at the top of the list. One of the most valuable features of LLMs from the beginning has been their ability to improve written language.

Why is their use more intense in English-speaking universities?

noooooooph•2w ago

To that point I think it's lovely how LLMs democratize science. At ICLR a few years ago I spoke with a few Korean researchers that were delighted that their relative inability to write in English was no being held against them during the review process. I think until then I underestimated how pivotal this technology was in lowering the barrier to entry for the non-English speaking scientific community.

bjourne•2w ago

There are legitimate, non-cheating ways to use LLMs for writing. I often use the wrong verb forms ("They synthesizes the ..."), write "though" when it should be "although", and forget to comma-separate clauses. LLMs are perfect for that. Generating text from scratch, however, is wrong.

thaumasiotes•2w ago

> I often ... write "though" when it should be "although"

That is a purely imaginary "error". Anywhere you can use 'although', you are free to use 'though' instead.

bjourne•2w ago

Yeah, but you cannot use although anywhere you can use though, though.

thaumasiotes•2w ago

That's true, but the one-way substitutability still means there is no such thing as "writ[ing] 'though' when it should be 'although'".

rustystump•2w ago

I do similar proofing (esp spelling) but u need to be very careful as it will nudge u to specific styles that rob originality.

bloppe•2w ago

I agree, but I don't think any of the broadly acceptable uses would result in easily identifiable flaws like those in the post, especially hallucinated URLs.

aydyn•2w ago

>also plagiarism

To me, this is a reminder of how much of a specific minority this forum is.

Nobody I know in real life, personally or at work, has expressed this belief.

I have literally only ever encountered this anti-AI extremism (extremism in the non-pejorative sense) in places like reddit and here.

Clearly, the authors in NeurIPS don't agree that using an LLM to help write is "plagiarism", and I would trust their opinions far more than some random redditor.

fingerlocks•2w ago

I find that hard to believe. Every creative professional that I know shares this sentiment. That’s several graphic designers at big tech companies, one person in print media, and one visual effects artist in the film industry. And once you include many of their professional colleagues that becomes a decent sample size.

jama211•2w ago

Graphic design is a completely different kettle of fish. Comparing it to academic paper writing is disingenuous.

kortilla•2w ago

The thread is about not knowing anyone at all who thinks AI is plagiarizing.

jama211•2w ago

Yes, plagiarising text based content. No one in this thread meant graphics.

jama211•2w ago

Yup, and no matter how flimsy an anti-ai article is, it will skyrocket to the top of HN because of it. It makes sense though, HN users are the most likely to feel threatened by LLMs, and therefore are more likely to be anxious about them.

I don’t love ai either, but that’s the truth.

techpression•2w ago

Strange, I find it quite the opposite, especially ”pro-ai” comments are often top of the list.

jama211•2w ago

I think there’s a bit of both, with a valley in the middle

falkensmaize•2w ago

“Anti-AI extremism”? Seriously?

Where does this bizarre impulse to dogmatically defend LLM output come from? I don’t understand it.

If AI is a reliable and quality tool, that will become evident without the need to defend it - it’s got billions (trillions?) of dollars backstopping it. The skeptical pushback is WAY more important right now than the optimistic embrace.

cthalupa•2w ago

The fact that there is absurd AI hype right now doesn't mean that we should let equally absurd bullshit pass on the other side of the spectrum. Having a reasonable and accurate discussion about the benefits, drawbacks, side effects, etc. is WAY more important right now than being flagrantly incorrect in either direction.

Meanwhile this entire comment thread is about what appears to be, as fumi2026 points out in their comment, a predatory marketing play by a startup hoping to capitalize on the exact sort of anti AI sentiment that you seem to think is important... just because there is pro AI sentiment?

Naming and shaming everyday researchers based on the idea that they have let hallucinations slip into their paper all because your own AI model has decided thatit was AI so you can signal boost your product seems pretty shitty and exploitative to me, and is only viable as a product and marketing strategy because of the visceral anti AI sentiment in some places.

falkensmaize•2w ago

“anti-ai sentiment”

No that’s a straw man, sorry. Skepticism is not the same thing as irrational rejection. It means that I don’t believe you until you’ve proven with evidence that what you’re saying is true.

The efficacy and reliability of LLMs requires proof. Ai companies are pouring extraordinary, unprecedented amounts of money into promoting the idea that their products are intelligent and trustworthy. That marketing push absolutely dwarfs the skeptical voices and that’s what makes those voices more important at the moment. If the researchers named have claims made against them that aren’t true, that should be a pretty easy thing for them to refute.

rustystump•2w ago

The cat is out of the bag tho. AI does have provably crazy value. Certainly not the agi hype marketing spews and who knows how economically viable it would be without vc.

However, i think any one who is still skeptical of the real efficacy is willfully ignorant. This is not a moral endorsement on how it was made or if it is moral to use but god damn it is a game changer across vast domains.

necovek•2w ago

There were a number of studies already shared reporting on the impression of increased efficiency without the actual increase in efficiency.

Which means that it's still not a given, though there are obviously cases where individual cases seem to be good proof of it.

cthalupa•2w ago

There was a front page post just a couple of days ago where the article claimed LLMs have not improved in any way in over a year - an obviously absurd statement. A year before Opus 4.5, I couldn't get models to spit out a one shot Tampermonkey script to add chapter turns to my arrow keys. Now I can one small personal projects in claude code.

If you are saying that people are not making irrational and intellectually dishonest arguments about AI, I can't believe that we're reading the same articles and same comments.

techpression•2w ago

Isn’t that the whole point of publishing? This happened plenty before AI too, and the claims are easily verified by checking the claimed hallucinations. Don’t publish things that aren’t verified and you won’t have a problem, same as before but perhaps now it’s easier to verify, which is a good thing. We see this problem in many areas, last week it was a criminal case where a made up law was referenced, luckily the judge knew to call it out. We can’t just blindly trust things in this era, and calling it out is the only way to bring it up to the surface.

cthalupa•2w ago

> Isn’t that the whole point of publishing?

No, obviously not. You're confusing a marketing post by people with a product to sell with an actual review of the work by the relevant community, or even review by interested laypeople.

This is a marketing post where they provide no evidence that any of these are hallucinations beyond their own AI tool telling them so - and how do we know it isn't hallucinating? Are there hallucinations in there? Almost certainly. Would the authors deserve being called out by people reviewing their work? Sure.

But what people don't deserve is an unrelated VC funded tech company jumping in and claiming all of their errors are LLM hallucinations when they have no actual proof, painting them all a certain way so they can sell their product.

> Don’t publish things that aren’t verified and you won’t have a problem

If we were holding this company to the same standard, this blog wouldn't be posted either. They have not and can not verify their claims - they can't even say that their claims are based on their own investigations.

techpression•2w ago

Most research is funded by someone with a product to sell, not all but a frightening amount of it. VC to sell, VC to review. The burden of proof is always on the one publishing and it can be a very frustrating experience, but that is how it is, the one making the claim needs to defend themselves, from people (who can be a very big hit or miss) or machines alike. The good thing is that if this product is crap then it will quickly disappear.

cthalupa•2w ago

That's still different from a bunch of researchers being specifically put in a negative light purely to sell a product. They weren't criticized so that they could do better, be it in their own error checking if it was a human-induced issue, or not relying on LLMs to do the work they should have been. They were put on blast to sell a product.

That's quite a bit different than a study being funded by someone with a product to sell.

neilv•2w ago

> AI Overview

> Plagiarism is using someone else's words, ideas, or work as your own without proper credit, a serious breach of ethics leading to academic failure, job loss, or legal issues, and can range from copying text (direct) to paraphrasing without citation (mosaic), often detected by software and best avoided by meticulous citation, quoting, and paraphrasing to show original thought and attribution.

aydyn•1w ago

Not sure if I am correctly interpreting your implicit point but

> Plagiarism is using someone else's words,

Its right there. LLM is not "someone else"; its a very useful piece of software.

nick238•2w ago

The LLM model and version should be included as an author so there's useful information about where the content came from.

BobbyJo•2w ago

> Nobody I know in real life, personally or at work, has expressed this belief.

TBF, most people in real life don't even know how AI works to any degree, so using that as an argument that parent's opinion is extreme is kind of circular reasoning.

> I have literally only ever encountered this anti-AI extremism (extremism in the non-pejorative sense) in places like reddit and here.

I don't see parent's opinions as anti-AI. It's more an argument about what AI is currently, and what research is supposed to be. AI is existing ideas. Research is supposed to be new ideas. If much of your research paper can be written by AI, I call into question whether or not it represents actual research.

aydyn•2w ago

> TBF, most people in real life don't even know how AI works to any degree

How about the authors who do research for NeurIPS? Do they know how AI works?

Intermernet•2w ago

Who knows? Do NeurIPS have a pedigree of original, well sourced research dating back to before the advent of LLMs? We're at the point where both of the terms "AI" and "Experts" are so blurred it's almost impossible to trust or distrust anything without spending more time on due diligence than most subjects deserve.

As the wise woman once said "Ain't nobody got time for that".

michaelt•2w ago

> Research is supposed to be new ideas. If much of your research paper can be written by AI, I call into question whether or not it represents actual research.

One would hope the authors are forming a hypothesis, performing an experiment, gathering and analysing results, and only then passing it to the AI to convert it into a paper.

If I have a theory that, IDK, laser welds in a sine wave pattern are stronger than laser welds in a zigzag pattern - I've still got to design the exact experimental details, obtain all the equipment and consumables, cut a few dozen test coupons, weld them, strength test them, and record all the measurements.

Obviously if I skipped the experimentation and just had an AI fabricate the results table, that's academic misconduct of the clearest form.

BobbyJo•2w ago

I am not an academic, so correct me if I am wrong, but in your example, the actual writing would probably only represent a small fraction of the time spent. Is it even worth using AI for anything other than spelling and grammar correction at that point? I think using an LLM to generate a paper from high level points wouldn't save much, if any, time if it was then reviewed the way that would require.

My brother in law is a professor, and he has a pretty bad opinion of colleagues that use LLMs to write papers, as his field (economics) doesn't involve much experimentation, and instead relies on data analysis, simulation, and reasoning. It seemed to me like the LLM assisted papers that he's seen have mostly been pretty low impact filler papers.

aydyn•1w ago

> I am not an academic, so correct me if I am wrong, but in your example, the actual writing would probably only represent a small fraction of the time spent. Is it even worth using AI for anything other than spelling and grammar correction at that point? I think using an LLM to generate a paper from high level points wouldn't save much, if any, time if it was then reviewed the way that would require.

Its understandable that you believe that, but its absolutely true that writing in academia is a huge time sink. Think about it, the first thing your reviewers are going to notice is not results but how well it is written.

If its written terribly you have lost, and it doesnt matter how good your results are at that point. Its common to spend days with your PI writing a paper to perfection, and then spend months back and forth with reviewers updating and improving the text. This is even more true the higher up you go in journal prestige.

Davidzheng•2w ago

"If much of your research paper can be written by AI, I call into question whether or not it represents actual research" And what happens to this statement if next year or later this year the papers that can be autonomously written passes median human paper mark?

BobbyJo•2w ago

What does it mean to cross the median human paper mark? How os that measured?

It seems to me like most of the LLM benchmarks wind up being gamed. So, even if there were a good benchmark there, which I do not believe there is, the validity of the benchmark would likely diminish pretty quickly.

neilv•2w ago

> Clearly, the authors in NeurIPS don't agree that using an LLM to help write is "plagiarism",

Or they didn't consider that it arguably fell within academia's definition of plagiarism.

Or they thought they could get away with it.

Why is someone behaving questionably the authority on whether that's OK?

> Nobody I know in real life, personally or at work, has expressed this belief. I have literally only ever encountered this anti-AI extremism (extremism in the non-pejorative sense) in places like reddit and here.

It's not "anti-AI extremism".

If no one you know has said, "Hey, wait a minute, if I'm copy&pasting this text I didn't write, and putting my name on it, without credit or attribution, isn't that like... no... what am I missing?" then maybe they are focused on other angles.

That doesn't mean that people who consider different angles than your friends do are "extremist".

They're only "extremist" in the way that anyone critical at all of 'crypto' was "extremist", to the bros pumping it. Not coincidentally, there's some overlap in bros between the two.

aydyn•2w ago

> Why is someone behaving questionably the authority on whether that's OK?

Because they are not. Using AI to help writing is something literally every company is pushing for.

tsimionescu•2w ago

How is that relevant? Companies care very little about plagiarism, at least in the ethical sense (they do care if they think it's a legal risk, but that has turned out to not be the case with AI, so far at least).

aydyn•2w ago

What do you mean how is that relevant? Its a vast majority opinion in society that using ai to help you write is fine. Calling it "plagiarism" is a tiny minority online opinion.

tsimionescu•2w ago

First of all, the very fact that companies need to encourage it shows that it is not already a majority opinion in society, it is a majority opinion among company management, which is often extremely unethical.

Secondly, even if it is true that it is a majority opinion in society doesn't mean it's right. Society at large often misunderstands how technology works and what risks it brings and what are its inevitable downstream effects. It was a majority opinion in society for decades or centuries that smoking is neutral to your health - that doesn't mean they were right.

aydyn•2w ago

> Secondly, even if it is true that it is a majority opinion in society doesn't mean it's right. Society at large often misunderstands how technology works and what risks it brings and what are its inevitable downstream effects. It was a majority opinion in society for decades or centuries that smoking is neutral to your health - that doesn't mean they were right.

That its a majority opinion instead of a tiny minority opinion is a strong signal that its more likely to be correct. For example its a majority opinion that murder is bad; this has held true for millennia.

Heres a simpler explanation: toaster frickers tend to seek out other toaster frickers online in niche communities. Occams razor.

necovek•2w ago

As long as AI companies have paid them to train on their data (see a number of licensing deals between OpenAI and news agencies and such).

Culonavirus•2w ago

Higher education is not free. People pay a shit ton of money to attend and also governments (taxpayers) invest a lot. Imagine offloading your research to an AI bot...

thomasahle•2w ago

> And also plagiarism, when you claim authorship of it.

I don't actually mind putting Claude as a co-author on my github commits.

But for papers there are usually so many tools involved. It would be crowded to include each of Claude, Gemini, Codex, Mathematica, Grammarly, Translate etc. as co-authors, even though I used all of them for some parts.

Maybe just having a "tools used" section could work?

the__alchemist•2w ago

I suspect the parent post was concerned about plagiarizing the author of training data; not software tools.

piyh•2w ago

>I am troubled by people using an LLM at all to write academic research papers.

I'm an outsider to the academic system. I have cool projects that I feel push some niche application to SOTA in my tiny little domain, which is publishable based on many of the papers I've read.

If I can build a system that does a thing, I can benchmark and prove it's better than previous papers, my main blocker is getting all my work and information into the "Arxiv PDF" format and tone. Seems like a good use of LLMs to me.

jasonfarnon•2w ago

agree, I dont find this evidence of AI. It often happened that authors change, there are multiple venues, or I'm using an old version of the paper. We also need to see the denominator. If this google paper had this one bad citation out of 20 versus out of 60.

Also everyone I know has been relying on google scholar for 10+ years. Is that AI-ish? There are definitely errors on there. If you would extrapolate from citation issues to the content in the age of LLMs, were you doing so then as well?

It's the age-old debate about spelling/grammar issues in technical work. In my experience it rarely gets to the point that these errors eg from non-native speakers affect my interpretation. Others claim to infer shoddy content.

andy12_•2w ago

> However: we do not know if these are the only errors, they are merely a signature that the paper was submitted without being thoroughly checked for hallucinations

Given how stupidly tedious and error-prone citations are, I have no trouble believing that the citation error could be the only major problem with the paper, and that it's not a sign of low quality by itself. It would be another matter entirely if we were talking about something actually important to the ideas presented in the paper, but it isn't.

nearbuy•2w ago

The rate here (about 1% of papers) just doesn't seem that bad, especially if many of the errors are minor and don't affect the validity of the results. In other fields, over half of high-impact studies don't replicate.

currymj•2w ago

the earlier list of ICLR papers had way more egregious examples. Those were taken from the list of submissions not accepted papers however.

nazgul17•2w ago

The thing is, when you copy paste a bibliography entry from the publisher or from Google Scholar, the authors won't be wrong. In this case, it is. If I were to write a paper with AI, I would at least manage the bibliography by hand, conscious of hallucinations. The fact that the hallucination is in the bibliography is a pretty strong indicator that the paper was written entirely with AI.

arjvik•2w ago

I'm not sure I agree... while I don't ever see myself writing papers with AI, I hate wrangling a bibtex bibliography.

I wouldn't trust today's GPT-5-with-web-search to do turn a bullet point list of papers into proper citations without checking myself, but maybe I will trust GPT-X-plus-agent to do this.

joshvm•2w ago

Reference managers have existed for decades now and they work deterministically. I paid for one when writing my doctoral thesis because it would have been horrific to do by hand. Any of the major tools like Zotero or Mendeley (I used Papers) will export a bibtex file for you, and they will accept a RIS or similar format that most journals export.

storystarling•2w ago

This seems solvable today if you treat it as an architecture problem rather than relying on the model's weights. I'm using LangGraph to force function calls to Crossref or OpenAlex for a similar workflow. As long as you keep the flow rigid and only use the LLM for orchestration and formatting, the hallucinations pretty much disappear.

jmmcd•2w ago

Google Scholar provides imperfect citations - very often wrong article type (eg article versus conference paper), but up to and including missing authors, in my experience.

samusiam•2w ago

I've had the same experience. Also papers will often have multiple entries in Google Scholar, with small differences between them (enough that Scholar didn't merge them into one entry).

David_Osipov•2w ago

Great job! I've tried to test their tool as well, but was totally paywalled.

fmbb•2w ago

> So the citation was not fabricated, but it was incorrectly attributed (perhaps via use of an AI autocomplete).

Well the title says ”hallucinations”, not ”fabrications”. What you describe sounds exactly like what AI builders call hallucinations.

j2kun•2w ago

Read the article. The author uses the word "fabricate" repeatedly to describe the situation where the wrong authors are in the citation.

_alternator_•2w ago

The missing analysis is, of course, a comparison with pre-LLM conferences, like 2022 or 2023 that would show a “false positive” rate for the tool.

anishrverma•2w ago

Agreed.

What I find more interesting is how easy these errors are to introduce and how unlikely they are to be caught. As you point out, a DOI checker would immediately flag this. But citation verification isn’t a first-class part of the submission or review workflow today.

We’re still treating citations as narrative text rather than verifiable objects. That implicit trust model worked when volumes were lower, but it doesn’t seem to scale anymore

There’s a project I’m working on at Duke University, where we are building a system that tries to address exactly this gap by making references and review labor explicit and machine verifiable at the infrastructure level. There’s a short explainer here that lays out what we mean, if useful context helps: https://liberata.info/

dexdal•2w ago

Citation checks are a workflow problem, not a model problem. Treat every reference as a dependency that must resolve and be reproducible. If the checker cannot fetch and validate it, it does not ship.

janalsncm•2w ago

This is par for the course for GPTZero, which also falsely claims they can detect AI generated text, a fundamentally impossible task to do accurately.

ainch•2w ago

I'm not going to bat for GPTZero, but I think it's clearly possible to identify some AI-written prose. Scroll through LinkedIn or Twitter replies and there are clear giveaways in tone, phrasing and repeated structures (it's not just X it's Y).

Not to say that you could ever feasibly detect all AI-generated text, but if it's possible for people to develop a sense for the tropes of LLM content then there's no reason you couldn't detect it algorithmically.

janalsncm•2w ago

> there's no reason you couldn't detect it algorithmically

For any real world classifier there is a precision/recall tradeoff. Do you care more about false positives or false negatives? If you choose to truly minimize false positives you should simply always predict negative.

For your example “it’s not just X it’s Y” I agree it’s a red flag. But the origin of the pattern is from human text which the LLM picked up on. So some people did (and likely still do) use that construction.

bjourne•2w ago

Sorry, but blaming it on "AI autocomplete" is the dumbest excuse ever. Author lists come from BibTeX entries and while they often contains errors since they can come from many sources, they do not contain completely made up authors. I don't share your view that hallucinated citations are less damaging in background section. Background, related works, and introduction is the sections where citations most often show up. These sections are meant to be read and generating them with AI is plain cheating.

j2kun•2w ago

I'm not blaming anything on anything, because I did not (nor did the authors) confirm the cause of any of these errors.

> I don't share your view that hallucinated citations are less damaging in background section.

Who exactly is damaged in this particular instance?

bjourne•2w ago

Trust is damaged. I cannot verify that the evidence is correct only that the conclusions follow from the evidence. I have to rely on the authors to truthfully present their evidence. If they for whatever reason add hallucinated citations to their background that trust is 100% gone.

j2kun•2w ago

You are speaking in the abstract. Did you read this paper? I suspect you did not.

StopDisinfo910•2w ago

The example you provided doesn't sit right me.

If the mistake is one error of author and location in a citation, I find it fairly disingenuous to call that an hallucination. At least, it doesn't meet the threshold for me.

I have seen this kind of mistakes done long before LLM were even a thing. We used to call them that: mistakes.

beowulfey•2w ago

As with anything, it is about trusting your tools. Who is culpable for such errors? In the days of human authors, the person writing the text is responsible for not making these errors. When AI does the writing, the person whose name is on the paper should still be responsible—but do they know that? Do they realize the responsibility they are shouldering when they use these AI tools? I think many times they do not; we implicitly trust the outputs of these tools, and the dangers of that are not made clear.

lou1306•2w ago

> relatively harmless and minor errors

They are not harmless. These hallucinated references are ingested by Google Scholar, Scopus, etc., and with enough time they will poison those wells. It is also plain academic malpractice, no matter how "minor" the reference is.

cyber_kinetist•2w ago

It has been several years since the reviewing process for top AI conferences have been broken as hell, due to having too many submissions and only a few reviewers (up to the point that Masters students are reviewing the papers). It was only a matter of time before these conferences will be filled with AI-written papers.

mat_b•2w ago

> we discovered 100s of hallucinated citations missed by the 3+ reviewers who evaluated each paper.

This says just as much about the humans involved.

mkehrt•2w ago

Well for one, it's definitely not the responsibility of the reviewers to check that all the citations exist. That would be insane.

Prof_Sigmund•2w ago

The authors talk about "a model's ability to align with human decisions" as a matter of the past. The omission in the paper is RLHF (Reinforcement Learning from Human Feedback). All these companies are "teaching machines to predict the preferences of people who click 'Accept All Cookies' without reading," by using low-paid human evaluators — “AI teachers.”

If we go back to Google, before its transformation into an AI powerhouse — as it gutted its own SERPs, shoving traditional blue links below AI-generated overlords that synthesize answers from the web’s underbelly, often leaving publishers starving for clicks in a zero-click apocalypse — what was happening?

The same kind of human “evaluators” were ranking pages. Pushing garbage forward. The same thing is happening with AI. As much as the human "evaluators" trained search engines to elevate clickbait, the very same humans now train large language models to mimic the judgment of those very same evaluators. A feedback loop of mediocrity — supervised by the... well, not the best among us. The machines still, as Stephen Wolfram wrote, for any given sequence, use the same probability method (e.g., “The cat sat on the...”), in which the model doesn’t just pick one word. It calculates a probability score for every single word in its vast vocabulary (e.g., “mat” = 40% chance, “floor” = 15%, “car” = 0.01%), and voilà! — you have a “creative” text: one of a gazillion mindlessly produced, soulless, garbage “vile bile” sludge emissions that pollute our collective brains and render us a bunch of idiots, ready to swallow any corporate poison sent our way.

In my opinion, even worse: the corporates are pushing toward “safety” (likely from lawsuits), and the AI systems are trained to sell, soothe, and please — not to think, or enhance our collective experience.

Lerc•2w ago

So the headline says

>GPTZero finds 100 new hallucinations in NeurIPS 2025 accepted papers

And I'm left wondering if they mean 100 papers or 100 hallucinations

The subheading says

>GPTZero's analysis 4841 papers accepted by NeurIPS 2025 show there are at least 100 with confirmed hallucinations

Which accidentally a word, but seems to clarify that they do legitimately mean 100 papers.

A later heading says

>Table of 100 Hallucinated Citations in Published Across 53 NeurIPS Papers

Which suggests either the opposite, or that they chose a subset of their findings to point out a coincidentally similar number of incidents.

How many papers did they find hallucinations in? I'm still not certain. Is it 100, 53 or some other number altogether? Does their quality of scrutiny match the quality of their communication. If they did in-fact find 100 Hallucinations in 53 papers, would the inconsistency against their claim of "papers accepted by NeurIPS 2025 show there are at least 100 with confirmed hallucinations" meet their own bar for a hallucination?

j2kun•2w ago

They counted multiple hallucinations in a single paper toward the 100, and explicitly call out one paper with 13 incorrect citations that are claimed (reasonably, IMO) to be hallucinated.

Lerc•2w ago

So you are saying their claim of

>GPTZero's analysis 4841 papers accepted by NeurIPS 2025 show there are at least 100 with confirmed hallucinations

Is not true. [Edit - that sounds a bit harsh making it seem like you are accusing them, it's more that this is a logical conclusion of your(imo reasonable) interpretation.

j2kun•2w ago

I think it is true and intentionally vague for marketing purposes. And FWIW, I support the effort writ large

OptionX•2w ago

The old create the problem and sell the solution shtick.

gold23•2w ago

*expose the problem

gowld•2w ago

Why does "Robust Label Proportions Learning" have a "Scan" link, while all the others have a "Sources" link? Was this web page generated by AI?

gowld•2w ago

"100 Hallucinated Citations in Published Across 53 NeurIPS Papers"

No one cares about citations. They are hallucinated because they are required to be present for political reasons, even though they have no relevance.

gowld•2w ago

I searched Google for one of the hallucinations: [N. Flammarion. Chen "sam generalizes"]

AI Overview: Based on the research, [Chen and N. Flammarion (2022)](https://gptzero.me/news/neurips/) investigate why Sharpness-Aware Minimization (SAM) generalizes better than SGD, focusing on optimization perspectives

The link is a link to the OP web page calling the "research" a hallucination.

djoldman•2w ago

I would love to see this analysis run on pre-GPT era papers.

waldarbeiter•2w ago

My website of choice whenever I have to deal with references is dblp [1]. In my opinion more reliable than Google scholar in creating correct BibTeX. Also when searching for a paper you clearly see where it has been published or if it is only on arxiv.

[1] https://dblp.org/

sdellis•2w ago

This is an advertisement disguised as a "report".

gold23•2w ago

Nonetheless this investigation is important to anyone with a stronger desire to preserve intellectual honesty than disdain for a company trying to expand their offering.

snihalani•2w ago

it's painful that HN has become that

scoper31134•2w ago

a good advertisement

SaaSasaurus•2w ago

I'm surprised it's only 100, honestly. Also feels a little sensationalized... Before AI I wonder how many "hallucinations" were in human-written papers. Is there any data on this?

rovr138•2w ago

These are 100, already reviewed papers and accepted papers, by researchers in their areas of expertise. Usually PhD's and Professors.... They judge.

These are not all the submissions that they received. The review process can be... brutal for some people (depending on the quality of their submission)

abalone•2w ago

At least in one case the authors claimed to use ChatGPT to "generate the citations after giving it author-year in-text citations, titles, or their paraphrases." They pasted the hallucinations in without checking. They've since responded with corrections to real papers that in most cases are very similar to the hallucination, lending credibility to their claim.[1]

Not great, but to be clear this is different from fabricating the whole paper or the authors inventing the citations. (In this case at least.)

[1] https://openreview.net/forum?id=IiEtQPGVyV

bjourne•2w ago

I counted 15 hallucinated citations. The authors explanation is plausible, but it is still 15 citations to works they clearly have not read. Any university teaches you that citing sources you personally have not verified supports you claim(s) is fraudulent. Apologizing is not enough, they should retract the article.

abalone•3d ago

What makes you say they "clearly have not read" their citations? Are you assuming that because they used ChatGPT to generate the citation section based on their description of the papers that they haven't read the papers? Are you suggesting that their clarifications of which real papers the ChatGPT citations were meant to map to are fake, and if so which ones?

lifetimerubyist•2w ago

Surely this will help with the trust in our institutions that has been completely eroded over the last 5 years.

thestructuralme•2w ago

The most striking part of the report isn't just the 100 hallucinations—it’s the "submission tsunami" (220% increase since 2020) that made this possible. We’re seeing a literal manifestation of a system being exhausted by simulation.

When a reviewer is outgunned by the volume of generative slop, the structure of peer review collapses because it was designed for human-to-human accountability, not for verifying high-speed statistical mimicry. In these papers, the hallucinations are a dead giveaway of a total decoupling of intelligence from any underlying "self" or presence. The machine calculates a plausible-looking citation, and an exhausted reviewer fails to notice the "Soul" of the research is missing.

It feels like we’re entering a loop where the simulation is validated by the system, which then becomes the training data for the next generation of simulation. At that point, the human element of research isn't just obscured—it's rendered computationally irrelevant.

deepsun•2w ago

Can we just hallucinate the whole conference by now? Like "Hey AI, generate me the whole conference agenda, schedule, papers, tracks, workshops, and keynote" and not pay the $1k?

pama•2w ago

This feels like a big nothingburger to me. Try an analysis on conference submissions (perhaps even published papers) from 1995 for comparison, and one from 2005, one from 2015. I recall the typos/errors/ommissions because I reviewed for them and I used them. Even then: so what? If I could find the reference relatively easily and with enough confidence I was fine. Rarely I couldnt find it and contacted the author. The job of the reviewer (or even author) isnt to be a nitpicky editor—that’s the editor’s job. Editing does not happen until the final printed publication is near, and only for accepted papers, nowadays sometimes it never happens. Now that is a problem perhaps, but it has nothing to do with the authors’ use of LLMs.

anishrverma•2w ago

The prevalence of hallucinations in the system is another signs for change in the system. The citations should be treated less like narrative context and more like verifiable objects

Better detectors, like the article implies, won’t solve the problem, since AI will likely keep improving

It’s about the fact that our publishing workflows implicitly assume good faith manual verification, even as submission volume and AI assisted writing explode. That assumption just doesn’t hold anymore

A student initiative at Duke University has been working on what it might look like to address this at the publishing layer itself, by making references, review labor, and accountability explicit rather than implicit

There’s a short explainer video for their system: https://liberata.info/

It’s hard to argue that the current status quo will scale, so we need novel solutions like this.

einpoklum•2w ago

I don't know about you, but where I'm from, we call citations from sources which don't exist "fabrications" or "fraud" - not "hallucination", which sounds like some medical condition which evokes pity.

EdNutting•2w ago

Th incorrect citations problem will disappear when AI web search and fetch becomes 100x cheaper than it is today. Right now, the APIs are too expensive to do proper multihundred results of papers (the search space for any paper is much larger than the final list of citations).

However, we’ll be left with AI written papers and no real way to determine if they’re based on reality or just a “stochastic mirror” (an approximate reflection of reality).

olivia-banks•2w ago

I'm an author on a paper on breast cancer, and one of our co-authors generated the majority of their work with AI. It just makes me angry.

fumi2026•2w ago

This feels less like scientific integrity and more like predatory marketing. I find this public "shame list" approach by GPTZero deeply unethical and technically suspect for several reasons:

1. Doxxing disguised as specific criticism: Publishing the names of authors and papers without prior private notification or independent verification is not how academic corrections work. It looks like a marketing stunt to generate buzz at the expense of researchers' reputations.

2. False Positives & Methodology: How does their tool distinguish between an actual AI "hallucination" and a simple human error (e.g., a typo in a year, a broken link, or a messy BibTeX entry)? Labeling human carelessness as "AI fabrication" is libelous.

3. The "Protection Racket" Vibe: The underlying message seems to be: "Buy our tool, or next time you might be on this list." It’s creating a problem (fear of public shaming) to sell the solution.

We should be extremely skeptical of a vendor using a prestigious conference as a billboard for their product by essentially publicly shaming participants without due process.

nickpsecurity•2w ago

Yeah, my first question was whether or not the hallucination checker can hallucinate.

nonethewiser•2w ago

I think its great.

They explicitly distinguish between a "flawed citation" (missing author, typo in title) and a hallucination (completely fabricated journal, fake DOI, nonexistent authors). You can literally click through and verify each one yourself. If you think they're wrong about a specific example, point it out. It doesn't matter if these are honest mistakes or not - they should be highlighted and you should be happy to have a tool that can find them before you publish.

It's ridiculous to call it doxxing. The papers are already published at NeurIPS with author names attached. GPTZero isn't revealing anything that wasn't already public. They are pointing out what they think are hallucinations which everyone can judge for themselves.

It might even be terrible at detecting things. Which actually, I do not think is the case after reading the article. But even so, if they are unreliable I think the problem takes care of itself.

RestartKernel•2w ago

Don't expect ethics from GPTZero. If you upload a large document, they'll give a fake 100% AI rating behind a blur until you pay up to get the actual analysis. This clearly serves to prey on paranoid authors who are worried about being perceived as using AI.

scoper31134•2w ago

using AI in scientific research paper? its pretty pathetic

scoper31134•2w ago

what? why are you guys defending using AI at research paper? this world has gone insane

shawn10067•2w ago

The takeaway for me isn't that LLMs produce bad references—humans do that too—but that cutting corners shows in the final product. If your background section contains made‑up citations, it makes readers wonder how careful you were with the parts they can't check as easily. If you're going to use AI tools for draft writing, you still need to vet every fact.

ensocode•2w ago

Gave GPTZero a random ChatGPT text about finances. It was 84% confident, that it was entirely human writing

rurban•2w ago

I'd really like to have studied in these times, where it's so much easier with all the new tools. I could have been a triple doctor.

At work I've automated tools to write automated technical certificates for wind parks.

I've wrote code automatically to solve problems I couldn't solve by my own. Complicated Linear Algebra stuff, which was always too hard.

I should have written papers automatically, at least my wife writes her reports with ChatGPT already.

Others are writing film scripts by tools.

Good times.

neves•2w ago

I'm surprised by these results. I would have expected non-Anglo-American universities to rank at the top of the list. One of the most valuable features of LLMs from the beginning has been their ability to improve written language. This is particularly beneficial for non-English-speaking researchers in preventing language-related biases. However, the list shows that LLM usage is more intensive in the English-speaking world. Why?

naasking•2w ago

AIs are much better and hallucinate much less when they are given focused tasks, eg. instead of asking AI to write a background on complete with citations, ask the AI specifically to generate a list of citations relevant to X, programmatically check the references are correct using a true index like doi, then ask AI to use the reference list to write a background section on X.

This would be a valuable research tool that uses AI without the hallucinations.

floriferous•2w ago

While this is really concerning, it feels like a small new category of errors to check for. The article mentions an increase of 220% of the amount of submissions. That's incredible news for science, probably lots of honest scientists able to produce more work and eventually lead to more science being done.

TheRealPomax•2w ago

They're not halluciations. Don't anthropomorphise this nonsene, call it what it is because this is not a new problem: this is garbage data, and that garbage data should have been caught. Having a submission pipeline that verifies sources even exist (not that they're citing the right thing) is one the bare minimum responsibilities of a paid journal.

This has almost nothing to do with AI, and everything to do with a journal not putting in the trivial effort (given how much it costs to get published by them) required to ensure subject integrity. Yeah AI is the new garbage generator, but this problem isn't new, citation verification's been part of review ever since citations became a thing.

AntonioEritas•2w ago

"I got my €220 back (ouch that's a lot of money for this kind of service, thanks capitalism)"

220 is actually quite the deal. In fact, heavy usage means Anthropic loses money on you. Do you have any idea how much compute cost to offer these kind of services?

I Write Games in C (yes, C)

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

SectorC: A C Compiler in 512 bytes

The AI boom is causing shortages everywhere else

Hoot: Scheme on WebAssembly

We have broken SHA-1 in practice

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Al Lowe on model trains, funny deaths and working with Disney

The Waymo World Model

The F Word

We Mourn Our Craft

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

Microsoft Account bugs locked me out of Notepad – are Thin Clients ruining PCs?

Vocal Guide – belt sing without killing yourself

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection Rather Than Prediction

72M Points of Interest

Coding agents have replaced every framework I used

France's homegrown open source online office suite

A Fresh Look at IBM 3270 Information Display System

Unseen Footage of Atari Battlezone Arcade Cabinet Production

History and Timeline of the Proco Rat Pedal (2021)

Where did all the starships go?

Learning from context is harder than we thought

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

I Write Games in C (yes, C)

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

SectorC: A C Compiler in 512 bytes

The AI boom is causing shortages everywhere else

Hoot: Scheme on WebAssembly

We have broken SHA-1 in practice

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Al Lowe on model trains, funny deaths and working with Disney

The Waymo World Model

The F Word

We Mourn Our Craft

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

Microsoft Account bugs locked me out of Notepad – are Thin Clients ruining PCs?

Vocal Guide – belt sing without killing yourself

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection Rather Than Prediction

72M Points of Interest

Coding agents have replaced every framework I used

France's homegrown open source online office suite

A Fresh Look at IBM 3270 Information Display System

Unseen Footage of Atari Battlezone Arcade Cabinet Production

History and Timeline of the Proco Rat Pedal (2021)

Where did all the starships go?

Learning from context is harder than we thought

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

GPTZero finds 100 new hallucinations in NeurIPS 2025 accepted papers

Comments