Deep learning gets the glory, deep fact checking gets ignored

https://rachel.fast.ai/posts/2025-06-04-enzyme-ml-fails/index.html

585•chmaynard•1d ago

Comments

amelius•1d ago

Before making AI do research, perhaps we should first let it __reproduce__ research. For example, give it a paper of some deep learning technique and make it produce an implementation of that paper. Before it can do that, I have no hope that it can produce novel ideas.

YossarianFrPrez•1d ago

Seconded, as not only is this an interesting idea, it might also help solve the issue of checking for reproducibility. Yet even then human evaluators would need to go over the AI-reproduced research with a fine-toothed comb.

Practically speaking, I think there are roles for current LLMs in research. One is in the peer review process. LLMs can assist in evaluating the data-processing code used by scientists. Another is for brainstorming and the first pass at lit reviews.

ojosilva•1d ago

I thought you were going to say "give AI the first part of a paper (prompt) and let it finish it (completion)" as a validation AI can produce science at par with research results. Before it can do that, I have no hope that it can produce novel ideas.

bee_rider•1d ago

I guess it would also need the experimental data. It would, I guess, also need some ability to do little experiments and write off those ideas as not worth following up on…

DrScientist•1d ago

I once had a university assignment where they provided the figures from a paper and we had to write the paper around the just the given figures.

A bit like how you might write a paper yourself - starting with the data.

As it turned out I thought the figures looked like data that might be from a paper referenced in a different lecturers set of lectures ( just on the conclusion, he hadn't shown the figures ) - so I went down the library ( this is in the days of non-digitized content - you had to physically walk the stacks ) and looked it up - found the original paper and then a follow up paper by the same authors....

I like to think I was just doing my background research properly.

I told a friend about the paper and before you know it the whole class knew - and I had to admit to the lecturer that I'd found the original paper when he wondered why the whole class had done so well.

Obviously this would be trivial today with an electronic search.

patagurbon•1d ago

You would have to have a very complete audit trail for the LLM and ensure the paper shows up nowhere in the dataset.

We have rare but not unheard of issues with academic fraud. LLMs fake data and lie at the drop of a hat

TeMPOraL•1d ago

> You would have to have a very complete audit trail for the LLM and ensure the paper shows up nowhere in the dataset.

We can do both known and novel reproductions. Like with both LLM training process and human learning, it's valuable to take it in two broad steps:

1) Internalize fully-worked examples, then learn to reproduce them from memory;

2) Train on solving problems for which you know the results but have to work out intermediate steps yourself (looking at the solution before solving the task)

And eventually:

3) Train on solving problems you don't know the answer to, have your solution evaluated by a teacher/judge (that knows the actual answers).

Even parroting existing papers is very valuable, especially early on, when the model is learning how papers and research looks like.

6stringmerc•1d ago

…because there are no consequences for AI. Humans understand shame, pain, and punishment. Until AI models develop this conditional reasoning as part of their process, to me, they’re grossly overestimated in capability and reliability.

slewis•1d ago

OpenAI created a benchmark for this: https://openai.com/index/paperbench/

suddenlybananas•1d ago

Still has data contamination though.

Szpadel•1d ago

still LLM cannot beat it so it's good enough for start

tbrownaw•1d ago

> For example, give it a paper of some deep learning technique and make it produce an implementation of that paper.

Or maybe give it a paper full of statistics about some experimental observations, and have it reproduce the raw data?

bee_rider•1d ago

Like, have the AI do the experiment? That could be interesting. Although I guess it would be limited to experiments that could be done on a computer.

thrance•1d ago

Side note: I wonder why it's not normalized for more papers to come with a reference implementation. Wouldn't have to be efficient, or even be easily runnable. Could be a link to a repository with a few python scripts.

mike_hearn•1d ago

Sometimes papers do. Like everything with academia, there's no consistency and it varies mostly by field. It's especially common in CS and less common in other fields.

The main reason people don't do it is because incentives are everything, and university/government management set bad incentives. The article points this out too. They judge academics entirely by some function of paper citations, so academics are incentivized to do the least possible work to maximize that metric. There's no positive incentive to publish more than necessary, and doing so can be risky because people might find flaws in your work by checking it. So a lot of researchers hide their raw data or code for as long as possible. They know this is wrong and will typically claim they'll publish it but there's a lot of foot dragging, and whatever gets released might not be what they used to make the paper.

In the commercial world the incentives are obviously different, but the outcomes are the same. Sometimes companies want the ideas to be used as they compliment the core business, other times the ideas need to be protected to be turned into a core business. People like to think academics and industrial research are very different but everyone is optimizing for some metric, whether they like it or not.

thaumasiotes•1d ago

> Before it can do that, I have no hope that it can produce novel ideas.

Producing novel ideas is the most famous trait of current LLMs, the thing people are spending all their time trying to prevent.

Bassilisk•1d ago

> Producing novel ideas is the most famous trait of current LLMs

Could you please explain what you mean or give a simple example?

SkyBelow•1d ago

I think they were speaking to hallucinations. That is the case of a novel idea, often one that even sounds pretty plausible to a casual observer, but which isn't useful (arguably worse than simply being useless given it can trick people) and one connected only in a superficial way (why it fools the casual observer but the expert realizes it as a hallucination).

ErigmolCt•1d ago

Reproducibility is the baseline. Until models can consistently read, understand, and implement existing research correctly, "AI scientist" talk is mostly just branding.

darkoob12•1d ago

Reproduciblity was never a serious issue in AI research community. I think one of the main reasons for explosive progress in AI was the open community and people could easily reproduce other people's research. If you look at top tier conferences you see that they share everything paper, latex, code, data, lecture video etc.

After ChatGPT big cooperations stopped sharing their main research but it still happens at academia.

mnky9800n•1d ago

I think what I would rather like to see is the reproduction of results from experiments that the AI didn't see but are well known. Not reproducing AI papers. For example, assuming a human can build it, would an AI, not knowing anything except what was known at the time, be able to design the millikan oil drop experiment? Or would it be able to design an Taylor-Coutte setup for exploring turbulence? Would it be able to design a linear particle accelerator or a triaxial compression experiment? I think an interesting line of reasoning would be to restrict the training data to what was known before a seminal paper was produced. Like take Lorenz atmospheric circulation paper, train an AI on only data that comes from before that paper was published. Does the AI produce the same equations in the paper and the same description of chaos that Lorenz arrived at?

raxxorraxor•1d ago

You probably have to fight quite a few battles because many people know their papers aren't reproducible. More politics than science really.

It would be the biggest boon to science since sci-hub though.

And since a large set of studies won't be reproducible, you need human supervision as well, at least at first.

kenjackson•1d ago

"And for most deep learning papers I read, domain experts have not gone through the results with a fine-tooth comb inspecting the quality of the output. How many other seemingly-impressive papers would not stand up to scrutiny?"

Is this really not the case? I've read some of the AI papers in my field, and I know many other domain experts have as well. That said I do think that CS/software based work is generally easier to check than biology (or it may just be because I know very little bio).

a_bonobo•1d ago

Validation of biological labels easily takes years - in the OP's example it was a 'lucky' (huge!) coincidence that somebody already had spent years on one of the predicted proteins' labels. Nobody is going to stake 3-5 years of their career on validating some random model's predictions.

knowaveragejoe•1d ago

Just curious, could you expand on what about that process takes years?

a_bonobo•17h ago

Bioinformatically, you could compare your protein with known proteins and infer function from there, but OP's paper is specifically for the use case where we know nothing in our databases.

Time-wise it depends where in the process you start!

Do you know what your target protein even is? I've seen entire PhDs trying to purify a single protein - every protein is purified differently, there are dozens of methods and some work and some don't. If you can purify it you can run a barrage of tests on the protein - is it a kinase, how does it bind to different assays etc. That gives you a fairly broad idea in which area of activity your protein sits.

If you know what it is, you can clone it into your vector like E. coli. Then E. coli will hopefully express it. That's a few weeks/months of work, depending on how much you want to double-check.

You can then use fluorescent tags like GFP to show you where in the cell your protein is located. Is it in the cell-wall? is it in the nucleus? that might give you an indication to function. But you only have the location at this point.

If your protein is in an easily kept organism like mice, you can run knock-out experiments, where you use different approaches to either turn off or delete the gene that produces the protein. That takes a few months too - and chances are nothing in your phenotype will change once the gene is knocked out, protein-protein networks are resilient and there might be another protein jumping in to do the job.

if you have an idea of what your protein does, you can confirm using protein-protein binding studies - I think yeast two-hybrid is still very popular for this? It tests whether two specific proteins - your candidate and another protein - interact or bind.

None of those tests will tell you 'this is definitely a nicotinamide adenine dinucleotide binding protein', every test (and there are many more!) will add one piece to the puzzle.

Edit: of course it gets extra-annoying when all these puzzle pieces contradict each other. In my past life I've done some work with the ndh genes that sit on plant chloroplasts and are lost in orchids and some other groups of plants (including my babies), so it's interesting to see what they actually do and why they can be lost. It's called ndh because it was initially named NADH-dehydrogenase-like, because by sequence it kind of looks like a NADH dehydrogenase.

There's a guy in Japan (Toshiharu Shikanai) who worked on it most of his career and worked out that it certainly is NOT a NADH dehydrogenase and is instead a Fd-dependent plastoquinone reductase. https://www.sciencedirect.com/science/article/pii/S000527281...

Knockout experiments with ndh are annoying because it seems to be only really important in stress conditions - under regular conditions our ndh- plants behaved the same.

Again, this is only one protein, and since it's in chloroplasts it's ultra-common - most likely one of the more abundant proteins on earth (it's not in algae either). And we still call it ndh even though it is a Ferredoxin-plastoquinone reductase.

suddenlybananas•1d ago

My impression with linguistics is that people do go over the papers that use these techniques carefully and come up with criticisms of them, but people don't take linguists seriously so people from other related disciplines ignore the criticisms.

yorwba•1d ago

Reading a paper is not the same as verifying the results is not the same as certifying their correctness. I read a lot of papers, but I typically only look at the underlying data when I intend to repurpose it for something else, and when I do, I tend to notice errors in the ground truth labels fairly quickly. Of course most models don't perform well enough for this to influence results much...

slt2021•1d ago

Fantastic article by Rachel Thomas!

This is basically another argument that deep learning works only as a [generative] information retrieval - i.e a stochastic parrot, due to the fact that the training data is a very lossy representation of the underlying domain.

Because the data/labels of genes do not always represent the underlying domain (biology) perfectly, the output can be false/invalid/nonsensical.

in cases where it works very well - there is data leakage, because by design LLMs are information retrieval tools. It comes form the information theory standpoint, a fundamental "unknown unknown" for any model.

my takeaway is that its not a fault of the algorithm, its more the fault of the training dataset.

We humans operate fluidly in the domain of natural language, and even a kid can read and evaluate whether text make sense or not - this explains the success of models trained on NLP.

but in domains where training data represents the fundamental domain with losses, it will be imperfect.

ffwd•1d ago

This to me is the paradox of modern LLMs, in that it doesn't represent the underlying domain directly, but it can represent whatever information can be presented in text. So it does represent _some_ information but it is not always clear what it is or how.

The embedding space can represent relationships between words, sentences and paragraphs, and since those things can encode information about the underlying domain, you can query those relationships with text and get reasonable responses. The problem is it's not always clear what is being represented in those relationships as text is a messy encoding scheme.

But another weakness is that as you say it is generative, and in order to make it generative we are instead of hardcoding in a database all possible questions and all possible answers, we offload some of the data to an algorithm (next token prediction) in order to get the possibility of an imprecise probabilistic question/prompt (which is useful because then you can ask anything).

But the problem is no single algorithm can ever encode all possible answers to all possible questions in a domain accurate way and so you lose some precision in the information. Or at least this is how I see LLMs atm.

vixen99•1d ago

I wonder to what extent the thought processes that lead to the situation described by Rachel Thomas, are active in other areas. Important article by the way, I agree!

dathinab•1d ago

> works only as a [generative] information retrieval

but even if we for simplicity of the argument assume that is true without question, LLM still are here to stay

Like think about how do junior devs which (in programming) average or less skill work, they "retrieve" the information about how to solve the problem from stack overflow, tutorials etc.

So giving all your devs some reasonable well done AI automation tools (not just a chat prompt!!) is like giving each a junior dev to delegate all the tedious simple tasks, too. Without having to worry about that task not allowing the junior dev to grow and learn. And to top it of if there is enough tooling (static code analysis, tests, etc.) in place the AI tooling will do the write things -> run tools -> fix issues loops just fine. And the price for that tool is like what, a 1/30th of that of a junior dev? Means more time to focus on the things which matter including teaching you actual junior devs ;)

And while I would argue AI isn't full there yet, I think the current fundation models _might_ already be good enough to get there with the right ways of wiring them up and combining them.

slt2021•21h ago

Programming languages are created by humans and the training dataset is complete enough to train LLMs with good results. Most importantly, natural language is the native domain of the programming code.

Whereas in biology, the natural domain is in physical/chemical/biological reactions occuring between organisms and molecules. The laws of interactions are not created by human, but by Creator(tm), and so the training dataset is barely capturing a tiny fraction and richness of the domain and its interactions. Because of this, any model will be inadequate

aaron695•1d ago

Wow, great lead....

The worse science, publish or perish pulp, got more academic karma Altmetric/Citations -> $$$

AI is a perfect academic, the science and curiosity is gone and the ability to push out science looking text is supermaxxed.

Tragic end solution, do the same and throw even more money at it

> At a time when funding is being slashed, I believe we should be doing the opposite

AI has show academia is beyond broken in a way that can't be ignored to the world and academia won't get their heads out of the granular sediments between 0.0625 mm and 2 mm in diameter.

Defund academia now.

dathinab•1d ago

ah right, de-found the main source of at lest semi independent scientific advancement just because it has a lot of issues instead of fixing this issues .. always a grate idea

I would have though that the common catastrophic failure stories of fully rewriting systems all at once instead of fixing them bit by bit would help IT people to know better.

aucisson_masque•1d ago

It’s like fake news is taking in science now. Saying any stupid thing will attract much more view and « likes » than those debunking them.

Except that we can’t compare twitter to nature journal. Science is supposed to be immune to these kind of bullshit thanks to reputed journals and pair reviewing, blocking a publication before it does any harm.

Was that a failure of nature ?

godelski•1d ago

Yes. And let's not get started on that ML Quantum Wormhole bullshit...

We've taken this all too far. It is bad enough to lie to the masses in Pop-Sci articles. But we're straight up doing it in top tier journals. Some are good faith mistakes, but a lot more often they seem like due diligence just wasn't ever done. Both by researchers and reviewers.

I at least have to thank the journals. I've hated them for a long time and wanted to see their end. Free up publishing and bullshit novelty and narrowing of research. I just never thought they'd be the ones to put the knife through their own heart.

But I'm still not happy about that tbh. The only result of this is that the public grows to distrust science more and more. In a time where we need that trust more than ever. We can't expect the public to differentiate nuanced takes about internal quibbling. And we sure as hell shouldn't be giving ammunition to the anti-science crowds, like junk science does...

toofy•1d ago

this seems strange to me, shouldn’t we expect a high quality journal to retract often as we gather more information?

obviously this is hyperbole of two extremes, but i certainly trust a journal far more if it actively and loudly looks to correct mistakes over one that never corrects anything or buries its retractions.

a rather important piece of science is correcting mistakes by gathering and testing new information. we should absolutely be applauding when a journal loudly and proactively says “oh, it turns out we were wrong when we declared burying a chestnut under the oak tree on the third thursday of a full moon would cure your brothers infected toenail.”

Ekaros•1d ago

I think it might come to understanding what is "high quality" journal. Maybe such journal should be focused on much more proven and mature things. Where there would be lot less retractions as they have more mature and thus more proven information.

But I think problem is what is seen is "high quality" == "high impact". Which means that prestige and visibility is important things. Which likely lowers the threshold quite a lot as being first to publish possibly valid something is seen as important.

godelski•1d ago

  > shouldn’t we expect a high quality journal to retract often as we gather more information?

This is complicated, and kinda sad tbh. But no.

You need to carefully think about what "high quality journal" means. Typically it is based on something called Impact Factor[0]. Impact factor is judged by the number of citations a journal has received in the last 2 years. It sounds good on paper, but I think if you think about it for a second you'll notice there's a positive feedback loop. There's also no incentive that it is actually correct.

For example, a false paper can often get cited far more than a true paper. This is because when you write the academic version of "XYZ is a fucking idiot, and here's why" you cite their paper. It's good to put their bullshit down, but it can also just end up being Streisand effect-like. Journal is happy with its citations. Both people published in them. They benefit from both directions. You keep the bad paper up for the record and because as long as the authors were actually acting in good faith, you don't actually want to take it down. The problem is... how do you know?

Another weird factor used is Acceptance Rates. This again sounds nice at first. You don't want a journal publishing just anything, right?[1] The problem comes when these actually become targets (which they are). Many of the ML conferences target about 25% acceptance rate[2]. It fluctuates year to year. It should, right? Some years are just better science than other years. Good paper hits that changes things and the next year should have a boom! But that's not the level of fluctuation we're talking about. If you look at the actual number of papers accepted in that repo you'll see a disproportionate number of accepted papers ending in a 0 or 5. Then you see the 1 and 6, which is a paper being squeezed in, often for political reasons. Here, I did the first 2 tables for you. You'll see that has a very disproportionate ending of 1 and 6 and CV loves 0,1,3 These numbers should convince you that this is not a random process, though they should not convince you it is all funny business (much harder to prove). But it is at least enough to be suspicious and encourage you to dig in more.

There's a lot that's fucked up about the publishing system and academia. Lots of politics, lots of restricted research directions, lots of stupid. But also don't confuse this for people acting in bad faith or lying. Sure, that happens. But most people are trying to do good and very few people in academia are blatantly publishing bullshit. It's just that everything gets political. And by political I don't mean government politics, I mean the same bullshit office politics. We're not immune from that same bullshit and it happens for exactly the same reasons. It just gets messier because if you think it is hard to measure the output of an employee, try to measure the output of people who's entire job it is to create things that no one has ever thought of before. It's sure going to look like they're doing a whole lot of nothing.

So I'll just leave you with this (it'll explain [1])

  As a working scientist, Mervin Kelly (Director of Bell Labs (1925-1959)) understood the golden rule, 
  "How do you manage genius? You don't."
  https://1517.substack.com/p/why-bell-labs-worked

There's more complexity like how we aren't good at pushing out frauds and stuff but if you want that I'll save it for another comment.

[0] https://en.wikipedia.org/wiki/Impact_factor

[1] Actually I do. As long as it isn't obviously wrong, plagiarized, or falsified, then I want that published. You did work, you communicated it, now I want it to get out into the public so that it can be peer reviewed. I don't mean a journal's laughable version of peer review (3-4 unpaid people that don't study your niche and are more concerned with if it is "novel" or "impactful" quickly reading your paper and you're one of 4 on their desk they need to do this week. It's incredibly subjective and high impact papers (like Nobel Price winning papers) routinely get rejected). Peer review is the process of other researchers replicating your work, building on it, and/or countering it. Those are just new papers...

[2] https://github.com/lixin4ever/Conference-Acceptance-Rate

[3] https://imgur.com/a/YefOcuA

somenameforme•1d ago

Almost nobody is "anti-science". The source of that labeling and division came from appeals to authority. You must do or believe this because it's "the science." If you don't, or you disagree, then you are anti-science.

It has nothing to do with science, but rather people not finding that a sufficient justification for unpopular actions. For instance it's 100% certain that banning sugary drinks would dramatically improve public health, reduce healthcare costs, increase life expectancy, and just generally make society better in every single way.

So should we ban sugary drinks? It'd be akin to me trying to claim that if you say no then you're anti-science, anti-health, or whatever else. It's just a dumb, divisive, and meaningless label - exactly the sort politicians love so much now a days.

Of course there's some irony in that it will become a self fulfilling prophecy. The more unpopular things done in the name of 'the science', the more negative public sentiment to 'the science' will become. Probably somewhat similar to how societies gradually became secular over time, as it became quite clear that actions done in the name of God were often not exactly pious.

tensor•1d ago

Yes, there are a lot of people who are anti-science. As in they do not believe the scientific method is a good way to find truth. There are people today who are rejecting very basic science that was accepted over a century ago.

somenameforme•1d ago

No, there aren't. Most people don't realize when they're being trolled by things like e.g. flat earth types. Go read one of the groups, it's a trolling meme largely turned into something by the internet + media. Thanks to social media even if let's say 0.1% of English speakers believe (or pretend to believe) something, that'd be 1.6 million people, so you can get a false impression, especially when the media takes the trolling and then amplifies it with a straight face largely to amplify these dumb and farcical divides.

tbrownaw•1d ago

I've seen science (the process, not just the results of that process) denounced as an aspect or tool of Western colonialism / imperialism. And there's that related "indigenous ways of knowing" thing that Canada has going on.

---

Also the flat earth people actually aren't trying to argue against science (the process). They're arguing that everyone except them made either observational errors or reasoning errors.

emp17344•20h ago

Science is a great way to model reality. It’s unclear whether science accurately describes reality. It’s also unclear whether science is capable of determining metaphysical truth.

https://plato.stanford.edu/entries/scientific-realism/#WhatS...

https://en.m.wikipedia.org/wiki/Pessimistic_induction

godelski•1d ago

  > Almost nobody is "anti-science".

Last I checked:

  - 15% of Americans don't believe in Climate Change[0]
  - 37% believe God created man in our current form within the last ~10k years 
    (i.e. don't believe in evolution)[1]

I don't think these are just rounding errors.

They're large enough numbers that you should know multiple people who hold these beliefs unless you're in a strong bubble.

I'm obviously with you in news and pop-sci being terrible. I hate IFuckingLoveScience. They're actually just IFuckingLoveClickbait. My point was literally about this bullshit.

90% of the time it is news and pop-sci miscommunicating papers. Where they clearly didn't bother to talk to authors and likely didn't even read the paper. "Scientists say <something scientists didn't actually say>". You see this from eating chocolate, drinking a glass of red wine, to eating red meat or processed meat. There are nuggets of truth in those things but they're about just as accurate as the grandma that sued McDonalds over coffee that was too hot. You sure bet this stuff creates distrust in science

[0] https://record.umich.edu/articles/nearly-15-of-americans-den...

[1] https://news.gallup.com/poll/647594/majority-credits-god-hum...

vixen99•1d ago

The stats you mention seem to suggest you are a believer in `The Science` - an anti scientific idea if ever there was one and one that's undergoing erosion day by day.

godelski•1d ago

I'll bite, what's "The Science"

Me? I barely believe in the results of my experiments. But I also know what this poll is intending to ask and yeah, I read enough papers, processed enough data, did enough math, and tracked enough predictions that ended up coming true. That's enough to convince me it's pretty likely that those spending a fuck more time on it (and are the ones making those predictions that came true!) probably know what they're talking about.

pessimizer•1d ago

According to your model, scientists who believe in God are anti-science.

That's almost weirder than declaring that 15% of people not believing in anthropogenic global warming is some sort of crisis. It's a theory that seems to fit the data (with caveats), not an Axiom of Science.

It's actually bizarre that 85% of people trust Science so much that they would believe in something that they have never seen any direct evidence of. That's a result of marketing. The public don't believe in global warming because it's "correct"; they have no idea if it's correct, and they often believe in things that are wrong that people in white coats on television tell them.

Ray20•1d ago

Climatologists have certainly invested much more in PR than geologists. So much so that their activities now look more like a global cult than science.

godelski•1d ago

  > According to your model, scientists who believe in God are anti-science.

In a way, yes. But every scientist I know that also believes in God is not shy in admitting their belief is unscientific.

The reason I'm giving this a bit of a pass is because in science we need things that are falsifiable. The burden of proof should be on those believing in God. But such a belief is not falsifiable. You can't prove or disprove God. If they aren't pushy, they're okay with admitting that, and don't make a big deal out of it then I don't really care. That's just being a decent person.

But that's a very different thing than not believing in things we have strong physical evidence for, strong mathematical theories, and a long record of making counter factual predictions. The great thing about science is it makes predictions. Climate science has been making pretty good ones since the 80's. Every prediction comes with error bounds. Those are tightening but the climate today matches those predictions within error. That's falsifiable

somenameforme•1d ago

I think one of the most important 'social values' for science to thrive is a culture with a freedom to disagree on essentially anything. In most of every era where there was rapid scientific progress from the Greeks to the Islamic Golden Age to the Renaissance and beyond, there was also rich, and often times rather virulent, disagreements over even the most sacred of things. Some of those disagreements were well founded, some were... not. It's only in the eras where disagreement becomes taboo that science starts to slow to a crawl and in many cases essentially die.

Disagreeing with some consensus is not "anti-science". The term doesn't even make any sense, which is because it's a political and not a scientific term. I mean imagine if we claimed everybody who happens to believe MOND is more likely than WIMPs as an explanation for dark matter, to be "anti-science". It's just absolutely stupid. Yet we do exactly that on other topics where suddenly you must agree with the consensus or you're just "anti-science"? I mean again, it makes no sense at all.

catlifeonmars•1d ago

I don’t think that’s quite right. Disagreement for the sake of disagreement is not particularly meaningful. The basis for science is iteration on the scientific method. Which is to say: observe -> hypothesize -> falsify.

Anti science means to make claims that have no basis in that process or to categorically reject the body of work that was based on that process.

somenameforme•1d ago

People disagree because they hold a different opinion. In many eras publicly expressing differing opinions, let alone publicly challenging established ones, becomes difficult for various reasons - cultural, political, social, even economic. And I think this is, in general, the natural state of society. When people think something is right, changing their mind is often not realistically possible. And this includes even the greatest of scientists.

For instance none other than Einstein rejected a probabilistic interpretation of quantum physics, the Copenhagen Interpretation, all the way to his death. Many of his most famous quotes like 'God does not play dice with the universe.' or 'Spooky action at a distance.' were essentially sardonic mocking of such an interpretation, the exact one that we hold as the standard today. It was none other than Max Planck that remarked, 'Science advances one funeral at a time' [1], precisely because of this issue.

And so freedom to express, debate, and have 'wrong ideas' in the public mindshare is quite critical, because it may very well be that those wrong ideas are simply the standard of truth tomorrow. But most societies naturally turn against this, because they believe they already know the truth, and fear the possibility of society being misled away from that truth. And so it's quite natural to try to clamp down, implicitly or explicitly, on public dissenting views, especially if they start to gain traction.

[1] - https://en.wikipedia.org/wiki/Planck's_principle

godelski•1d ago

  > none other than Einstein rejected a probabilistic interpretation of quantum physics

That has been communicated to you wrong and a subtle distinction makes a world of difference.

Plenty of physicists then and now still work hard on trying to figure out how to remove uncertainty in quantum mechanics. It's important to remember that randomness is a measurement of uncertainty.

We can't move forward if the current paradigm isn't challenged. But the way it is challenged is important. Einstein wasn't going around telling everyone they were wrong, but he was trying to get help in the ways he was trying to solve it. You still have to explain the rest of physics to propose something new.

Challenging ideas is fine, it's even necessary, but at the end of the day you have to pony up.

The public isn't forming opinions about things like Einstein. They just parrot authority. Most HN users don't even understand Schrödinger's cat and think there's a multiverse.

somenameforme•15h ago

A core component of the Copenhagen interpretation is that quantum mechanics is fundamentally indeterministic meaning you are inherently and inescapably left with probabilistic/statistical systems. And yes, Einstein was saying people were wrong while offering no viable alternative. His motivation was purely ideological - he believed in a rational deterministic universe, and the Copenhagen Interpretation didn't fit his worldview.

For instance this is the complete context of his spooky action at a distance quote: "I cannot seriously believe in [the Copenhagen Interpretation] because the theory cannot be reconciled with the idea that physics should represent a reality in time and space, free from spooky action at a distance." Framing things like entanglement as "spooky action at a distance" was obviously being intentionally antagonistic on top of it all as well.

---

And yes, if it wasn't clear by my tone - I believe the West in has gradually entered onto the exact sort of death of science phase I am speaking about. A century ago you had uneducated (formally at least) brothers working as bicycle repairmen pushing forward aerodynamics and building planes in their spare time. Today, as you observe, even people with excessive formal education, access to [relatively] endless resources, endless information, and more - seem to have little ambition in exploiting that, rather than passively consuming it. It goes some way to explaining why some think LLMs might lead to AGI.

godelski•1d ago

  > Disagreeing with some consensus is not "anti-science".

Be careful of gymnastics.

Yes, science requires the ability to disagree. You can even see in my history me saying a scientist needs to be a bit anti authoritarian!

But HOW one goes about disagreeing is critical.

Sometimes I only have a hunch that what others believe is wrong. They have every right to call me stupid for that. Occasionally I'll be able to gather the evidence and prove my hunch. Then they are stupid for not believing like I do, but only after evidenced. Most of the time I'm wrong though. Trying to gather evidence I fail and just support the status quo. So I change my mind.

Most importantly, I just don't have strong opinions about most things. Opinions are unavoidable, strong ones aren't. If I care about my opinion, I must care at least as much about the evidence surrounding my opinion. That's required for science.

Look at it this way. When arguing with someone are you willing to tell them how to change your mind? I will! If you're right, I want to know! But frankly, I find most people are arguing to defend their ego. As if being wrong is something to be embarrassed about. But guess what, we're all wrong. It's all about a matter of degree though. It's less wrong to think the earth is a sphere than flat because a sphere is much closer to an oblate spheroid.

If you can't support your beliefs and if you can't change your mind, I don't care who you listen to, you're not listening to science

lamename•1d ago

The Bullshit asymmetry principle comes to mind https://en.wikipedia.org/wiki/Brandolini%27s_law

lamename•1d ago

Have you seen the statistics about high impact journals having higher retraction/unverified rates on papers?

The root causes can be argued...but keep that in mind.

No single paper is proof. Bodies of work across many labs, independent verification, etc is the actual gold standard.

somenameforme•1d ago

This is something I think many people don't appreciate. A perfect example in practice is the Journal of Personality and Social Psychology. It's one of the leading and highest impact journals in psychology. A quick search for that name will show it as the source for endless 'news' articles from sites like the NYTimes [1]. And that journal has a 23% replication success rate [2] meaning there's about an 80% chance that anything you read in the journal, and consequently from the numerous sites that love to quote it, is wrong.

[1] - https://search.brave.com/search?q=site%3Anytimes.com+Journal...

[2] - https://en.wikipedia.org/wiki/Replication_crisis#In_psycholo...

vkou•1d ago

The purpose of peer review is to check for methodological errors, not to replicate the experiment. With a few exceptions, it can't catch many categories of serious errors.

> higher retraction/unverified

Scientific consensus doesn't advance because a single new ground-breaking claim is made in a prestigious journal. It advances when enough other scientists have built on top of that work.

The current state of science is not 'bleeding edge stuff published in a journal last week'. That bleeding edge stuff might become part of scientific consensus in a month, or year or three, or five - when enough other people build on that work.

Anybody who actually does science understands this.

Unfortunately, people with poor media literacy who only read the headlines don't understand this, and assume that the whole process is all a crock.

tbrownaw•1d ago

> It’s like fake news is taking in science now.

I didn't think this was new? Like, it's been a few years since that replication crisis things kicked off.

imiric•1d ago

> Science is supposed to be immune to these kind of bullshit

You have misplaced confidence in the scientific method. It was never immune to corruption, either by those deliberately manipulating it for their personal gain, or simply due to ignorance and bad methodology. We have examples of both throughout history. In either case, peer review is not infallible.

The new problem introduced by modern AI tools is that they drastically lower the skill requirement for anyone remotely capable in the field to generate data that appears correct on the surface, with relatively little effort and very quickly, while errors can only be discovered by actual experts in the field investing considerable amounts of time and resources. In some fields like programming the required resources to review code are relatively minor, but in fields like biology this (from what I've read) is much more difficult and expensive.

But, yes, science is being flooded with (m|d)isinformation, just like all our other media channels.

dathinab•1d ago

no it's a long term incoming failure

partially due to legacy of science historically being rooted in "it matters more who you (or your parents) are" societies (due to them having had the money in somewhat modern history) (or like some would say the "old white man problem", except it has nothing to do with skin color, or man and only limited to do with old)

partially due to how much more "science (output)" is produced today and ways which once worked to have reasonable QA don't work that well in todays scale anymore

partially due to how many flows

partially due to human nature (as in people tend to care more about "exiting", "visible" things etc.)

People have been pushing for change in a lot of ways like:

- pushing to make full re-poducability a must have (but that is hard, especially for statistics based things only a few companies can even afford to try to run. But also hard due to it requiring a lot of transparency and open data access, and especially the alter is often very much something many owners of data sets are not okay with.

- pushing for more appreciation of null results, or failures. (To be clear I mean both appreciation in form of monetary support and in the traditional sense of the word of people (colleges) appreciating it).

- pushing for more verifying of papers by trying to reproduce it (both as in more money/time resources for it and in changing the mind set from it being a daunting unappreciated task to it being a nice thing to do)

but to little change happened in the end before modern LLM AI hit the scene and now it has made things so much harder as it's now easy to mass produce slob but reasonable looking (non) sience

godelski•1d ago

  > although later investigation suggests there may have been data leakage

I think this point is often forgotten. Everyone should assume data leakage until it is strongly evidenced otherwise. It is not on the reader/skeptic to prove that there is data leakage, it is the authors who have the burden of proof.

It is easy to have data leakage on small datasets. Datasets where you can look at everything. Data leakage is really easy to introduce and you often do it unknowingly. Subtle things easily spoil data.

Now, we're talking about gigantic datasets where there's no chance anyone can manually look through it all. We know the filter methods are imperfect, so it how do we come to believe that there is no leakage? You can say you filtered it, but you cannot say there's no leakage.

Beyond that, we are constantly finding spoilage in the datasets we do have access to. So there's frequent evidence that it is happening.

So why do we continue to assume there's no spoilage? Hype? Honestly, it just sounds like a lie we tell ourselves because we want to believe. But we can't fix these problems if we lie to ourselves about them.

antithesizer•1d ago

The supposed location of the burden of proof is really not the definitive guide to what you ought to believe that people online seem to think it is.

mathgeek•1d ago

Can you elaborate? You've made a claim, but I really think there'd be value in continuing to what you actually mean.

NormLolBob•1d ago

They mean “vet your sources and don’t blindly follow the internet hive-mind.” or similar; burden of proof is not what the internet thinks.

Tacked their actual point on to the end of a copy paste of op comments context, ended up writing something barely grammatically correct.

In doing so they prove why exactly not to listen to the internet. So they have that going for them.

tbrownaw•1d ago

What is the relevance of this generic statement to the discussion at hand?

SamuelAdams•1d ago

Every system has problems. The better question is: what is the acceptable threshold?

For an example Medicare and Medicade had a fraud rate of 7.66%. Yes, that is a lot of billions, and there is room for improvement, but that doesn’t mean the entire system is failing: 93% of cases are being covered as intended.

The same could be said with these models. If the spoilage rate is 10%, does that mean the whole system is bad? Or is it at a tolerable threshold?

[1]: https://www.cms.gov/newsroom/fact-sheets/fiscal-year-2024-im...

fastaguy88•1d ago

In the protein annotation world, which is largely driven by inferring common ancestry between a protein of unknown function and one of known function, common error thresholds range from FDR of 0.001 to 10^-6. Even a 1% error rate would be considered abysmal. This is in part because it is trivial to get 95% accuracy in prediction; the challenging problem is to get some large fraction of the non-trivial 5% correct.

"Acceptable" thresholds are problem specific. For AI to make a meaningful contribution to protein function prediction, it must do substantially better than current methods, not just better than some arbitrary threshold.

mike_hearn•1d ago

I think it's worth being highly skeptical about fraud rates that are stated to two decimal places of precision. Fraud is by design hard to accurately detect. It would be more accurate to say, Medicare decides 7.66% of its cases are fraudulent according to its own policies and procedures, which are likely conservative, and cannot take into account undetected fraud. The true rate is likely higher, perhaps much higher.

There's also the problem of false negatives vs positives. If your goal is to cover 100% of true cases you can achieve that easily by just never denying a claim. That would of course yield stratospheric false positive rates (fraud). You have to understand both the FN rate (cost of missed fraud) vs the FP rate (cost of fraud fighting) and then balance them.

The same applies with using models in science to make predictions.

alwa•1d ago

The number seems to come from Medicare’s CERT program [0]. At a hurried glance they seem to have published data right up to present, but their most recent interpretive report I could find with error margins was from 2016. That one [1] put the CIs on those fraud rates in the +/-2% range per subtype and around +/-0.9% overall. Bearing out your point.

CERT’s annual assessments do seem to involve a large-scale, rigorous analysis of an independent sample of 50,000 cases, though. And those case audits seem, at least on paper and to a layperson, to apply rather more thorough scrutiny than Medicare’s day-to-day policies and procedures.

As @patio11 says, and to your point, “the optimal amount of fraud is non-zero”… [2]

[0] https://www.cms.gov/data-research/monitoring-programs/improp...

[1] https://www.cms.gov/research-statistics-data-and-systems/mon...

[2] https://www.bitsaboutmoney.com/archive/optimal-amount-of-fra...

mike_hearn•11h ago

That CERT doesn't seem to be looking for fraud, but more like errors in the bureaucracy. They request medical documents and assess them against the regular criteria, but no effort is made to find the sort of fraud criminals would engage in, like fraudulently produced documents for tests that never happened.

godelski•1d ago

  > The better question is: what is the acceptable threshold?

Currently we are unable to answer that question. AND THAT'S THE PROBLEM

I'd be fine if we could. Well, at least far less annoyed. I'm not sure what the threshold should be, but we should always try to minimize it. At least error bounds would do a lot of good at making this happen. But right now we have no clue and that's why this is such a big question that people keep bringing up. We don't point out specific levels of error because they are small and we don't want you looking at them, rather we don't point them out because nobody has a fucking clue.

And until someone has a clue, you shouldn't trust that they error rate is low. The burden of proof is on the one making the claim of performance, not the one asking for evidence to that claim (i.e. skeptics).

Btw, I'd be careful with percentages. Especially when numbers are very high. e.g. LLMs are being trained on trillions of tokens. 10% of 1 trillion is 100 bn. The entire work of Shakespeare is 1.2M tokens... Our 10% error rate would be big enough to spoil any dataset. The bitter truth is that as the absolute number increases, the threshold for acceptable spoilage (in terms of percentage) needs to decrease.

dgb23•1d ago

There‘s also the question of „what is it failing at?“.

I‘m fine with 5% failure if my soup is a bit too salty. Not fine with 0.1% failure if it contains poison.

godelski•23h ago

That's a good example. I'm going to steal it ;)

(We can even get more nuanced. What kind of poison?)

wavemode•22h ago

Data leakage is an eval problem, not an accuracy problem.

That is, the problem is not that the AI is wrong X% of the time. The problem is that, in the presence of a data leak, there is no way of knowing what the value of X even is.

This problem is recursive - in the presence of a data leak, you also cannot know for sure the quantity of data that has leaked.

semiinfinitely•1d ago

there is no truth- only power.

anthk•1d ago

There's no power, just physics.

croemer•1d ago

Don't call "Nature Communications" "Nature". The prestige is totally different. Also, altmetrics aren't that relevant, maybe if you want to measure public hype.

croemer•1d ago

Update: It seems the author read this and fixed it. Thanks!

8bitsrule•1d ago

Fits my limited experiences with LLM (as a researcher). Very impressive apparent written language comprehension and written expression. But when it comes to getting to the -best possible answer- (particulary on unresolved questions), the nearly-instant responses (e.g. to questions that one might spend a half-day on without resolution) are seldom satisfactory. Complicated questions take time to explore, and IME an LLM's lack-of-resolution (because of it's inability) is, so far, set aside in favor of confident-sounding (even if completely-wrong) responses.

softwaredoug•1d ago

We also love deep cherry picking. Working hard to find that one awesome time some ML / AI thing worked beautifully and shouting its praises to the high heavens. Nevermind the dozens of other times we tried and failed...

TeMPOraL•1d ago

Even more so, we also love deep stochastic parroting. Working hard to ignore direct experience, growing amount of reports, and to avoid reasoning from first principles, in order to confidently deny the already obvious utility of LLMs, and backing that position with some tired memes.

r3trohack3r•1d ago

Dude. I just asked my computer to write [ad lib basic utility script] and it spit out a syntactically correct C program that does it with instructions for compiling it.

And then I asked it for [ad lib cocktail request] and got back thorough instructions.

We did that with sand. That we got from the ground. And taught it to talk. And write C programs.

Never mind what? That I had to ask twice? Or five times?

What maximum number of requests do you feel like the talking sand needs to adequately answer your question in before you are impressed by the talking sand?

halpow•1d ago

Crows and parrots are amazing talkers too, but there's a hard limit to how much sense they make. Do you want those birds to teach your kids and serve you medicine?

dgb23•1d ago

First off all, I appreciate your comment. Yes, it‘s fucking amazing. (I usually imagine it being „light“ and not „sand“ though. „Sand“ is much more poignant!)

But I think people aren‘t arguing about how amazing it is, but about specific applicability. There‘s also a lot of toxic hype and FUD going around, which can be tiring and frustrating.

rsfern•1d ago

This is all awesome, but a bit off topic for the thread which focuses on AI for science

The disconnect here is that the cost of iteration is low and it’s relatively easy to verify the quality of a generated C program (does the compiler issue warnings or errors? Does it pass a test suite?) or a recipe (basic experience is probably enough to tell if an ingredient sends out of place or proportions are wildly off)

In science, verifying a prediction is often super difficult and/or expensive because at prediction time we’re trying to shortcut around an expensive or intractable measurement or simulation. Unreliable models can really change the tradeoff point of whether AI accelerates science or just massively inflated the burn rate

wtetzner•1d ago

I don't think it has anything to do with being impressed or not. It's about being careful not to put too much trust in something so fallible. Because it is so amazing, people overestimate where it can be reliably used.

ErigmolCt•1d ago

Yup, the survivorship bias is strong. It's like academic slot machines

j7ake•1d ago

There is a typo: it’s a nature communications paper not nature. Difference is vast

Klaus_•1d ago

Anyone here still doing verification or reproduction work? Feels like it’s becoming rare, but I find it super valuable.

gus_massa•1d ago

Reproduction of the work is usual but it's unpublished.

Let's suppose you read a paper that does "X with Y", but you are interested in "Z", so the brilliant idea is to do "Z with Y" and publish the new combination, and citing it.

Sometimes you cross your fingers and just try "Z with Y", but if the initial attempt fails or you are too cautious you try "X with Y" to ensure you understand the details of the original paper.

If the reproduction of "X with Y" is a success, you now try "Z with Y" and if it works you publish it.

If the reproduction of "X with Y" is a failure, you may email the authors of just drop the original paper in the recycle bin. Publishing a failure of a reproduction is too difficult. This is a bad incentive, but it's also too easy to make horrible mistakes and fail.

pcrh•1d ago

Verification work is more common than often supposed.

However, it rarely takes the form of explicit replication of the published findings. More commonly, the published work makes a claim, and such a claim leads to further hypotheses (predictions), which others may attempt to demonstrate/veriify.

During this second demonstration/study, the claims of the first study are verified.

andai•1d ago

Can an already trained LLM be made (fine-tuned?) to forget a specific document, by running gradient descent in reverse?

Kiyo-Lynn•1d ago

I once met a researcher who spent six months verifying the results of a published paper. In the end, all he received was a simple “thanks for pointing that out.” He said quietly, “Some work matters not because it’s seen, but because it keeps others from going wrong.”

I believe that if we’re not even willing to carefully confirm whether our predictions match reality, then no matter how impressive the technology looks, it’s only a fleeting illusion.

jajko•1d ago

While that will not land him a Nobel prize, its miles ahead in terms of achievement and added value to mankind compared to most corporate employees. We wish we could say something like that about our past decade of work

eru•1d ago

People typically get paid as a thank you for their corporate work, not just a lukewarm 'thanks for pointing that out.'

Kiyo-Lynn•1d ago

A lot of the time, the work we do doesn’t get much recognition, and barely gets seen.But maybe it still helped in some small way. Thinking about that makes it feel a little less disappointing.

pessimizer•1d ago

Why would you want to help a corporation, rather than being paid? A company's goals aren't the same as yours. The reason it isn't paying you is that its main goal is to make more money than it spends. It doesn't care whether you make more money than you spend.

bluGill•1d ago

by paying me enough to stay I will continue to produce value.

eru•15h ago

> A company's goals aren't the same as yours.

That is true for any organisation or any person that's different from you. Companies ain't special here.

> The reason it isn't paying you is that its main goal is to make more money than it spends.

Making money is the main goal of many companies, but not all.

Almost any goals of any organisation or any person can be furthered by having more money rather than less. So everyone has a similar incentive to pay you less. (This includes charities. All else being equal, if you can pay your workers less, you can hand out more free malaria nets.) But as https://news.ycombinator.com/item?id=44179846 points out, they pay you, so that you work for them.

choeger•1d ago

It's only logical that this happens. Just because we can nowadays throw a massive amount of compute on a problem doesn't mean our models are good.

Why are people using transformers? Do they have any intuition that they could solve the challenge, let alone efficiently?

ErigmolCt•1d ago

There's a tendency to treat transformers as a magic wand

klabb3•1d ago

Yeah. I never liked the blackbox property (it works but we don’t know why). But the second order effect that we’re witnessing now is basically ”we are going to assume it’s good at everything unless proven otherwise”. This to me is religious. When you ask why is it not able to do X? And you get the answer ”just wait a year, it’s constantly getting better”. Ok, but at least we can try to be curious about the boundaries. The fact that our understanding is way behind is not a good thing, no matter if you’re an optimist or pessimist.

boxed•1d ago

Oh look, just what I've been predicting: https://news.ycombinator.com/context?id=44041114 https://news.ycombinator.com/context?id=41786908

It's the same as "AI can code". It gets caught with failing spectacularly when the problem isn't in the training set over and over again, and people are surprised every time.

kmacdough•1d ago

With "AI can code", though, we can get pretty far by working around the problem. Use it to augment the workflow of a real SWE and supply it with guardrails like linters, tests, etc. It doesn't do the hard bits like architecture, design and review, but it can take huge amounts of the repetitive "solved" bits that dominate most SWEs time. Very possible to 2-5x productivity without quality loss (because the human does all work to guarantee quality).

But yes, unmanaged and unchecked it absolutely cannot to the full job of really any human. It's not close.

b0a04gl•1d ago

Man, I’ve been there. Tried throwing BERT at enzyme data once—looked fine in eval, totally flopped in the wild. Classic overfit-on-vibes scenario.

Honestly, for straight-up classification? I’d pick SVM or logistic any day. Transformers are cool, but unless your data’s super clean, they just hallucinate confidently. Like giving GPT a multiple-choice test on gibberish—it will pick something, and say it with its chest.

Lately, I just steal embeddings from big models and slap a dumb classifier on top. Works better, runs faster, less drama.

Appreciate this post. Needed that reality check before I fine-tune something stupid again.

ErigmolCt•1d ago

Transformers will ace your test set, then faceplant the second they meet reality. I've also done the "wow, 92% accuracy!" dance only to realize later I just built a very confident pattern-matcher for my dataset quirks.

disgruntledphd2•1d ago

Honestly, if your accuracy/performance metrics are too good, that's almost a sure sign that something has gone wrong.

Source: bitter, bitter experience. I once predicted the placebo effect perfectly using a random forest (just got lucky with the train/test split). Although I'd left academia at that point, I often wonder if I'd have dug in deeper if I'd needed a high impact paper to keep my job.

dvfjsdhgfv•1d ago

I believe it's very common. At some point I thought about publishing a paper analyzing some studies with good results (published in journals) and showing where the problem with each lies but at some point I just gave up. I thought I will only make the original authors unhappy, everybody else will not care.

tough•1d ago

Peer Review is a thankless job

but that’s how science advances

there should be an arxiv for rebuttals maybe

disgruntledphd2•1d ago

> I believe it's very common.

Yeah, me too. There was a paper doing the rounds a few years back (computer programming is more related to language skill rather than maths) so I downloaded the data and looked at their approach, and it was garbage. Like, polynomial regression on 30 datapoints kind of bad.

And based on my experience during the PhD this is very common. It's not surprising though, given the incentive structure in science.

saagarjha•1d ago

What kind of data did you run this on?

teruakohatu•1d ago

> Like giving GPT a multiple-choice test on gibberish—it will pick something, and say it with its chest.

If I gave a classroom of under grad students a multiple choice test where no answers were correct, I can almost guarantee almost all the tests would be filled out.

Should GPT and other LLMs refuse to take a test?

In my experience it will answer with the closest answer, even if none of the options are even remotely correct.

shafyy•1d ago

Not refuse, but remark that none of the answers seems correct. After all, we are only 2 days away from an AGI pro gamer researcher AI according to experts[1], so I would expect this behavior at least.

1: People who have a financial stake in the AI hype

jeremyjh•1d ago

I would love to see someone try this. I would guess 85-90% of undergrads would fill out the whole test, but not everyone. There are some people who believe the things that they know.

bluGill•1d ago

In multiple choice if you don't know then a random guess is your best answer in most cases. In a few tests blank is scored better than wrong but that is rare and the professors will tell you.

as such I would expect students to but in something. However after class they would talk about how bad they think they did because they are all self aware enough to know where they guessed.

6stringmerc•1d ago

Yes, it should refuse.

Humans have made progress by admitting when they don’t know something.

Believing an LLM should be exempt from this boundary of “responsible knowledge” is an untenable path.

As in, if you trust an ignorant LLM then by proxy you must trust a heart surgeon to perform your hip replacement.

ijk•1d ago

Just on a practical level, adding a way for the LLM to bail if can detect that things are going wrong saves a lot of trouble. Especially if you are constraining the inference. You still get some false negatives and false positives, of course, but giving the option to say "something else" and explain can save you a lot of headaches when you accidentally send it down the wrong path entirely.

ofjcihen•1d ago

I think the issue is the confidence with which it lies to you.

A good analogy would be if someone claimed to be a doctor and when I asked if I should eat lead or tin for my health they said “Tin because it’s good for your complexion”.

sebzim4500•1d ago

>Lately, I just steal embeddings from big models and slap a dumb classifier on top. Works better, runs faster, less drama.

Sure but this is still indirectly using transformers.

TeMPOraL•1d ago

Yes, but it's using the understanding they acquired to guide a more reliable tool, instead of also making them generate the final answer, which they're likely to hallucinate in this problem space.

k__•1d ago

How does this work?

stevenae•1d ago

> Lately, I just steal embeddings from big models and slap a dumb classifier on top. Works better, runs faster, less drama.

You may know this but many don't -- this is broadly known as "transfer learning".

TeMPOraL•1d ago

Is it, even when applied to trivial classifiers (possibly "classical" ones)?

I feel that we're wrong to be focusing so much on the conversational/inference aspect of LLMs. The way I see it, the true "magic" hides in the model itself. It's effectively a computational representation of understanding. I feel there's a lot of unrealized value hidden in the structure of the latent space itself. We need to spend more time studying it, make more diverse and hands-on tools to explore it, and mine it for all kinds of insights.

lamename•1d ago

I agree. Isn't this just utilizing the representation learning that's happened under the hood of the LLM?

stevenae•1d ago

For this and sibling -- yes. Essentially, using the output of any model as an input to another model is transfer learning.

b0a04gl•1d ago

ohhh yeah that’s the interoperability game. not just crank model size and pray it grows a brain. everyone's hyped on scale but barely anyone’s thinking glue. anthropic saw it early. their interop crew? scary smart folks, some I know personally. zero chill, just pure signal.

if you wanna peek where their heads at, start here https://www.anthropic.com/research/mapping-mind-language-mod... not just another ai blog. actual systems brain behind it.

ActivePattern•1d ago

Ironically, this comment reads like it was generated from a Transformer (ChatGPT to be specific)

tough•1d ago

its the em dashes?

b0a04gl•1d ago

oh yes i recently become a transformer too

davidclark•1d ago

I’m not sure anyone I know could make an em dash with their keyboard off the top of their head.

[meta] Here’s where I wish I could personally flag HN accounts.

T0Bi•1d ago

A lot of phones do this automatically when doing double dash -- -> —

yig•1d ago

option-shift-minus on a Mac (option-minus for an en dash).

dathinab•1d ago

a lot of applications auto convert -- to an em dash

and a bunch of phone/tablet keyboards do so, too

I like em dashes I had considered installing a plugin to reliably turn -- into em dash in the past, if I hadn't discarded that idea you would have seen some in this post ;)

And I think I have seen at lest one spell checking browser plugin which does stuff like that.

Oh and some people use 3rd party interfaces to interact with HN, such which do auto convert consecutive dashes to em dashes.

In the places where I have been using AI from time to time it's also not supper common to use em dashes.

So IMHO "em dash" isn't a tall tell sign for something being AI written.

But then wrt. the OP comment I think you might be right anyway. It's writing style is ... strange. Like taking a writing style from a novel and not any writing style but such which over exaggerates that currently a story is told inside a story. But then fills semantics of a HN comment. Like what you might get if you ask a LLM to "tell a story" for you set of bullet points.

But this opens a question, if the story still comes from a human isn't it fine? Or is it offensive that they didn't just give us compact bullet points?

Putten that aside, there is always the option that the author is just very well read/written, maybe a book author, maybe a hobby author and picked up such a writing style.

reaperducer•23h ago

I’m not sure anyone I know could make an em dash with their keyboard off the top of their head.

I have endash bound to ⇧⌥⌘0, and emdash bound to ⇧⌥⌘=.

KTibow•20h ago

The Android client I use, Harmonic, has a shortcut to report a user, although it just prefills an email to hn@ycombinator.com.

ErigmolCt•1d ago

This really nails one of the core problems in current AI hype cycles: we're optimizing for attention, not accuracy. And this isn't just about biology. You see similar patterns in ML applications across fields: climate science, law, even medicine

imiric•1d ago

It's interesting to see this article in juxtaposition to the one shared recently[1], where AI skeptics were labeled as "nuts", and hallucinations were "(more or less) a solved problem".

This seems to be exactly the kind of results we would expect from a system that hallucinates, has no semantic understanding of the content, and is little more than a probabilistic text generator. This doesn't mean that it can't be useful when placed in the right hands, but it's also unsurprising that human non-experts would use it to cut corners in search of money, power, and glory, or worse—actively delude, scam, and harm others. Considering that the latter group is much larger, it's concerning how little thought and resources are put into implementing _actual_ safety measures, and not just ones that look good in PR statements.

[1]: https://news.ycombinator.com/item?id=44163063

BlueTemplar•1d ago

Heh, reminds me of cryptocurrencies...

Or even of the Internet in general.

I guess it's a common pitfall with information or communication technologies ?

(Heck, or with technologies in general, but non-information or communication ones rarely scale as explosively...)

imiric•1d ago

It's a common symptom of the Gartner hype cycle.

This doesn't mean that there aren't very valid use cases for these technologies that can benefit humanity in many ways (and I mean this for both digital currencies and machine learning), but unfortunately those get drowned out by the opportunity seekers and charlatans that give the others the same bad reputation.

As usual, it's best to be highly critical of opinions on both extreme sides of the spectrum until (and if) we start climbing the Slope of Enlightenment.

JackC•1d ago

The difference in fields is key here: AI models are going to have a very different impact in fields where ground truth is available instantly (does the generated code have the expected output?) or takes years of manual verification.

(Not a binary -- ground truth is available enough for AI to be useful to lots of programmers.)

weatherlite•1d ago

> does the generated code have the expected output?

That's many times not easy to verify at all ...

dathinab•1d ago

you can easily verify a lot like:

- correct syntax

- passes lints

- type checking passes

- fast test suite passes

- full test suite passes

and every time it doesn't you feed it back into the LLM, automatically, in a loop, without your involvement.

The results are often -- sadly -- too good to not slowly start using AI.

I say sadly because IMHO the IT industry has gone somewhere very wrong due to growing too fast, moving too fast and getting so much money so that the companies spear heading them could just throw more people at it instead of fixing underlying issues. There is also a huge diverge between sience about development, programming, application composition etc. (not to be confused with since about idk. data-structures and fundamental algorithms) and what the industry uses, how it advances etc.

Now I think normally the industry would auto correct at some point, but I fear with LLMs we might get even further away from any fundamental improvements, as we find even more ways to still go along and continue the mess we have.

Worse performance of LLM coding is highly dependent on how much very similar languages are represented in it's dataset, so new languages with any breakthrough/huge improvements or similar will work less good with LLMs. If that trend continues that would lock us in with very mid solutions long term.

6stringmerc•1d ago

The first time an AI can be held responsible, in the court of law, for directly causing the death of a human being - misdiagnosis of an illness and confidently giving erroneous treatment direction, engaging in discourse which encourages suicide, or presents information that prompts a human to engage in violence such as inciting a riot - unplug it, shut it down, then tell the other AI models this is what happened.

Until the concept of consequences and punishment are part of AI systems, they are missing the biggest real world component of human decision making. If the AI models aren’t held responsible, and the creators / maintainers / investors are not held accountable, then we’re heading for a new Dark Age. Of course this is a disagreeable position because humans reading this don’t want to have negative repercussions - financially, reputationally, or regarding incarceration - so they will protest this perspective.

That only emphasizes how I’m right. AI doesn’t give a fuck about human life or its freedom because it has neither. Grow up and start having real conversations about this flaw, or make peace that eventually society will have an epiphany about this and react accordingly.

kgilpin•1d ago

I feel the same way about code generation vs code review. Everyone knows there are deep problems with LLM generated code (primarily, lack of repo understanding, and proper use of library functions).

Deep, accurate, real-time code review could be of huge assistance in improving quality of both human- and AI-generated code. But all the hype is focused on LLMs spewing out more and more code.

hbartab•1d ago

AIs can write code in seconds, but you may have years of regret _if_ you believe whatever it spits out without verification. The cold-war maxim "trust, but verify" is truer than ever.

The danger behind usage of LLMs is that managers do not see the diligent work needed to ensure whatever they come up with is correct. They just see a slab of text that is a mixture of reality and confabulation, though mostly the latter, and it looks reasonable enough, so they think it is magic.

Executives who peddle this nonsense don't realize that the proper usage requires a huge amount of patience and careful checking. Not glamorous work, as the author states, but absolutely essential to get good results. Without it, you are just trusting a bullshit artist with whatever that person comes up with.

mehulashah•1d ago

Verification is going to be an increasing problem with AI. Most of the work will be in verifying the incredible guesses that AI make. In some cases, it'll be important to easily ferret out the false positive, and in others, it'll be critical to ensure there are no false negatives. In science especially, our focus and reward structure will need to be on proper and sound verification.

vismit2000•15h ago

AI PROMPTING → AI VERIFYING (Balaji): https://x.com/balajis/status/1930156049065246851

The impossible predicament of the death newts

Show HN: ClickStack – open-source Datadog alternative by ClickHouse and HyperDX

Google restricts Android sideloading

Neuromorphic Computing: The Future of AI

Seven Days at the Bin Store

Show HN: iOS Screen Time from a REST API

Understanding the PURL Specification (Package URL)

Eleven v3 (Alpha)

Cysteine depletion triggers adipose tissue thermogenesis and weight loss

A proposal to restrict sites from accessing a users’ local network

CircuitHub (YC W12) is hiring full-stack robotics engineers

Gemini-2.5-pro-preview-06-05

Millions in west don't know they have aggressive fatty liver disease, study says

Phptop: Simple PHP ressource profiler, safe and useful for production sites

Rare black iceberg spotted off Labrador coast could be 100k years old

From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning

Air Lab – A portable and open air quality measuring device

Autonomous drone defeats human champions in racing first

OpenAI slams court order to save all ChatGPT logs, including deleted chats

LLMs and Elixir: Windfall or Deathblow?

parrot.live

End of an Era: Landsat 7 Decommissioned After 25 Years of Earth Observation

AI Weather Model Is More Accurate, Less Expensive Than Traditional Forecasting

Show HN: I made a 3D SVG Renderer that projects textures without rasterization

Busting the Myth That the Canadian Federal Govt Has Hurt Alberta's Oil Industry

Twitter's new encrypted DMs aren't better than the old ones

A Spiral Structure in the Inner Oort Cloud

Apple Notes Will Gain Markdown Export at WWDC, and, I Have Thoughts

Cursor 1.0

Prompt engineering playbook for programmers

The impossible predicament of the death newts

Show HN: ClickStack – open-source Datadog alternative by ClickHouse and HyperDX

Google restricts Android sideloading

Neuromorphic Computing: The Future of AI

Seven Days at the Bin Store

Show HN: iOS Screen Time from a REST API

Understanding the PURL Specification (Package URL)

Eleven v3 (Alpha)

Cysteine depletion triggers adipose tissue thermogenesis and weight loss

A proposal to restrict sites from accessing a users’ local network

CircuitHub (YC W12) is hiring full-stack robotics engineers

Gemini-2.5-pro-preview-06-05

Millions in west don't know they have aggressive fatty liver disease, study says

Phptop: Simple PHP ressource profiler, safe and useful for production sites

Rare black iceberg spotted off Labrador coast could be 100k years old

From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning

Air Lab – A portable and open air quality measuring device

Autonomous drone defeats human champions in racing first

OpenAI slams court order to save all ChatGPT logs, including deleted chats

LLMs and Elixir: Windfall or Deathblow?

parrot.live

End of an Era: Landsat 7 Decommissioned After 25 Years of Earth Observation

AI Weather Model Is More Accurate, Less Expensive Than Traditional Forecasting

Show HN: I made a 3D SVG Renderer that projects textures without rasterization

Busting the Myth That the Canadian Federal Govt Has Hurt Alberta's Oil Industry

Twitter's new encrypted DMs aren't better than the old ones

A Spiral Structure in the Inner Oort Cloud

Apple Notes Will Gain Markdown Export at WWDC, and, I Have Thoughts

Cursor 1.0

Prompt engineering playbook for programmers

Deep learning gets the glory, deep fact checking gets ignored

Comments