Taking CV-filler from 80% to 95% of published academic work is yet another revolutionary breakthrough on the road to superintelligence.
Is it cynical to believe this is already true and has been forever?
Is it naive to hope that when AI can do this work, we will all admit that much of the work was never worth doing in the first place, our academic institutions are broken, and new incentives are sorely needed?
I’m reminded of a chapter in Abundance where Ezra Klein notes how successful (NIH?) grant awardees are getting older over time, nobody will take risks on young scientists, and everyone is spending more of their time churning out bureaucratic compliance than doing science.
> Most notably, it provides confidence levels in its findings, which Cheeseman emphasizes is crucial.
These 'confidence levels' are suspect. You can ask Claude today, "What is your confidence in __" and it will, unsurprisingly, give a 'confidence interval'. I'd like to better understand the system implemented by Cheeseman. Otherwise I find the whole thing, heh, cheesy!
The problem is so much of consensus is wrong and it is going to start by giving you the consensus answer on anything.
There are subjects I can get it to tell me the consensus answer then say "what about x" and it completely changes and contradicts the first answer because x contradicts the standard consensus orthodoxy.
To me it is not much different than going to the library to research something. The library is not useless because the books don't read themselves or because there are numerous books on a subject that contradict each other. Gaining insight from reading the book is my role.
I suspect much LLM criticism is from people who neither much use LLMs nor learn much of anything new anyway.
Throwing compute to mine a search space seems like one of the less controversial ways to use technology...
There should be some research results showing their fundamental limitations. As opposed to empirical observations. Can you point at them?
What about VLMs, VLAs, LMMs?
However you feel about LLMs, and I say this because you don't have to use them for very long before you witness how useful they can be for large datasets so I'm guessing you're not a fan, they are undeniably incredible tools in some areas of science.
https://news.stanford.edu/stories/2025/02/generative-ai-tool...
Where by "good at" you mean "are totally shit at"?
They routinely hallucinate things even on tiny datasets like codebases.
But the latter doesn't invalidate the former.
I... don't even know how to respond to that.
Also. I didn't say they were useless. Please re-read the claim I responded to.
> Are they also really bad at some aspects of dealing with both? Absolutely. Dangerously, humorously bad sometimes.
Indeed.
Now combine "Finding patterns in large datasets is one of the things LLMs are really good at." with "they hallucinate even on small datasets" and "Are they also really bad at some aspects of dealing with both? Absolutely. Dangerously, humorously bad sometimes"
Translation, in case logic somehow eludes you: if an LLM finds a pattern in a large dataset given that it often hallucinates, dangerously, humorously bad, what are the chances that the pattern it found isn't a hallucination (often subtle one)?
Especially given the undeniable verifiable fact that LLMs are shit at working with large datasets (unless they are explicitly trained on them, but then it still doesn't remove the problem of hallucinations)
The first developed a model to calculate protein function based on DNA sequence - yet provides no results of testing of the model. Until it does, it’s no better than the hundreds of predictive models thrown on the trash heap of science.
The second tested a models “ability to predict neuroscience results” (which reads really oddly). How did they test it? Pitted humans against LLMs in determining which published abstracts were correct.
Well yeah? That’s exactly what LLMs are good at - predicting language. But science is not advanced by predicting which abstracts of known science are correct.
It reminds me of my days in working with computational chemists - we had an x-ray structure of the molecule bound to the target. You can’t get much better than that at hard, objective data.
“Oh yeah, if you just add a methyl group here you’ll improve binding by an order of magnitude”.
So we went back to the lab, spent a week synthesizing the molecule, sent it to the biologists for a binding study. And the new molecule was 50% worse at binding.
And that’s not to blame the computation chemist. Biology is really damn hard. Scientists are constantly being surprised at results that are contradictory to current knowledge.
Could LLMs be used in the future to help come up with broad hypotheses in new areas? Sure! Are the hypotheses going to prove fruitless most of the time? Yes! But that’s science.
But any claim of a massive leap in scientific productivity (whether LLMs or something else) should be taken with a grain of salt.
Not disagreeing with your initial statement about LLMs being good and finding patterns in datasets btw.
When asked about their confidence, these things are almost entirely useless. If the Magic Disruption Box is incapabele of knowing whether or not it read "42/A" correctly, I'm not convinced it's gonna revolutionize science by doing autonomous research.
If you give the model the image and a prior prediction, what can it tell you? Asking for it to produce a 1-10 figure in the same token stream as the actual task seems like a flawed strategy.
That’s how typical classification and detection CNNs work. Class and confidence value along with bounding box for detection CNNs.
CNNs and LLMs are not that different. You can train an LLM architecture to do the same thing that CNNs do with a few modifications, see Vision Transformers.
But the second-order 'confidence as a symbolic sequence in the stream' is only (very) vaguely tied to this. Numbers-as-symbols are of different kind to numbers-as-next-token-probabilities. I don't doubt there is _some_ relation, but it's too much inferential distance away and thus worth almost nothing.
With that said, nothing really stops you from finetuning an LLM to produce accurately calibrated confidence values as symbols in the token stream. But you have to actually do that, it doesn't come for free by default.
[0] https://www.bbc.com/news/articles/c4g4zlyqnr0o — "I used Lego to design a farm for people who are blind - like me"
Are you implying that science done by humans is entirely error-free?
Reading hand-written digits was the 'hello world' of AI well before LLMs came along. I know, because I did it well before LLMs came along.
Obviously a simple model itself can't know if it's right or wrong, as per one of Wittgenstein's quote:
If there were a verb meaning 'to believe falsely', it would not have any significant first person, present indicative.
That said, IMO not (as Wittgenstein seemed to have been claiming) impossible, as at the very least human brains are not single monolithic slabs of logic: https://www.lesswrong.com/posts/CFbStXa6Azbh3z9gq/wittgenste...In the case of software, whatever system surrounds this unit of machine classification (be it scripts or more ML) can know how accurately this unit classifies things in certain conditions. My own MNIST-hello-world example, split the test set and training set, the test set tells you (roughly!) how good the training was: while this still won't tell you if any given answer is wrong, it will tell you how many of those 40 million is probably wrong.
Humans and complex AI can, in principle, know their own uncertainty, e.g. I currently estimate my knowledge of physics to be around the level of a first year undergraduate course student, because I have looked at what gets studied in the first year and some past paers and most of it is not surprising (just don't ask me which one is a kaon and which one is a pion).
Unfortunately "capable" doesn't mean "good", and indeed humans are also pretty bad at this, the general example is Dunning Kruger, and my personal experience of that from the inside is that I've spent the last 7.5 years living in Germany, and at all points I've been sure (with evidence, even!) that my German is around B1 level, and yet it has also been the case that with each passing year my grasp of the language has improved, so what I'm really sure of is that I was wrong 7 years ago, but I don't know if I still am or not, and will only find out at the end of next month when I get the results of an exam I have yet to sit.
Funny you say that.
As for why it’s featured on HN, do you think it’s less important than the politics that is Nobel peace prize that is top of the front page at the moment?
The writing is on the wall that AI will continue to dominate discussion on here for the forseeable future (yawn...)
Its time for me to move to a space where the AI bros are not, wherever that may be. I hope to see some of you there!
alsetmusic•2w ago
NewsaHackO•2w ago
taormina•2w ago
famouswaffles•2w ago
WD-42•2w ago
> Cheeseman finds Claude consistently catches things he missed. “Every time I go through I’m like, I didn’t notice that one! And in each case, these are discoveries that we can understand and verify,” he says.
Pretty vague and not really quantifiable. You would think an article making a bold claim would contain more than a single, hand-wavy quote from an actual scientist.
famouswaffles•2w ago
Why? What purpose would quotes serve better than a paper with numbers and code? Just seems like nitpicking here. The article could have gone without a single quote (or had several more) and it wouldn't really change anything. And that quote is not really vague in the context of the article.
catlifeonmars•2w ago
Even if the article is accurate, it still makes sense to question the motives of the publisher. Especially if they’re selling a product.
joshribakoff•2w ago
simonw•2w ago
Citation needed?
Closest I've seen to that was Dario saying AI would write 90% of the code, but that's very different from declaring the death of software development as an occupation.
falloutx•2w ago
throw234234234•2w ago
AI Engineers aren't actually SWE's per se; they use code but they see it as tedious non-main work IMO. They are happy to automate their compliment and raise in status vs SWE's who typically before all of this had more employment opportunities and more practical ways to show value.
throw310822•2w ago
signatoremo•2w ago
Saying what? This describes three projects that use an Anthropic’s product. Do you need a third party to confirm that? Or do you need someone to tell you if they are legit?
There are hundreds of announcements by vendors ơn HN. Did you object to them all or only when your own belief is against them?