Chemical knowledge and reasoning of large language models vs. chemist expertise

https://www.nature.com/articles/s41557-025-01815-x

57•bookofjoe•1d ago

Comments

fuzzfactor•1d ago

Nothing to see here unless you have some kind of unsatisfied interest in the future of AI :\

This is all highly academic, and I'm highly industrial so take this with a grain of salt. Sodium salt or otherwise, your choice ;)

If you want things to be accomplished at the bench, you want any simulation to be made by those who have not been away from the bench for that many decades :)

Same thing with the industrial environment, some people have just been away from it for too long regardless of how much familiarity they once had. You need to brush up, sometimes the same plant is like a whole different world if you haven't been back in a while.

mistrial9•7h ago

BASF Group - will they speak in public? probably not, given what is at stake IMHO

calibas•5h ago

I'm sure an LLM knows more about computer science than a human programmer.

Not to say the LLM is more intelligent or better at coding, but that computer science is an incredibly broad field (like chemistry). There's simply so much to know that the LLM has an inherent advantage. It can be trained with huge amounts of generalized knowledge far faster than a human can learn.

Do you know every common programing language? The LLM does, plus it can code in FRACTRAN, Brainfuck, Binary lambda calculus, and a dozen other obscure languages.

It's very impressive, until you realize the LLM's knowledge is a mile wide and an inch deep. It has vast quantities of knowledge, but lacks depth. A human that specializes in a field is almost always going to outperform an LLM in that field, at least for the moment.

esafak•4h ago

But the LLM can already connect things that you can not, by virtue of its breadth. Some may disagree, but I think it will soon go deeper too.

timschmidt•4h ago

> Do you know every common programing language? The LLM does, plus it can code in FRACTRAN, Brainfuck, Binary lambda calculus, and a dozen other obscure languages.

Not only this, but they're surprisingly talented at reading compiled binaries in a dozen different machine and bytecodes. I have seen one one-shot an applet rewrite from compiled java bytecode to modern javascript.

anthk•2h ago

Binwalk, Unicorn... as if it that was advanced wizardry. Unix systems have file(1) since forever and binutils from and to every arch.

mumbisChungo•4h ago

It's impressive until you realize its limitations.

Then it becomes impressive again once you understand how to productively use it as a tool, given its limitations.

anthk•2h ago

So impressive that every complex SUBLEQ code I've tried with an LLM failed really fast.

logifail•2h ago

> Do you know every common programing language?

A long time ago my OH was introduced to someone who claimed "to speak seven languages fluently".

Her response at the time was was "Do they have anything interesting to say in any of them?"

6LLvveMx2koXfwn•3h ago

Received 01 April 2024

Accepted 26 March 2025

Published 20 May 2025

Probably normal but shows the built in obsolescence of the peer review journal article model in such a fast moving field.

Jimmc414•3h ago

shows the value of preprint servers like arxiv.org and chemrxiv.org

eesmith•2h ago

How so?

To me it looks like the paper was submitted last year but the peer reviewers identified issues with the paper which required revision before the final acceptance in March.

We can see the paper was updated since the 1 April 2024 version as it includes o1-preview (released September 2024, I believe), and GPT‑3.5 Turbo from August. I think a couple of other tested versions also post-date 1 April.

Thus, one possible criticism might have been (and I stress that I am making this up) that the original paper evaluated only 3 systems, and didn't reflect the fully diversity of available tools.

In any case, the main point of the paper was not the specific results of AI models available by the end of last year, but the development of a benchmark which can be used to evaluated models in general.

How has that work been made obsolete?

bufferoverflow•1h ago

How so? All the models they've tested are obsolete, multiple generations behind high-end versions.

(Though even these obsolete models did better than the best humans and domain experts).

eesmith•1h ago

As I wrote, the main point of the paper was not the specific model evaluation, but the development of a benchmark which can be used to test new models.

Good benchmark development is hard work. The paper goes into the details of how it was carried out.

Now that the benchmark is available, you or anyone else could use it to evaluate the current high-end versions, and measure how the performance has changed over time.

You could also use their paper to help understand how to develop a new benchmark, perhaps to overcome some limitations in the benchmark.

That benchmark and the contents of that paper are not obsolete until there is a better benchmark and description of how to build benchmarks.

pu_pe•56m ago

Nice benchmark but the human comparison is a little lacking. They claim to have surveyed 19 experts, though the vast majority of them have only a master's degree. This would be akin to comparing LLM programming expertise to a sample of programmers with less than 5 years of experience.

I'm also not sure it's a fair comparison to average human results like that. If you quiz physicians on a broad variety of topics, you shouldn't expect cardiologists to know that much about neurology and vice-versa. This is what they did here, it seems.

KSteffensen•31m ago

I'll get some downvotes for this but PhD vs master's degree difference is mostly work experience, an element of workload hazing and snobbery.

Somebody with a masters degree and 5 years of work experience will likely know more than a freshly graduated PhD

eesmith•20m ago

Sure, but all we know is that these "13 have a master’s degree (and are currently enroled in Ph.D. studies)". We only know they have at least "2 years of experience in chemistry after their first university-level course in chemistry."

How does that qualify them as "domain experts"? What domain is their expertise? All of chemistry?

Start your own Internet Resiliency Club

Jokes and Humour in the Public Android API

NesDev.org – A community of homebrew game devs for NES and other retro consoles

Lisp-stat: Lisp environment for statistical computing

Is Gravity Just Entropy Rising? Long-Shot Idea Gets Another Look

Why SSL was renamed to TLS in late 90s (2014)

Modifying an HDMI dummy plug's EDID using a Raspberry Pi

Twin – A Textmode WINdow Environment

Canyon.mid

Childhood leukemia: how a deadly cancer became treatable

DARPA program sets distance record for power beaming

Telephone Exchanges in the UK

Datalog in Rust

Real-time CO2 monitoring without batteries or external power

First 2D, non-silicon computer developed

Chemical knowledge and reasoning of large language models vs. chemist expertise

Simplest C++ Callback, from SumatraPDF

How to modify Starlink Mini to run without the built-in WiFi router

Datalog in miniKanren

Accumulation of cognitive debt when using an AI assistant for essay writing task

Nanonets-OCR-s – OCR model transforms documents into structured markdown

Reinventing circuit breakers with supercritical CO2

Fields where Native Americans farmed a thousand years ago discovered in Michigan

Unprecedented optical clock network lays groundwork for redefining the second

Solving LinkedIn Queens with APL

Random Walk: A Modern Introduction (2010) [pdf]

Foundations of Computer Vision (2024)

The Hewlett-Packard Archive

Cyborg Embryos Offer New Insights into Brain Growth

The Art of Lisp and Writing (2003)

Start your own Internet Resiliency Club

Jokes and Humour in the Public Android API

NesDev.org – A community of homebrew game devs for NES and other retro consoles

Lisp-stat: Lisp environment for statistical computing

Is Gravity Just Entropy Rising? Long-Shot Idea Gets Another Look

Why SSL was renamed to TLS in late 90s (2014)

Modifying an HDMI dummy plug's EDID using a Raspberry Pi

Twin – A Textmode WINdow Environment

Canyon.mid

Childhood leukemia: how a deadly cancer became treatable

DARPA program sets distance record for power beaming

Telephone Exchanges in the UK

Datalog in Rust

Real-time CO2 monitoring without batteries or external power

First 2D, non-silicon computer developed

Chemical knowledge and reasoning of large language models vs. chemist expertise

Simplest C++ Callback, from SumatraPDF

How to modify Starlink Mini to run without the built-in WiFi router

Datalog in miniKanren

Accumulation of cognitive debt when using an AI assistant for essay writing task

Nanonets-OCR-s – OCR model transforms documents into structured markdown

Reinventing circuit breakers with supercritical CO2

Fields where Native Americans farmed a thousand years ago discovered in Michigan

Unprecedented optical clock network lays groundwork for redefining the second

Solving LinkedIn Queens with APL

Random Walk: A Modern Introduction (2010) [pdf]

Foundations of Computer Vision (2024)

The Hewlett-Packard Archive

Cyborg Embryos Offer New Insights into Brain Growth

The Art of Lisp and Writing (2003)

Chemical knowledge and reasoning of large language models vs. chemist expertise

Comments