Not to say the LLM is more intelligent or better at coding, but that computer science is an incredibly broad field (like chemistry). There's simply so much to know that the LLM has an inherent advantage. It can be trained with huge amounts of generalized knowledge far faster than a human can learn.
Do you know every common programing language? The LLM does, plus it can code in FRACTRAN, Brainfuck, Binary lambda calculus, and a dozen other obscure languages.
It's very impressive, until you realize the LLM's knowledge is a mile wide and an inch deep. It has vast quantities of knowledge, but lacks depth. A human that specializes in a field is almost always going to outperform an LLM in that field, at least for the moment.
Not only this, but they're surprisingly talented at reading compiled binaries in a dozen different machine and bytecodes. I have seen one one-shot an applet rewrite from compiled java bytecode to modern javascript.
file can't program in brainfuck while doing basic binary analysis.
Binwalk and Unicorn can't do that either. And they can't write to you in multiply natural languages either
LLMs are at their best when the context capacity of the human is stretched and the task doesn't really take any reasoning but requires an extraction of some basic, common pattern.
That’s the very reason we built computers. If an LLM did not also meet this definition, there would be no point of it existing
Then it becomes impressive again once you understand how to productively use it as a tool, given its limitations.
A long time ago my OH was introduced to someone who claimed "to speak seven languages fluently".
Her response at the time was was "Do they have anything interesting to say in any of them?"
Other half? I've never seen this acronym before.
I wonder how acronyms such as OTOH even become so well known that they can be used without fear or not being understood? When is that threshold reached? Is using OH now the beginning of a new well-known acronym? I guess only time will tell...
And to answer the question - the threshold is when people stop complaining about the use :)
It also isn't an exact synonym of "conversely".
I've been an extensive internet user for decades and I don't have it in memory, so I'm not sure how to feel about your assertion. I'm not the only person saying this.
I'm sure that depends on the tolerance. "assist" and "help"? "dog" and "canine"? "purchase" and "buy"?
What a ridiculous assumption.
Maybe they consider themselves and their partner to be equal halves of a whole. You know, the definition of half.
As a non native speaker you'll probably just feel upset/hopeless/angry.
From my experience, "non-native" here includes people who are "fluent".
So we arrive at the situation where my OH-SO beloved wife is fluent in English and is definitely better than me at writing clearly constructed English essays but when it comes to usage of random idioms/slang or understanding local (and foreign!) English accents I have a very clear advantage.
But do they understand it? I mean, A child used swear words, but does it understand the meaning of the swear words. In other comment, somebodies OH also mentioned about artistic abilities and utility of the words spoken.
It doesn't matter to my employment prospects if the AI "understands" or "thinks", whatever is meant by that, but rather if potential employers recon it's good enough to not bother employing me.
Accepted 26 March 2025
Published 20 May 2025
Probably normal but shows the built in obsolescence of the peer review journal article model in such a fast moving field.
To me it looks like the paper was submitted last year but the peer reviewers identified issues with the paper which required revision before the final acceptance in March.
We can see the paper was updated since the 1 April 2024 version as it includes o1-preview (released September 2024, I believe), and GPT‑3.5 Turbo from August. I think a couple of other tested versions also post-date 1 April.
Thus, one possible criticism might have been (and I stress that I am making this up) that the original paper evaluated only 3 systems, and didn't reflect the fully diversity of available tools.
In any case, the main point of the paper was not the specific results of AI models available by the end of last year, but the development of a benchmark which can be used to evaluated models in general.
How has that work been made obsolete?
(Though even these obsolete models did better than the best humans and domain experts).
Good benchmark development is hard work. The paper goes into the details of how it was carried out.
Now that the benchmark is available, you or anyone else could use it to evaluate the current high-end versions, and measure how the performance has changed over time.
You could also use their paper to help understand how to develop a new benchmark, perhaps to overcome some limitations in the benchmark.
That benchmark and the contents of that paper are not obsolete until there is a better benchmark and description of how to build benchmarks.
I'm also not sure it's a fair comparison to average human results like that. If you quiz physicians on a broad variety of topics, you shouldn't expect cardiologists to know that much about neurology and vice-versa. This is what they did here, it seems.
Somebody with a masters degree and 5 years of work experience will likely know more than a freshly graduated PhD
How does that qualify them as "domain experts"? What domain is their expertise? All of chemistry?
Seriously LLM's as a cultural technology cast them as a super interactive indexing system. I find that's a useful lens to use to understand this kind of study.
This was neat to see but also raised some eyebrows from me. A clever kid with some pharmacology knowledge and basic organic chemistry understanding could get up to no good.
Especially since you can ask the model to use commonly available reagents + precursors and for synthesis routes that use the least amount of equipment and glassware.
Besides, the books Pihkl and Tikhl lay out how to make most psychoactive substances, and those books have been online for free for decades now.[1][2] Maybe there are easier routes and easier to acquire precursor recipes, but I doubt those would be hard to find. The hardest part by far is the chemistry intuition.
[1]https://erowid.org/library/books_online/pihkal/pihkal.shtml [2]https://erowid.org/library/books_online/tihkal/tihkal.shtml
There are various "one-pot" techniques for certain compounds if one is sufficiently clever.
For example, a certain cathinone can be produced by combining ephedrine/pseudoephedrine with a household product that reduces secondary alcohols to ketones and letting it sit.
Or Extractions & Ire, along with his other channel Explosions & Fire[2], which is a PhD student trying to do chemistry in his shed, literally, using stuff you can get from a well-stocked hardware store or such.
Often the steps seem straight forward but there are details in the papers that are not covered, or the contaminants from using some brand household product rather than a pure source screws it up.
Still, his videos are usually quite entertaining regardless of results.
I’m a chemist and I asked it to show me the structure for a common molecule and it kept getting it really wrong
Most chemists will begin to develop an intuition. This is where the issues develop.
This intuition is a combination of the chemists mental model, and how the sensory environment stimulates that. As a polymer chemist in a certain system maybe brown means I see scattering hence particles. My system is supposed to be homogeneous so I bin the reaction.
It is often known that good grades don’t make good researchers. That’s because researchers aren’t doing rote recall.
So the issue is this: we ask the LLM how many proton environment in this nmr?
We should ask: I’m intercalating Li into a perovskite using BuLi. Why does the solution turn pink?
All of that is to say that I don't think the classic engineering fields have some kind of knowledge or intuition that is truly inaccessible to LLMs, I just think that it is in a form that is too difficult right now to train on. However if you could train a model on them, I strongly suspect they would get to the same level they are at today with software.
Are they? Last time I checked (couple of seconds ago), they still made silly mistakes and hallucinated wildly.
Example: https://imgur.com/a/Cj2y8km (AI teaching me about the Coltrane operator, that obviously does not exist).
Even 2.5 flash easily gets this https://imgur.com/a/OfW30eL
For example, 2.5 Flash fails to explain the difference between the short ternary operator (null coalescing) and the Elvis operator.
Even when I specify a language (therefore clearing the confusion, supposedly), it still fails to even recognize the Elvis operator by its toupe, and mixes it up the explanation (it doesn't even understand what I asked).
So, the point I'm trying to make is that they're not any better for programming than they're for chemistry.
When it failed, I replied: "in PHP".
You don't seem to understand what I'm trying to say and instead is trying to defend LLMs for a fault that is a fact known in the industry at large.
I'm sure that in short time, I could make 2.5 Pro hallucinate as well. If not on this question, on others.
This behavior is inline with the paper conclusions:
> many models are not able to reliably estimate their own limitations.
(see Figure 3, they tested a variety of models of different qualities).
This is the kind of question a junior developer can answer with simple google searches, or by reading the PHP manual, or just by testing it on a REPL. Why do we need a fancy model in order to answer such a simple inquiry? Would a beginner know that the answer is incorrect and he should use a different model?
Also, from the paper:
> For very relevant topics, the answers that models provide are wrong.
> Given that the models outperformed the average human in our study, we need to rethink how we teach and examine chemistry.
That's true for programming as well. It outperforms the average human, but then it makes silly mistakes that could confuse beginners. It displays confidence in being plain wrong.
The study also used manually curated questions for evaluation, so my prompt is not some dirty trick. It's totally inline with the context of this discussion.
See if this looks any better (I don't know PHP): https://g.co/gemini/share/7849517fdb89
If it doesn't, what specifically is incorrect?
--
The JavaScript example should have mentioned the use of `||` (or operator) to achieve the same effect of a shorthand ternary. It's common knowledge.
In PHP specifically, `??` allows you to null coalesce array keys and other types of complex objects. You don't need to write `isset($arr[1]) ? $arr[1] : "ipsum"`, you can just `$arr[1] ?? "ipsum"`. TypeScript has it too and I would expect anyone answering about JavaScript to mention that, since it's highly relevant for the ecosystem.
Also in PHP, there is the `?:` that is similar to what `||` does in JavaScript in an assignment context, but due to type juggling, it can act as a null coalesce operator too (although not for arrays or complex types).
The PHP example they present, therefore, is plain wrong and would lead to a warning for trying to access an unset array key. Something that the `??` operator (not mentioned in the response) would solve.
I would go as far as explaining null conditional acessors as well `$foo?->bar` or `foo?.bar`. Those are often called Elvis operators coloquially and fall within the same overall problem-solving category.
The LLM answer is a dangerous mix of incomplete and wrong. It could lead a beginner to adopt an old bad practice, or leave a beginner without a more thorough explanation. Worst of all, the LLM makes those mistakes with confidence.
--
What I think is going on is that null handling is such a basic task, that programmers learn it in the first few years of their careers and almost never write about it. There's no need to. I'm sure a code-completion LLM can code using those operators effectively, but LLMs cannot talk about them consistently. They'll only get better at it if we get better at it, and we often don't need to write about it.
In this particular elvis operator thing, there has been no significant improvement in the correctedness of the answer in __more than 2 whole years__. Samples from ChatGPT in 2023 (note my image date): https://imgur.com/UztTTYQ https://imgur.com/nsqY2rH.
So, _for some things_, contrary to what you suggested before, LLMs are not getting that much better.
I don't agree that you can pick one cherry example and use it to illustrate anything general about the progress of the models in general, though. There are far too many counterexamples to enumerate.
(Actually I suspect what will happen is that we'll change the way we write documentation to make it easy for LLMs to assimilate. I know I'm already doing that myself.)
Benchmarks and evaluations are made of cherry picked examples. What makes my example invalid, and benchmark prompts valid? (it's a rethorical question, you don't need to answer).
> write documentation to make it easy for LLMs to assimilate.
If we ever do that, it means LLMs failed at their job. They are supposed to help and understand us, not the other way around.
If you buy into the whole AGI thing, I guess so, but I don't. We don't have a good definition of intelligence, so it's a meaningless question.
We do know how to make and use tools, though. And we know that all tools, especially the most powerful and/or hazardous ones, reward the work and care that we put into using them. Further, we know that tool use is a skill, and that some people are much better at it than others.
What makes my example invalid, and benchmark prompts valid?
Your example is a valid case of something that doesn't work perfectly. We didn't exactly need to invent AI to come up with something that didn't work perfectly. I have examples of using LLMs to generate working, useful code in advanced, specialized disciplines, code that I frankly don't understand myself and couldn't have written without months of study, but that I can validate.
Just one of those examples is worth a thousand examples like yours, in my book. I can now do things that were simply impossible for me before. It would take some nerve to demand godlike perfection on top of that, or to demand useful results with little or no effort on my part.
It's the same principle. A tool is supposed to assist us, not the other way around.
An LLM, "AGI magic" or not, is supposed to write for me. It's a tool that writes for me. If I am writing for the tool, there's something wrong with it.
> I have examples [...] Just one of those examples is worth a thousand examples like yours
Please, share them. I shared my example. It can be a very small "bug report", but it's real and reproducible. Other people can build on it, either to improve their "tool skills" or to improve LLMs themselves.
An example that is shared is worth much more than an anectode.
I started out brainstorming with o1-pro, trying to come up with ways to anticipate drift on multiple timescales, from multiple influences with differing lag times, and correct it using temperature trends measured a couple of inches away on a different component. It basically said, "Here, train this LSTM model to predict your drift observations from your observed temperature," and spewed out a bunch of cryptic-looking PyTorch code. It would have been familiar enough to an ML engineer, I'm sure, but it was pretty much Perl to me.
I was like, Okaaaaayyy....? but I tried it anyway, suggested hyperparameters and all, and it was a real road-to-Damascus moment. Again, I can't share the plots and they wouldn't make sense anyway without a lot of explanation, but the outcome of my initial tests was freakishly good.
Another model proved to be able to translate the Python to straight C for use by the onboard controller, which was no mean feat in itself (and also allowed me to review it myself), and now that problem is just gone. Basically for free. It was a ridiculous, silly thing to try, and it worked.
When this tech gets another 10x better, the customer won't need me anymore... and that is fucking awesome.
> It would have been familiar enough to an ML engineer, I'm sure, but it was pretty much Perl to me.
How can you be sure that the solution doesn't have obvious mistakes that an ML engineer would spot right away?
> When this tech gets another 10x better
A chainsaw is way better than a regular saw, but it's also more dangerous. Learning to use it can be fun. Learning not to cut your toes is also important.
I am looking for ways in which LLMs could potentially cut people's toes.
I know you don't want to hear that your favorite tool can backfire, and you're still skeptic despite having experienced the example I gave you firsthand. However, I was still hopeful that you could understand my point.
I can come up with prompts that make better models hallucinate (see post below).
I don't understand your objection. This is a known fact, LLMs hallucinate shit regardless of the model size.
Nothing matters in this business except the first couple of time derivatives.
However, I'm discussing this within the context of the study presented in the paper, not some future yet-to-be-achieved performance expectation.
If we step outside the context of the paper (not advised), I think any average developer is better than an LLM at energy efficiency. LLMs cheat by consuming more resources than a human. "Better" is quite relative. So, let's keep reasonable.
https://chatgpt.com/share/685041db-c324-800b-afc6-5cb2c5ef31...
I would say odds are because of an impurity. My first guess might be the solvent if there is more in action than reagents or reactants. Maybe could be confirmed or denied by some carefully figured filtration beforehand, which might not even be that difficult. I doubt I would try much further than that unless it was a bad problem.
Although for instance an alternate simple purification like distillation is pretty much routine for pure aniline to get some colorless material, and that's some pretty rough stuff to handle.
Now I once was a young chemist facing AI, I ended up highly focused on going forward in ways that would not be "taken over" by AI, and I knew I couldn't be slow or recession still might catch up with me, plus the 1990's were approaching fast ;)
By the mid 1990's I figured there's no way the stuff they have in this paper had not been well investigated.
I always knew it would take people that had way more megabytes than I could afford.
Sheesh, did I overestimate the progress people were making when I wasn't looking.
Is this a documentation problem? The LLMs are only trained on what is written down. Seems to track with another comment further down quoting:
"Models are limited in ability to answer knowledge-intensive questions, probably because the required knowledge cannot easily be accessed via papers but rather by lookup in specialized databases, which the humans used to answer such questions"
> [...] our analysis shows [...] performance of models is correlated with [...] size [...]. This [...] also indicates that chemical LLMs could, [...], be further improved by scaling them up.
Does that means the world of chemists will be eaten by LLMs? Will LLMs just improve chemists output or productivity? I'd be scared if this happened in my area of work.
Hopefully we'll see human assisted with AI & induced demand for a good while, but the idea that people work unassisted in knowledge work is gonna go the way of artisan clothing
fuzzfactor•7mo ago
This is all highly academic, and I'm highly industrial so take this with a grain of salt. Sodium salt or otherwise, your choice ;)
If you want things to be accomplished at the bench, you want any simulation to be made by those who have not been away from the bench for that many decades :)
Same thing with the industrial environment, some people have just been away from it for too long regardless of how much familiarity they once had. You need to brush up, sometimes the same plant is like a whole different world if you haven't been back in a while.
mistrial9•7mo ago