Not to say the LLM is more intelligent or better at coding, but that computer science is an incredibly broad field (like chemistry). There's simply so much to know that the LLM has an inherent advantage. It can be trained with huge amounts of generalized knowledge far faster than a human can learn.
Do you know every common programing language? The LLM does, plus it can code in FRACTRAN, Brainfuck, Binary lambda calculus, and a dozen other obscure languages.
It's very impressive, until you realize the LLM's knowledge is a mile wide and an inch deep. It has vast quantities of knowledge, but lacks depth. A human that specializes in a field is almost always going to outperform an LLM in that field, at least for the moment.
Not only this, but they're surprisingly talented at reading compiled binaries in a dozen different machine and bytecodes. I have seen one one-shot an applet rewrite from compiled java bytecode to modern javascript.
Then it becomes impressive again once you understand how to productively use it as a tool, given its limitations.
A long time ago my OH was introduced to someone who claimed "to speak seven languages fluently".
Her response at the time was was "Do they have anything interesting to say in any of them?"
Accepted 26 March 2025
Published 20 May 2025
Probably normal but shows the built in obsolescence of the peer review journal article model in such a fast moving field.
To me it looks like the paper was submitted last year but the peer reviewers identified issues with the paper which required revision before the final acceptance in March.
We can see the paper was updated since the 1 April 2024 version as it includes o1-preview (released September 2024, I believe), and GPT‑3.5 Turbo from August. I think a couple of other tested versions also post-date 1 April.
Thus, one possible criticism might have been (and I stress that I am making this up) that the original paper evaluated only 3 systems, and didn't reflect the fully diversity of available tools.
In any case, the main point of the paper was not the specific results of AI models available by the end of last year, but the development of a benchmark which can be used to evaluated models in general.
How has that work been made obsolete?
(Though even these obsolete models did better than the best humans and domain experts).
Good benchmark development is hard work. The paper goes into the details of how it was carried out.
Now that the benchmark is available, you or anyone else could use it to evaluate the current high-end versions, and measure how the performance has changed over time.
You could also use their paper to help understand how to develop a new benchmark, perhaps to overcome some limitations in the benchmark.
That benchmark and the contents of that paper are not obsolete until there is a better benchmark and description of how to build benchmarks.
I'm also not sure it's a fair comparison to average human results like that. If you quiz physicians on a broad variety of topics, you shouldn't expect cardiologists to know that much about neurology and vice-versa. This is what they did here, it seems.
Somebody with a masters degree and 5 years of work experience will likely know more than a freshly graduated PhD
How does that qualify them as "domain experts"? What domain is their expertise? All of chemistry?
fuzzfactor•1d ago
This is all highly academic, and I'm highly industrial so take this with a grain of salt. Sodium salt or otherwise, your choice ;)
If you want things to be accomplished at the bench, you want any simulation to be made by those who have not been away from the bench for that many decades :)
Same thing with the industrial environment, some people have just been away from it for too long regardless of how much familiarity they once had. You need to brush up, sometimes the same plant is like a whole different world if you haven't been back in a while.
mistrial9•7h ago