Source: One of the most classic internet websites, zombo.com (sound on)
Edit: notice that I said "100%, or nearly so". I realize that 100% is an unrealistic metric for an LLM, but come on, the robots should be at least as competent as the humans they replace, and ideally much more so.
Of course not.
This idea that AI should be correct 100% of the time is like expecting autonomous vehicles to have a 0% crash rate to be successful. It is just a coping metric which allows humans to feel superior. In reality they already outperform humans in terms of crash rates
I am pleasantly amused to see it’s the cutting edge tech bros shaking their fist at the young LLMs on their lawn. We got free intern-quality work. Take the win.
Breaking over 80% accuracy and solving the rest of 20% problem will be the main challenge of next-gen (or next-2gen) LLM, not to mention they still have tasks to bring down the computing costs.
EDIT: that said, solving 80% of problems with 80% of accuracy with significant time saving is a solution that's worth to account, though we need to keep sceptical because the rest 20% may be gotten much worse because the 80% solved is in bad quality.
Then, in a truly genius stroke of AI science, the current article extrapolates this to infinity and beyond, while hand-waving away the problem of “messiness”, which clearly calls the extrapolation into question:
> At the heart of the METR work is a metric the researchers devised called “task-completion time horizon.” It’s the amount of time human programmers would take, on average, to do a task that an LLM can complete with some specified degree of reliability, such as 50 percent. A plot of this metric for some general-purpose LLMs going back several years [main illustration at top] shows clear exponential growth, with a doubling period of about seven months. The researchers also considered the “messiness” factor of the tasks, with “messy” tasks being those that more resembled ones in the “real world,” according to METR researcher Megan Kinniment. Messier tasks were more challenging for LLMs [smaller chart, above]
“AI might write a decent novel by 2030”? Have you read the absolute dreck they produce today? An LLM will NEVER produce a decent novel, for the same reason it will never independently create a decent game or movie: It can’t read the novel, play the game, or watch the movie, and have an emotional response to it or gauge it’s entertainment value. It has no way to judge if a work of art will have an emotional impact on its audience or dial in the art to enhance that impact or make a statement that resonates with people. Only people can do that.
All in all, this article is unscientific, filled with hand-waving “and then a miracle occurs”, and meaningless graphs that in no way indicate that LLMs will undergo the kind of step change transformation needed to reliably and independently accomplish complex tasks this decade. The study authors themselves give the game away when they use “50% success rate” as the yardstick for an LLM. You know what we call a human with a 50% success rate in the professional world? Fired.
I don’t think it was responsible of IEEE to publish this article and I expect better from the organization.
I think it'll be possible to publish a "book" as a series of prompts.
which the LLMs can expand out into the narrative story.
it's a novel you can chat with. the new novel for the post-LLMs era is more like publishing the whole author... which then you can "intervew" as an LLM (reminiscent of Harry Potter when Ron's sister find the evil journal, and she basically "chats" with the notebook)
That said, I will make no definitive statements like “never” and “can’t” as it relates to AI in the next 5 years because it is already doing things that I would have thought unlikely just 5 years ago…and frankly would have thought functionally impossible back 40 years ago.
I appreciate your unwillingness to say "never" here, but I think the parent comment deserves credit for calling out something important that rarely gets discussed: the importance of emotion for producing great art. This is one of the classic themes of Asimov's entire Robot ouvre, which spends many books digging into the differences between (far more advanced) AI and actual human intelligence.
There are fundamental, definable, structural deficiencies that separate LLMs from human thought, it's plainly incorrect to pretend otherwise, and the...extrapolationists...are neglecting that we have no idea how to solve these problems.
That’s subjective though. An opinion I agree with, but still subjective.
I think it’s within the realm of possibility given the advances we have seen so far that a near future AI given enough input describing emotion could simulate it just enough for people to accept that it created a “decent” work. Likely undetectable by most people as AI created vs human created.
> we have no idea how to solve these problems.
Yet. Obviously time and talent moved us from Eliza to ChatGPT/Gemini. Is it really unlikely that time and talent can’t push us over the artificial emotion precipice as well? I am not betting against it.
Predictions from the METR AI scaling graph are based on a flawed premise - https://news.ycombinator.com/item?id=43885051 - May 2025 (25 comments)
AI's Version of Moore's Law - https://news.ycombinator.com/item?id=43835146 - April 2025 (1 comment)
Forecaster reacts: METR's bombshell paper about AI acceleration - https://news.ycombinator.com/item?id=43758936 - April 2025 (74 comments)
Measuring AI Ability to Complete Long Tasks – METR - https://news.ycombinator.com/item?id=43423691 - March 2025 (1 comment)
Does that tell us anything useful? No. They're LLMs, not chess engines, "word count" software, or a game of hangman*. You might as well add "make a sandwich" to the list of tasks.
Also 50% is the bar? In most jobs trainees only start actually being worth their wage once they reach about 99%, anything below wastes the time of someone more competent.
I wonder how much money is being collectively wasted on trying to shove LLMs into areas where you'd really need AGI, rather than focusing resources on improving LLMs for those areas where they're actually useful.
* Though I do recommend attempting to play hangman with an LLM. It's highly entertaining.
Similarly, while I’m sure you could make good progress on starting a business in a month, it seems like that would take longer to genuinely complete from start to finish. Also, it seems like it’s necessarily a task that relies on external factors: Waiting for approval to come from various agencies, hiring employees, waiting for other parties to sign contracts, etc.
That is nothing. "git clone" can, with 100% reliability, "complete" tasks in a minute that take over 1,000,000 man hours. It even keeps the license.
It is a shame the IEEE now promotes this theft.
greenchair•4h ago