Large language models are improving exponentially?

https://spectrum.ieee.org/large-language-model-performance

40•pseudolus•5h ago

Comments

greenchair•4h ago

thanks that was a good one!

revskill•4h ago

Is there any limit ?

coderatlarge•4h ago

“ If the idea of LLMs improving themselves strikes you as having a certain singularity-robocalypse quality to it, Kinniment wouldn’t disagree with you. But she does add a caveat: “You could get acceleration that is quite intense and does make things meaningfully more difficult to control without it necessarily resulting in this massively explosive growth,” she says. It’s quite possible, she adds, that various factors could slow things down in practice. “Even if it were the case that we had very, very clever AIs, this pace of progress could still end up bottlenecked on things like hardware and robotics.” “

tbalsam•4h ago

The only limit is yourself

Source: One of the most classic internet websites, zombo.com (sound on)

tbalsam•4h ago

For those curious: https://en.m.wikipedia.org/wiki/Zombo.com

LorenDB•4h ago

Why would you benchmark the LLMs for 50% success? I expect 100% success, or nearly so, to make an LLM a practical replacement for s human. 50% success is far too unreliable.

Edit: notice that I said "100%, or nearly so". I realize that 100% is an unrealistic metric for an LLM, but come on, the robots should be at least as competent as the humans they replace, and ideally much more so.

the__alchemist•4h ago

100% - that's quite the metric!

wobblyasp•4h ago

Yah I'm not really getting why that was chosen. Maybe not 100%, but something closer to 75% would be totally workable

elif•4h ago

So every time you try to fix a bug, you succeed on the first try in the expected amount of time?

Of course not.

This idea that AI should be correct 100% of the time is like expecting autonomous vehicles to have a 0% crash rate to be successful. It is just a coping metric which allows humans to feel superior. In reality they already outperform humans in terms of crash rates

tbalsam•4h ago

There are versions of this kind of benchmark with a higher threshold, however, it only seems to adjust the timetables by a linear amount, so you're only buying 1-2 years or so depending on what you want that % success rate to be.

tsilmanr•4h ago

50% success threshold has the lowest variance. If you'd choose 100% or even 90%, the plot would have much higher variance and difficult to discern trend. (And would require many more test samples)

gcanyon•3h ago

I'm not saying 50% is the right number, but I can imagine that a 100% requirement would produce very noisy results, while 50% might achieve a much smoother graph that produces a (hopefully) better view into the future. It might be reasonable to look at where a 95-100% result appears relative to the 50% result for past evaluations, and then project out from the projected 50% results in the future by the same amount -- but again, it might be the case that looking for 100% results relative to 50% leads to a very noisy result that can't be generalized into the future well.

stocksinsmocks•2h ago

How many humans achieve that level of success? Even with QA processes that quintuple development cost, is it even 100%? The software I regularly use is not even 95% defect free.

I am pleasantly amused to see it’s the cutting edge tech bros shaking their fist at the young LLMs on their lawn. We got free intern-quality work. Take the win.

nickpeterson•4h ago

The Skynet Funding Bill is passed. The system goes on-line August 4th, 1997. Human decisions are removed from strategic defense. Skynet begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th

fendy3002•4h ago

Because I always believe that Pareto Principle applies in most aspect of computing: https://en.wikipedia.org/wiki/Pareto_principle, I believe it'll also apply on this case too, and I find that it tracks with the progress of LLM/AIs.

Breaking over 80% accuracy and solving the rest of 20% problem will be the main challenge of next-gen (or next-2gen) LLM, not to mention they still have tasks to bring down the computing costs.

EDIT: that said, solving 80% of problems with 80% of accuracy with significant time saving is a solution that's worth to account, though we need to keep sceptical because the rest 20% may be gotten much worse because the 80% solved is in bad quality.

Yoric•3h ago

There is a big difference between LLMs and most other tech improvements, though: with most technologies that I can think of that solve 80% of the problem, it's easy to find out whether the technology works. When you're working with an LLM, though, it's really hard to know whether the answer is correct/usable or not.

fl0id•4h ago

I call BS. That graph seems very misleading, like just getting faster for me is not improving exponentially. By improving exponentially most ppl would understand getting smarter

ecocentrik•4h ago

It's already very misleading that they have used "Answering a question" as a the most trivial task to anchor their trend line. In the middle of their trend line they have humans taking 8 minutes to "find a fact on the web". Both of those tasks have a large variance in time requirements and outcomes.

dom96•4h ago

It takes a human 167 hours to start a new company? What does that even mean?

untitled2•4h ago

Classic mistake is that if 1 worker will produce 10 products a day, 10 workers will produce 100. Fact is what one software developer will do in a week, ten will do in a year. Copypasta can be fast and very inaccuare today -- it will be faster and much more inaccurate later.

timr•4h ago

For those people who won’t read anything more than the headline, this is a silly paper based on a metric that considers only “task completion time” at “a specified degree of reliability, such as 50 percent” for “human programmers”.

Then, in a truly genius stroke of AI science, the current article extrapolates this to infinity and beyond, while hand-waving away the problem of “messiness”, which clearly calls the extrapolation into question:

> At the heart of the METR work is a metric the researchers devised called “task-completion time horizon.” It’s the amount of time human programmers would take, on average, to do a task that an LLM can complete with some specified degree of reliability, such as 50 percent. A plot of this metric for some general-purpose LLMs going back several years [main illustration at top] shows clear exponential growth, with a doubling period of about seven months. The researchers also considered the “messiness” factor of the tasks, with “messy” tasks being those that more resembled ones in the “real world,” according to METR researcher Megan Kinniment. Messier tasks were more challenging for LLMs [smaller chart, above]

dang•3h ago

What would be a more accurate and neutral headline?

Y_Y•3h ago

The paper and blog posts referenced are both called "Measuring AI Ability to Complete Long Tasks”, this might do better.

timr•2h ago

Agreed. Or "AI models are getting faster", which seems defensible.

recursivecaveat•52m ago

They're not saying that the models are getting faster. They're saying that the models are becoming capable at all of completing tasks that take humans longer and longer. The task completion time for humans is a proxy for complexity of the task, or some notion of how far the model can get without human intervention.

pu_pe•4h ago

We can see exponential improvement in LLM performance in all sorts of metrics. The key question is whether this improvement will be sustained in coming years.

donkey_brains•3h ago

I’m sure someone more knowledgeable and well-spoken than I will provide a more scathing takedown of this article soon, but even I can laugh at its breathless endorsement of some very dubious claims with no supporting evidence.

“AI might write a decent novel by 2030”? Have you read the absolute dreck they produce today? An LLM will NEVER produce a decent novel, for the same reason it will never independently create a decent game or movie: It can’t read the novel, play the game, or watch the movie, and have an emotional response to it or gauge it’s entertainment value. It has no way to judge if a work of art will have an emotional impact on its audience or dial in the art to enhance that impact or make a statement that resonates with people. Only people can do that.

All in all, this article is unscientific, filled with hand-waving “and then a miracle occurs”, and meaningless graphs that in no way indicate that LLMs will undergo the kind of step change transformation needed to reliably and independently accomplish complex tasks this decade. The study authors themselves give the game away when they use “50% success rate” as the yardstick for an LLM. You know what we call a human with a 50% success rate in the professional world? Fired.

I don’t think it was responsible of IEEE to publish this article and I expect better from the organization.

ysofunny•3h ago

the LLMs will do to novels something else:

I think it'll be possible to publish a "book" as a series of prompts.

which the LLMs can expand out into the narrative story.

it's a novel you can chat with. the new novel for the post-LLMs era is more like publishing the whole author... which then you can "intervew" as an LLM (reminiscent of Harry Potter when Ron's sister find the evil journal, and she basically "chats" with the notebook)

kcplate•2h ago

No idea why you are getting downvoted for this, it seems to me this would be exactly the kind of thing you could do…even hallucinations would contribute in a meaningful way.

kcplate•3h ago

Likely due to my nearly 40 years experience in the tech industry, and knowing where we were then compared to where we are now—I am floored by what LLMs are doing and how much better they are even in the last 2 years I have been tracking on them.

That said, I will make no definitive statements like “never” and “can’t” as it relates to AI in the next 5 years because it is already doing things that I would have thought unlikely just 5 years ago…and frankly would have thought functionally impossible back 40 years ago.

timr•1h ago

LLMs are cool and they're amazing for what they are, but the hype is just ridiculous right now, and the extrapolation fallacy is still a fallacy. Without a good structural reason to assume exponential growth (e.g. organism reproduction, which is itself not actually exponential), it's kind of the Godwin's Law of AI debate: the first person to say "if we only project forward X years..." terminates the conversation.

I appreciate your unwillingness to say "never" here, but I think the parent comment deserves credit for calling out something important that rarely gets discussed: the importance of emotion for producing great art. This is one of the classic themes of Asimov's entire Robot ouvre, which spends many books digging into the differences between (far more advanced) AI and actual human intelligence.

There are fundamental, definable, structural deficiencies that separate LLMs from human thought, it's plainly incorrect to pretend otherwise, and the...extrapolationists...are neglecting that we have no idea how to solve these problems.

kcplate•21m ago

> deserves credit for calling out something important that rarely gets discussed: the importance of emotion for producing great art

That’s subjective though. An opinion I agree with, but still subjective.

I think it’s within the realm of possibility given the advances we have seen so far that a near future AI given enough input describing emotion could simulate it just enough for people to accept that it created a “decent” work. Likely undetectable by most people as AI created vs human created.

> we have no idea how to solve these problems.

Yet. Obviously time and talent moved us from Eliza to ChatGPT/Gemini. Is it really unlikely that time and talent can’t push us over the artificial emotion precipice as well? I am not betting against it.

dang•3h ago

I thought there had been more threads about this but could only find the following. Others?

Predictions from the METR AI scaling graph are based on a flawed premise - https://news.ycombinator.com/item?id=43885051 - May 2025 (25 comments)

AI's Version of Moore's Law - https://news.ycombinator.com/item?id=43835146 - April 2025 (1 comment)

Forecaster reacts: METR's bombshell paper about AI acceleration - https://news.ycombinator.com/item?id=43758936 - April 2025 (74 comments)

Measuring AI Ability to Complete Long Tasks – METR - https://news.ycombinator.com/item?id=43423691 - March 2025 (1 comment)

satisfice•3h ago

These comments are a balm to my soul. But usually when I make them I get voted down for being mean to AI.

Y_Y•3h ago

Lies, damn lies, statistics, confident LLM hallucinations, tech hype journalism

chmod775•3h ago

What sort of nonsense chart is that? I can trivially come up with tasks that a competent human can complete in a minute, but that LLMs will absolutely face-plant on. In fact I could probably make that line go any direction I wanted to.

Does that tell us anything useful? No. They're LLMs, not chess engines, "word count" software, or a game of hangman*. You might as well add "make a sandwich" to the list of tasks.

Also 50% is the bar? In most jobs trainees only start actually being worth their wage once they reach about 99%, anything below wastes the time of someone more competent.

I wonder how much money is being collectively wasted on trying to shove LLMs into areas where you'd really need AGI, rather than focusing resources on improving LLMs for those areas where they're actually useful.

* Though I do recommend attempting to play hangman with an LLM. It's highly entertaining.

actuallyalys•2h ago

I feel like it takes a human a month to write a novel or start up a company only of you’re talking about a very constrained version of the task. Like people write novels in a month — that’s the whole premise of National Novel Writing Month or NaNoWriMo — but they aren’t finished products, they’re first drafts.

Similarly, while I’m sure you could make good progress on starting a business in a month, it seems like that would take longer to genuinely complete from start to finish. Also, it seems like it’s necessarily a task that relies on external factors: Waiting for approval to come from various agencies, hiring employees, waiting for other parties to sign contracts, etc.

bgwalter•2h ago

"By 2030, the most advanced LLMs should be able to complete, with 50 percent reliability, a software-based task that takes humans a full month of 40-hour workweeks."

That is nothing. "git clone" can, with 100% reliability, "complete" tasks in a minute that take over 1,000,000 man hours. It even keeps the license.

It is a shame the IEEE now promotes this theft.

bgwalter•2h ago

Since neural networks can approximate any function, surely RSA-4096 will soon be factored with exponential progress!

spacemadness•1m ago

What the hell IEEE? I expect better from you than this fluff. “It might write a decent novel by 2030” is a pure garbage take.

Ask HN: How many communities HN it devs in C language?

Broken AI Discourse with Steve Klabnik [video]

Show HN: Liryo – Book journal app that tracks stats, goals and spending

Loggers close in on uncontacted people in Peruvian Amazon

Speeding up PostgreSQL dump/restore snapshots

The New York Times News Quiz, July 4, 2025

The Calculator-on-a-Chip (2015)

Long-lost Chinese typewriter prototype from the 1940s that changed computing

Thesis Drift

Show HN: I designed a math boardgame. HTML mock-up is now a playable game

EU says it will continue rolling out AI legislation on schedule

Senate Tax Bill Expands QSBS Benefits for Startups

Git experts should try Jujutsu

Cursor Pricing Changes

ApplePay vs. Alternative Payment Services

Is There Is Any PNG and SVG Icons Website That Work Without JavaScript?

Adding a new instruction to RISC-V back end in LLVM

Cops in [Spain] think everyone using a Google Pixel must be a drug dealer

Goodbye to All That – My Resignation from the FBI

Show HN: I Made a Hot or Not Benchmark for AI Design

The curious rise of giant tablets on wheels

Cat Facts Cause Context Confusion

The Velvet Sundown officially confirm they're AI

Dating Apps = Tech Hiring

A Survey and Evaluation of Database Management System Extensibility [pdf]

Local-First Software Is Easier to Scale

DuckLake v0.2

How the Pacific got its bend (2017)

The job juggler the tech world can't stop talking about speaks out

How to Search for Theorems in Lean 4

Ask HN: How many communities HN it devs in C language?

Broken AI Discourse with Steve Klabnik [video]

Show HN: Liryo – Book journal app that tracks stats, goals and spending

Loggers close in on uncontacted people in Peruvian Amazon

Speeding up PostgreSQL dump/restore snapshots

The New York Times News Quiz, July 4, 2025

The Calculator-on-a-Chip (2015)

Long-lost Chinese typewriter prototype from the 1940s that changed computing

Thesis Drift

Show HN: I designed a math boardgame. HTML mock-up is now a playable game

EU says it will continue rolling out AI legislation on schedule

Senate Tax Bill Expands QSBS Benefits for Startups

Git experts should try Jujutsu

Cursor Pricing Changes

ApplePay vs. Alternative Payment Services

Is There Is Any PNG and SVG Icons Website That Work Without JavaScript?

Adding a new instruction to RISC-V back end in LLVM

Cops in [Spain] think everyone using a Google Pixel must be a drug dealer

Goodbye to All That – My Resignation from the FBI

Show HN: I Made a Hot or Not Benchmark for AI Design

The curious rise of giant tablets on wheels

Cat Facts Cause Context Confusion

The Velvet Sundown officially confirm they're AI

Dating Apps = Tech Hiring

A Survey and Evaluation of Database Management System Extensibility [pdf]

Local-First Software Is Easier to Scale

DuckLake v0.2

How the Pacific got its bend (2017)

The job juggler the tech world can't stop talking about speaks out

How to Search for Theorems in Lean 4

Large language models are improving exponentially?

Comments