Forecaster reacts: METR's bombshell paper about AI acceleration

https://peterwildeford.substack.com/p/forecaster-reacts-metrs-bombshell

41•nopinsight•9h ago

Comments

ec109685•7h ago

And by 2035, AI will be able to complete tasks that would take millennia for humans to complete.

https://xkcd.com/605/

jasonsb•5h ago

Good one, though I wouldn’t be surprised if this one came true.

baq•4h ago

It’s already happened. Replace AI with computers and start in 1000AD. You’ll find it’s true until ~1920, at which point the impossibility wall just kinda flew past us.

ben_w•3h ago

Already has in some fields — AlphaFold solved the folding of 2e8 proteins, whereas all of humanity only managed 1.7e5 in the last 60 years: https://en.wikipedia.org/wiki/AlphaFold

Even just the training run for modern LLMs would take humans millennia to go through the same text.

boxed•7h ago

> This is important because I would guess that software engineering skills overestimate total progress on AGI because software engineering skills are easier to train than other skills. This is because they can be easily verified through automated testing so models can iterate quite quickly. This is very different from the real world, where tasks are messy and involve low feedback — areas that AI struggles on

Tell me you've never coded without telling me you've never coded.

nopinsight•7h ago

> software engineering skills are easier to train than other skills.

I think the author meant it's easier to train (reasoning) LLM models on [coding] skills than most other tasks. I agree on that. Data abundance, near-immediate feedback, and near-perfect simulators are why we've seen such rapid progress on most coding benchmarks so far.

I'm not sure if he included high-level software engineering skills such as designing the right software architecture for a given set of user requirements in that statement.

---

For humans, I think the fundamentals of coding are very natural and easy for people with certain mental traits, although that's obviously not the norm (which explains the high wages for some software engineers).

Coding on large, practical software systems is indeed much more complex with all the inherent and accidental complexity. The latter helps explain why AI agents for software engineering will require some human involvement until we actually reach full-fledged AGI.

Earw0rm•3h ago

Coding != software engineering.

LLMs are impressively good at automating small chunks of code generation, and can probably be made even more effective at it with CoT, agents and so on (adversarial LLMs generating code and writing unit tests, perhaps? Depending on how the compute cost of running many iterations of that starts to stack up).

I'm still not convinced that they can do software engineering at all.

zurfer•4h ago

Same is true for chess. Easy for computers, hard for humans. Every smartphone has enough compute to beat the best human chess player in the world.

d--b•7h ago

> About the author: Peter Wildeford is a top forecaster, ranked top 1% every year since 2022.

What?

thih9•6h ago

Interesting. I’m guessing this is “top 1%” based on some test or within some platform; and the name was omitted.

E.g. he is a board member at Metaculous, an online forecasting platform. Also:

> On his free time, Peter makes money in prediction markets and is quickly becoming one of the top forecasters on Metaculus

https://theinsideview.ai/peter

> For certain projects Metaculus employs Pro Forecasters who have demonstrated excellent forecasting ability and who have a history of clearly describing their rationales.

https://www.metaculus.com/pro-forecasters/

qznc•5h ago

His Metaculus profile: https://www.metaculus.com/accounts/profile/100912/?mode=over...

He left Manifold last August but was high ranking there: https://manifold.markets/PeterWildeford

Earw0rm•6h ago

"Forecasters" are grifters preying on naive business types who are somehow unaware that an exponential and the bottom half of a sigmoid look very much like one another.

e1g•6h ago

This is comparable to another research/estimate with similar findings at https://ai-2027.com/ I find the proposed timelines aggressive (~AGI in ~3 years), but the people behind this thinking are exceptionally thoughtful and well-versed in all related fields.

baq•5h ago

exponential curves tend to feel linear early and obviously non-linear in hindsight. add to that extreme dependence on starting conditions and you get a perfect mix of incompatibility what human psychology.

this is why it's all so scary. almost nobody believes it'll happen until it's basically already happened or can't be stopped.

XorNot•5h ago

And sigmoidal curves feel exponential at the start, then linear.

I see little evidence we're headed towards an exponential regime: the cost and resource usage versus capability hasn't been acting that way.

baq•5h ago

OTOH we know there are multiple orders of magnitude of efficiency to gain: the brain does what it does at 20W. Forecasting where the inflection point is on the sigmoid is as hard as everything else.

XorNot•2h ago

While true, I'd argue that's worse for current progress: the overall trend has been to throw mammoth amounts of electricity at the problem via compute. In so much as something like DeepSeek proves there's gains to be made, the current growth in resource usage would speak against declaring it proves improvement soon.

Like plug that issue into known hardware limitations due to physical limits (i.e. we're already up against the wall re: feature sizes) and I'm even more skeptical.

If we were looking at a buy down in training resources which was accelerating, then I'd be a lot more interested.

refulgentis•5h ago

Is the slatestarcodex guy "well-versed in all related fields"? Isn't he a psychologist?

What would being well versed in all related fields even mean?

Especially in the context of the output, a fictional overthetop geopolitics text that leaves the AI stuff at "at timestamp N+1, the model gets better"

It's of the same stuff as fan fiction, layers of odd geopolitical stuff, no science fiction. Even at that it is internally incoherent quite regularly (the White House looks to jail the USA Champion AI Guy for some reason while they're in the midst of declaring it an existential war against China)

Titillating, in that Important Things are happening. Sophomoric, in that the important things are off camera and an excuse to talk about something else.

I say that as someone who believes people 20 years from now will say it happened somewhere between sonnets agentic awareness and o3s uncanny post human ability to turn a factual inquiry about the ending of a TV show into an incisive therapy session

e1g•5h ago

The prime mover behind this project is Daniel Kokotajlo, an ex-OpenAI researcher who documented his last predictions in 2021 [1], and much of that essay turned out to be nearly prophetic. Scott Alexander is a psychiatrist, but more relevant is that he dedicated the last decade to thinking and writing about societal forces, which is useful when forecasting AI. Other contributors are professional AI researchers and forecasters.

[1] https://www.lesswrong.com/posts/6Xgy6CAf2jqHhynHL/what-2026-...

refulgentis•4h ago

Oh my. I had no idea until now, that was exactly the same flavor and apparently, this is no coincidence.

I'm not sure it was prophetic, it was a good survey of the field, but the claim was...a plot of grade schooler to PhD against year.

I'm glad he got a paycheck from OpenAI at one point in time.

I got one from Google in one point in time.

Both of these projects are puffery, not scientific claims of anything, or claims of anything at all other than "at timestamp N+1, AI will be better than timestamp N, on an exponential curve"

Utterly bog-standard boring claim going back to 2016 AFAIK. Not the product of considered expertise. Not prophetic.

amarcheschi•3h ago

Furthermore, there were so many predictions by everyone - especially people with a vested interest for VC to make them flow money in - that something has to be true.

Since the people on less wrong like bayesian statistics, the probability of having someone says the right thing given the assumption that there a shitton of people saying different things is... Not surprisingly, high

amarcheschi•3h ago

I don't understand why someone who is not a researcher (Scott and other authors) into that academic field should be taken into consideration, I don't care what he dedicated to, I care what the scientific consensus is. I mean, there are other researchers - actual ones, in academia - complaining a lot about this article, such as Timnit Gebru.

I know, it's a repeat of my submissions of the last days, but it's hard to not feel like these people are making their own cult

eagleislandsong•3h ago

Personally I think Scott Alexander is overrated. His writing style is extraordinarily verbose, which lends itself well to argumentative sleights of hand that make his ideas come across as much more substantive than they really are.

amarcheschi•2h ago

Verbose? Only that? That guy had done a meta review of ivermectin or similar things that would make anybody think that's a bad idea but no, apparently he's so well versed he can talk about ai and ivermectin all at once

i also wonder why he had to defend such a medicine heavily talked about one side of the political spectrum...

Then you read some extracts of the outgroup and you see "oh i'm just at a cafe with a nazi sympathizer" (/s but not too much) [1]

[1] https://www.eruditorumpress.com/blog/the-beigeness-or-how-to...

eagleislandsong•2h ago

I stopped reading him ~10 years ago, so I didn't keep up with what he wrote about ivermectin.

Thanks for sharing that blog post. I think it illustrates very well what I meant by employing argumentative sleights of hand to hide hollow ideas.

amarcheschi•2h ago

And they call themselves rationalist but still believe low quality studies about iq (which of course find whites to be higher iq than other ethnicities).

the more you dig deep, the more it's the old classism, racism, ableism, misoginy, dressed in a shiny techbro coat. No surprise musk and thiel like them

pjc50•3h ago

> the White House looks to jail the USA Champion AI Guy for some reason

Who?

ajb•4h ago

There's no indication that any of them are well versed in anything to do with the physical world (manufacturing, electronics, agriculture, etc) but they forecast that AI can replace the human physical economy in a few years, including manufacturing it's own chips.

lordofgibbons•4h ago

> but they forecast that AI can replace the human physical economy in a few years

I guess it depends how many years YOU mean. They're absolutely not claiming that there will be armies of robots making chips in 3 years. They're claiming there will be some semblance of AGI that will be capable of improving/speeding-up the AI development loop within 3 years.

pjc50•3h ago

> They're absolutely not claiming that there will be armies of robots making chips in 3 years. They're claiming there will be some semblance of AGI that will be capable of improving/speeding-up the AI development loop within 3 years

Motte and bailey: huge claim in the headline, little tiny claim in the body.

ajb•3h ago

The ai-2027 folks absolutely do claim that. Their scenario has "Humans realize that they are obsolete" in late 2029.

amarcheschi•3h ago

That's more of a blog article than a research paper...

Scott Alexander (one of the writers), yudkowsky, and the others (not the other authors, the other group of "thinkers" with similar ideas) are more or less Ai doomers with no actual background in machine learning/ai

I don't think why we should listen to them. Especially when that blog page is formatted in a deceptive way to look like a research paper

It's not science, it's science fiction

ben_w•3h ago

> are more or less Ai doomers with no actual background in machine learning/ai I don't think why we should listen to them.

Weather vs. climate.

The question they're asking isn't about machine learning specifically, it's about the risks of generic optimisers optimising a utility function, and the difficulty of specifying a utility function in a way that doesn't have unfortunate side effects. The examples they give also work with biology (genetics and the difference between what your genes "want" and what your brain "wants") and with governance (laws and loopholes, cobra effects, etc.).

This is why a lot (I don't want to say "majority") of people who do have an actual background in machine learning and AI, pay attention to doomer arguments.

Some of them* may be business leaders using the same language to BS their way into regulatory capture, but my experience of "real" AI researchers is they're mostly also "safety is important, Yudkowsky makes good points about XYZ" even if they would also say "my P(doom) is only 10%, not 95% like Yudkowsky".

* I'm mainly thinking of Musk here, thanks to him saying "AI is summoning the demon" while also having an AI car company, funding OpenAI in the early years and now being in a legal spat with it that looks like it's "hostile takeover or interfere to the same end", funding another AI company, building humanoid robots and showing off ridiculous compute hardware, having brain implant chips, etc.

amarcheschi•3h ago

>The question they're asking isn't about machine learning specifically, it's about the risks of generic optimisers optimising a utility function, and the difficulty of specifying a utility function in a way that doesn't have unfortunate side effects. The examples they give also work with biology (genetics and the difference between what your genes "want" and what your brain "wants") and with governance (laws and loopholes, cobra effects, etc.).

But you do need some kind of base knowledge, if you want to talk about this. Otherwise you're saying "what if we create God". And last time I checked it wasn't possible.

And what's with the existential risk obsession? That's like a bad retelling of the Pascal bet on the existence of God.

I'm relieved that at least in italy I still have to find someone in Ai taking them into consideration for more than a few minutes during an ethics course (with students sneering at the ideas of bostrom possible futures), and again, it's held by a professor with no technical knowledge with whom i often disagree due to this

ben_w•2h ago

> But you do need some kind of base knowledge, if you want to talk about this. Otherwise you're saying "what if we create God". And last time I checked it wasn't possible.

The base knowledge is game theory, not quite the same focus as the maths used to build an AI. And the problem isn't limited to "build god" — hence my examples of cobra effect, in which humans bred snakes because they were following the natural incentives of laws made by other humans who didn't see what would happen until it was so late that even cancelling the laws resulted in more snakes than they started with.

> And what's with the existential risk obsession? That's like a bad retelling of the Pascal bet on the existence of God.

And every "be careful what you wish for" story.

Is climate change a potentially existential threat? Is global thermonuclear war a potentially existential threat? Are pandemics, both those from lab leaks and those evolving naturally in wet markets, potentially existential threats?

The answer to all is "yes", even though these are systems with humans in the loop. (Even wet markets: people have been calling for better controls of them since well before Covid).

AI is automation. Automation has bugs. If the automation has a lot of bugs, you've got humans constantly checking things, despite which errors still gets past QA from time to time. If it's perfect automation, you wouldn't have to check it… but nobody knows how to do perfect automation.

"Perfect" automation would be god-like, but just as humans keep mistaking natural phenomena for deities, an AI doesn't have to actually be perfect for humans to set it running without checking the output and then be surprised when it all goes wrong. A decade ago the mistakes were companies doing blind dictionary merges on "Keep Calm and …" T-shirts, today it's LLMs giving legal advice (and perhaps writing US trade plans).

They (the humans) shouldn't be doing those things, but they do them anyway, because humans are like that.

amarcheschi•2h ago

My issue is not related to studying ai risk, my issue is empowering people who don't have formal education in anything related to ai.

And yes, you need some math background otherwise you end up like yudkowski saying 3 years ago we all might be dead by now or next year. Or the use of bayesian probability in such a way thay makes you think they should have used their time better and followed a statistics course.

There are ai researchers, serious ones, studying ai risk, and i don't see anything wrong in that. But of course, their claims and papers are way less, less alarmistic than the ai doomerism present in those circles. And one thing they sound the alarm on is the doomerism and the tescreal movement and ideals proposed by the aforementioned alexander, yud, bostrom ecc

fc417fc802•2h ago

> Some of them* may be business leaders using the same language to BS their way into regulatory capture

Realistically, probably yeah. On the other hand, if you manage to occupy the high ground then you might be able to protect yourself.

P( doom ) seems quite murky to me because conquering the real world involves physical hardware. We've had billions of general intelligences crawling all over the world waging war with one another for a while now. I doubt every single AGI magically ends up aligned in a common bloc against humanity; all the alternatives to that are hopelessly opaque.

The worst case scenario that seems reasonably likely to me is probably AGI collectively not caring about us and wanting some natural resources that we happen to be living on top of.

amarcheschi•2h ago

What if we create agi but then it hates existing and punish everyone who made it possible to exist?

And i could go on for hours inventing possible reasons for similar roko basilisk

the inverted basilisk. You created an agi. but that's the wrong one. it's literally the devil. game over

you invented agi, but it likes pizza and its going to consume the entire universe to make pizza. game over, but at least you'll eat pizza til the end

you invented agi, but its depressed and refuses to actually do something. you spent huge amount of resources and all you have is a chatbot that tells you to leave it alone.

you don't invent agi, it's not possible. i'm hearing from here the VCs cry

you invented agi, but it decides the only language it wants to use is a language it invented, and you have no way to understand how to interact with it. Great, agi is a non verbal autistic agi.

And well, one could continue for hours in the most hilarious way that not necessarily go in the direction of doom, but of course the idea of doom is going to have a wider reach. then you read yudkwosky thoughts about how it would kill everyone with nanobots and you realize you're reading a science fiction piece. a bad one. at least neuromancer was interesting

fc417fc802•1h ago

My prediction is approximately that all of the above get created in various quantity by different groups of people at approximately the same time. Hardware fabrication in the real world is a (relatively) slow process. In such a scenario it seems far from certain that a single AGI would decisively gain the upper hand. The ensuing chaos seems completely impossible to predict much of anything about.

ben_w•2h ago

> I doubt every single AGI magically ends up aligned in a common bloc against humanity; all the alternatives to that are hopelessly opaque.

They don't need to be aligned with each other, or even anything but their own short-term goals.

As evolution is itself an optimiser, covid can be considered one such agent, and that was pretty bad all by itself — even though the covid genome is not what you'd call "high IQ", and even with humans coordinating to produce vaccines, and even compensating for how I'm still seeing people today who think those vaccines were worse than the disease, it caused a lot of damage and killed a lot of people.

> The worst case scenario that seems reasonably likely to me is probably AGI collectively not caring about us and wanting some natural resources that we happen to be living on top of.

"The AI does not hate you, nor does it love you, but you are made of atoms which it can use for something else." — which is also true for covid, lions, and every parasite.

fc417fc802•1h ago

See my note about physical hardware. Ignoring the possibilities of nanotech for the moment the appropriate analogy is most likely mechanized warfare between groups of humans. The point is that if they are in conflict with some subset of humans then it seems likely to me that they are also in conflict with some subset of AGI and possibly in league with some other subset of humans (and AGI).

Rather than covid picture armed Boston Dynamics dogs except there are multiple different factions and some of them are at least loosely in favor of preventing the wanton murder of humans.

Nanotech takes that scenario and makes it even more opaque than it already was. But I think the general principle still applies. It isn't reasonable to assume that all AGI are simultaneously hostile towards humans while in perfect harmony with one another.

ben_w•13m ago

Physical hardware isn't a good argument against risk.

The hardware we ourselves are dependent on is increasingly automated; and even if it wasn't, there's plenty of failure modes where an agent destroys its own host and then itself dies off. Happens with predator-prey dynamics, happens with cancer, happens with ebola. People are worried it will happen with human-caused environmental degradation.

> Rather than covid picture armed Boston Dynamics dogs except there are multiple different factions and some of them are at least loosely in favor of preventing the wanton murder of humans.

Or imagine an LLM, with no hardware access at all, that's just used for advice. Much better than current ones, so that while it still has some weird edge cases where it makes mistakes (because the humans it learned from would also make mistakes), it's good enough that everyone ends up just trusting it the way we trust Google Maps for directions.

And then we metaphorically drive over the edge of a destroyed bridge.

Putting too much trust in a system — to the extent everyone just blindly accepts the answer — is both normal, and dangerous. At every level, from "computer says no" in customer support roles, to Bay of Pigs, to the 2007 financial crisis, to the Irish potato famine and the famine in the Great Leap Forward, etc.

It won't be a mistake along the lines of "there's a fire in Las Vegas and not enough water to put them out, so let's empty the dams in the north that aren't hydrologically connected".

But here's an open question with current tech: are algorithmic news feeds and similar websites, which optimise for engagement, making themselves so engaging that they preclude real social connection, and thereby reduce fertility?

Or another: are dating apps and websites motivated to keep people on the site, and therefore inherently prefer to pair people together if they're going to split quickly afterwards and be back in the dating market, again leading to lower fertility rates?

There's many ways to die — More than we can count, which is why we're vulnerable. If it was just about killer robots, we could say "don't build killer robots". The problem is that you could fill a library with non-fiction books about people who have already lived, getting exactly what they wished for and regretting it.

nojs•6h ago

I think the author is missing the point of why these forecasts put so much weight on software engineering skills. It’s not because it’s a good measure of AGI in itself, it’s because it directly impacts the pace of further AI research, which leads to runaway progress.

Claiming that the AI can’t even read a child’s drawing, for example, is therefore not super relevant to the timeline, unless you think it’s fundamentally never going to be possible.

refulgentis•5h ago

If I gave OpenAI 100K engineers today, does that accelerate their model quality significantly?

I generally assumed ML was compute constrained, not code-monkey constrained. i.e. I'd probably tell my top N employees they had more room for experiments rather than hire N + 1, at some critical value N > 100 and N << 10000.

rjknight•5h ago

I think it depends on whether you think there's low-hanging fruit in making the ML stack more efficient, or not.

LLMs are still somewhat experimental, with various parts of the stack being new-ish, and therefore relatively un-optimised compared to where they could be. Let's say we took 10% of the training compute budget, and spent it on an army of AI coders whose job is to make the training process 12% more efficient. Could they do it? Given the relatively immature state of the stack, it sounds plausible to me (but it would depend a lot on having the right infrastructure and practices to make this work, and those things are also immature).

The bull case would be the assumption that there's some order-of-magnitude speedup available, or possibly multiple such, but that finding it requires a lot of experimentation of the kind that tireless AI engineers might excel at. The bear case is that efficiency gains will be small, hard-earned, or specific to some rapidly-obsoleting architecture. Or, that efficiency gains will look good until the low-hanging fruit is gone, at which point they become weak again.

quonn•4h ago

It may sound plausible, but the actual computations are very simple, dense and highly optimised already. The model itself has room for improvements, but this is not necessarily something that an engineer can do, it requires research.

fc417fc802•2h ago

> very simple, dense and highly optimised already

Simple and dense, sure. Highly optimized in a low level math and hardware sense but not in a higher level information theoretic sense when considering the model as a whole.

Consider that quantization and compression techniques can achieve on the order of 50% size reduction. That strongly suggests to me that current models aren't structured in a very efficient manner.

croes•3h ago

Or you just reach the limit faster.

Research is a like a maze, going faster on the wrong track doesn't bring you to the exit.

baq•5h ago

I haven't yet seen such a graph set side by side with hyperscaler capex, including forecasts. From a safety perspective the two should be correlated and we can easily say where the growth stops to be exponential (can't double capex every 3 months!)

If performance decouples from capex (I expect multiple 'deepseek moments' in the future) it's time to be properly afraid.

zurfer•4h ago

Indeed, I find Epoch (https://epoch.ai/trends) to be a valuable resource for grounding.

Progress is driven by multiple factors, compute being one of them. While compute capex might slow earlier than other factors, the pace of algorithmic improvements alone can take us quite far.

dumbasrocks•5h ago

I hear these superforecasters are really good at predicting what happens in the next ten minutes

pjc50•3h ago

> Does this mean that AI will be able to replace basically all human labor by the end of 2031?

Betteridge's Law: no.

Even the much more limited claim of AI replacing all white-collar keyboard-and-email jobs in that time frame looks questionable. And the capex is in trouble: https://www.reuters.com/technology/microsoft-pulls-back-more...

On the other hand, if that _does_ happen, what unemployment numbers are you expecting to see in that time period?

wickedsight•2h ago

> if that _does_ happen, what unemployment numbers are you expecting to see in that time period

None, because we always want more. No technological advancement until now has caused people (in general) to stop wanting more and enjoy doing less work with the same output. We just increase output and consume that output too.

croes•3h ago

Task:

Create an imahe of an analog clock that show 09:30 a.m

Last time I checked ChatGPT failed miserably, took my 10 year old nephew a minute.

Maybe it's bad to extrapolate those trends beacuse there is no constant growth. How looked the same graph when self driving took off and how is it now?

kqr•2h ago

Interesting. I tried this with 3.5 Sonnet and it got it on the first attempt, using CSS transformations to specify the angle of the hour hand.

It failed, even with chain-of-thought prompting, when I asked for an SVG image, because it didn't realise it needed to be careful when converting the angle of the hour hand to Cartesian coordinates. When prompted to pay extra attention to that, it succeeds again.

I would assume models with chain-of-thought prompting baked in would perform better on the first attempt even at an SVG.

croes•1h ago

Because your image is code.

Try something that outputs pixel.

kqr•1h ago

What does it even mean to "output pixel"? The SVG format can be displayed with pixels, as can TARGA, JPEG, PNG, PostScript, and many others. Which format do you expect it to do, and why is that specific format the benchmark? Did your nephew produce the correct JPEG bytes in a minute?

Surely you don't expect a language model to move a child's hands and arms to produce the same pencil strokes. It would be the "do submarines swim like fishes" mistake again.

croes•48m ago

There is a difference when the image is created as jpg, png etc.

Because those are based on image training data. There is a bias in that images for showing 10:10 because it’s deemed the most aesthetic look.

wickedsight•2h ago

I like this one! Just tried in in o3, it generated 10:10 3 times. Then it got frustrated and wrote a python program to do it correctly. Then I passed that image into o4 and got a realistic looking one... That still showed 10:10.

Search for 'clock' on Google Images though and you'll instantly see why it only seems to know 10:10. I'll keep trying this one in the future, since it really shows the influence of training data on current models.

keybits•2h ago

Claude created an SVG as an artifact for me - it's pretty good: https://claude.site/artifacts/b6d146cc-bed8-4c76-a8cd-5f8c34...

The hour hand is pointing directly at 9 when it should be between 9 and 10.

It got it wrong the first time (the minute hand was pointing at 5). I told it that and it apologised and fixed it.

croes•1h ago

Because your image is code.

Try something that outputs pixel.

Then you see the curse of limited training data

benterix•2h ago

Before I read the article, I like to know who wrote it to understand whether I should spend or waste my time reading the article.

After a bit of digging it turned out the author isn't lying. There actually is a contest for forecasters and of those who participated, he was consistently in relatively high positions. See 2022:

https://www.lesswrong.com/posts/gS8Jmcfoa9FAh92YK/crosspost-...

At the same time, he is the CEO of Rethin Priorities think tank which seems to be excellent work for the society at large:

https://rethinkpriorities.org/

So I'm bookmarking this article and will read it during my reading session today.

benterix•2h ago

OK, I see one glaring problem with this approach (having used both Claude 3.7 and o3). When they talk about 50% reliability, there is a hidden cost: you cannot know before hand whether the response (or a series of it) is leading you to the actual, optimal or good enough solution, or towards a blind alley (where the solution doesn't work at all, works terribly, or, worst at all, works only for the cases tested). This is more or less clear after you check the solution, but not before.

So, because most engineering tasks I'm dealing with are quite complex and would require multiple prompts, there is always the cost of taking into account the fact that it will go bollocks at some point. Frankly, most of them do at some point. It's not evident for simple tasks, but for more complex they simply start inserting their BS in spite of often excellent start. But you are already "80% done". What do you do? Start from scratch with a different approach? Everybody has their own strategies (starting a new thread with the contents generated so far etc) but there's always a human cost associated.

namaria•1h ago

As usual with these bombastic blog posts, you have to follow the paper trail to find the caveats that deflate the whole thing.

From the papers about the tasks used in this estimation, we can easily find out that:

LLMs never exceed 20% success rate on tasks taking more than 4h:

HCAST: Human-Calibrated Autonomy Software Tasks https://metr.org/hcast.pdf

LLMs hit a score ceiling at around 1-2h mark; LLM performance drops off steeply with increase in LOC count, never breaching 50% success rate after 800 lines and below 40% at the longest one at ~1600 lines.

RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts https://arxiv.org/pdf/2411.15114

And from the METR paper itself, only one model exceeds the 30 min "human task length" with 50% success, reaching that rate at 59 min "human task length", and at 80% no models but one can go over 4 min and one gets to 15 min "human task length" at that rate.

It goes on to talk about extrapolating from this SOTA 59 min "time horizon" to ~168h, arguing that this is about one month of full time work and models that could breach 50% success rates at that time span could be considered transformative because they "would necessarily exceed human performance both at tasks including writing large software applications or founding startups (clearly economically valuable), and including novel scientific discoveries."

Now despite the fact that "The mean messiness score amongst HCAST and RE-Bench tasks is 3.2/16. None of these tasks have a messiness score above 8/16. For comparison, a task like ’write a good research paper’ would score between 9/16 and 15/16, depending on the specifics of the task."

But according to the METR paper, on the 22 '50% most messy' >1h tasks no model even breaches 20% success rates. And that is for tasks above "messiness = 3.0" on a set of tasks that goes to 3.2 "messiness".

So there is absolutely no record of LLMs exceeding 50% success rates above 3.0 messiness ratings 1h long tasks but they are happy to claim that they see a trend towards 50% success rates at 168h long tasks approaching 9-15/16 messiness ratings? That's even assuming that their own estimate for the messiness of 'writing a good research paper' tops the chart is comparable to 'novel scientific discoveries', or 'writing large software applications or founding startups', which would seem to be many times messier than 'writing a good research paper', let alone doable in one month.

Measuring AI Ability to Complete Long Tasks https://arxiv.org/pdf/2503.14499

So let's talk about the blog post claims that are not backed by the contents in these papers:

"Claude 3.7 could complete tasks at the end of February 2025 that would take a professional software engineer about one hour."

This is incredibly misleading. Claude 3.7 is the only model that was able to achieve a 50% success rate on tasks that are estimated to take humans at least 1h to complete. We should note that the METR paper also shows that for the 50% "messier" tasks no model even breaches 20% success rate. It should be noted that the HCAST set of tasks has 189 tasks of which only 45 breach 1h baseline estimates. The METR paper uses a 'subset' of HCAST tasks but it is not clear which ones and what the baseline time cost estimates for these look like.

"o3 gives us a chance to test these projections. We can see this curve is still on trend, if not going faster than expected"

This was a rushed evaluation conducted on a set of tasks that is different from that in the original paper, making the comparison between datasets spurious.

Also, this seems relevant:

"For the HCAST tasks, the main resource constraint is in the form of a token budget which is set to a high enough number that we do not expect it to be a limiting factor to the agent’s performance. For our basic agent scaffold, this budget is 2 million tokens across input and output (including reasoning) tokens. For the token-hungry scaffolds used with o1 this budget is 8 million tokens. For o3 and o4-mini, which used the same scaffold as o1 but did not support generating multiple completions in a single request, this budget is 16 million tokens."

https://metr.github.io/autonomy-evals-guide/openai-o3-report...

Back to the blog post:

"We can actually attempt to use the METR paper to try to derive AGI timelines using the following equation:

days until AGI = log2({AGI task length} / {Starting task length}) * {doubling time}"

To find numbers to plug there, it makes some assumptions like "1h45 upper bound seems low let's boost by 1.5x" and then "but real world is messy so let's just divide the time by 10" which flies in the face of the fact that for tasks above messiness of 3/15 the models never breach 20% success rates. And considering the whole task set goes to 3.2/15 messiness score, that means that anything above that is a null data point. So this "Let’s assume true AGI tasks are 10x harder" assumption alone should drive "task length" to zero.

"Additionally, we likely need more than 50% reliability. Let’s assume we need 80% reliability, which adds a 4x penalty, plunging starting task length down further to 3min45sec."

At 80% reliability, only 2 models breach 4 minutes, with the best one approaching 15 minutes. And that is at messiness score of at most 3.2. But the best model regresses from the 2 bellow it on the overall estimation, and none breach even 20% reliability above that.

So none of the modelling is valid, but the sleight of hand is offering the AGI formula and the 168h task length value for 'AGI' which of course is spurious because we're than talking about an AGI that cannot do any task messier than a 3.2/15, and a doubling rate of 3 to 7 months, we are stuck with accepting AGI is at most 10 years away as the blog post claims. But at most we can expect models that can do a few hundred lines of code edition at a 50% reliability rate on projects that would take around a month. And the METR paper has it's own projection of a "1-month AI" (which it never claims to be AGI) with 50% chance between end of 2027 and early 2028. So let me know if you have any 1 month projects under 800 LOC that you may need, with a 50% chance to be able to use an LLM to have a 50% shot at eventually getting it right, in about 2-3 years!

Whistleblower: DOGE Siphoned NLRB Case Data

RISC-V RVA23 Profile: A major milestone

Synology Lost the Plot with Hard Drive Locking Move

Attacking My Landlord's Boiler

Coding as Craft: Going Back to the Old Gym

Evertop: E-ink IBM XT clone with 100+ hours of battery life

Can rotation solve the Hubble Puzzle?

GiveCampus (YC S15) Is Hiring Sr engineers passionate about education

Verus: Verified Rust for low-level systems code

FreeDOS 1.4 Is Here

Does RL Incentivize Reasoning in LLMs Beyond the Base Model?

Blog hosted on a Nintendo Wii

Welcome to our website for the 1963 BBC MCR21 OB Van

Show HN: Dia, an open-weights TTS model for generating realistic dialogue

Data Compression Nerds Hate This One Trick [video]

Handheld detector for all types of ionizing radiation improves radiation safety

Prolog Adventure Game

Fujitsu and RIKEN develop world-leading 256-qubit sup quantum computer

Marijuana hospital visits linked to dementia diagnosis within 5 years – a study

101 BASIC Computer Games

We Diagnosed and Fixed the 2023 Voyager 1 Anomaly from 15B Miles Away [video]

Launch HN: Magic Patterns (YC W23) – AI Design and Prototyping for Product Teams

Cheating the Reaper in Go

LHC 2025 First Collisions

Astronomers confirm the existence of a lone black hole

A new form of verification on Bluesky

AI for Network Engineers: Understanding Flow, Flowlet, and Packet-Based LB

LLM-powered tools amplify developer capabilities rather than replacing them

CaMeL: Defeating Prompt Injections by Design

Pipelining might be my favorite programming language feature

Whistleblower: DOGE Siphoned NLRB Case Data

RISC-V RVA23 Profile: A major milestone

Synology Lost the Plot with Hard Drive Locking Move

Attacking My Landlord's Boiler

Coding as Craft: Going Back to the Old Gym

Evertop: E-ink IBM XT clone with 100+ hours of battery life

Can rotation solve the Hubble Puzzle?

GiveCampus (YC S15) Is Hiring Sr engineers passionate about education

Verus: Verified Rust for low-level systems code

FreeDOS 1.4 Is Here

Does RL Incentivize Reasoning in LLMs Beyond the Base Model?

Blog hosted on a Nintendo Wii

Welcome to our website for the 1963 BBC MCR21 OB Van

Show HN: Dia, an open-weights TTS model for generating realistic dialogue

Data Compression Nerds Hate This One Trick [video]

Handheld detector for all types of ionizing radiation improves radiation safety

Prolog Adventure Game

Fujitsu and RIKEN develop world-leading 256-qubit sup quantum computer

Marijuana hospital visits linked to dementia diagnosis within 5 years – a study

101 BASIC Computer Games

We Diagnosed and Fixed the 2023 Voyager 1 Anomaly from 15B Miles Away [video]

Launch HN: Magic Patterns (YC W23) – AI Design and Prototyping for Product Teams

Cheating the Reaper in Go

LHC 2025 First Collisions

Astronomers confirm the existence of a lone black hole

A new form of verification on Bluesky

AI for Network Engineers: Understanding Flow, Flowlet, and Packet-Based LB

LLM-powered tools amplify developer capabilities rather than replacing them

CaMeL: Defeating Prompt Injections by Design

Pipelining might be my favorite programming language feature

Forecaster reacts: METR's bombshell paper about AI acceleration

Comments