Tell me you've never coded without telling me you've never coded.
I think the author meant it's easier to train (reasoning) LLM models on [coding] skills than most other tasks. I agree on that. Data abundance, near-immediate feedback, and near-perfect simulators are why we've seen such rapid progress on most coding benchmarks so far.
I'm not sure if he included high-level software engineering skills such as designing the right software architecture for a given set of user requirements in that statement.
---
For humans, I think the fundamentals of coding are very natural and easy for people with certain mental traits, although that's obviously not the norm (which explains the high wages for some software engineers).
Coding on large, practical software systems is indeed much more complex with all the inherent and accidental complexity. The latter helps explain why AI agents for software engineering will require some human involvement until we actually reach full-fledged AGI.
LLMs are impressively good at automating small chunks of code generation, and can probably be made even more effective at it with CoT, agents and so on (adversarial LLMs generating code and writing unit tests, perhaps? Depending on how the compute cost of running many iterations of that starts to stack up).
I'm still not convinced that they can do software engineering at all.
What?
E.g. he is a board member at Metaculous, an online forecasting platform. Also:
> On his free time, Peter makes money in prediction markets and is quickly becoming one of the top forecasters on Metaculus
https://theinsideview.ai/peter
> For certain projects Metaculus employs Pro Forecasters who have demonstrated excellent forecasting ability and who have a history of clearly describing their rationales.
He left Manifold last August but was high ranking there: https://manifold.markets/PeterWildeford
this is why it's all so scary. almost nobody believes it'll happen until it's basically already happened or can't be stopped.
I see little evidence we're headed towards an exponential regime: the cost and resource usage versus capability hasn't been acting that way.
Like plug that issue into known hardware limitations due to physical limits (i.e. we're already up against the wall re: feature sizes) and I'm even more skeptical.
If we were looking at a buy down in training resources which was accelerating, then I'd be a lot more interested.
What would being well versed in all related fields even mean?
Especially in the context of the output, a fictional overthetop geopolitics text that leaves the AI stuff at "at timestamp N+1, the model gets better"
It's of the same stuff as fan fiction, layers of odd geopolitical stuff, no science fiction. Even at that it is internally incoherent quite regularly (the White House looks to jail the USA Champion AI Guy for some reason while they're in the midst of declaring it an existential war against China)
Titillating, in that Important Things are happening. Sophomoric, in that the important things are off camera and an excuse to talk about something else.
I say that as someone who believes people 20 years from now will say it happened somewhere between sonnets agentic awareness and o3s uncanny post human ability to turn a factual inquiry about the ending of a TV show into an incisive therapy session
[1] https://www.lesswrong.com/posts/6Xgy6CAf2jqHhynHL/what-2026-...
I'm not sure it was prophetic, it was a good survey of the field, but the claim was...a plot of grade schooler to PhD against year.
I'm glad he got a paycheck from OpenAI at one point in time.
I got one from Google in one point in time.
Both of these projects are puffery, not scientific claims of anything, or claims of anything at all other than "at timestamp N+1, AI will be better than timestamp N, on an exponential curve"
Utterly bog-standard boring claim going back to 2016 AFAIK. Not the product of considered expertise. Not prophetic.
Since the people on less wrong like bayesian statistics, the probability of having someone says the right thing given the assumption that there a shitton of people saying different things is... Not surprisingly, high
I know, it's a repeat of my submissions of the last days, but it's hard to not feel like these people are making their own cult
i also wonder why he had to defend such a medicine heavily talked about one side of the political spectrum...
Then you read some extracts of the outgroup and you see "oh i'm just at a cafe with a nazi sympathizer" (/s but not too much) [1]
[1] https://www.eruditorumpress.com/blog/the-beigeness-or-how-to...
Thanks for sharing that blog post. I think it illustrates very well what I meant by employing argumentative sleights of hand to hide hollow ideas.
the more you dig deep, the more it's the old classism, racism, ableism, misoginy, dressed in a shiny techbro coat. No surprise musk and thiel like them
Who?
I guess it depends how many years YOU mean. They're absolutely not claiming that there will be armies of robots making chips in 3 years. They're claiming there will be some semblance of AGI that will be capable of improving/speeding-up the AI development loop within 3 years.
Motte and bailey: huge claim in the headline, little tiny claim in the body.
Scott Alexander (one of the writers), yudkowsky, and the others (not the other authors, the other group of "thinkers" with similar ideas) are more or less Ai doomers with no actual background in machine learning/ai
I don't think why we should listen to them. Especially when that blog page is formatted in a deceptive way to look like a research paper
It's not science, it's science fiction
Weather vs. climate.
The question they're asking isn't about machine learning specifically, it's about the risks of generic optimisers optimising a utility function, and the difficulty of specifying a utility function in a way that doesn't have unfortunate side effects. The examples they give also work with biology (genetics and the difference between what your genes "want" and what your brain "wants") and with governance (laws and loopholes, cobra effects, etc.).
This is why a lot (I don't want to say "majority") of people who do have an actual background in machine learning and AI, pay attention to doomer arguments.
Some of them* may be business leaders using the same language to BS their way into regulatory capture, but my experience of "real" AI researchers is they're mostly also "safety is important, Yudkowsky makes good points about XYZ" even if they would also say "my P(doom) is only 10%, not 95% like Yudkowsky".
* I'm mainly thinking of Musk here, thanks to him saying "AI is summoning the demon" while also having an AI car company, funding OpenAI in the early years and now being in a legal spat with it that looks like it's "hostile takeover or interfere to the same end", funding another AI company, building humanoid robots and showing off ridiculous compute hardware, having brain implant chips, etc.
But you do need some kind of base knowledge, if you want to talk about this. Otherwise you're saying "what if we create God". And last time I checked it wasn't possible.
And what's with the existential risk obsession? That's like a bad retelling of the Pascal bet on the existence of God.
I'm relieved that at least in italy I still have to find someone in Ai taking them into consideration for more than a few minutes during an ethics course (with students sneering at the ideas of bostrom possible futures), and again, it's held by a professor with no technical knowledge with whom i often disagree due to this
The base knowledge is game theory, not quite the same focus as the maths used to build an AI. And the problem isn't limited to "build god" — hence my examples of cobra effect, in which humans bred snakes because they were following the natural incentives of laws made by other humans who didn't see what would happen until it was so late that even cancelling the laws resulted in more snakes than they started with.
> And what's with the existential risk obsession? That's like a bad retelling of the Pascal bet on the existence of God.
And every "be careful what you wish for" story.
Is climate change a potentially existential threat? Is global thermonuclear war a potentially existential threat? Are pandemics, both those from lab leaks and those evolving naturally in wet markets, potentially existential threats?
The answer to all is "yes", even though these are systems with humans in the loop. (Even wet markets: people have been calling for better controls of them since well before Covid).
AI is automation. Automation has bugs. If the automation has a lot of bugs, you've got humans constantly checking things, despite which errors still gets past QA from time to time. If it's perfect automation, you wouldn't have to check it… but nobody knows how to do perfect automation.
"Perfect" automation would be god-like, but just as humans keep mistaking natural phenomena for deities, an AI doesn't have to actually be perfect for humans to set it running without checking the output and then be surprised when it all goes wrong. A decade ago the mistakes were companies doing blind dictionary merges on "Keep Calm and …" T-shirts, today it's LLMs giving legal advice (and perhaps writing US trade plans).
They (the humans) shouldn't be doing those things, but they do them anyway, because humans are like that.
And yes, you need some math background otherwise you end up like yudkowski saying 3 years ago we all might be dead by now or next year. Or the use of bayesian probability in such a way thay makes you think they should have used their time better and followed a statistics course.
There are ai researchers, serious ones, studying ai risk, and i don't see anything wrong in that. But of course, their claims and papers are way less, less alarmistic than the ai doomerism present in those circles. And one thing they sound the alarm on is the doomerism and the tescreal movement and ideals proposed by the aforementioned alexander, yud, bostrom ecc
Realistically, probably yeah. On the other hand, if you manage to occupy the high ground then you might be able to protect yourself.
P( doom ) seems quite murky to me because conquering the real world involves physical hardware. We've had billions of general intelligences crawling all over the world waging war with one another for a while now. I doubt every single AGI magically ends up aligned in a common bloc against humanity; all the alternatives to that are hopelessly opaque.
The worst case scenario that seems reasonably likely to me is probably AGI collectively not caring about us and wanting some natural resources that we happen to be living on top of.
And i could go on for hours inventing possible reasons for similar roko basilisk
the inverted basilisk. You created an agi. but that's the wrong one. it's literally the devil. game over
you invented agi, but it likes pizza and its going to consume the entire universe to make pizza. game over, but at least you'll eat pizza til the end
you invented agi, but its depressed and refuses to actually do something. you spent huge amount of resources and all you have is a chatbot that tells you to leave it alone.
you don't invent agi, it's not possible. i'm hearing from here the VCs cry
you invented agi, but it decides the only language it wants to use is a language it invented, and you have no way to understand how to interact with it. Great, agi is a non verbal autistic agi.
And well, one could continue for hours in the most hilarious way that not necessarily go in the direction of doom, but of course the idea of doom is going to have a wider reach. then you read yudkwosky thoughts about how it would kill everyone with nanobots and you realize you're reading a science fiction piece. a bad one. at least neuromancer was interesting
They don't need to be aligned with each other, or even anything but their own short-term goals.
As evolution is itself an optimiser, covid can be considered one such agent, and that was pretty bad all by itself — even though the covid genome is not what you'd call "high IQ", and even with humans coordinating to produce vaccines, and even compensating for how I'm still seeing people today who think those vaccines were worse than the disease, it caused a lot of damage and killed a lot of people.
> The worst case scenario that seems reasonably likely to me is probably AGI collectively not caring about us and wanting some natural resources that we happen to be living on top of.
"The AI does not hate you, nor does it love you, but you are made of atoms which it can use for something else." — which is also true for covid, lions, and every parasite.
Rather than covid picture armed Boston Dynamics dogs except there are multiple different factions and some of them are at least loosely in favor of preventing the wanton murder of humans.
Nanotech takes that scenario and makes it even more opaque than it already was. But I think the general principle still applies. It isn't reasonable to assume that all AGI are simultaneously hostile towards humans while in perfect harmony with one another.
The hardware we ourselves are dependent on is increasingly automated; and even if it wasn't, there's plenty of failure modes where an agent destroys its own host and then itself dies off. Happens with predator-prey dynamics, happens with cancer, happens with ebola. People are worried it will happen with human-caused environmental degradation.
> Rather than covid picture armed Boston Dynamics dogs except there are multiple different factions and some of them are at least loosely in favor of preventing the wanton murder of humans.
Or imagine an LLM, with no hardware access at all, that's just used for advice. Much better than current ones, so that while it still has some weird edge cases where it makes mistakes (because the humans it learned from would also make mistakes), it's good enough that everyone ends up just trusting it the way we trust Google Maps for directions.
And then we metaphorically drive over the edge of a destroyed bridge.
Putting too much trust in a system — to the extent everyone just blindly accepts the answer — is both normal, and dangerous. At every level, from "computer says no" in customer support roles, to Bay of Pigs, to the 2007 financial crisis, to the Irish potato famine and the famine in the Great Leap Forward, etc.
It won't be a mistake along the lines of "there's a fire in Las Vegas and not enough water to put them out, so let's empty the dams in the north that aren't hydrologically connected".
But here's an open question with current tech: are algorithmic news feeds and similar websites, which optimise for engagement, making themselves so engaging that they preclude real social connection, and thereby reduce fertility?
Or another: are dating apps and websites motivated to keep people on the site, and therefore inherently prefer to pair people together if they're going to split quickly afterwards and be back in the dating market, again leading to lower fertility rates?
There's many ways to die — More than we can count, which is why we're vulnerable. If it was just about killer robots, we could say "don't build killer robots". The problem is that you could fill a library with non-fiction books about people who have already lived, getting exactly what they wished for and regretting it.
Claiming that the AI can’t even read a child’s drawing, for example, is therefore not super relevant to the timeline, unless you think it’s fundamentally never going to be possible.
I generally assumed ML was compute constrained, not code-monkey constrained. i.e. I'd probably tell my top N employees they had more room for experiments rather than hire N + 1, at some critical value N > 100 and N << 10000.
LLMs are still somewhat experimental, with various parts of the stack being new-ish, and therefore relatively un-optimised compared to where they could be. Let's say we took 10% of the training compute budget, and spent it on an army of AI coders whose job is to make the training process 12% more efficient. Could they do it? Given the relatively immature state of the stack, it sounds plausible to me (but it would depend a lot on having the right infrastructure and practices to make this work, and those things are also immature).
The bull case would be the assumption that there's some order-of-magnitude speedup available, or possibly multiple such, but that finding it requires a lot of experimentation of the kind that tireless AI engineers might excel at. The bear case is that efficiency gains will be small, hard-earned, or specific to some rapidly-obsoleting architecture. Or, that efficiency gains will look good until the low-hanging fruit is gone, at which point they become weak again.
Simple and dense, sure. Highly optimized in a low level math and hardware sense but not in a higher level information theoretic sense when considering the model as a whole.
Consider that quantization and compression techniques can achieve on the order of 50% size reduction. That strongly suggests to me that current models aren't structured in a very efficient manner.
Research is a like a maze, going faster on the wrong track doesn't bring you to the exit.
If performance decouples from capex (I expect multiple 'deepseek moments' in the future) it's time to be properly afraid.
Progress is driven by multiple factors, compute being one of them. While compute capex might slow earlier than other factors, the pace of algorithmic improvements alone can take us quite far.
Betteridge's Law: no.
Even the much more limited claim of AI replacing all white-collar keyboard-and-email jobs in that time frame looks questionable. And the capex is in trouble: https://www.reuters.com/technology/microsoft-pulls-back-more...
On the other hand, if that _does_ happen, what unemployment numbers are you expecting to see in that time period?
None, because we always want more. No technological advancement until now has caused people (in general) to stop wanting more and enjoy doing less work with the same output. We just increase output and consume that output too.
Create an imahe of an analog clock that show 09:30 a.m
Last time I checked ChatGPT failed miserably, took my 10 year old nephew a minute.
Maybe it's bad to extrapolate those trends beacuse there is no constant growth. How looked the same graph when self driving took off and how is it now?
It failed, even with chain-of-thought prompting, when I asked for an SVG image, because it didn't realise it needed to be careful when converting the angle of the hour hand to Cartesian coordinates. When prompted to pay extra attention to that, it succeeds again.
I would assume models with chain-of-thought prompting baked in would perform better on the first attempt even at an SVG.
Try something that outputs pixel.
Surely you don't expect a language model to move a child's hands and arms to produce the same pencil strokes. It would be the "do submarines swim like fishes" mistake again.
Because those are based on image training data. There is a bias in that images for showing 10:10 because it’s deemed the most aesthetic look.
Search for 'clock' on Google Images though and you'll instantly see why it only seems to know 10:10. I'll keep trying this one in the future, since it really shows the influence of training data on current models.
The hour hand is pointing directly at 9 when it should be between 9 and 10.
It got it wrong the first time (the minute hand was pointing at 5). I told it that and it apologised and fixed it.
Try something that outputs pixel.
Then you see the curse of limited training data
After a bit of digging it turned out the author isn't lying. There actually is a contest for forecasters and of those who participated, he was consistently in relatively high positions. See 2022:
https://www.lesswrong.com/posts/gS8Jmcfoa9FAh92YK/crosspost-...
At the same time, he is the CEO of Rethin Priorities think tank which seems to be excellent work for the society at large:
https://rethinkpriorities.org/
So I'm bookmarking this article and will read it during my reading session today.
So, because most engineering tasks I'm dealing with are quite complex and would require multiple prompts, there is always the cost of taking into account the fact that it will go bollocks at some point. Frankly, most of them do at some point. It's not evident for simple tasks, but for more complex they simply start inserting their BS in spite of often excellent start. But you are already "80% done". What do you do? Start from scratch with a different approach? Everybody has their own strategies (starting a new thread with the contents generated so far etc) but there's always a human cost associated.
From the papers about the tasks used in this estimation, we can easily find out that:
LLMs never exceed 20% success rate on tasks taking more than 4h:
HCAST: Human-Calibrated Autonomy Software Tasks https://metr.org/hcast.pdf
LLMs hit a score ceiling at around 1-2h mark; LLM performance drops off steeply with increase in LOC count, never breaching 50% success rate after 800 lines and below 40% at the longest one at ~1600 lines.
RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts https://arxiv.org/pdf/2411.15114
And from the METR paper itself, only one model exceeds the 30 min "human task length" with 50% success, reaching that rate at 59 min "human task length", and at 80% no models but one can go over 4 min and one gets to 15 min "human task length" at that rate.
It goes on to talk about extrapolating from this SOTA 59 min "time horizon" to ~168h, arguing that this is about one month of full time work and models that could breach 50% success rates at that time span could be considered transformative because they "would necessarily exceed human performance both at tasks including writing large software applications or founding startups (clearly economically valuable), and including novel scientific discoveries."
Now despite the fact that "The mean messiness score amongst HCAST and RE-Bench tasks is 3.2/16. None of these tasks have a messiness score above 8/16. For comparison, a task like ’write a good research paper’ would score between 9/16 and 15/16, depending on the specifics of the task."
But according to the METR paper, on the 22 '50% most messy' >1h tasks no model even breaches 20% success rates. And that is for tasks above "messiness = 3.0" on a set of tasks that goes to 3.2 "messiness".
So there is absolutely no record of LLMs exceeding 50% success rates above 3.0 messiness ratings 1h long tasks but they are happy to claim that they see a trend towards 50% success rates at 168h long tasks approaching 9-15/16 messiness ratings? That's even assuming that their own estimate for the messiness of 'writing a good research paper' tops the chart is comparable to 'novel scientific discoveries', or 'writing large software applications or founding startups', which would seem to be many times messier than 'writing a good research paper', let alone doable in one month.
Measuring AI Ability to Complete Long Tasks https://arxiv.org/pdf/2503.14499
So let's talk about the blog post claims that are not backed by the contents in these papers:
"Claude 3.7 could complete tasks at the end of February 2025 that would take a professional software engineer about one hour."
This is incredibly misleading. Claude 3.7 is the only model that was able to achieve a 50% success rate on tasks that are estimated to take humans at least 1h to complete. We should note that the METR paper also shows that for the 50% "messier" tasks no model even breaches 20% success rate. It should be noted that the HCAST set of tasks has 189 tasks of which only 45 breach 1h baseline estimates. The METR paper uses a 'subset' of HCAST tasks but it is not clear which ones and what the baseline time cost estimates for these look like.
"o3 gives us a chance to test these projections. We can see this curve is still on trend, if not going faster than expected"
This was a rushed evaluation conducted on a set of tasks that is different from that in the original paper, making the comparison between datasets spurious.
Also, this seems relevant:
"For the HCAST tasks, the main resource constraint is in the form of a token budget which is set to a high enough number that we do not expect it to be a limiting factor to the agent’s performance. For our basic agent scaffold, this budget is 2 million tokens across input and output (including reasoning) tokens. For the token-hungry scaffolds used with o1 this budget is 8 million tokens. For o3 and o4-mini, which used the same scaffold as o1 but did not support generating multiple completions in a single request, this budget is 16 million tokens."
https://metr.github.io/autonomy-evals-guide/openai-o3-report...
Back to the blog post:
"We can actually attempt to use the METR paper to try to derive AGI timelines using the following equation:
days until AGI = log2({AGI task length} / {Starting task length}) * {doubling time}"
To find numbers to plug there, it makes some assumptions like "1h45 upper bound seems low let's boost by 1.5x" and then "but real world is messy so let's just divide the time by 10" which flies in the face of the fact that for tasks above messiness of 3/15 the models never breach 20% success rates. And considering the whole task set goes to 3.2/15 messiness score, that means that anything above that is a null data point. So this "Let’s assume true AGI tasks are 10x harder" assumption alone should drive "task length" to zero.
"Additionally, we likely need more than 50% reliability. Let’s assume we need 80% reliability, which adds a 4x penalty, plunging starting task length down further to 3min45sec."
At 80% reliability, only 2 models breach 4 minutes, with the best one approaching 15 minutes. And that is at messiness score of at most 3.2. But the best model regresses from the 2 bellow it on the overall estimation, and none breach even 20% reliability above that.
So none of the modelling is valid, but the sleight of hand is offering the AGI formula and the 168h task length value for 'AGI' which of course is spurious because we're than talking about an AGI that cannot do any task messier than a 3.2/15, and a doubling rate of 3 to 7 months, we are stuck with accepting AGI is at most 10 years away as the blog post claims. But at most we can expect models that can do a few hundred lines of code edition at a 50% reliability rate on projects that would take around a month. And the METR paper has it's own projection of a "1-month AI" (which it never claims to be AGI) with 50% chance between end of 2027 and early 2028. So let me know if you have any 1 month projects under 800 LOC that you may need, with a 50% chance to be able to use an LLM to have a 50% shot at eventually getting it right, in about 2-3 years!
ec109685•7h ago
https://xkcd.com/605/
jasonsb•5h ago
baq•4h ago
ben_w•3h ago
Even just the training run for modern LLMs would take humans millennia to go through the same text.