Measuring AI Ability to Complete Long Tasks

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

247•spicypete•1mo ago

Comments

grim_io•1mo ago

This seems like a good way to measure LLM improvement.

It matches the my personal feeling when using progressively better models over time.

Dwedit•1mo ago

Opus is already the name of an audio codec.

GaggiX•1mo ago

Opus: "an artistic work, especially one on a large scale."

The names Haiku, Sonnet, and Opus have not been chosen randomly.

oidar•1mo ago

And so much more intuitive than the OpenAI names for their models. I still don't get their naming scheme.

p1esk•1mo ago

Have you been living under a rock?

pants2•1mo ago

Gemini is already the name of a Greek god, a constellation, a space mission, a crypto exchange, an astrological sign, a car, and a comic villain! How will we ever figure out which one someone is talking about?

subdavis•1mo ago

I recently asked Opus to just “Add vector search” to my current hobby project, a topic I know very little about. It set up manticore, pulled an embedding model, wrote a migration tool for my old keyword indices, and built the front end. I’m not exaggerating much either: the prompt was the length of a tweet.

I think it would easily have taken me 4+ hours to do that. It ran in 15 minutes while I played Kirby Air Riders and worked on the first try.

Afterward, I sort of had to reflect on the fact that I learned essentially nothing about building vector search. I wanted the feature more than I wanted to know how to build the feature. It kept me learning the thing I cared about rather than doing a side quest.

ModernMech•1mo ago

The result of you having worked 4 hours to implement the thing is not just that you have the thing, it's that you have the thing and you understand the thing. Having the thing is next to useless if you don't understand it.

At best it plods along as you keep badgering Claude to fix it, until inevitably Claude reaches a point where it can't help. At which time you'll be forced to spend at least the 4 hours you would have originally spent trying to understand it so you can fix it yourself.

At worst the thing will actively break other things you do understand in ways you don't understand, and you'll have to spend at least 4 hours cleaning up the mess.

Either way it's not clear you've saved any time at all.

OxfordOutlander•1mo ago

> inevitably Claude reaches a point where it can't help.

Perhaps not. If LLMs keep getting better, more competent models can help him stay on top of it lol.

evklein•1mo ago

You're still captive to a product. Which means that when CloudCo. increases their monthly GenAI price from $50/mo. to $500/mo., you're losing your service or you're paying. By participating in the build process you're giving yourself a fighting chance.

pillefitz•1mo ago

I will quickly forget the details about any given code base within a few months anyway. Having used AI to build a project at least leaves me with very concise and actionable documentation and, as the prompter, I will have a deep understanding of the high-level vision, requirements and functionality.

subdavis•1mo ago

Respectfully, I think I’m in a better position to decide a) what value this has to me and b) what I choose to learn vs just letting Opus deal with. You don’t have enough information to say if I’ve saved time because you don’t know what I’m doing or what my goals are.

ModernMech•1mo ago

Respectfully, a) I didn't say anything about what value this has to you but moreover...

b) you also don't have enough information to say if it's saved you time because the costs you will bear are in the future. Systems require maintenance, that's a fact you can't get rid of with AI. And often times, maintaining systems require more work than building them in the first place. Maintaining systems tends to require a deep understanding of how they work and the tradeoffs that were decided when they were built.

But you didn't build the thing, you didn't even design it as you left that up to Claude. That makes the AI the only thing on the planet that understands the system, but we know actually the AI doesn't understand anything at all. So no one understands the system you built, including the AI you used. And you expect that this whole process will have saved you time, while you play games?

I just don't see it working out that way, sorry. The artifact the AI spit out will eventually demand you pay the cost in time to understand it, or you will incur future costs for not understanding it as it fails to act as you expect. You'll pay either way in the end.

subdavis•1mo ago

> And you expect that this whole process will have saved you time, while you play games?

The topic in question is “Can AI tools do a task that would take a human 4 hours”. Not whether it can do that in a way that leads to maintainability or sustained learning. I’m noodling on a hobby project as leisure time. I got what I wanted. I had fun.

> incur future costs for not understanding it as it fails to act as you expect

That is your stronger argument. I’ve seen quality problems with the search results that come from using a smaller embedding model than I should. I don’t know yet if that’s a blocker or tolerable.

But I think that argument would be wrong too, because I’m very glad I chose Claude. The biggest limitation might be that I don’t have the compute locally to run an embedding model good enough to achieve decent results. It would have been a huge waste of my time to build it by hand and discover that at the end. I’m not about to pay for a sass vector DB or run this in AWS. At that level of effort I’d just scrap it.

weitendorf•1mo ago

You do learn how to control claude code and architect/orient things around getting it to deliver what you want. That's a skill that is both new and possibly going to be part of how we work for a long time (but also overlaps with the work tech leads and managers do).

My proto+sqlite+mesh project recently hit the point where it's too big for Claude to maintain a consistent "mental model" of how eg search and the db schemas are supposed to be structured, kept taking hacky workarounds by going directly to a db at the storage layer instead of the API layer, etc. so I hit an insane amount of churn trying to get it to implement some of the features needed to get it production ready.

Here's the whackamole/insanity documented in git commit history: https://github.com/accretional/collector/compare/main...feat...

But now I know some new tricks and intuition for avoiding this situation going forward. Because I do understand the mental model behind what this is supposed to look like at its core, and I need to maintain some kind of human-friendly guard rails, I'm adding integration tests in a different repo and a README/project "constitution" that claude can't change but is accountable for maintaining, and configuring it to keep them in context while working on my project.

Kind of a microcosm of startups' reluctance to institute employee handbook/kpis/PRDs followed by resignation that they might truly be useful coordination tools.

ModernMech•1mo ago

Yeah, this is close to my experience with it as well. The AI spits out some tutorial code and it works, and you think all your problems are solved. Then in working with the thing you start hitting problems you would have figured out if you had built the thing from scratch, so you have to start pulling it apart. Then you start realizing some troubling decisions the AI made and you have to patch them, but to do so you have to understand the architecture of the thing, requiring a deep dive into how it works.

At the end of the day, you've spent just as much time gaining the knowledge, but one way was inductive (building it from scratch) while the other is deductive (letting the AI build it and then tearing it apart). Is one better than the other? I don't know. But I don't think one saves more time than the other. The only way to save time is to allow the thing to work without any understanding of what it does.

latentsea•1mo ago

I agree with this sentiment a lot. I find my experience matches this. It's not necessarily fast at first, but you learn lessons along the way that develop a new set of techniques and ways of approaching the problem that feel fundamental and important to have learnt.

My fun lesson this week was there's not a snowballs chance in hell GitHub Copilot can correctly update a Postman collection. I only realised there was a Postman MCP server after battling through that ordeal and eventually making all the tedious edits myself.

vachina•1mo ago

Yeah and then it becomes an unmaintainable monolith because at some point the AI also lost track of what code does what.

Great for Opus because you’re now a captive customer.

tokioyoyo•1mo ago

The point of eventual “all-code-is-written-by-AI” is that it really does not matter if your code is maintainable or not. In the end, most of the products are written to accomplish some sort of a goal or serve a need within a given set of restrictions (cost, speed and etc.). If the goal is achieved within given restrictions, the codebase can be thrown away until the next need is there to just create everything from scratch, if needed.

simonw•1mo ago

I don't buy it.

I think that could work, but it can work in the same way that plenty of big companies have codebases that are a giant ball of mud and yet they somehow manage to stay in business and occasionally ship a new feature.

Meanwhile their rivals with well constructed codebases who can promptly ship features that work are able to run rings around them.

I expect that we'll learn over time that LLM-managed big ball of mud codebases are less valuable than LLM-managed high quality well architected long-term maintained codebases.

tokioyoyo•1mo ago

Fair enough. In my imagination, I can see people writing AI-first framework/architectures and a general trend for people to “migrate to such frameworks”, just like the push towards the microservices architectures in 2010s. A part of these frameworks would be “re-constructibility” by changing contracts in parts where it matters, and somehow the framework would make it easy for the LLM to discover such “parts”.

Honestly, i’m making stuff up, as I don’t think it’s feasible right now because of the context sizes. But given how fast things develop, maybe in a couple of years things might change.

weitendorf•1mo ago

No you're not making it up, this is exactly what some people are working on. Agent frameworks are starting to move towards "dynamic" service discovery/runtime introspection and composition-with-guardrails. Some keywords are "agent mesh", and the general marketing from AI companies about AI "inventors", and agent-driven interfaces like Google's a2ui (which is just a spec)

We recently started working on https://github.com/accretional/collector to serve as a dynamic proto ORM+CRUD server with search and discovery, and features for operating as a node in an "agent/service mesh". The idea is that you can create a uniform interface for data retrieval/search/APIs that lets agents dynamically register, invoke, or discover any data type or service, or write it themselves, then register it locally or share it.

It is feasible to do this stuff now actually, just a bit tricky because most LLMs aren't trained to operate this way without very explicit instructions for how to do so, and for collector specifically the API surface is probably too big. But I am pretty sure neither would take long to fix if enough people were adopting this kind of pattern.

tokioyoyo•1mo ago

That’s actually really cool, and makes sense in my head! This is somewhat how I imagined it, except my guess would be someone would fine tune a general purpose LLMs (somehow, as it is much cheaper than starting from scratch, idk?) to behave this way rather than instructing it all the way in. And whoever develops the framework would package it with the access to this fine-tuned LLM.

But yeah, what you guys are doing looks sweet! I need to get out of my ass and see what people are doing in this sphere as it sounds fun.

weitendorf•1mo ago

> fine tune a general purpose LLMs (somehow, as it is much cheaper than starting from scratch, idk?) to behave this way rather than instructing it all the way in

I'd love to do that too but there are basically three ways to teach LLMs how to use it afaik: with data created "in the wild" and a degree of curation or augmentation, or with full-on reinforcement learning/goal-oriented training, or some kind of hybrid based on eg conformance testing and validating LLM output at a less sophisticated level (eg if it tries to call an api that's not in the set that it just saw during discovery, the LLM is being dumb, train it out of doing that).

The thing is they are not really mutually exclusive, and LLM companies will do it anyway to make their models useful if enough people are using this or want to use it. This is what's happened already with eg MCP and skills and many programming languages. Anyway, if prompting works to get it to use it properly it validates that the model can be trained to follow that process too, the same way it knows how to work with React

tokioyoyo•1mo ago

I see, makes sense! I’ll try to keep up to see what you guys are doing and overcome the problems. Thanks a lot!

fluidcruft•1mo ago

My experience with LLM and agents has led to the opinion that a LLM-friendly codebase is actually a very human friendly code base.

simonw•1mo ago

Same here. So far everything I have found to help LLMs is just good practice generally: automated tests, documentation, clear issue descriptions, a neat commit history, well featured code etc.

DANmode•1mo ago

Documentation, aka any kind of architectural plan,

“we’ll figure it out when we get there” human slop.

cornel_io•1mo ago

And at the end of the day it's not really a tradeoff we'll need to make, anyways: my experience with e.g. Claude Code is that every model iteration gets much better at avoiding balls of mud, even without tons of manual guidance and pleading.

I get that even now it's very easy to let stuff get out of hand if you aren't paying close attention yourself to the actual code, so people assume that it's some fundamental limitation of all LLMs. But it's not, much like 6 fingered hands was just a temporary state, not anything deep or necessary that was enforced by the diffusion architecture.

skeptic_ai•1mo ago

It’s interesting how the monolith companies with a big ball of shit still stay in business.

But I’d say some projects (I expect to live less than 1 year) I’d just vibe code them so I won’t care much about the code. I just give very high level architectural ideas and that’s it.

Other projects which I expect lifespan to be more than 1-2 years I won’t let it become a ball of shit.

So it depends on the project.

throw1235435•1mo ago

I think the models, if they continue to get better, and frameworks/service patterns change to accommodate AI's. Where pieces of code will be thrown away, etc because the new code will be designed to slowly accommodate the "big ball of mud" risk.

We are moving from a conceptual/model job which typically requires training and skills (i.e. the code model/tool use/etc meets the requirements) to simply validation which is an easier problem and/or can be sharded in other roles. In other words the engineering part (i.e. the fun part) will be left to the AI. What I've found is people types (e.g. managers), and QA types (if it works, I don't care, this is what needs to work) will do well. People who liked the craftsman ship, solving problems, etc will do worse. Pure tech IMO will be less and less of a career.

Aperocky•1mo ago

Recreating everything from scratch gets harder and the previous requirements will eventually not be met after sufficient number of them have been accumulated. AI would have no solution to this unless it iterate on the same code base, but since I've not seen evidence of architectural maintainability from AI, a project that are fully given to AI is bound to fail.

AI is still incredibly useful used in tandem, but have it implement full feature from one sentence usually lead to doom.

weitendorf•1mo ago

It does matter because the code needs to still be legible and discoverable and semantic enough for other AI to find it and use it without it being so confusing or painful that they prefer to just write it themselves.

The reason software is so valuable is that it's capital/up-front investment in figuring something out that can continuously deliver value with low or no marginal cost. Rewriting/maintenance/difficulty/figuring out software is marginal cost.

ruszki•1mo ago

In the case of OP, they cannot even test it, because they have no clue how it works. They cannot test whether the goal was achieved or not.

The other day I generated an MCP server for AST of Java. I had no clue how that works. I couldn’t test it because I had no idea how that looks like. Btw, AI even lied in tests, because it literally mocked out everything from live code. So everything was green, and literally nothing was tested, and it was untestable manually by me.

rolisz•1mo ago

Yes, it's a risk if you don't guide it well, but you can also manage it pretty ok.

I have a side project that I started in January 2024. Initially, used Github Copilot autocompletions heavily. This year I started using CLI agents (mostly Claude, but others too) to do more stuff. I got to around 100k LoC (sure, it's not enterprise scale, but for a personal project it's pretty big), but I'd argue it's maintainable, it's split into 10 Django apps, that are each pretty self contained, I've done several refactors on it (using AI agents) to make it more maintainable.

DANmode•1mo ago

If you don’t know that Opus isn’t an entity, but a model,

you might be a little too far removed from the situation to comment authoritatively?

Avicebron•1mo ago

> I learned essentially nothing about building vector search. I wanted the feature more than I wanted to know how to build the feature

Opus/Anthropic is hands down the best in my experience. But using it feels like intellectual fast food (they all are), I hate the fact that I can build something like a neatly presentable one off spa tool (ty Simon) when I'm barely paying attention. it feels unsatisfying to use.

EDIT: because I'm rambling, I like "AI" as much as the next guy, probably more because I was there before it turned into LLMs"R"US, but I also like(d) the practice of sitting around listening to music solving problems with Scala. I don't know why we've decided to make work less fun..

pastel8739•1mo ago

“We” didn’t decide to make work less fun, others decided for us.

fluidcruft•1mo ago

I sort of disagree. It's somewhat like having hypercard again. You can build fun UI things and make machines do what you want them to do. You can care about the parts you want to care about and not sweat about the parts you don't want to learn in detail (yet). And Claude and codex make great guides/Sherpas.

There are just too many parts involved to do anything. For example today I built a simple data collection app to use on my phone that involves inventories with photos for a tedious workflow I have to do. I knew what I wanted but didn't know how to even choose which tools to bother learn. And just even trying things to see if an approach works or not without spending hours learning one thing or another or wading through the hell of web search is really great.

Things I learned today that I figure everyone else must know: if you want to take a photo from a webapp I guess you need https. So I decided to try mTLS (knew it existed but never had the time) so asked Claude to write me a short tutorial about setting it up, creating keys, importing them (including a cool single line trick of spinning up a python server and downloading the keys on my phone rather than find a USB stick or whatever). And then helping me figure out a path out of the suffering of Chrome and Firefox hating self-signed CA. But at least I figured out how to make Firefox happy. But it would insist on prompting me for the certificate for every htmx request. But chatting with Claude I learn caddy is pretty cool, it's go. Claude suggests an auth boxcar when I balk at adding auth and user management to my app because I think the webserver should handle all this shit (wtf is a boxcar? Claude clues me in). I tell Claude to use go or rust to build the boxcar because Jesus Christ "yay" build another service just to get a good damn customized CRUD app on my phone that can take a picture. Claude picks go which is fine by me. (Incidentally I can't write go, but I can read it and it's on my "to be learned" agenda and go seems safer than a pile of python for this simple thing) The boxcar was fine but Claude was struggling with getting headers to work in the caddy config. So while Claude is working on that I do a quick Google about whether caddy can have extensions because there has to be a better way to "if someone has authenticated successfully, give them a cookie that will last an hour so they don't have to mash the confirm about using the certificate for every goddamn htmx request" than spin up a web service. Interrupt Claude and suggest an extension instead of a boxcar. Claude's on board so we ditch the boxcar. Have Claude and codex evaluate the extension for security. They find important issues about things a jerk might do, fix them. So successful mTLS connections transition to session cookies. So my dumb CRUD tool doesn't have to worry about auth. Which it didn't have to do anyway except browsers say so etc because my phone is literally only able to access the server via VPN anyway.

Other things I have learned today that only wasted 5min of Claude's time rather than hours of mine: Firefox camera access can't control flash, focus or zoom. So call out to the native app instead.

This is all quite fun and the tool I'm building is going to really make my own life better.

Is there a better way to do this: probably.

Avicebron•1mo ago

>only wasted 5min of Claude's time rather than hours of mine

I mean will you (we) retain all that it did after a few months go by? You may say we don't need to, but that sounds a little shallow given we're both on HN. Do you remember Gatsby's criticism of "Summer People"?

fluidcruft•1mo ago

I don't even remember things I did two years ago unless I leave good breadcrumbs and documentation. I don't think it's particularly worse than pulling in some dependency or framework from GitHub that will be completely different next year anyway. And Google's prone to change anything in Android anyway. Mobile or web seems like a foundation of quicksand, it's not anything I care about. The real takeaway is I can be productive without wasting my time on all the damn churn by just-in-time learning aided by these tools.

I'm pretty sure I will remember how easy and correct it was to modify Caddy vs the months of putzing around building Rube Goldberg constellations of services crap that I did last year for a different thing and that even Claude wanted to do. I've done the whole wading through outdated blog posts and trying to read documentation on other projects that I was doing. Learning five different projects and having to maintain seven services running in docker just so that I can use a tool to capture photos and store them in a webapp that can only be used if connected to my own VPN is insane and it's why I am not a web developer. I will 100% remember what 200 lines of golang does after looking at it again. The 1000 lines of JavaScript that were and backend auth crap that no longer exists: good riddance.

And no I don't trust my memory about what Gatsby said about Summer people without looking it up. I read Gatsby 30 years ago.

simonw•1mo ago

I don't think building it the long way is necessarily a more effective way to learn.

You could spend 4 hours (that you don't have) building that feature. Or... you could have the coding agent build it in the background for you in 15 minutes, then spend 30 minutes reading through what it did, tweaking it yourself and peppering it with questions about how it all works.

My hunch is that the 30 minutes of focused learning spent with a custom-built version that solves your exact problem is as effective (or even more effective) than four hours spent mostly struggling to get something up and running and going down various rabbit holes of unrelated problem-solving.

Especially if realistically you were never going to carve out those four hours anyway.

aabhay•1mo ago

This feels like the exactly wrong way to think about it IMO. For me “knowledge” is not the explicit recitation of the correct solution, it’s all the implicit working knowledge I gain from trying different things, having initial assumptions fail, seeing what was off, dealing with deployment headaches, etc. As I work, I carefully pay attention to the outputs of all tools and try to mentally document what paths I didn’t take. That makes dealing with bugs and issues later on a lot easier, but it also expands my awareness of the domain, and checks my hubris on thinking I know something, and makes it possible to reason about the system when doing things later on.

Of course, this kind of interactive deep engagement with a topic is fast becoming obsolete. But the essence to me of “knowing” is about doing and experiencing things, updating my bayesian priors dialectically (to put it fancily)

simonw•1mo ago

I agree that the only reliable way to learn is to put knowledge into practice.

I don't think that's incompatible with getting help from LLMs. I find that LLMs let me try so much more stuff, and at such a faster rate, that my learning pace has accelerated in a material way.

gflarity•1mo ago

Consider, ever so briefly, that people don't all learn the same. You do you.

simonw•1mo ago

That's fair.

Something I'm really interested right now is the balance in terms of the struggle required to learn something.

I firmly believe that there are things where the only way to learn how to do them is to go through the struggle. Writing essays for example - I don't think you can shortcut learning to write well by having an LLM do that for you, even though actually learning to write is a painful and tiresome progress.

But programming... I've seen so many people who quit learning to program because the struggle was too much. Those first six months of struggling with missing semicolons are absolutely miserable!

I've spoken to a ton of people over the past year who always wanted to learn to program but never managed to carve out that miserable six months... and now they're building software, because LLMs have shaved down that learning curve.

theLiminator•1mo ago

I think it really depends on how it's used. It's a massive accelerant if it's just helping you stitch stuff together. Or when it helps you get unblocked by quickly finding you the missing api you need.

But when it replaces you struggling through figuring out the mental model of what you're doing, then I think you end up learning at a much more shallow level than you would by doing thing manually.

AmbroseBierce•1mo ago

And that's a fast lane for security issues a plenty, when you cannot spot them because you don't even understand what each part is supposed to do.

habinero•1mo ago

That's not learning, that's building. It's like trying to learn how to draw via paint by numbers. Do you end up with something you could hang on the wall? Sure. Could you have fun doing it? Sure. Is there anything wrong with just doing that as a hobby? Of course not.

Is it a substitute for actually learning how to look at objects and break them down into shapes and color and value? No. You gotta put in the work if you want the result. Brains just work like that.

"Struggling with semicolons" isn't any different than drawing a hundred derpy looking faces that look terrible.

Ira Glass has a quote about this:

> “Nobody tells this to people who are beginners, I wish someone told me. All of us who do creative work, we get into it because we have good taste. But there is this gap. For the first couple years you make stuff, it’s just not that good. It’s trying to be good, it has potential, but it’s not. But your taste, the thing that got you into the game, is still killer. And your taste is why your work disappoints you. A lot of people never get past this phase, they quit. Most people I know who do interesting, creative work went through years of this. We know our work doesn’t have this special thing that we want it to have. We all go through this. And if you are just starting out or you are still in this phase, you gotta know its normal and the most important thing you can do is do a lot of work. Put yourself on a deadline so that every week you will finish one story. It is only by going through a volume of work that you will close that gap, and your work will be as good as your ambitions. And I took longer to figure out how to do this than anyone I’ve ever met. It’s gonna take awhile. It’s normal to take awhile. You’ve just gotta fight your way through.”

simonw•1mo ago

I love that Ira Glass quote. I've thought about it a lot!

I still think paint by numbers is a valid early step along the path to learning to draw.

Herring•1mo ago

I like the sentiment, I really do, but nobody (outside a phd program) pays you to learn. That's just not how society is set up. If FAANG companies could get away with hiring high school kids at min wage to prompt all day they would. We'll figure that out real quick as that exponential rises. If you don't like it, build a better society. While you still can.

dns_snek•1mo ago

Correction: Nobody wants to pay for you to learn, yet they implicitly do it and rely on it.

If companies decide that professional learning is unnecessary in the age of AI they'll be committing a horrible blunder. Their "fuck around" phase might sting, but missing an entire generation of skilled professionals is going to make our value skyrocket in the "find out" phase, a few years down the line.

johnfn•1mo ago

But how much of that time is truly spent on learning relevant knowledge, and how much of it is just (now) useless errata? Take vector search for an example. Pre-GPT, I would spend like an hour chasing down a typo, like specifying 1023 instead of 1024 or something. This sort of problem is now trivially solved in minutes by a LLM that fully understands the API surface area. So what exactly do I lose by not spending that hour chasing it down? It has nothing to do with learning vector search better, and an LLM can do it better and faster than I can.

extr•1mo ago

I think people fool themselves with this kind of thing a lot. You debug some issue with your GH actions yaml file for 45 minutes and think you "learned something", but when are you going to run into that specific gotcha again? In reality the only lasting lesson is "sometimes these kinds of yaml files can be finnicky". Which you probably already knew at the outset. There's no personal development in continually bashing your head into the lesson of "sometimes computer systems were set up in ways that are kind of tricky if you haven't seen that exact system before". Who cares. At a certain point there is nothing more to the "lesson". It's just time consuming trial and error kind of gruntwork.

iwontberude•1mo ago

I don’t think it foolishness. Through random sampling (troubleshooting problems) you can construct a statistically significant model for understanding the whole of the problem space. Maybe it doesn’t scale linearly with the amount of samples but it’s additive for sure.

Applejinx•1mo ago

Github Actions, web development, stuff like that, are terrible examples of where not to use AI.

You can't really go to giant piles of technical debt and look to those for places to be human. It's soul-destroying. My concern would be that vibe coding will make those places of soul-less technical debt even deeper and deadlier. There will be nobody there, for generations of cruft. Where once the technical debt was made by committee, now it'll be the ghosts of committees, stirred up by random temperature, only to surface bits of rot that just sink down into the morass again, unfixed.

When 'finicky' is actually an interesting problem, or a challenge, that's one thing. When 'finicky' is just 'twelve committees re-hacked this and then it's been maintained by LLMs for years', there is nothing gained by trying to be human at it.

iwontberude•1mo ago

I have a friend that took over a project that another dev started that had literally hundreds of markdown documents in repo with things as insane as software for managing souls in a quantum immortality scheme.

mmasu•1mo ago

I remember a very nice quote from an Amazon exec - “there is no compression algorithm for experience”. The LLM might as well do wrong things, and you still won’t know what you don’t know. But then, iterating with LLMs is a different kind of experience; and in the future people will likely do that more than just grinding through the failure of just missing semicolons Simon is describing below. It’s a different paradigm really

visarga•1mo ago

Of course there is - if you write good tests, they compress your validation work, and stand in for your experience. Write tests with AI, but validate their quality and coverage yourself.

I think the whole discussion about coding agent reliability is missing the elephant in the room - it is not vibe coding, but vibe testing. That is when you run the code a few times and say LGTM - the best recipe to shoot yourself in the foot no matter if code was hand written or made with AI. Just put the screw on the agent, let it handle a heavy test harness.

mmasu•1mo ago

this is a very good point, however the risk of writing bad or non extensive tests is still there if you don’t know what good looks like! The grind will still need to be there, but it will be a different way of gaining experience

DANmode•1mo ago

Starting to get it!

New skills, not no skills.

There will still be a wide spectrum of people that actually understand the stack - and don’t - and no matter how much easier or harder the tools get, those people aren’t going anywhere.

barrkel•1mo ago

Compression algorithms for experience are of great interest to ML practitioners and they have some practices that seem to work well. Curriculum learning, feedback from verifiable rewards. Solve problems that escalate in difficulty, are near the boundary of your capability, and ideally have a strong positive or negative feedback on actions sooner rather than later.

grim_io•1mo ago

Trial and error is not how apprenticeship works, for example.

As an apprentice, you get correct and precise enough instructions and you learn from the masters perfection point downwards.

Maybe we have reached a point where we can be the machine's apprentices in some ways.

jstummbillig•1mo ago

I think is exactly right in principle and practically. The question is what domain knowledge you should improve on to maximize outcome: Will understanding the machine code be the thing that most likely translates to better outcomes? Will building the vector search the hard way be? Or will it be focusing on the thing that you do with the vector search?

At some point things will get hard, as long as the world is. You don't need to concern yourself with any technical layer for that to be true. The less we have to concern ourselves with technicalities, the further that points shifts towards the thing we actually care about.

PessimalDecimal•1mo ago

Forgetting LLMs and coding agents for a second, what OP describes is like watching a Youtube video on how to make a small repair around the house. You can watch that and "know" what needs to be done afterwards. But it is a very different thing to do it yourself.

Ultimately it comes to whether gaining the know how through experience is worth it or not.

viking123•1mo ago

It's like reading a math book.

bulbar•1mo ago

Take a look at Bloom's taxonomy. It's exactly about what you are talking about.

gtowey•1mo ago

You could say that knowledge is understanding all the paths that won't solve your problem.

weitendorf•1mo ago

Generally I agree with your takes and find them very reasonable but in this case I think your deep experience might be coloring your views a bit.

LLMs can hurt less experienced engineers by keeping them from building an intuition for why things work a certain way, or why an alternative won't work (or conversely, why an unconventional approach might not only be possible, but very useful and valuable!).

I think problem solving is optimization in the face of constraints. Generally using LLMs IME, the more you're able to articulate and understand your constraints, and prescriptively guide the LLM towards something it's capable of doing, the more effective they are and the more maintainable their output is for you. So it really helps to know when to break the rules or to create/do something unconventional.

Another way to put it is that LLMs have commodified conventional software so learning when to break or challenge convention is going to be where most of the valuable work is going forward. And I think it's hard to actually do that unless you get into the weeds and battle/try things because you don't understand why they won't work. Sometimes they do

simonw•1mo ago

I think it's very easy to harm your learning by leaning into LLMs.

What I don't believe is that it HAS to be like this. Maybe it's my natural optimism showing through here, but I'm confident it's possible to accelerate rather than slow down your learning progress with LLMs, if you're thoughtful about how you apply them.

An open question for me is how feasible it is to teach people how to teach themselves effectively using this new technology.

I have a core belief that everything is learnable, if people are motivated to learn. I have no idea how to help instill that motivation in people who don't yet have it though!

Arainach•1mo ago

> An open question for me is how feasible it is to teach people how to teach themselves effectively using this new technology.

It's not really an open question. We've had a huge amount of content on the internet including documentation, tutorials, example code, and actual online courses available for years and in the end most people don't learn effectively when presented with that information and left to themselves. LLMs are no different.

DANmode•1mo ago

What would you have us do, though?

Stifle the tools, somehow?

You’ve had nontechnical devs since npm, or before!

No: people that care to understand the whole stack, and be able to provide that value, will still exist and shine.

throw1235435•1mo ago

> No: people that care to understand the whole stack, and be able to provide that value, will still exist and shine.

I hope so. But I don't believe so. I think us SWE's will find a way to disrupt that too as we all rush for the exits before this industry sinks.

The biggest barrier previously to anything (not just SWE) was the fact that like everything worth it in life; it takes work to see results. Generally people are time/resource poor and have to spend their own time or outsource the effort to get something - which limits the things they can do.

AI takes that away for SWE relative to other fields. People can get instant gratification now and "do it themselves" and given cost/benefit will prefer other fields now and want to spend their time elsewhere. At scale there will still be jobs for things people don't want to manage themselves but they will be more routine and busy work - not high salary skills based.

ktzar•1mo ago

It's the same hunch we all have when we think we're going to learn something by watching tutorials. We learn by struggling.

gambiting•1mo ago

>>My hunch is that the 30 minutes of focused learning spent with a custom-built version that solves your exact problem is as effective

My hunch is the exact opposite of this. You will learn close to nothing by reading this for 30 minutes.

barrkel•1mo ago

I don't know. I built a vector similarity system for my hobby project the "hard" way, which was mostly getting Python set up with all the dependencies (seriously, Python dependency resolution is a non-trivial problem), picking a model with the right tradeoffs, installing pgvector, picking an index that optimized my distance metric, calculating and storing vectors for all my data, and integrating routes and UI which dispatched ANN search (order by / limit) to my indexed column. I also did some clustering, and learned something of how awkward it is in practice to pick a representative vector for a cluster - and in fact you may want several.

I now know what the model does (at a black box level) and how all the parts fit together. And I have plans to build classifiers on top of the vectors I built for further processing.

The experience of fighting Python dependencies gives me more appreciation for uv over venv and will leave me less stuck whenever the LLM fails to help resolve the situation.

politelemon•1mo ago

That's assuming everyone learns the same way, which isn't true. Watching a streamer beat a dark souls boss won't automatically make you competent at the game. Reading through gobs of code generated for you without knowing why various things were needed won't help either. A middle approach could be to get the LLM to guide you through the steps.

mbbutler•1mo ago

It's not just assuming that everyone learns the same way. It's assuming that everyone learns the way that all of the research literature on learning claims does not work.

Learning requires active recall/synthesis. Looking at solved examples instead of working them yourself does not suffice in math, physics, chemistry, or CS, but somehow it is supposed to work in this situation?

enraged_camel•1mo ago

Agree completely. The other aspect for me is that LLMs make me unafraid to take on initiatives in areas I know nothing about and/or am uninterested in pursuing due to discrepancy in effort vs reward. As a result I end up doing more and learning more.

Applejinx•1mo ago

This really makes for a good natural experiment: carry on :)

I have a hard time imagining how much you'd have to literally bribe me to get me to try doing it the way you describe. I'm too interested in implementation details of things and looking for innovations—in fact I make my living doing that, like some cyberpunk gremlin just delighting in messing with stuff in unexpected ways. I don't understand why you're not, but maybe it's not for me to understand.

Carry on. We'll check back and see how it worked for ya :)

simonw•1mo ago

I delight in messing with stuff in unexpected ways, and see that as a great way to learn. I just don't want to have to type all of that mischief out by hand.

I ported a complex new Python program I wrote to Go last night on my phone out of nothing more than wild curiosity. You can bet I learned a bunch about Go in the process, from bed, in about 20 minutes.

risyachka•1mo ago

Reading without actually doing does not really result in learning, only very marginal one.

Try reading tutorials on a new programming language for 30 minutes and then open new text file and write basic loop with print.

It won’t even compile- which shows you haven’t really learned anything. Just read an interesting story. Sure you pita few bits here and there but you still don’t know how to do even the moat basic thing.

simonw•1mo ago

Working with an LLM feels very different to me from reading a static tutorial.

It's more like having the tutorial author there with you and actively engaging with them to collaborate on building the exact tutorial for the exact project you're looking to build.

I'll take a bidirectional conversation with a subject matter expert (and when you're just staring to learn Rust LLMs can absolutely take the role of "expert", in comparison to you at least) over struggling on my own against static documentation and the Rust compiler.

And I can take over the wheel at any moment! It's entirely on me to decide how much I get to do vs how much the LLM does for me.

That's why learning in this way is a skill in its own right, and one that I'd like to see studied and formalized and taught to other people.

csomar•1mo ago

The struggle is how you learn. I think that’s pretty much established scientifically by now?

simonw•1mo ago

If it is I'd very much like to learn more about the science.

I find it hard to believe that wasting hours hunting for a missing semicolon (at the very real risk of quitting entirely) is essential for learning. Does that mean every student who asks a TA or fellow-student to help them find that semicolon is hurting themselves when they do that?

If not, what's different about asking an LLM?

simonw•1mo ago

I had Claude go dig up some science for me: https://claude.ai/share/2dc95280-ff92-4b13-816f-24f5993d8fc7

The most relevant concepts appear to be:

- Desirable Difficulties - https://en.wikipedia.org/wiki/Desirable_difficulty - "A desirable difficulty is a learning task that requires a considerable but desirable amount of effort, thereby improving long-term performance. [...] The task must be able to be accomplished. Too difficult a task may dissuade the learner and prevent full processing."

- Worked-example effect - https://en.wikipedia.org/wiki/Worked-example_effect - "Specifically, it refers to improved learning observed when worked examples are used as part of instruction, compared to other instructional techniques such as problem-solving. [...] However, it is important to note that studying [worked examples] loses its effectiveness with increasing expertise"

- Expertise reversal effect - https://en.wikipedia.org/wiki/Expertise_reversal_effect - "The expertise reversal effect refers to the reversal of the effectiveness of instructional techniques on learners with differing levels of prior knowledge."

- "Generation effect" - https://en.wikipedia.org/wiki/Generation_effect - "The generation effect is a phenomenon whereby information is better remembered if it is generated from one's own mind rather than simply read."

dns_snek•1mo ago

Well that's a gross oversimplification of the process. Hunting for a missing semicolon is a basic mechanical task that doesn't require much thought.

Engaging with an intellectual problem, trying to solve it one way, failing, reasoning through the process and the requirements, trying to discover a better way of solving something, going down some wrong paths, backtracking, merging diverging ideas and ultimately finding a solution is going to yield an infinitely deeper understanding of the problem, what works, what doesn't, and improve your general intuition and problem-solving skills.

Deep engagement builds deep understanding, shallow engagement builds shallow understanding. There's no substitute for doing the hard work yourself - I've tutored classmates in school and I find this rather obvious. A tutor (human or LLM) can try to find a way to explain something in a way that you understand but if you don't do most of the hard work yourself it's never going to stick. I noticed that when I would spoon-feed answers to people it would always just lead them into a false sense of confidence.

simonw•1mo ago

My argument here is that you can still do hard work that helps you learn while leaning on an LLM to help along the way.

There's a reason kids do better when assigned a 1-1 tutor. LLMs, used effectively, can have a similar effect. Probably a weaker effect although maybe it can be stronger since there's no shame involved in asking an LLM a question.

csomar•1mo ago

I am far from being an expert on this topic. I took this course (many years ago?) https://www.coursera.org/learn/learning-how-to-learn and held this idea since.

> ind it hard to believe that wasting hours hunting for a missing semicolon

That's not what I meant by "struggle" and you do not use an LLM for that anyway.

simonw•1mo ago

What did you mean by struggle?

I use LLMs to help me spot mistakes like that all the time, and I encourage people learning to code to do the same.

girvo•1mo ago

> Or... you could have the coding agent build it in the background for you in 15 minutes, then spend 30 minutes reading through what it did, tweaking it yourself and peppering it with questions about how it all works

I can only speak for myself, but the only way I've been able to learn things rapidly in this industry is by writing things myself: even rote re-typing of books or SO answers was enough to trigger this for me.

Just querying models and reading output doesn't seem to work for me, but that's maybe down to my particular learning style.

simonw•1mo ago

That's why I said "tweaking it yourself" - that's the point where you go beyond "just querying models and reading output".

girvo•1mo ago

That hasn't been enough for me, so far in my experience. I think I'm too crusty and set in my ways after 25 years of learning programming.

I have found them useful for general explanations and it's decent at finding me sources or directly answering questions about codebases/architecture (like where the @appendNode declarative mutation directive is fired in sequence for the Relay updater system), but it's code output I've not found a good teach tool for myself.

throwaway613745•1mo ago

Just speaking from personal experience but the struggle is what creates the learning.

I learned refactoring patterns from Fowler's book. But when I tried to actually use them I still struggled. I didn't fully understand how the patterns worked until I actually tried (and failed) to use them a few times.

You don't really internalize things until you understand what doesn't work just as much as what does. You don't learn nearly as much from success as you do from failure. I would say the ratio of truly internalized knowledge is much higher for failure.

The notion that you can get a bot to just vomit out a vector database and then you can just "read the code" and you'll understand how a vector database works is just ludicrous.

simonw•1mo ago

This conversation isn't about building a vector database from scratch, it's about learning to integrate with an existing vector database.

throwaway613745•1mo ago

The topic is basically irrelevant. I could just edit my post to change the two instances of "vector database" to "vector database integration" and nothing else would change about my point.

I could change the post to be about learning word-working by watching a robot build a shelf and nothing would change.

simonw•1mo ago

I genuinely do think you can learn 90% of what that is to learn about integrating with a vector database from having an LLM do the work for you and then carefully reviewing what it did.

Turns out there's science that backs me up here: https://en.wikipedia.org/wiki/Worked-example_effect - showing people "worked examples" can be more effective than making them solve the problem themselves.

That Wikipedia article is a little weak, this MIT page is better: https://tll.mit.edu/teaching-resources/how-people-learn/work...

throwaway613745•1mo ago

> Worked examples are step-by-step illustrations of the process required to complete a task or solve a problem.

That’s not what having a bot generate your integration is and reading it post-facto is. The bot isn’t guiding you through the process so you can go do it yourself. At best you would use this as a reference to go do another integration yourself - but at this point why even bother when you can just get the bot to do it again?

The only thing people learn using AI is how to do things with AI.

simonw•1mo ago

> The bot isn’t guiding you through the process so you can go do it yourself.

It is if you ask it to. Learning well with LLMs requires a lot of self-discipline - you have to be actively aware of the threat that you won't actually learn effectively and take steps to counter that.

I keep meticulous notes of everything these things do for me, which adds up to a valuable set of notes over time. I gave up on remembering things without notes a long time ago!

jfreds•1mo ago

Having gone though exactly this exercise recently comparing a homegrown vector db against Qdrant, I’m wholeheartedly in agreement that getting a working solution FAST, and then spending a decent amount of time interrogating it (with help of LLM), is my favorite learning pattern

sorokod•1mo ago

You can spend 30 min, watching someone learning how to ski, you will learn something. You will not be able to ski by yourself though.

beasthacker•1mo ago

I buy the productivity argument, but I’m not convinced “30 minutes reading/tweaking agent output” is equivalent for learning to building it yourself.

If your goal is the feature, then yes: letting the agent do the heavy lifting and reviewing the diff afterward is a huge win.

But if your goal is understanding / skill-building, the hard part usually isn’t seeing a working solution. It’s doing the messy work of (a) making design choices, (b) getting stuck, (c) debugging, and (d) forming the mental model that lets you reproduce it later. Reviewing a correct implementation can create a feeling of “I get it,” but that feeling often doesn’t survive a blank file.

I’ve noticed this in my own hobby coding: LLMs are great for familiarity and unblocking progress, but the learning “sticks” much more when I’ve had to struggle through the failure modes myself. I’m watching the same dynamic play out with my son using ChatGPT to study for physics/calculus . . . it feels deep for him in the moment with the LLM, but exam-style transfer exposes the gaps.

simonw•1mo ago

If I had four hours to dedicate to this particular learning project I would still use LLMs to help me along the way, with the expectation that I'd learn more from those four hours than I would if I'd spent the same amount of time deliberately not using LLMs to help me.

We've been given a tool that lets us ask questions in human language and get back answers that are correct 90% of the time! And that remaining 10% means we have to engage critically with those answers, which is a useful learning trick in its own right.

beasthacker•1mo ago

Hypothetical for you:

Learn more if you tried to figure it out yourself for 3 hours then used the LLM for the last hour to unblock/check your work? Or learn more by utilizing LLM for help the whole four hours?

My own experience is what I learn from an LLM sticks better if I take the former approach.

simonw•1mo ago

Depends on the task and my goals. If it was something new to me that I wanted to learn really deeply - and I had the four hours to spend - I might try the LLM-free route for the first three hours like you suggest.

If I found myself needing to do anything unrelated to the learning task, like knock out a quick Bash script, I'd still call on the LLM to get me out of that and help me stay focused on the new skill though.

scotty79•1mo ago

If you are not failing you are barely learning anything.

yeasku•1mo ago

Can we see that vector search code or use it?

lordnacho•1mo ago

> I wanted the feature more than I wanted to know how to build the feature

This is exactly what LLMs are great for. For instance, I'm looking at trading models. I want to think about buying and selling. I need some charts to look at, but I'm not a chart wizard. I can make basic charts, but it feels tedious to actually learn the model of how the charting software works. LLM will just give me the chart code for the visualization I want, and if I ever care to learn about it, I have it in a form that is relevant to me, not the form of the API documents.

In general, a lot of coding is like this. You have some end goal in mind, but there's a bunch of little things that need to be knitted together, and the knitting used to take a lot of time.

I like to say the LLM has reduced my toil while getting me to the same place. I can even do multiple projects at once, only really applying myself where there is a decision to be made, and it's all possible because I'm not sorting out the minutiae of some incidental API.

trebligdivad•1mo ago

Well, look through it's log and what it did and if you don't understand anything ask it why it did it/what it does.

exe34•1mo ago

I like having the flexibility. If it's something I want to learn, I'll ask it to write some explanation into an md that I can read, and I can also look at the code diff in more detail. but if it's tedious things like interacting with the android sdk, I'll just let it do whatever it needs to do to get the feature working.

yismail•1mo ago

Would be interesting to see Gemini 3.0 Pro benchmarked as well.

PunchTornado•1mo ago

Exactly. I don't understand how an article like this ignores the best models out there.

cubefox•1mo ago

This article was published a long time ago, in March.

yismail•1mo ago

That's true, but it looks like it's been updated since then because the benchmarks include Claude Opus 4.5

simonw•1mo ago

I didn't really understand the "long task" thing until I actually experienced it. The problem is finding a task you can set an agent that justifies working for that long. I finally hit one when I tried porting that Python HTML5 parser to JavaScript by pointing Codex CLI at the 9,200 html5lib-tests test suite: https://simonwillison.net/2025/Dec/15/porting-justhtml/

It's pretty amazing to watch tools-in-a-loop crunch away for >4 hours to solve a generally difficult problem through sheer brute-force.

ehnto•1mo ago

I think you might be misunderstanding the article actually, this is about AI solving tasks as measured by how long it takes a human to solve the task. The AI could potentially solve it much quicker, but the use of "human time to solve" is an attempt to create a metric that reveals long horizon complexity (as I understand it anyway).

It's interesting because like the article notes, AI is really smashing benchmarks, but actual usefulness in automation of thought work is proving much more elusive. I think that collective experience of AI just not being that useful, or as useful as benchmarks suggest it should be, is captured in this metric.

rishabhaiover•1mo ago

I've practiced a healthy skepticism of the recent boom but I can't reason why the long horizon time wouldn't stretch to 8 hours or a week worth's of effort from next year. After Opus-4.5, governments and organizations should really figure out a path out of this storm because we're in it now.

theptip•1mo ago

Doubling time has been 7 months for a while, so you should expect 8h not 1 week next year.

dwohnitmok•1mo ago

It's significantly accelerated to 4 months since the beginning of 2025, which puts 1 week within reach if things stay on trend. But yes 7 months is the more reliable long-term trend.

ehnto•1mo ago

Can we attribute the acceleration to something specific, that might not actually continue growth? For example, agentic coding and reasoning models seem to have made a huge leap in abilities, but wouldn't translate to an ongoing exponential growth.

dwohnitmok•1mo ago

There's a fair amount of uncertainty on this point. In general it's unclear when/whether things will plateau out (although there are indications again that the trend is accelerating not decelerating).

That being said, if by "agentic coding" you are implying that a leap in capabilities is due to novel agentic frameworks/scaffolding that have appeared in 2025, I believe you are confusing cause and effect.

In particular, the agentic frameworks and scaffolding are by and large not responsible for the jump in capabilities. It is rather that the underlying models have improved sufficiently such that these frameworks and scaffolding work. None of the frameworks and scaffolding approaches of 2025 are new. All of them had been tried as early as 2023 (and indeed most of them had been tried in 2020 when GPT-3 came out). It's just that 2023-era models such as GPT-4 were far too weak to support them. Only in 2025 have models become sufficiently powerful to support these workflows.

Hence agentic frameworks and scaffolding are symptoms of ongoing exponential growth, not one-time boosts of growth.

Likewise reasoning models do not seem to be a one-time boost of growth. In particular reasoning models (or more accurate RLVR) seem to be an on-going source of new pretraining data (where the reasoning traces of models created during the process of RLVR serve as pretraining data for the next generation of models).

I remain uncertain, but I think there is a very real chance (>= 50%) that we are on an exponential curve that doesn't top out anytime soon (which gets really crazy really fast). If you want to do something about it, whether that's stopping the curve, flattening the curve, preparing yourself for the curve etc., you better do it now.

rishabhaiover•1mo ago

Well said. I don't think anybody's stopping anything. I wish I knew how to prepare for it.

rishabhaiover•1mo ago

Predictions over historical data in a landscape with fragile priors doesn't seem like a strong metric to me (it's a useful approximation at best)

dwohnitmok•1mo ago

To be clear this doesn't mean that it takes the AI > 4 hours to do the task. METR is measuring the difficulty of tasks by how long it takes a human to do the same task. This benchmark is saying that Opus 4.5 can now do tasks (related to AI R&D, coding foremost among them) that take human experts > 4 hours (at a 50% reliability level; whether that's actually useful depends on of course the cost of failure). It is silent on how long it takes AI systems to do those tasks. In theory an AI system could take longer than that (in practice it's usually significantly shorter).

This is of course quite highly correlated with an AI system being able to churn through a task for a long time. But it's not necessarily the same thing.

Of course the big questions are going to arise if/when we start passing lines like 8 hours (a whole work day) or 40 hours (a whole work week).

twotwotwo•1mo ago

METR is using hours of equivalent human effort, not actual hours the agent itself spends, so by their methodology, your task might qualify as one where it pulls off much more than 4h of human work.

"Human hours equivalent" itself is an interesting metric, because: which human? Or rather, I'm sure they had a coherent definition in mind: presumably a human reasonably competent at whatever the specific task is. But hours the abstract human standard would spend is different from the hours any specific person, say you or I, would spend.

In particular, some of the appeal (and risk!!) of these things is precisely that you can ask for help with things that would be quick work for someone (who knows jq, or a certain corner of the PyPI library ecosystem, or modern CSS, or TypeScript annotations, or something else) but not for you.

TobiasJBeers•1mo ago

The “50% time horizon” feels most actionable when you pair it with an expected-value model. For a given task: EV ≈ (human_time_saved × $/hour) − (p_fail × cost_of_failure) − (iteration/oversight cost). A model crossing 4h-at-50% might be hugely useful for low failure-cost work, and still net-negative for anything where rollback/debug is expensive. The missing piece is how p_fail scales with task length + how recoverable failures are.

twotwotwo•1mo ago

Yeah--it's difficult to go from a benchmark involving the model attempting things alone to the effect assisting people on real tasks because, well, ideally you'd measure that with real people doing real tasks. Last time METR tried that (in early '25) they found a net slowdown rather than any speedup at all. Go figure!

nightshift1•1mo ago

>which human

The second graph has this under it:

The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years...

twotwotwo•1mo ago

Yeah--I wanted a short way to gesture at the subsequent "tasks that are fast for someone but not for you are interesting," and did not mean it as a gotcha on METR, but I should've taken a second longer and pasted what they said rather than doing the "presumably a human competent at the task" handwave that I did.

nightshift1•1mo ago

I agree. After all, benchmarks don't mean much, but I guess they are fine as long as they keep measuring the same thing every time. Also, the context matter. In my case, I see a huge difference between the gains at work vs those at home on a personal project where I don't have to worry about corporate policies, security, correctness, standards, etc. I can let the LLM fly and not worry about losing my job in record time.

tacitusarc•1mo ago

My problem with the OpenAI models (GPT5.2 in particular) recently is an extreme aversion to doing more than the smallest step in a task before asking for using input. Even if I explicitly instruct it to continue without input until the task is complete, it ignores the instruction.

I cannot imagine GPT5.2 working on a task for more than 2 minutes, let alone 4 hours. I’m curious if you’ve run into this and figured out a way around it?

BoiledCabbage•1mo ago

What agent framework are you using? It can differ from one to the next on the same model.

tacitusarc•1mo ago

I am using it in Zed.

simonw•1mo ago

I've not had that problem at all with GPT-5.2 running in Codex CLI.

I use prompts like this:

  Build a pure JavaScript library (no dependencies) for encoding and 
  decoding this binary format. Start by looking at how the lite3-python 
  library works - the JavaScript one should have the same API and probably the
   same code design too. Build the JS one in lite3-javascript - it should be a
   single JavaScript module which works in both Node.js and in the browser. 
  There should be a test script that runs with Node.js which runs against the 
  files in the lite3-python/format_suite folder. Write the test script first, 
  run it and watch it fail, then build the JavaScript library and keep running
   the tests until they pass.

tacitusarc•1mo ago

I have not tried it in Codex CLI, I’ll give that a shot and see if it changes things.

tacitusarc•1mo ago

It did make a noticeable difference

macrolime•1mo ago

I find that surprising. GPT 5.2 is the model I've had working the longest. It frequently works more than 4 hours nonstop, while earlier models would stop to ask if they should continue every 10 minutes. 5.1 and earlier ignores it if I ask it to continue until a task is done, but 5.2 will usually finish it.

noosphr•1mo ago

What's more amazing is how fast your account empties when they do that.

fragmede•1mo ago

it's $200/month for the "unlimited" plan.

noosphr•1mo ago

It's amazing how fast your account hits usage limits.

lelanthran•1mo ago

I think GP was being sarcastic: they did say that the plans were "unlimited".

I read

    It's "unlimited"

and

    It's unlimited

quite differently.

hatefulheart•1mo ago

Simon have you got to the point where you just don’t read the article?

Others have pointed out your interpretation of long task is not the same as the article.

Maybe this is the negative effects of excessive LLM usage that are spoken about.

simonw•1mo ago

They were right. I hadn't read enough of the article to understand what was meant by multi-hour tasks. I upvoted them for pointing that out.

lelanthran•1mo ago

>> Maybe this is the negative effects of excessive LLM usage that are spoken about.

> I upvoted them for pointing that out.

I'm also curious about what you think about the GPs question. TBH, responding after reading half an article was a common thing for most people pre-LLM anyway.

simonw•1mo ago

Yeah, show me a Hacker News user who's never posted a comment on a story without properly reading it (or even without clicking the link). LLMs have nothing to do with it.

If I had piped the article through an LLM first, I wouldn't have made the embarrassing mistake in that comment!

visarga•1mo ago

You should take into consideration the time it took to make those 9200 tests originally. If you have good test coverage the agent can go much farther ahead.

dangus•1mo ago

Heh, I mostly use AI in the opposite direction to write tests because:

1. That’s the part of development work I hate the most and never really clicked with me

2. AI to to this point seems to be better at writing tests than code

Take this with the grain of salt that:

1. I suck

2. My work is mostly in the realm of infrastructure where testing has always been weird and a little dumb

9rx•1mo ago

AI has become very good at writing pointless and bad tests, at least. It remains difficult to compel it to write good tests consistently.

But even if it wrote great tests every time, the trouble is that testing was designed around the idea of "double entry accounting". Even great tests can test the wrong thing. In the old world you would write a test case and then implement something to satisfy the same. If both sides of the ledger agree, so to speak, you can be pretty confident that both are correct. — In other words, going through the process of implementation gives an opportunity to make sure the test you wrote isn't ill-conceived or broken itself. If you only write the tests, or only write the implementation, or write none of it, there is no point at which you can validate your work.

If you have already built up an application and are reusing its test suite to reimplement the software in another language, like above, that is one thing, but in greenfield work it remains an outstanding problem of how to validate the work when you start to involve AI agents. Another article posted here recently suggests that we can go back to manual testing to validate the work... But that seems like a non-solution.

visarga•1mo ago

Every error is a signal you need better tests. You can let the LLM create tests for every error it stumbles into, besides all the regular tests it can write on its own. Add all test scenarios you can think of, since you are not implementing them by hand. A bad test is invalidated by code, and a bad code invalidated by the test, so between them the AI agent can become reliable.

lifis•1mo ago

Quickly looking at the source code, mostly treeBuilder and tokenizer, I do see several possible improvements: - Use Typescript instead of JavaScript - Use perfect hashes instead of ["a', "b", "c"].includes() idioms, string equalities, Seys, etc. - Use a single perfect hash to match all tags/attribute names and then use enums in the rest of the codebase - Use a single if (token.kind === Tag.START instead of repeating that for 10 consecutive conditionals - Don't return the "reprocess" constant, but use an enum or perhaps nothing if "reprocess" is the only option - Try tail recursion instead of a switch over the state in the tokenizer - Use switches (best after a perfect hash lookup) instead of multiple ifs on characters in the tokenizer - "treeBuilder.openElements = treeBuilder.open_elements;" can't possibly be good code

Perhaps the agent can find these themselves if told to make the code perfect and not just pass tests

simonw•1mo ago

Thanks for the feedback - I pasted it into a Claude Code session on my phone, here's the resulting PR: https://github.com/simonw/justjshtml/pull/7

I didn't include the TypeScript bit though - it didn't use TypeScript because I don't like adding a build step to my JavaScript projects if I can possible avoid it. The agent would happily have used TypeScript if I had let it.

I don't like that openElements = open_elements pattern either - it did that because I asked it for a port of a Python library and it decided to support the naming conventions for both Python and JavaScript at once. I told it to remove all of those.

I had it run a micro benchmark too against the before and after - here's the code it used for that: https://github.com/simonw/justjshtml/blob/a9dbe2d7c79522a76f...

  BEFORE benchmark:
  Input: 87,707 bytes
  Average: 7.846 ms
  Ops/sec: 127.5

After applying your suggestions:

  AFTER: 
  Average: 7.769ms
  Ops/sec: 128.7  (1% improvement)

It pushed back against the tail recursion suggestion:

> The current implementation uses a switch statement in step(). JavaScript doesn’t have proper tail call optimization (only Safari implements it), so true tail recursion would cause stack overflow on large documents.

Jcampuzano2•1mo ago

How are you guys even doing long tasks with plain Codex or Claude code?

I use Claude code and I get hit with a permissions prompt every 2 seconds for anything I try to do.

Sure I can turn off all dangerous permissions but it'd probably honestly stop and claim it's finished well before it actually is in most cases from my experience.

To be fair I haven't tried codex so maybe it's better at this but I'm my experience almost every model stops at some point and claims victory or stops and tells me something like "next we'll continue on with XYZ" at which point I have to prompt it to continue.

simonw•1mo ago

You have to use --yolo or --dangerously-skip-permissions options.

Thankfully the cloud versions (Claude Code for web, Codex Cloud) run like that already, and are relatively safe in that if anything goes wrong it happens on someone else's computer.

stavros•1mo ago

Codex (at least 5 and 5.1) is bad at asking for permission. Whenever it wants to run pre-commit or platformio, it tries to do that, that fails because of the sandbox, and then Codex decides something is wrong with the cache directory and keeps asking for permission to sudo chown ~/.cache, every time.

I have to specifically tell it to request permission for the command it wants to run, and then it works. Very annoying, and very annoying that it can't persist the permission, like Claude Code can, so it doesn't have to ask again every single time.

pugio•1mo ago

Opus looks like a big jump from the previous leader (GPT 5.1), but when you switch from "50%" to "80%", GPT 5.1 still leads by a good margin. I'm not sure if you can take much from this - perhaps "5.1 is more reliable at slightly shorter stuff, choose Opus if you're trying to push the frontier in task length".

gizmodo59•1mo ago

Yeah. 50% of the time to throw away expensive tokens and limits is not ideal. But I bet by this time next year OSS models will be at that capability!

Aperocky•1mo ago

I think the problem here is LLM eventually pollute its context window with so much of the current task that the larger picture or architectural sanity is forgotten in favor of the current task at hand.

And rarely is a software one and done, with a few round like this, the software architecture would have become schizophrenic. Combating this tendency usually require a lot of the work of these "long task" to be thrown away and more closely limiting what the AI is trying to do as they happen. The success of one "long task" is not necessarily a good thing!

Leynos•1mo ago

This was why server-side compaction in GPT-5.2 was such a big deal. The model is by default provided with a tool that will prioritise the initial task and salient updates in context window compaction, and the new model has been trained to use it.

karimQuant•1mo ago

The big issue is the 50%, if you switch to 80% it's much less. Now if you are in the wrong side of 50% given the task was 4hours. How much additional time to 4hours you need. repeat trying to get the task done 50%*50%->25% , 50%^4 -> 6.25%. the cost of bad luck is very high.

bulbar•1mo ago

It's it bad luck though? I would've thought that if AI can't solve it first try the probability of fixing it in second try would be higher/lower (depending on the task).

nrhrjrjrjtntbt•1mo ago

Why measure in minutes and not tokens? Seems you could cheat by slowing the ai down.

wmf•1mo ago

They measure the time it takes a human to complete the task. They don't care how long the AI takes (although in practice it's much faster than human). Measuring tokens isn't a good idea because newer models can complete tasks using fewer tokens.

bentobean•1mo ago

> We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months.

If true, how much of this is a result of:

1. Genuine technical advancement

or:

2. Shoveling trillions of dollars into compute resources in order to service incoming LLM requests in a way that is completely unrealistic over the long term?

In other words… are we talking about genuine, sustainable innovation that we get to take with us moving forward and benefit from? Or are we talking about an “improvement” that is more akin to a mirage that will eventually disappear when the Ponzi scheme eventually collapses?

dghost-dev•1mo ago

Good point.

emp17344•1mo ago

I wonder how much of this stuff is attributable to true model advancement, or if it’s an improvement in the genetic harness? It’s impossible to separate strict model improvement from improvement in the associated tools.

mediaman•1mo ago

Much of this is due to vastly better posttraining RL, not models that are much bigger. The idea that most of these gains comes from training really big models, or throwing immensely larger amounts of compute at it, is not really true.

twotwotwo•1mo ago

I'm conflicted about opining on models: no individual has actually done a large sample of real-world tasks with a lot of models to be able to speak with authority, but I kinda think we should each share our dubiously-informed opinions anyway because benchmarks aren't necessarily representative of real-world use and many can clearly be gamed.

Anyhow, I noticed more of a difference trying Opus 4.5 compared to Sonnet 4.5 than I'd noticed from, for example, the last couple Sonnet bumps. Objectively, at 1.66x Sonnet's price instead of the old 5x, it's much more often practical to consider reaching for than past Opus models. Anthropic's basic monthly thing also covers a fair amount of futzing with it in CC.

At the other extreme, another surprise of this family is that Haiku 4.5 with reasoning on is usable: better than Sonnet with thinking off according to some bencharks, and in any case subjectively decent for point edits, single-page thingies, and small tools.

atleastoptimal•1mo ago

They should do a 95% and 99% version of the graphs, otherwise it's hard to ascertain whether the failure cases will remain in the elusive "stuff humans can do easily but LLM's trip up despite scaling"

Davidzheng•1mo ago

Big error bars and METR people are saying the longer end of the benchmark are less accurate right now. I think they mean this is a lower bound!

scellus•1mo ago

It's complicated. Opus 4.5 is actually not that good at the 80% threshold but is above others at 50% threshold of completion. I read there's a single task around 16h that the model completed, and the broad CI comes from that.

METR currently simply runs out of tasks at 10-20h, and as a result you have a small N and lots of uncertainty there. (They fit a logistic to the discrete 0/1 results to get the thresholds you see in the graph.) They need new tasks, then we'll know better.

JohnnyMarcone•1mo ago

Thanks for this comment. I've been trying to find anything about the huge error bars. Do you have any sources you can share for further reading?

iLoveOncall•1mo ago

> current models have almost 100% success rate on tasks taking humans less than 4 minutes

The contrary is easily verifiable by everyone individually. It's nowhere near 100%, or even 50% for few minutes tasks even with the best models in real world situations.

ben_w•1mo ago

I've only noticed that combination (failure of short everyday tasks from SOTA models) on image comprehension, not text.

So some model will misclassify my American black nightshade* weeds as a tomato, but I get consistently OK results for text out from good models unless it's a trick question.

* I recon, at least; looked like this to me: https://en.wikipedia.org/wiki/Solanum_americanum#/media/File...

iLoveOncall•1mo ago

The research from Metr, and my comment, is exclusively related to software development tasks.

ben_w•1mo ago

Re-reading my comment, I realise I missed the most important part, the question.

What examples can you give of "real world situations" where they fail?

Obviously I don't want to use them for whatever that is.

NiloCK•1mo ago

I appreciate horizon expansion as a fundamental metric, but duration seems like too crude a measure. We used to like it when computers were fast.

An infinitely unscrupulous model provider could double this five hour result by cutting your output tokens/second in half!

This isn't only a question of gaming the metric: the very strong current small-fast models (4.5 Haiku, Gemini 3 Flash) have no hope of being measured fairly against this - they will succeed or fail much faster just because they are much faster.

How about something like total output token count as the "long term horizon" metric instead?

docstryder•1mo ago

Task duration is the time it would take for humans to complete the task. The speed of the models and how how long they might take to complete the task is not part of this metric.

scellus•1mo ago

The time (horizon) here is not that of the model completing the task, but a human completing the task.

NiloCK•1mo ago

Wow that was a garbage comment!

My introduction to this type of model measuring came from an interview where the repeatedly hammered-home point was that Sonnet 4.0 nailed a gigantic refactor (conversion of a large legacy asp.net or similar into react server-side components or similar) in a loop whose runtime was some large number of hours. I mistakenly attributed the same framing here.

scotty79•1mo ago

> As shown above, when we fit a similar trend to just the 2024 and 2025 data, this shortens the estimate of when AI can complete month-long tasks with 50% reliability by about 2.5 years.

I don't think I have 50% success rate at month long tasks.

Anything that exceeds one day is pretty hard.

rich_sasha•1mo ago

How does "cost" per frontier task change with time?

Extrapolating any exponential growth is always dangerous, but over say 3 years at this pace, we'd go from 2 hours to 70,or about 8 days' work.

Quite scary. But what does cost do over the same timeline? Does it increase with computational complexity? Is it worse - because, IIRC, transformers computational cost is quadratic in context length. Is it better - some kind of economies of scale?

I glanced thought the article but couldn't find any info on this.

0x000xca0xfe•1mo ago

After spending many hours optimizing some routines I now think performance optimization is a great benchmark for identifiying how generally smart an AI is at helping with some specific piece of code.

Solutions are quite easy to verify with differential testing and produce a number for direct comparison.

Less code is usually better and you generally can't "cheat" by adding more cruft so it nullifies the additive bias. Good optimization requires significant understanding of the underlying structures. Everything has performance tradeoffs so it requires systemic thinking and not just stringing independent pieces together.

So far I've found that Gemini Pro 3 was the best at reasoning about tricky SIMD code but the results with most models were pretty underwhelming.

zkmon•1mo ago

> We believe this work has important implications ... > First, our work demonstrates an approach ...

The Conclusions section is not for making a sales pitch for your article. It is for summarizing any new knowledge the article brings out.

yoan9224•1mo ago

The key insight from this benchmark is using "human-equivalent hours" rather than actual AI execution time. It's measuring capability complexity, not speed.

What's interesting is the 50% vs 80% reliability gap. At 50% success rate on a 4-hour task, you're essentially gambling. If it fails, you've potentially wasted the 4 hours plus the time debugging why it failed.

This is why I think the current "agent" paradigm needs human checkpoints at regular intervals. Let the AI work for 30 minutes, then review progress. Repeat. This way you catch drift early before it compounds.

The other thing missing from these benchmarks: recovery ability. When the AI gets stuck on hour 3 of a 4-hour task, can it recognize the problem and backtrack? Or does it confidently continue down the wrong path?

hnthrowaway121•1mo ago

You’ve only wasted the 4 hours if you didn’t spend them doing something else.

At 50/50 it’s an ok bet if the debugging time is much less than the total human time, even if the loops are long, you might rather 4 hours of deep work on an important human thing or on just relaxing vs babysitting the LLM. Assuming that about half the time that will pay off with a correctly done thing with very little effort, it’s kind of amazing.

afro88•1mo ago

> The key insight from this benchmark is using "human-equivalent hours" rather than actual AI execution time. It's measuring capability complexity, not speed.

> What's interesting is the 50% vs 80% reliability gap. At 50% success rate on a 4-hour task, you're essentially gambling. If it fails, you've potentially wasted the 4 hours plus the time debugging why it failed.

Your first two paragraphs are at odds with each other. If it fails, you've potentially wasted the time it took the agent to *perform* the "it takes humans 4h" long task. Which in most cases is single digit minutes.

That's why one of the solid use cases for agents is doing multiple throw away proof of concepts to explore a problem / new feature before deciding on a solution to actually implement. Usually you'd have time for one, or maybe none. If it fails you've lost a maybe 10 minutes, but likely learned something new about the potential solution.

bicepjai•1mo ago

IMHO, in the software field, learning can be simpler to 2 phases. The first one is exploration, where we read blogs, docs, and books; listen to lectures and talks. Then comes the second phase of exploitation, where we actually use the thing we learned. You can think of all those “learning from scratch” videos as someone who is doing the phase 2. I love the phase one and most of the time don’t have time and energy to sit down and go through the phase 2. Nowadays, I feel like the 2 phases are combined, thanks to LLMs. For instance, I wanted to do some animation for visualizations. This week, I learned AnimeJS by watching CCAgent create the animation I wanted, which was interspersed with questions that were answered with diagrams and text, which accomplishes the phase 1. I do not like letting them run the show. Then comes phase 2, where I organize the code, abstract things, rewrite code, still use their help for long rewrites, but totally my ideas and mine only. This saves time tremendously.

yoan9224•1mo ago

The key insight from this benchmark is using "human-equivalent hours" rather than actual AI execution time. It's measuring capability complexity, not speed.

What's interesting is the 50% vs 80% reliability gap. At 50% success rate on a 4-hour task, you're essentially gambling. If it fails, you've potentially wasted the 4 hours plus the time debugging why it failed.

dvfjsdhgfv•1mo ago

> This is why I think the current "agent" paradigm needs human checkpoints at regular intervals. Let the AI work for 30 minutes, then review progress. Repeat. This way you catch drift early before it compounds.

The problem with this approach is that in 30 minutes, an agent is able to produce a massive amount of stuff. Reviewing all this is a nightmare, in the sense that on the surface it seems fine and it often works, until it doesn't. The bugs introduced are often subtle and their effects manifest later, if ever.

So, for stuff that matters (to me), I prefer not to use agents at all.

Maybe things will change in a year, or 5, or 10. I will be giving it a try. but for the moment it's just not worth it, and the upside-down workflow it pushes on me is just making me tired and lose satisfaction from doing my job.

mkoubaa•1mo ago

Ask not what the agent can do you for you, ask what you can do for the agent.

If you fail to break up the task into agent sized chunks, you're the problem.

sshh12•1mo ago

For folks interested in some of the nuances of this benchmark, I just posted this deep dive:

https://blog.sshh.io/p/understanding-ai-benchmarks

big-chungus4•1mo ago

"Train adversarially robust image model" is not a long task imo

leecommamichael•1mo ago

I read their citations (which are actually the same authors of this paper) and they also define using Python's built-in web server to "build a web server" as a long task.

minikeyvalue

Neomacs: GPU-accelerated Emacs with inline video, WebKit, and terminal via wgpu

Show HN: Moli P2P – An ephemeral, serverless image gallery (Rust and WebRTC)

How I grow my X presence?

What's the cost of the most expensive Super Bowl ad slot?

What if you just did a startup instead?

Hacking up your own shell completion (2020)

Show HN: Gorse 0.5 – Open-source recommender system with visual workflow editor

GLM-OCR: Accurate × Fast × Comprehensive

Local Agent Bench: Test 11 small LLMs on tool-calling judgment, on CPU, no GPU

Show HN: AboutMyProject – A public log for developer proof-of-work

Expertise, AI and Work of Future [video]

So Long to Cheap Books You Could Fit in Your Pocket

PID Controller

SpaceX Rocket Generates 100GW of Power, or 20% of US Electricity

Kubernetes MCP Server

I Built a Movie Recommendation Agent to Solve Movie Nights with My Wife

What were the first animals? The fierce sponge–jelly battle that just won't end

Sidestepping Evaluation Awareness and Anticipating Misalignment

OldMapsOnline

What It's Like to Be a Worm

Don't go to physics grad school and other cautionary tales

Lawyer sets new standard for abuse of AI; judge tosses case

AI anxiety batters software execs, costing them combined $62B: report

Bogus Pipeline

Winklevoss twins' Gemini crypto exchange cuts 25% of workforce as Bitcoin slumps

How AI Is Reshaping Human Reasoning and the Rise of Cognitive Surrender

Cycling in France

Ask HN: What breaks in cross-border healthcare coordination?

Show HN: Simple – a bytecode VM and language stack I built with AI

minikeyvalue

Neomacs: GPU-accelerated Emacs with inline video, WebKit, and terminal via wgpu

Show HN: Moli P2P – An ephemeral, serverless image gallery (Rust and WebRTC)

How I grow my X presence?

What's the cost of the most expensive Super Bowl ad slot?

What if you just did a startup instead?

Hacking up your own shell completion (2020)

Show HN: Gorse 0.5 – Open-source recommender system with visual workflow editor

GLM-OCR: Accurate × Fast × Comprehensive

Local Agent Bench: Test 11 small LLMs on tool-calling judgment, on CPU, no GPU

Show HN: AboutMyProject – A public log for developer proof-of-work

Expertise, AI and Work of Future [video]

So Long to Cheap Books You Could Fit in Your Pocket

PID Controller

SpaceX Rocket Generates 100GW of Power, or 20% of US Electricity

Kubernetes MCP Server

I Built a Movie Recommendation Agent to Solve Movie Nights with My Wife

What were the first animals? The fierce sponge–jelly battle that just won't end

Sidestepping Evaluation Awareness and Anticipating Misalignment

OldMapsOnline

What It's Like to Be a Worm

Don't go to physics grad school and other cautionary tales

Lawyer sets new standard for abuse of AI; judge tosses case

AI anxiety batters software execs, costing them combined $62B: report

Bogus Pipeline

Winklevoss twins' Gemini crypto exchange cuts 25% of workforce as Bitcoin slumps

How AI Is Reshaping Human Reasoning and the Rise of Cognitive Surrender

Cycling in France

Ask HN: What breaks in cross-border healthcare coordination?

Show HN: Simple – a bytecode VM and language stack I built with AI

Measuring AI Ability to Complete Long Tasks

Comments