Yeah, listen... I'm glad these types of studies are being conducted. I'll say this though: the difference between pre- and post-Opus 4.5 has been night and day for me.
From August 2025 through November 2025 I led a complex project at work where I used Sonnet 4.5 heavily. It was very helpful, but my total productivity gains were around 10-15%, which is pretty much what the study found. Once Opus came out in November though, it was like someone flipped a switch. It was much more capable at autonomous work and required way less hand-holding, intervention or course-correction. 4.6 has been even better.
So I'm much more interested in reading studies like this over the next two years where the start period coincides with Opus 4.5's release.
Seems likely that process is holding things back. Planning has always been a "best-guess". There's lots you can't account for until you start a task.
Code review mostly exists because the cost of doing something wrong was high (because human coding is slow). If you can code faster, you can replace bad code faster. I.e., LLMs have cheapened the cost of deployment.
We can't honestly assess the new way of doing things when we bring along the baggage of the old way of doing things.
If the agent comes back in a few minutes with a tiny fix, it is probably a small task.
If the agent produces a large, convoluted solution that would need careful review, it is at least a medium task.
And if the agent gets stuck, runs into architectural constraints, etc. then it is definitely a hard task.
verdverm•3h ago
dude250711•1h ago
verdverm•1h ago
eucyclos•6m ago