I feel like there are two challenges causing this. One is that it's difficult to get good data on how long the same person in the same context would have taken to do a task without AI vs with. The other is that it's tempting to time an AI with metrics like how long until the PR was opened or merged. But the AI workflow fundamentally shifts engineering hours so that a greater percentage of time is spent on refactoring, testing, and resolving issues later in the process, including after the code was initially approved and merged. I can see how it's easy for a developer to report that AI completed a task quickly because the PR was opened quickly, discounting the amount of future work that the PR created.
I'm not making any claim in either direction, the authors themselves recognize the study's limitations, I'm just trying to say that everyone should have far greater error bars. This technology is the weirdest shit I've seen in my lifetime, making deductions about productivity from anecdotes and dubious benchmarks is basically reading tea leaves.
But Figure 18 shows that time spent actively coding decreased (which might be where the feeling of a speed-up was coming from) and the gains were eaten up by time spent prompting, waiting for and then reviewing the AI output and generally being idle.
So maybe it's not a good idea to use LLMs for tasks that you could've done yourself in under 5 minutes.
Developers totally spend time totally differently, though, this is a great callout! On page 10 of the paper [1], you can see a breakdown of how developers spend time when they have AI vs. not - in general, when these devs have AI, they spend a smaller % of time writing code, and a larger % of time working with AI (which... makes sense).
[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf
The standard experimental design that solves this is to randomly assign participants to the experiment group (with AI) and the control group (without AI), which is what they did. This isolates the variable (with or without AI), taking into account uncontrollable individual, context, and environmental differences. You don't need to know how the single individual and context would have behaved in the other group. With a large enough sample size and effect size, you can determine statistical significance, and that the with-or-without-AI variable was the only difference.
So I never feel like I'm getting any faster. 90% of my time is still spent in frustration, even when I'm producing twice the code at higher quality
I think an approach that I tried recently is to use it as a friction remover instead of a solution provider. I do the programming but use it to remove pebbles such as that small bit of syntax I forgot, basically to keep up the velocity. However, I don't look at the wholesale code it offers. I think keeping the active thinking cap on results in code I actually understand while avoiding skill atrophy.
This was my pr-AI experience anyway, so getting that first chunk of time back is helpful.
Related: One of the better takes I've seen on AI from an experienced developer was, "90% of my skills just became worthless, and the other 10% just became 1,000 times more valuable." There's some hyperbole there, I but I like the gist.
Otherwise he can shut the fuck up about being 1000x more valuable imo
Once AI can actually untangle our 14 year old codebase full of hosh-posh code, read every commit message, JIRA ticket, and Slack conversation related to the changes in full context, it's not going to solve a lot of the hard problems at my job.
But nothing will make them stick to the one API version I use.
Models trained for tool use can do that. When I use Codex for some Rust stuff for example, it can grep from source files in the directory dependencies are stored, so looking up the current APIs is trivial for them. Same works for JavaScript and a bunch of other languages too, as long as it's accessible somewhere via the tools they have available.
i don't mean to pick on your usage of this specifically, but i think it's noteworthy that the colloquial definition of "rubber ducking" seems to have expanded to include "using a software tool to generate advice/confirm hunches". I always understood the term to mean a personal process of talking through a problem out loud in order to methodically, explicitly understand a theoretical plan/process and expose gaps.
based on a lot of articles/studies i've seen (admittedly haven't dug into them too deeply) it seems like the use of chatbots to perform this type of task actually has negative cognitive impacts on some groups of users - the opposite of the personal value i thought rubber-ducking was supposed to provide.
I like to think of it that instead of having seemingly endless amounts of half thoughts spinning around inside your head, you make an idea or thought more “fully formed” when you express it verbally or with written (or typed) words.
I believe this is part of why therapy can work, by actually expressing our thoughts, we’re kind of forced to face realities and after doing so it’s often much easier to reflect on it. Therapists often recommend personal journals as they can also work for this.
I believe rubber ducking works because in having to explain the problem, it forces you to actually gather your thoughts into something usable from which you can more effectively reflect on.
I see no reason why doing the same thing except in writing to an LLM couldn’t be equally effective.
This is what human language does though, isn't it? Evolves over time, in often weird ways; like how many people "could care less" about something they couldn't care less about.
I just used it to write about 80 lines of new code like that, and there's no question it saves time.
I do think you're onto something with getting pebbles out of the road inasmuch as once I know what I need to do AI coding makes the doing much faster. Just yesterday I was playing around with removing things from a List object using the Java streams API and I kept running into ConcurrentOperationsExceptions, which happen when multiple threads are mutating the list object at the same time because no thread can guarantee it has the latest copy of the list unaltered by other threads. I spent about an hour trying to write a method that deep copies the list, makes the change and then returns the copy and running into all sorts of problems til I asked AI to build me a thread-safe list mutation method and it was like "Sure, this is how I'd do it but also the API you're working with already has a method that just....does this." Cases like this are where AI is supremely useful - intricate but well-defined problems.
I think this may become a long horizon harvest for the rigorous OOP strategy, may Bill Joy be disproved.
Gray goo may not [taste] like steel-cut oatmeal.
Autocorrect is a scourge of humanity.
Even so... I still would be really surprised if there wasn't some systematic error here skewing the results, like the developers deliberately picked "easy" tasks that they already knew how to do, so implementing them themselves was particularly fast.
Seems like they authors had about as good methodology as you can get for something like this. It's just really hard to test stuff like this. I've seen studies proving that code comments don't matter for example... are you going to stop writing comments? No.
We explore this factor in section (C.2.5) - "Trading speed for ease" - in the paper [1]. It's labeled as a factor with an unclear effect, some developers seem to think so, and others don't!
> like the developers deliberately picked "easy" tasks that they already knew how to do
We explore this factor in (C.2.2) - "Unrepresentative task distribution." I think the effect here is unclear; these are certainly real tasks, but they are sampled from the smaller end of tasks developers would work on. I think the relative effect on AI vs. human performance is not super clear...
[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf
I can use it for parts of code, algorithms, error solving, and maybe sometimes a 'first draft'.
But there is no way I could finish an entire piece of software with AI only.
Define schemas, interfaces, and perhaps some base classes that define the attributes I'm thinking about.
Research libraries that support my cause, and include them.
Reference patterns I have established in other parts of the codebase; internal tooling for database, HTTP services, etc.
Instruct the agent to come up with a plan for a first pass at execution in markdown format. Iterate on this plan; "what about X?"
Splat a bunch of code down that supports the structure I'm looking for. Iterate. Cleanup. Iterate. Implement unit tests and get them to pass.
Go back through everything manually and adjust it to suit my personal style, while at the same time fully understanding what's being done and why.
I use STT a lot to have conversations with the agent as we go, and very rarely allow it to make sequential edits without reviewing first; this is a great opportunity to go back and forth and refine what's being written.
If we studied folks with _no_ AI experience, then we might underestimate speedup, as these folks are learning tools (see a discussion of learning effects in section (C.2.7) - Below-average use of AI tools - in the paper). If we studied folks with _only_ AI experience, then we might overestimate speedup, as perhaps these folks can't really program without AI at all.
In some sense, these are just two separate and interesting questions - I'm excited for future work to really dig in on both!
Per our website, “To date, April 2025, we have not accepted compensation from AI companies for the evaluations we have conducted.” You can check out the footnote on this page: https://metr.org/donate
Not all payment is cash. Compute credits is still by all means compensation.
Such companies spit out "credits" all over the place in order to gain traction and enstablish themselves. I remember when cloud providers gave vps credits to startups like they were peanuts. To me, it really means absolutelly nothing.
In-kind compensation is still compensation.
Well, yes? I use compute for some personal projects so I would be absolutely fine if a part of my compensation was in compute credits.
As a company, even more so.
You're extrapolating, it's not saying this anywhere.
> It's no more "compensation" than a chemistry researcher being "compensated" with test tubes.
Yes, that's compensation too. Thanks for contributing another example. Here's another one: it's no more compensation than a software engineer being compensated with a new computer.
Actually the situation here is way worse than your example. Unless the chemistry researcher is commissioned by Big Test Tube Corp. to conduct research on the outcome of using their test tubes, there's no conflict of interest here. But there is an obvious conflict of interest on AI research being financed by credits given by AI companies to use their own AI tools.
As a philosopher who is into epistemology and ontology, I find this to be as abhorrent as religion.
'Science' doesn't matter who publishes it. Science needs to be replicated.
The psychology replication crisis is a prime example of why peer reviews and publishing in a journal matters 0.
Specifically, it works as an example of a specific case where peer review doesn’t help as much. Peer review checks your arguments, not your data collection process (which the reviewer can’t audit for obvious reasons). It works fine in other scenarios.
Peer review is unrelated to replication problems, except to the extent to which confused people expect peer review to fix totally unrelated replication problems.
...Or should I say "were" very important? With the help of today's GenAI every low effort stuff can look high effort without much extra effort.
One thing that did work in my favor is that, I was clearly creating a failing repro test case, and adding before and after along with PR. That helped getting the PR landed.
There are also a few PRs that never got accepted because the repro is not as strong or clear.
There's some existing lit about increased contributions to OS repositories after the introduction of AI -- I've also personally heard a fear anecdotes about an increase in the number of low-quality PRs from first time contributors, seemingly as a result of AI making it easier to get started -- ofc, the tradeoff is that making it easier to get started has pros to it too!
[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf
If you're short on time, I'd recommend just reading the linked blogpost or the announcement thread here [1], rather than the full paper.
Also, cool work, very happy to see actually good evaluations instead of just vibes or observational stuies that don't account for the Hawthorne effect
We'll be releasing anonymized data and some basic analysis code to replicate core results within the next few weeks (probably next, depending).
Our GitHub is here (http://github.com/METR/) -- or you can follow us (https://x.com/metr_evals) and we'll probably tweet about it.
If the instruction is just "implement this ticket with AI", then that's very realistic in that it's how management often tries to operate, but it's also likely to be quite suboptimal. There are ways to use AI that help a lot, and other ways that hurt more than it helps.
If your developers had sufficient experience with AI to tell the difference, then they might have compensated for that, but reading the paper I didn't see any indication of that.
That being said, we can't rule out that the experiment drove them to use more AI than they would have outside of the experiment (in a way that made them less productive). You can see more in section "Experimentally driven overuse of AI (C.2.1)" [1]
[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf
Did you measure subjective fatigue as one way to explain the misperception that AI was faster? As a developer-turned-manager I like AI because it's easier when my brain is tired.
TLDR: mixed evidence that developers make it less effortful, from quantitative and qualitative reports. Unclear effect.
I think personally when i tried tools like Void IDE, I was fighting with Void too much. It is beta software, it is buggy, but also the big one... learning curve of the tool.
I havent had the chance to try cursor but i imagine its going to have a learning curve as a new tool.
So perhaps there is a slowdown at first expected; but later after you get your context and prompting down pat. Asking specifically for what you want. Then you get your speed up.
My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.
This study had 16 participants, with a mix of previous exposure to AI tools - 56% of them had never used Cursor before, and the study was mainly about Cursor.
They then had those 16 participants work on issues (about 15 each), where each issue was randomly assigned a "you can use AI" v.s. "you can't use AI" rule.
So each developer worked on a mix of AI-tasks and no-AI-tasks during the study.
A quarter of the participants saw increased performance, 3/4 saw reduced performance.
One of the top performers for AI was also someone with the most previous Cursor experience. The paper acknowledges that here:
> However, we see positive speedup for the one developer who has more than 50 hours of Cursor experience, so it's plausible that there is a high skill ceiling for using Cursor, such that developers with significant experience see positive speedup.
My intuition here is that this study mainly demonstrated that the learning curve on AI-assisted development is high enough that asking developers to bake it into their existing workflows reduces their performance while they climb that learing curve.
Definitely. Effective LLM usage is not as straightforward as people believe. Two big things I see a lot of developers do when they share chats:
1. Talk to the LLM like a human. Remember when internet search first came out, and people were literally "Asking Jeeves" in full natural language? Eventually people learned that you don't need to type, "What is the current weather in San Francisco?" because "san francisco weather" gave you the same, or better, results. Now we've come full circle and people talk to LLMs like humans again; not out of any advanced prompt engineering, but just because it's so anthropomorphized it feels natural. But I can assure you that "pandas count unique values column 'Foo'" is just as effective an LLM prompt as "Using pandas, how do I get the count of unique values in the column named 'Foo'?" The LLM is also not insulted by you talking to it like this.
2. Don't know when to stop using the LLM. Rather than let the LLM take you 80% of the way there and then handle the remaining 20% "manually", they'll keep trying to prompt to get the LLM to generate what they want. Sometimes this works, but often it's just a waste of time and it's far more efficient to just take the LLM output and adjust it manually.
Much like so-called Google-fu, LLM usage is a skill and people who don't know what they're doing are going to get substandard results.
It is not as straightforward as people are told to believe!
Maybe the LLM doesn't strictly need it, but typing out does bring some clarity for the asker. I've found it helps a lot to catch myself - what am I even wanting from this?
I don't have any studies, but it eems to me reasonable to assume.
(Unlike google, where presumably it actually used keywords anyway)
In practice I have not had any issues getting information out of an LLM when speaking to them like a computer, rather than a human. At least not for factual or code-related information; I'm not sure how it impacts responses for e.g. creative writing, but that's not what I'm using them for anyway.
How can you be so sure? Did you compare in a systematic way or read papers by people who did it?
Now I surely get results giving the llm only snippets and keywords, but anything complex, I do notice differences the way I articulate. Not claiming there is a significant difference, but it seems to me this way.
No, but I didn't need to read scientific papers to figure how to use Google effectively, either. I'm just using a results-based analysis after a lot of LLM usage.
How do we get beyond that?
LLMs have made the distinction ambiguous because their capabilities are so poorly understood. When I say "you should talk to an LLM like it's a computer", that's a workflow statement; it's a more efficient way to accomplish the same goal. You can try it for yourself and see if you agree. I personally liken people who talk to LLMs in full, proper English, capitalization and all, to boomers who still type in full sentences when running a Google query. Is there anything strictly wrong with it? Not really. Do I believe it's a more efficient workflow to just type the keywords that will give you the same result? Yes.
Workflow efficiencies can't really be scientifically evaluated. Some people still prefer to have desktop icons for programs on Windows; my workflow is pressing winkey -> typing the first few characters of the program -> enter. Is one of these methods scientifically more correct? Not really.
So, yeah -- eventually you'll either find your own workflow or copy the workflow of someone you see who is using LLMs effectively. It really is "just trust me, bro."
IMO 80% is way too much, LLMs are probably good for things that are not your domain knowledge and you can efford to not be 100% correct, like rendering the Mandelbrot set, simple functions like that.
LLMs are not deterministic sometimes they produce correct code and other times they produce wrong code. This means one has to audit LLM generated code and auditing code takes more effort than writing it, especially if you are not the original author of the code being audited.
Code has to be 100% deterministic. As programmers we write code, detailed instructions for the computer (CPU), we have developed allot of tools such as Unit Tests to make sure the computer does exactly what we wrote.
A codebase has allot of context that you gain by writing the code, some things just look wrong and you know exactly why because you wrote the code, there is also allot of context that you should keep in your head as you write the code, context that you miss from simply prompting an LLM.
Noting a few important points here:
1. Some prior studies that find speedup do so with developers that have similar (or less!) experience with the tools they use. In other words, the "steep learning curve" theory doesn't differentially explain our results vs. other results.
2. Prior to the study, 90+% of developers had reasonable experience prompting LLMs. Before we found slowdown, this was the only concern that most external reviewers had about experience was about prompting -- as prompting was considered the primary skill. In general, the standard wisdom was/is Cursor is very easy to pick up if you're used to VSCode, which most developers used prior to the study.
3. Imagine all these developers had a TON of AI experience. One thing this might do is make them worse programmers when not using AI (relatable, at least for me), which in turn would raise the speedup we find (but not because AI was better, but just because with AI is much worse). In other words, we're sorta in between a rock and a hard place here -- it's just plain hard to figure out what the right baseline should be!
4. We shared information on developer prior experience with expert forecasters. Even with this information, forecasters were still dramatically over-optimistic about speedup.
5. As you say, it's totally possible that there is a long-tail of skills to using these tools -- things you only pick up and realize after hundreds of hours of usage. Our study doesn't really speak to this. I'd be excited for future literature to explore this more.
In general, these results being surprising makes it easy to read the paper, find one factor that resonates, and conclude "ah, this one factor probably just explains slowdown." My guess: there is no one factor -- there's a bunch of factors that contribute to this result -- at least 5 seem likely, and at least 9 we can't rule out (see the factors table on page 11).
I'll also note that one really important takeaway -- that developer self-reports after using AI are overoptimistic to the point of being on the wrong side of speedup/slowdown -- isn't a function of which tool they use. The need for robust, on-the-ground measurements to accurately judge productivity gains is a key takeaway here for me!
(You can see a lot more detail in section C.2.7 of the paper ("Below-average use of AI tools") -- where we explore the points here in more detail.)
The over-optimism is indeed a really important takeaway, and agreed that it's not tool-dependent.
TLDR: over the first 8 issues, developers do not appear to get majorly less slowed down.
[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf
But: if all developers did 136 AI-assisted issues, why only analyze excluding the 1st 8, rather than, say, the first 68 (half)?
4% more idle time 20% more AI interaction time
The 28% less coding/testing/research is why developers reported 20% less work. You might be spending 20% more time overall "working" while you are really idle 5% more time and feel like you've worked less because you were drinking coffee and eating a sandwich between waiting for the AI and reading AI output.
I think the AI skill-boost comes from having work flows that let you shave half that git-ops time, cut an extra 5% off coding, but cut the idle/waiting and do more prompting of parallel agents and a bit more testing then you really are a 2x dev.
This is going to be interesting long-term. Realistically people don't spend anywhere close to 100% of time working and they take breaks after intense periods of work. So the real benefit calculation needs to include: outcome itself, time spent interacting with the app, overlap of tasks while agents are running, time spent doing work over a long period of time, any skill degradation, LLM skills, etc. It's going to take a long time before we have real answers to most of those, much less their interactions.
My working hypothesis is that people who are fast at scanning lots of text (or code for that matter) have a serious advantage. Being able to dismiss unhelpful suggestions quickly and then iterating to get to helpful assistance is key.
Being fast at scanning code correlates with seniority, but there are also senior developers who can write at a solid pace, but prefer to take their time to read and understand code thoroughly. I wouldn't assume that this kind of developer gains little profit from typical AI coding assistance. There are also juniors who can quickly read text, and possibly these have an advantage.
A similar effect has been around with being able to quickly "Google" something. I wouldn't be surprised if this is the same trait at work.
I've found AI to be quite helpful in pointing me in the right direction when navigating an entirely new code-base.
When it's code I already know like the back of my hand, it's not super helpful, other than maybe doing a few automated tasks like refactoring, where there have already been some good tools for a while.
I totally agree with this. Although also, you can end up in a bad spot even after you've gotten pretty good at getting the AI tools to give you good output, because you fail to learn the code you're producing well.
A developer gets better at the code they're working on over time. An LLM gets worse.
You can use an LLM to write a lot of code fast, but if you don't pay enough attention, you aren't getting any better at the code while the LLM is getting worse. This is why you can get like two months of greenfield work done in a weekend but then hit a brick wall - you didn't learn anything about the code that was written, and while the LLM started out producing reasonable code, it got worse until you have a ball of mud that neither the LLM nor you can effectively work on.
So a really difficult skill in my mind is continually avoiding temptation to vibe. Take a whole week to do a month's worth of features, not a weekend to do two month's worth, and put in the effort to guide the LLM to keep producing clean code, and to be sure you know the code. You do want to know the code and you can't do that without putting in work yourself.
I agree. I have found that I can use agents most effectively by letting it write code in small steps. After each step I do review of the changes and polish it up (either by doing the fixups myself or prompting). I have found that this helps me understanding the code, but also avoids that the model gets in a bad solution space or produces unmaintainable code.
I also think this kind of close-loop is necessary. Like yesterday I let an LLM write a relatively complex data structure. It got the implementation nearly correct, but was stuck, unable to find an off-by-one comparison. In this case it was easy to catch because I let it write property-based tests (which I had to fix up to work properly), but it's easy for things to slip through the cracks if you don't review carefully.
(This is all using Cursor + Claude 4.)
Everything else in your post is so reasonable and then you still somehow ended up suggesting that LLMs should be quadrupling our output
It'll also apply to isolated-enough features, which is still a small amount of someone's work (not often something you'd work on for a full month straight), but more people will have experience with this.
I’ve also noticed that, generally, nobody likes maintaining old systems.
so where does this leave us as software engineers? Should I be excited that it’s easy to spin up a bunch of code that I don’t deeply understand at the beginning of my project, while removing the fun parts of the project?
I’m still grappling with what this means for our industry in 5-10 years…
It’s been a majority of my projects for the past two months. Not because work changed, but because I’ve written a dozen tiny, personalised tools that I wouldn’t have written at all if I didn’t have Claude to do it.
Most of them were completed in less than an hour, to give you an idea of the size. Though it would have easily been a day on my own.
This is visible under extreme time pressure of producing a working game in 72 hours (our team scores consistenly top 100 in Ludum Dare which is a somewhat high standard).
We use a popular Unity game engine all LLMs have wealth of experience (as in game development in general), but the output is 80% so strangely "almost correct but not usable" that I cannot take the luxury of letting it figure it out, and use it as fancy autocomplete. And I also still check docs and Stackoverflow-style forums a lot, because of stuff it plainly mades up.
One of the reasons is maybe our game mechanics often is a bit off the beaten road, though the last game we made was literally a platformer with rope physics (LLM could not produce a good idea how to make stable and simple rope physics under our constraints codeable in 3 hours time).
Poor stack overflow, it looks like they are the ones really hurting from all this.
This is my intuition as well. I had a teammate use a pretty good analogy today. He likened vibe coding to vacuuming up a string in four tries when it only takes one try to reach down and pick it up. I thought that aligned well with my experience with LLM assisted coding. We have to vacuum the floor while exercising the "difficult skill [of] continually avoiding temptation to vibe"
You hit the nail on the head here.
I feel like I’ve seen a lot of people trying to make strong arguments that AI coding assistants aren’t useful. As someone who uses and enjoys AI coding assistants, I don’t find this research angle to be… uh… very grounded in reality?
Like, if you’re using these things, the fact that they are useful is pretty irrefutable. If one thinks there’s some sort of “productivity mirage” going on here, well OK, but to demonstrate that it might be better to start by acknowledging areas where they are useful, and show that your method explains the reality we’re seeing before using that method to show areas where we might be fooling ourselves.
I can maybe buy that AI might not be useful for certain kinds of tasks or contexts. But I keep pushing their boundaries and they keep surprising me with how capable they are, so it feels like it’ll be difficult to prove otherwise in a durable fashion.
You’ve been given a dubiously capable genie that can write code without you having to do it! If this thing can build first drafts of those side projects you always think about and never get around to, that in and of itself is useful! If it can do the yak-shaving required to set up those e2e tests you know you should have but never have time for it is useful!
Have it try out all the dumb ideas you have that might be cool but don’t feel worth your time to boilerplate out!
I like to think we’re a bunch of creative people here! Stop thinking about how it can make you money and use it for fun!
Took me a week to build those tools. Its much more reliable (and flexible) than any LLM and cost me nothing.
It comes with secure Auth, email, admin, ect ect.. Doesn't cost me a dime and almost never has a common vulnerability.
Best part about it. I know how my side project runs.
LLMs have a v. steep and long learning curve as you posit (though note the points from the paper authors in the other reply).
Current LLMs just are not as good as they are sold to be as a programming assistant and people consistently predict and self-report in the wrong direction on how useful they are.
I recall an adage about work-estimation: As chunks get too big, people unconsciously substitute "how possible does the final outcome feel" with "how long will the work take to do."
People asked "how long did it take" could be substituting something else, such as "how alone did I feel while working on it."
One thing that happened here is that they aren't using current LLMs:
> Most issues were completed in February and March 2025, before models like Claude 4 Opus or Gemini 2.5 Pro were released.
That doesn't mean this study is bad! In fact, I'd be very curious to see it done again, but with newer models, to see if that has an impact.
I've been hearing this for 2 years now
the previous model retroactively becomes total dogshit the moment a new one is released
convenient, isn't it?
Yes, it might make a difference, but it is a little tiresome that there's always a “this is based on a model that is x months old!” comment, because it will always be true: an academic study does not get funded, executed, written up, and published in less time.
"No, the 2.8 release is the first good one. It massively improves workflows"
Then, 6 months later, the study comes out.
"Ah man, 2.8 was useless, 3.0 really crossed the threshold on value add"
At some point, you roll your eyes and assume it is just snake oil sales
* the release of agentic workflow tools
* the release of MCPs
* the release of new models, Claude 4 and Gemini 2.5 in particular
* subagents
* asynchronous agents
All or any of these could have made for a big or small impact. For example, I’m big on agentic tools, skeptical of MCPs, and don’t think we yet understand subagents. That’s different from those who, for example, think MCPs are the future.
> At some point, you roll your eyes and assume it is just snake oil sales
No, you have to realize you’re talking to a population of people, and not necessarily the same person. Opinions are going to vary, they’re not literally the same person each time.
There are surely snake oil salesman, but you can’t buy anything from me.
Of course it's possible that at some point you get to a model that really works, irrespective of the history of false claims from the zealots, but it does mean you should take their comments with a grain of salt.
Right.
> except that that is the same thing the same people say for every model release,
I did not say that, no.
I am sure you can find someone who is in a Groundhog Day about this, but it’s just simpler than that: as tools improve, more people find them useful than before. You’re not talking to the same people, you are talking to new people each time who now have had their threshold crossed.
no, it's the same names, again and again
That sounds like a claim you could back up with a little bit of time spent using Hacker News search or similar.
(I might try to get a tool like o3 to run those searches for me.)
Sure you may end up missing out on a good thing and then having to come late to the party, but coming early to the party too many times and the beer is watered down and the food has grubs is apt to make you cynical the next time a party announcement comes your way.
(Unless one believes the most grandiose prophecies of a technological-singularity apocalypse, that is.)
Like the boy who cried wolf, it'll eventually be true with enough time... But we should stop giving them the benefit of the doubt.
_____
Jan 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."
Feb 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."
Mar 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."
Apr 2025: [Ad nauseam, you get the idea]
Even then though, “technology gets better over time” shouldn’t be surprising, as it’s pretty common.
For context, I've been using AI, a mix of OpenAi + Claude, mainly for bashing out quick React stuff. For over a year now. Anything else it's generally rubbish and slower than working without. Though I still use it to rubber duck, so I'm still seeing the level of quality for backend.
I'd say they're only marginally better today than they were even 2 years ago.
Every time a new model comes out you get a bunch of people raving how great the new one is and I honestly can't really tell the difference. The only real difference is reasoning models actually slowed everything down, but now I see its reasoning. It's only useful because I often spot it leaving out important stuff from the final answer.
Just two years ago, this failed.
> Me: What language is this: "esto está escrito en inglés"
> LLM: English
Gemini and Opus have solved questions that took me weeks to solve myself. And I'll feed some complex code into each new iteration and it will catch a race condition I missed even with testing and line by line scrutiny.
Consider how many more years of experience you need as a software engineer to catch hard race conditions just from reading code than someone who couldn't do it after trying 100 times. We take it for granted already since we see it as "it caught it or it didn't", but these are massive jumps in capability.
As with anything, your miles may vary: I’m not here to tell anyone that thinks they still suck that their experience is invalid, but to me it’s been a pretty big swing.
Same. For me the turning point was VS Code’s Copilot Agent mode in April. That changed everything about how I work, though it had a lot of drawbacks due to its glitches (many of these were fixed within 6 or so weeks).
When Claude Sonnet 4 came out in May, I could immediately tell it was a step-function increase in capability. It was the first time an AI, faced with ambiguous and complicated situations, would be willing to answer a question with a definitive and confident “No”.
After a few weeks, it became clear that VS Code’s interface and usage limits were becoming the bottleneck. I went to my boss, bullet points in hand, and easily got approval for the Claude Max $200 plan. Boom, another step-function increase.
We’re living in an incredibly exciting time to be a skilled developer. I understand the need to stay skeptical and measure the real benefits, but I feel like a lot of people are getting caught up in the culture war aspect and are missing out on something truly wonderful.
So are you using Claude Code via the max plan, Cursor, or what?
I think I'd definitely hit AI news exhaustion and was viewing people raving about this agentic stuff as yet more AI fanbois. I'd just continued using the AI separate as setting up a new IDE seemed like too much work for the fractional gains I'd been seeing.
There is a skill gap, like, I think of it like vim: at first it slows you down, but then as you learn it, you end up speeding up. So you may also find that it doesn't really vibe with the way you work, even if I am having a good time with it. I know people who are great engineers who still don't like this stuff, just like I know ones that do too.
[0]: https://marketplace.visualstudio.com/items?itemName=anthropi...
An LLM that can test the code it is writing and then iterate to fix the bugs turns out to be a huge step forward from LLMs that just write code without trying to then exercise it.
The jump has been massive.
Sure they may get even more useful in the future but that doesn’t change my present.
More generally, the phenomenon this is quite simply explained and nothing surprising: New things improve, quickly. That does not mean that something is good or valuable but it's how new tech gets introduced every single time, and readily explains changing sentiment.
Generally, I do a couple of edits for clarity after posting and reading again. Sometimes that involves removing something that I feel could have been said better. If it does not work, I will just delete the comment. Whatever it was must not have been a super huge deal (to me).
Every hype cycle feels like this, and some of them are nonsense and some of them are real. We’ll see.
In contrast, what do I care if you believe in code generation AI? If you do, you are probably driving up pricing. I mean, I am sure that there are people that care very much, but there is little inherent value for me in you doing so, as long as the people who are building the AI are making enough profit to keep it running.
With regards to the VCs, well, how many VCs are there in the world? How many of the people who have something good to say about AI are likely VCs? I might be off by an order of magnitude, but even then it would really not be driving the discussion.
We're in a hype cycle, and it means we should be extra critical when evaluating the tech so we don't get taken in by exaggerated claims.
The people not buying into the hype, on the other hands, are actually the ones that have a very good reason to be invested, because if they turn out to be wrong they might face some very uncomfortable adjustments in the job landscape and a lot of the skills that they worked so hard to gain and believed to be valuable.
As always, be weary of any claims, but the tension here is very much the reverse of crypto and I don't think that's very appreciated.
If you pay attention to who says it, you'll find that people have different personal thresholds for finding llms useful, not that any given person like steveklabnik above keeps flip-flopping on their view.
This is a variant on the goomba fallacy: https://englishinprogress.net/gen-z-slang/goomba-fallacy-exp...
I do not program for my day job and I vibe coded two different web projects. One in twenty mins as a test with cloudflare deployment having never used cloudflare and one in a week over vacation (and then fixed a deep safari bug two weeks later by hammering the LLM). These tools massively raise the capabilities for sub-average people like me and decrease the time / brain requirements significantly.
I had to make a little update to reset the KV store on cloudflare and the LLM did it in 20s after failing the syntax twice. I would’ve spent at least a few minutes looking it up otherwise.
It's been a very noticeable uptick in power, and although there have been some nice increases with past model releases, this has been both the largest and the one that has unlocked the most real value since I've been following the tech.
I would argue you don't need the "as a programming assistant" phrase as right now from my experience over the past 2 years, literally every single AI tool is massively oversold as to its utility. I've literally not seen a single one that delivers on what it's billed as capable of.
They're useful, but right now they need a lot of handholding and I don't have time for that. Too much fact checking. If I want a tool I always have to double check, I was born with a memory so I'm already good there. I don't want to have to fact check my fact checker.
LLMs are great at small tasks. The larger the single task is, or the more tasks you try to cram into one session, the worse they fall apart.
The developer who has experience using cursor saw a productivity increase not because he became better at using cursor, but because he became worse at not using it.
A much simpler explanation is what your parent offered. And to many behavioralists it is actually the same explanation, as to a true scotsm... [cough] behavioralist personality is simply learned habits, so—by Occam’s razor—you should omit personality from your model.
Nobody is denying that people have personalities btw. Not even true behavioralists do that, they simply argue from reductionism that personality can be explained with learning contingencies and the reinforcement history. Very few people are true behavioralists these days though, but within the behavior sciences, scientists are much more likely to borrow missing factors (i.e. things that learning contingencies fail to explain) from fields such as cognitive science (or even further to neuroscience) and (less often) social science.
What I am arguing here, however, is that the appeal to personality is unnecessary when explaining behavior.
As for figuring out what personality is, that is still within the realm of philosophy. Maybe cognitive science will do a better job at explaining it than psychometricians have done for the past century. I certainly hope so, it would be nice to have a better model of human behavior. But I think even if we could explain personality, it still wouldn’t help us here. At best we would be in a similar situation as physics, where one model can explain things traveling at the speed of light, while another model can explain things at the sub-atomic scale, but the two models cannot be applied together.
Developers' own skills might atrophy, when they don't write that much code themselves, relying on AI instead.
And now when comparing with/without AI they're faster with. But a year ago they might have been that fast or faster without an AI.
I'm not saying that that's how things are. Just pointing out another way to interpret what GP said
I assume that many large companies have tested efficiency gains and losses of there programmers much more extensively than the authors of this tiny study.
A survey of companies and their evaluation and conclusions would carry more weight—-excluding companies selling AI products, of course.
I think an easy measure to help identify why a slow down is happening would be to measure how much refactoring happened on the AI generated code. Often times it seems to be missing stuff like error handling, or adds in unnecessary stuff. Of course this assumes it even had a working solution in the first place.
Most people who subscribe to that narrative have some connection to "AI" money, but there might be some misguided believers as well.
I've found that there are a couple of things you need to do to be very efficient.
- Maintain an architecture.md file (with AI assistance) that answers many of the questions and clarifies a lot of the ambiguity in the design and structure of the code.
- A bootstrap.md file(s) is also useful for a lot of tasks.. having the AI read it and start with a correct idea about the subject is useful and a time saver for a variety of kinds of tasks.
- Regularly asking the AI to refactor code, simplify it, modularize it - this is what the experienced dev is for. VIBE coding generally doesn't work as AI's tend to write messy non-modular code unless you tell them otherwise. But if you review code, ask for specific changes.. they happily comply.
- Read the code produced, and carefully review it. And notice and address areas where there are issues, have the AI fix all of these.
- Take over when there are editing tasks you can do more efficiently.
- Structure the solution/architecture in ways that you know the AI will work well with.. things it knows about.. it's general sweet spots.
- Know when to stop using the AI and code it yourself.. particuarly when the AI has entered the confusion doom loop. Wasting time trying to get the AI to figure out what it's never going to is best used just fixing it yourself.
- Know when to just not ever try to use AI. Intuitively you know there's just certain code you can't trust the AI to safely work on. Don't be a fool and break your software.
----
I've found there's no guarantee that AI assistance will speed up any one project (and in some cases slow it down).. but measured cross all tasks and projects, the benefits are pretty substantial. That's probably others experience at this point too.
Are we are still selling the "you are an expert senior developer" meme ? I can completely see how once you are working on a mature codebase LLMs would only slow you down. Especially one that was not created by an LLM and where you are the expert.
I think LLMs shine when you need to write a higher volume of code that extends a proven pattern, quickly explore experiments that require a lot of boilerplate, or have multiple smaller tasks that you can set multiple agents upon to parallelize. I've also had success in using LLMs to do a lot of external documentation research in order to integrate findings into code.
If you are fine-tuning an algorithm or doing domain-expert-level tweaks that require a lot of contextual input-output expert analysis, then you're probably better off just coding on your own.
Context engineering has been mentioned a lot lately, but it's not a meme. It's the real trick to successful LLM agent usage. Good context documentation, guides, and well-defined processes (just like with a human intern) will mean the difference between success and failure.
AI has a lot of potential but it's way over-hyped right now. Listen to the people on the ground who are doing real work and building real projects, none of them are over-hyping it. It's mostly those who have tangentially used LLMs.
It's also not surprising that many in this thread are clinging to a basic premise that it's 3 steps backwards to go 5 steps forward. Perhaps that is true but I'll take the study at face value, it seems very plausible to me.
Yes, and I'll add that there is likely no single "golden workflow" that works for everybody, and everybody needs to figure it out for themselves. It took me months to figure out how to be effective with these tools, and I doubt my approach will transfer over to others' situations.
For instance, I'm working solo on smallish, research-y projects and I had the freedom to structure my code and workflows in a way that works best for me and the AI. Briefly: I follow an ad-hoc, pair-programming paradigm, fluidly switching between manual coding and AI-codegen depending on an instinctive evaluation of whether a prompt would be faster. This rapid manual-vs-prompt assessment is second nature to me now, but it took me a while to build that muscle.
I've not worked with coding agents, but I doubt this approach will transfer over well to them.
I've said it before, but this is technology that behaves like people, and so you have to approach it like working with a colleague, with all their quirks and fallibilities and potentially-unbound capabilities, rather than a deterministic, single-purpose tool.
I'd love to see a follow-up of the study where they let the same developers get more familiar with AI-assisted coding for a few months and repeat the experiment.
Actually, it works well so long as you tell them when you’ve made a change. Claude gets confused if things randomly change underneath it, but it has no trouble so long as you give it a short explanation.
> My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.
This is what I heard about strong type systems (especially Haskell's) about 20-15 years ago."History does not repeat, but it rhymes."
If we rhyme "strong types will change the world" with "agentic LLMs will change the world," what do we get?
My personal theory is that we will get the same: some people will get modest-to-substantial benefits there, but changes in the world will be small if noticeable at all.
Also, my long experience is that even in PoC phase, using a type system adds almost zero extra time… of course if you know the type system, which should be trivial in any case after you’ve seen a few.
The study used 246 tasks across 16 developers, for an average of 15 tasks per developer. Divide that further in half because tasks were assigned as AI or not-AI assisted, and the sample size per developer is still relatively small. Someone would have to take the time to review the statistics, but I don’t think this is a case where you can start inferring that the developers who benefited from AI were just better at using AI tools than those who were not.
I do agree that it would be interesting to repeat a similar test on developers who have more AI tool assistance, but then there is a potential confounding effect that AI-enthusiastic developers could actually lose some of their practice in writing code without the tools.
The short version is that devs want to give instructions instead of ask for what outcome they want. When it doesn’t follow the instructions, they double down by being more precise, the worst thing you can do. When non devs don’t get what they want, they add more detail to the description of the desired outcome.
Once you get past the control problem, then you have a second set of issues for devs where the things that should be easy or hard don’t necessarily map to their mental model of what is easy or hard, so they get frustrated with the LLM when it can’t do something “easy.”
Lastly, devs keep a shit load of context in their head - the project, what they are working on, application state, etc. and they need to do that for LLMs too, but you have to repeat themselves often and “be” the external memory for the LLM. Most devs I have taught hate that, they actually would rather have it the other way around where they get help with context and state but want to instruct the computer on their own.
Interestingly, the best AI assisted devs have often moved to management/solution architecture, and they find the AI code tools brought back some of the love of coding. I have a hypothesis they’re wired a bit differently and their role with AI tools is actually closer to management than it is development in a number of ways.
The CTO and VPEng at my company (very small, still do technical work occasionally) both love the agent stuff so much. Part of it for them is that it gives them the opportunity to do technical work again with the limited time they have. Without having to distract an actual dev, or spend a long time reading through the codebase, they can quickly get context for an build small items themselves.
This suggests me though that they are bad at coding, otherwise they would have stayed longer. And I can't find anything in your comment that would corroborate the opposite. So what gives?
I am not saying what you say is untrue, but you didn't give any convincing arguments to us to believe otherwise.
Also, you didn't define the criteria of getting better. Getting better in terms of what exactly???
It's completely normal in development. How many years of programming experience you need for almost any language? How many days/weeks you need to use debuggers effectively? How long from the first contact with version control until you get git?
I think it's the opposite actually - it's common that new classes of tools in tech need experience to use well. Much less if you're moving to something different within the same class.
Is that perhaps because of the nature of the category of 'tech peoduct'. In other domains, this certainly isn't the case. Especially if the goal is to get the best result instead of the optimum output/effort balance.
Musical instruments are a clear case where the best results are down to the user. Most crafts are similar. There is the proverb "A bad craftsman blames his tools" that highlights that there are entire fields where the skill of the user is considered to be the most important thing.
When a product is aimed at as many people as the marketers can find, that focus on individual ability is lost and the product targets the lowest common denominator.
They are easier to use, but less capable at their peak. I think of the state of LLMs analogous to home computing at a stage of development somewhere around Altair to TRS-80 level. These are the first ones on the scene, people are exploring what they are good for, how they work, and sometimes putting them to effective use in new and interesting ways. It's not unreasonable to expect a degree of expertise at this stage.
The LLM equivalent of a Mac will come, plenty of people will attempt to make one before it's ready. There will be a few Apple Newtons along the way that will lead people to say the entire notion was foolhardy. Then someone will make it work. That's when you can expect to use something without expertise. We're not there yet.
Maybe, but it isn't hard to think of developer tools where this is the case. This is the entire history of editor and IDE wars.
Imagine running this same study design with vim. How well would you expect the not-previously-experienced developers to perform in such a study?
If my phone keeps crashing or if the browser is slow or clunky then yes, it’s not on me, it’s the phone, but an LLM is a lot more open ended in what it can do. Unlike the phone example above where I expect it to work from a simple input (turning it on) or action (open browser, punch in a url), what an LLM does is more complex and nuanced.
Even the same prompt from different users might result in different output - so there is more onus on the user to craft the right input.
Perhaps that’s why AI is exempt for now.
Nothing new this time except for people who have no vision and no ability to work hard not “getting it” because they don’t have the cognitive capacity to learn
The most useful thing of all would have been to have screen recordings of those 16 developers working on their assigned issues, so they could be reviewed for varying approaches to AI-assisted dev, and we could be done with this absurd debate once and for all.
Can someone point me to these 300k/yr jobs?
I feel like a proper study for this would involve following multiple developers over time, tracking how their contribution patterns and social standing changes. For example, take three cohorts of relatively new developers: instruct one to go all in on agentic development, one to freely use AI tools, and one prohibited from AI tools. Then teach these developers open source (like a course off of this book: https://pragprog.com/titles/a-vbopens/forge-your-future-with...) and have them work for a year to become part of a project of their choosing. Then in the end, track a number of metrics such as leadership position in community, coding/non-coding contributions, emotional connection to project, social connections made with community, knowledge of code base, etc.
Personally, my prior probability is that the no-ai group would likely still be ahead overall.
Two things that stand out to me are 1. it depends a lot on what kind of task you are having the LLM do. and 2. if the LLM process takes more time, it is very likely your cognitive effort was still way less - for sysadmin kinds of tasks, working with less often accessed systems, LLMs can read --help, man pages, doc sites, all for you, and give you the working command right there (And then run it, and then look at the output and tell you why it failed, or how it worked, and what it did). There is absolutely no question that second part is a big deal. Sticking it onto my large open source project to fix a deep, esoteric issue or write some subtle documentation where it doesnt really "get" what I'm doing, yeah it is not as productive in that realm and you might want to skip it for the thinking part there. I think everyone is trying to figure out this question of "when and how" for LLMs. I think the sweet spot is for tasks involving systems and technologies where you'd otherwise be spending a lot of time googling, stackoverflowing, reading man pages to get just the right parameters into commands and so forth. This is cognitive grunt work and the LLMs can do that part very well.
My week of effort with it was not really "coding on my open source project"; two examples were, 1. running a bunch of ansible playbooks that I wrote years ago on a new host, where OS upgrades had lots of snags; I worked with Claude to debug all the various error messages and places where the newer OS distribution had different packages, missing packages, etc. it was ENORMOUSLY helpful since I never look at these playbooks and I dont even remember what I did, Claude can read it for you and interpret it as well as you can. 2. I got a bugzilla for a fedora package that I packaged years ago, where they have some change to the directives used in specfiles that everyone has to make. I look at fedora packaging workflows once every three years. I told Claude to read the BZ and just do it. IT DID IT. I had to get involved running the "mock" suite as it needed sudo but Claude gave me the commands. zero googling. zero even reading the new format of the specfile (the bz linked to a tool that does the conversion). From bug received to bug closed and I didnt do any typing at all outside of the prompt. Had it done before breakfast since I didnt even need any glucose for mental energy expended. This would have been a painful and frustrating mental effort otherwise.
so the studies have to get more nuanced and survey a lot more than 16 devs I think
But it's hampered me in the fact that others, uninvited, toss an AI code review tool at some of my open PRs, and that spits out a 2-page document with cute emoji and formatted bullet points going over all aspects of a 30 line PR.
Just adds to the noise, so now I spend time deleting or hiding those comments in PRs, which means I have even _less_ time for actual useful maintenance work. (Not that I have much already.)
I use it like a know-it-all personal assistant that I can ask any question to; even [especially] the embarrassing, "stupid" ones.
> The only stupid question is the one we don't ask.
- On an old art teacher's wall
If you are looking for a 0.1% increase in productivity, then 16 is too small.
Soon once the tools and how people use them improve AI won’t be a hinderance for advanced tasks like this, and soon after AI will be able to do these prs on their own. It’s inevitable given the rate of improvement even since this study.
Currently AI is like a junior engineer, and if you don't have good experience managing junior engineers, AI isn't going to help you as much.
> This gap between perception and reality is striking: developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%.
I wonder what could explain such large difference between estimation/experience vs reality, any ideas?
Maybe our brains are measuring mental effort and distorting our experience of time?
The "economic experts" and "ml experts" are in many cases effectively the same group-- companies pushing AI coding tools have a vested interest in people believing they're more useful than they are. Executives take this at face value and broadly promise major wins. Economic experts take this at face value and use this for their forecasts.
This propagates further, and now novices and casual individuals begin to believe in the hype. Eventually, as an experienced engineer it moves the "baseline" expectation much higher.
Unfortunately this is very difficult to capture empirically.
It was fun to watch, it’s super polished and sci-fi-esque. But after 15 minutes I felt braindead and was bored out of my mind lol
What if agentic coding sessions are triggering a similar dopamine feedback loop as social media apps? Obviously not to the same degree as social media apps, I mean coding for work is still "work"... but there's maybe some similarity in getting iterative solutions from the agent, triggering something in your brain each time, yes?
If that was the case, wouldn't we expect developers to have an overly positive perception of AI because they're literally becoming addicted to it?
https://softwarecrisis.dev/letters/llmentalist/
Plus there's a gambling mechanic: Push the button, sometimes get things for free.
My issue with this being a 'negative' thing is that I'm not sure it is. It works off of the same hunting / foraging instincts that keep us alive. If you feel addiction to something positive, it is bad?
Social media is negative because it addicts you to mostly low quality filler content. Content that doesn't challenge you. You are reading shit posts instead of reading a book or doing something with better for you in the long run.
One could argue that's true for AI, but I'm not confident enough to make such a statement.
I wish there was a simple way to measure energy spent instead of time. Maybe nature is just optimizing for something else.
The developers might feel more productive because they're engaging with their code at a higher level of abstraction, even if it takes longer. This would be consistent with why they maintained positive perceptions despite the slowdown.
But, taking a broader view its possible that these initial speed ups are negated by the fact that I never really learn go or helm charts as deeply now that I use claude code. Over time, its possible that my net productiviy is still reduced. Hard to say for sure, especially considering I might not have even considered talking these more difficult go library modifications if I didn't have claude code to hold my hand.
Regardless, these tools are out there, increasing in effectiveness and I do feel like I need to jump on the train before it leaves me at the station.
Would be interesting (and in fact necessary to derive conclusions from this study) to see aggregate number of tasks completed per developer with AI augmentation. That is, if time per task has gone up by 20% but we clear 2x as many tasks, that is a pretty important caveat to the results published here
you're addicted to the FEELING of productivity more than actual productivity. even knowing this, even seeing the data, even acknowledging the complete fuckery of it all, you're still gonna use me. i'm still gonna exist. you're all still gonna pretend this helps because the alternative is admitting you spent billions of dollars on spicy autocomplete.
Scaled up slightly, we use it to build plenty of internal tooling in our video content production pipeline (syncing between encoding tools and a status dashboard for our non-technical content team).
Using it for anything more than boilerplate code, well-defined but tedious refactors, or quickly demonstrating how to use an unfamiliar API in production code, before a human takes a full pass at everything is something I'm going to be wary of for a long time.
I feel like programming has become increasingly specialized and even before AI tool explosion, it's way more possible to be ignorant of an enormous amount of "computing" than it used to be. I feel like a lot of "full stack" developers only understand things to the margin of their frameworks but above and below it they kind of barely know how a computer works or what different wire protocols actually are or what an OS might actually do at a lower level. Let alone the context in which in application sits beyond let's say, a level above a kubernetes pod and a kind of trial-end-error approach to poking at some YAML templates.
Do we all need to know about processor architectures and microcode and L2 caches and paging and OS distributions and system software and installers and openssl engines and how to make sure you have the one that uses native instructions and TCP packets and envoy and controllers and raft systems and topic partitions and cloud IAM and CDN and DNS? Since that's not the case--nearly everyone has vast areas of ignorance yet still does a bunch of stuff--it's harder to sell the idea that whatever AI tools are doing that we lose skills in will somehow vaguely matter in the future.
I kind of miss when you had to know a little of everything and it also seemed like "a little bit" was a bigger slice of what there was to know. Now you talk to people who use a different framework in your own language and you feel like you're talking to deep specialists whose concerns you can barely understand the existence of, let alone have an opinion on.
Such as: do you end up spending more time to find and fix issues, does AI use reduce institutional knowledge, will you be more inclined to start projects over from scratch.
* 3 weeks to transition from ai pairing to AI Delegation to ai multitasking. So work gains are mostly week 3+. That's 120+ hours in, as someone pretty senior here.
* Speedup is the wrong metric. Think throughput, not latency. Some finite amount of work might take longer, but the volume of work should go up because AI can do more on a task and diff tasks/projects in parallel.
Both perspectives seem consistent with the paper description...
For example, today I asked claude to implement per-user rate-limiting into my nestjs service, then iterated by asking implementing specific unit tests and some refactoring. It one-shot everything. I would say 90% time savings.
Unskilled people ask them ”i have giant problem X solve it” and end up with slop
“One shotting” apps, or even cursor and so forth seem like a waste of time. It feels like if you prompt it just right it might help but then it never really does.
For everything else, I think you're right, and actually the dialog-oriented method is way better. If I learn an approach and apply some general example from ChatGPT, but I do the typing and implementation myself so I need to understand what I'm doing, I'm actually leveling up and I know what I'm finished with. If I weren't "experienced", I'd worry about what it was doing to my critical thinking skills, but I know enough about learning on my own at this point to know I'm doing something.
I'm not interested in vibe coding at all--it seems like a one-way process to automate what was already not the hard part of software engineering; generating tutorial-level initial implementations. Just more scaffolding that eventually needs to be cleared away.
This shows that everyone in the study (economic experts, ML experts and even developers themselves, even after getting experience) are novices if we look at them from the Dunning-Kruger effect [1] perspective.
[1] https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
"The Dunning–Kruger effect is a cognitive bias in which people with limited competence in a particular domain overestimate their abilities."
I wouldn't accept someone's copy and pasted code from another project if it were under an incompatible license, let alone something with unknown origin.
anyway -AI as the tech currently stand is a new skill to use and takes us humans time to learn, but once we do well, its becomes force multiplier
ie see this: https://claude.ai/public/artifacts/221821f0-0677-409b-8294-3...
You can still do those things.
Cursor’s workflow exposes how differently different people track context. The best ways to work with Cursor may simply not work for some of us.
If Cursor isn’t working for you, I strongly encourage you to try CLI agents like Claude Code.
But isn't the important thing to measure... how long does it take to debug the resulting code at 3AM when you get a PagerDuty alert?
Similarly... how about the quality of this code over time? It's taken a lot of effort to bring some of the code bases I work in into a more portable, less coupled, more concise state through the hard work of
- bringing shared business logic up into shared folders
- working to ensure call chains flow top down towards root then back up through exposed APIs from other modules as opposed to criss-crossing through the directory structure
- working to separate business logic from API logic from display logic
- working to provide encapsulation through the use of wrapper functions creating portability
- using techniques like dependency injection to decouple concepts allowing for easier testing
etc
So, do we end up with better code quality that ends up being more maintainable, extensible, portable, and composable? Or do we just end up with lots of poor quality code that eventually grows to become a tangled mess we spend 50% of our time fighting bugs on?
My personal hypothesis is that seeing the LLM write _so much_ code may create the feeling that the problems it is solving would take longer to solve by yourself.
I already know what I need to write, I just need to get it into the editor. I wouldn’t trade the precision I have with vim macros flying across multiple files for an AI workflow.
I do think AI is a good rubber ducky sometimes tho, but I despise letting it take over editing files.
Jabrov•9h ago
IshKebab•8h ago
barbazoo•8h ago
IshKebab•8h ago
To be clear it wasn't $75k each.
narush•6h ago
Paper is here: https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf
mapt•8h ago
lawlessone•8h ago
IshKebab•7h ago
asdff•6h ago
narush•7h ago
biophysboy•6h ago