Measuring the impact of AI on experienced open-source developer productivity

https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/

507•dheerajvs•9h ago

Comments

Jabrov•9h ago

Very interesting methodology, but the sample size (16) is way too low. Would love to see this repeated with more participants.

IshKebab•8h ago

They paid the developers about $75k in total to do this so I wouldn't hold your breath!

barbazoo•8h ago

That's a lot of money for many of us. Do you know those folks were in a HCOL area?

IshKebab•8h ago

No idea. They don't say who they were; just random popular GitHub projects.

To be clear it wasn't $75k each.

narush•6h ago

You can see a list of repositories with participating developers in the appendix! Section G.7.

Paper is here: https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

mapt•8h ago

It isn't a lot of money for industry research. Changes of +-40% in productivity are an enormous advantage/disadvantage for a large tech company moving tens of billions of dollars a year in cashflow through a pipeline that their software engineers built.

lawlessone•8h ago

Neat, how to sign up??

IshKebab•7h ago

Go back in time, create a popular github repo with lots of stars, be lucky.

asdff•6h ago

I see these things posted on linkedin. Usually asking $40/hr though. But essentially the same thing as the OP outlines: you do some domain related task assigned either with or without an AI tool. Check linked in. They will have really vague titles like "data scientist" though even though that's not what is being described, its study subject. Maybe set 40/hr as a filter on linkedin and see if you can get a few to come up.

narush•7h ago

Noting that most of our power comes from the number of tasks that developers complete; it's 246 total completed issues in the course of this study -- developers do about 15 issues (7.5 with AI and 7.5 without AI) on average.

biophysboy•6h ago

Did you compare the variance within individuals (due to treatment) to the variance between individuals (due to other stuff)?

kokanee•9h ago

> developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%.

I feel like there are two challenges causing this. One is that it's difficult to get good data on how long the same person in the same context would have taken to do a task without AI vs with. The other is that it's tempting to time an AI with metrics like how long until the PR was opened or merged. But the AI workflow fundamentally shifts engineering hours so that a greater percentage of time is spent on refactoring, testing, and resolving issues later in the process, including after the code was initially approved and merged. I can see how it's easy for a developer to report that AI completed a task quickly because the PR was opened quickly, discounting the amount of future work that the PR created.

qsort•8h ago

It's really hard to attribute productivity gains/losses to specific technologies or practices, I'm very wary of self-reported anecdotes in any direction precisely because it's so easy to fool ourselves.

I'm not making any claim in either direction, the authors themselves recognize the study's limitations, I'm just trying to say that everyone should have far greater error bars. This technology is the weirdest shit I've seen in my lifetime, making deductions about productivity from anecdotes and dubious benchmarks is basically reading tea leaves.

yorwba•7h ago

Figure 21 shows that initial implementation time (which I take to be time to PR) increased as well, although post-review time increased even more (but doesn't seem to have a significant impact on the total).

But Figure 18 shows that time spent actively coding decreased (which might be where the feeling of a speed-up was coming from) and the gains were eaten up by time spent prompting, waiting for and then reviewing the AI output and generally being idle.

So maybe it's not a good idea to use LLMs for tasks that you could've done yourself in under 5 minutes.

narush•7h ago

Qualitatively, we don't see a drop in PR quality in between AI-allowed and AI-disallowed conditions in the study; the devs who participate are generally excellent, know their repositories standards super well, and aren't really into the 'get up a bad PR' vibe -- the median review time on the PRs in the study is about a minute.

Developers totally spend time totally differently, though, this is a great callout! On page 10 of the paper [1], you can see a breakdown of how developers spend time when they have AI vs. not - in general, when these devs have AI, they spend a smaller % of time writing code, and a larger % of time working with AI (which... makes sense).

[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

gitremote•6h ago

> I feel like there are two challenges causing this. One is that it's difficult to get good data on how long the same person in the same context would have taken to do a task without AI vs with.

The standard experimental design that solves this is to randomly assign participants to the experiment group (with AI) and the control group (without AI), which is what they did. This isolates the variable (with or without AI), taking into account uncontrollable individual, context, and environmental differences. You don't need to know how the single individual and context would have behaved in the other group. With a large enough sample size and effect size, you can determine statistical significance, and that the with-or-without-AI variable was the only difference.

dash2•9h ago

The authors say "High developer familiarity with repositories" is a likely reason for the surprising negative result, so I wonder if this generalizes beyond that.

kennywinker•8h ago

Like if it generalizes to situations where the developer is not familiar with the repo? That doesn’t seem like generalizing, that seems like specifying. Am I wrong in saying that the majority of developer time is spent in repos that they’re familiar with? Every job and project I’ve worked has been on a fixed set of repos the entire time. If AI is only helpful for the first week or two on a project, that’s not very many cases it’s useful for.

jbeninger•3h ago

I'd say I write the majority of my code in areas I'm familiar with, but spend the majority of my _time_ on sections I'm not familiar with, and ai helps a lot more with the latter than the former. I've always felt my coding life is speeding through a hundred lines of easy code then getting stuck on the 101st. Then as I get more experienced that hundred becomes 150, then 200, but always speeding through the easy part until I have to learn something new.

So I never feel like I'm getting any faster. 90% of my time is still spent in frustration, even when I'm producing twice the code at higher quality

add-sub-mul-div•8h ago

Without the familiarity would the work be getting done effectively? What does it mean for someone to commit AI code that they can't fully understand?

noisy_boy•8h ago

It is 80/20 again - it gets you 80% of the way in 20% of the time and then you spend 80% of the time to get the rest of the 20% done. And since it always feels like it is almost there, sunk-cost fallacy comes into play as well and you just don't want to give up.

I think an approach that I tried recently is to use it as a friction remover instead of a solution provider. I do the programming but use it to remove pebbles such as that small bit of syntax I forgot, basically to keep up the velocity. However, I don't look at the wholesale code it offers. I think keeping the active thinking cap on results in code I actually understand while avoiding skill atrophy.

wmeredith•8h ago

> and then you spend 80% of the time to get the rest of the 20% done

This was my pr-AI experience anyway, so getting that first chunk of time back is helpful.

Related: One of the better takes I've seen on AI from an experienced developer was, "90% of my skills just became worthless, and the other 10% just became 1,000 times more valuable." There's some hyperbole there, I but I like the gist.

skydhash•8h ago

It’s not funny when you find yourself redoing the first 80%, as the only way to complete the second 80%.

bluefirebrand•6h ago

Let us know if that dev you're talking about winds up working 90% less for the same amount, or earning 1000x more

Otherwise he can shut the fuck up about being 1000x more valuable imo

emodendroket•8h ago

I think it’s most useful when you basically need Stack Overflow on steroids: I basically know what I want to do but I’m not sure how to achieve it using this environment. It can also be helpful for debugging and rubber ducking generally.

threetonesun•8h ago

Absolutely this. For a while I was working with a language I was only partially familiar with, and I'd say "here's how I would do this in [primary language], rewrite it in [new language]" and I'd get a decent piece of code back. A little searching in the project to make sure it was stylistically correct and then done.

emodendroket•1h ago

Those kind of tasks are good for it, yeah. “Here’s some JSON. Please generate a Java class I can deserialize it into” is similar.

some-guy•8h ago

All those things are true, but it's such a small part of my workflow at this point that the savings, while nice, aren't nearly as life-changing to my job as my CEO is forcing us to think it is.

Once AI can actually untangle our 14 year old codebase full of hosh-posh code, read every commit message, JIRA ticket, and Slack conversation related to the changes in full context, it's not going to solve a lot of the hard problems at my job.

emodendroket•1h ago

Some of the “explain what it does” functionality is better than you might think but to be honest I find myself called on to work with unfamiliar tools all the time so I find plenty of value.

skydhash•8h ago

The issue is that it is slow and verbose, at least in its default configuration. The amount of reading is non trivial. There’s a reason most references are dense.

lukan•8h ago

Those issues you can partly solve by changing the prompt to tell it to be concise and don't explain its code.

But nothing will make them stick to the one API version I use.

diggan•7h ago

> But nothing will make them stick to the one API version I use.

Models trained for tool use can do that. When I use Codex for some Rust stuff for example, it can grep from source files in the directory dependencies are stored, so looking up the current APIs is trivial for them. Same works for JavaScript and a bunch of other languages too, as long as it's accessible somewhere via the tools they have available.

lukan•7h ago

Hm, I never tried codex so far, but quite some other tools and models and none could help me in a consistent way. But I am sceptical, because also if I tell them explicitel, to only use one specific version they might or not might use that, depending on their training corpus and temperature I assume.

malfist•3h ago

The less verbosity you allow the dumber the LLM is. It thinks in tokens and if you keep it from using tokens it's lobotomized.

emodendroket•1h ago

Well, compared to what method that would be faster to answer that kind of question?

GuinansEyebrows•7h ago

> rubber ducking

i don't mean to pick on your usage of this specifically, but i think it's noteworthy that the colloquial definition of "rubber ducking" seems to have expanded to include "using a software tool to generate advice/confirm hunches". I always understood the term to mean a personal process of talking through a problem out loud in order to methodically, explicitly understand a theoretical plan/process and expose gaps.

based on a lot of articles/studies i've seen (admittedly haven't dug into them too deeply) it seems like the use of chatbots to perform this type of task actually has negative cognitive impacts on some groups of users - the opposite of the personal value i thought rubber-ducking was supposed to provide.

jonathanlydall•2h ago

There is something that happens to our thought processes when we verbalise or write down our thoughts.

I like to think of it that instead of having seemingly endless amounts of half thoughts spinning around inside your head, you make an idea or thought more “fully formed” when you express it verbally or with written (or typed) words.

I believe this is part of why therapy can work, by actually expressing our thoughts, we’re kind of forced to face realities and after doing so it’s often much easier to reflect on it. Therapists often recommend personal journals as they can also work for this.

I believe rubber ducking works because in having to explain the problem, it forces you to actually gather your thoughts into something usable from which you can more effectively reflect on.

I see no reason why doing the same thing except in writing to an LLM couldn’t be equally effective.

danparsonson•2h ago

Indeed the duck is supposed to sit there in silence while the speaker does the thinking ^^

This is what human language does though, isn't it? Evolves over time, in often weird ways; like how many people "could care less" about something they couldn't care less about.

emodendroket•1h ago

Well OK, sure. But I’m having a “conversation” with nobody still. I’m surprised how often it happens that the AI a gives me a totally wrong answer but a combination of formulating the question and something in the answer made me think of the right thing after all.

eknkc•8h ago

It works great on adding stuff to an already established codebase. Things like “we have these search parameters, also add foo”. Remove anything related to x…

antonvs•8h ago

Exactly. If you can give it a contract and a context, essentially, and it doesn't need to write a large amount of code to fulfill it, it can be great.

I just used it to write about 80 lines of new code like that, and there's no question it saves time.

reverendsteveii•8h ago

well we used to have a sort of inverse pareto where 80% of the work took 80% of the effort and the remaining 20% of the work also took 80% of the effort.

I do think you're onto something with getting pebbles out of the road inasmuch as once I know what I need to do AI coding makes the doing much faster. Just yesterday I was playing around with removing things from a List object using the Java streams API and I kept running into ConcurrentOperationsExceptions, which happen when multiple threads are mutating the list object at the same time because no thread can guarantee it has the latest copy of the list unaltered by other threads. I spent about an hour trying to write a method that deep copies the list, makes the change and then returns the copy and running into all sorts of problems til I asked AI to build me a thread-safe list mutation method and it was like "Sure, this is how I'd do it but also the API you're working with already has a method that just....does this." Cases like this are where AI is supremely useful - intricate but well-defined problems.

cwmoore•8h ago

Code reuse at scale: 80 + 80 = 160% ~ phi...coincidence?

I think this may become a long horizon harvest for the rigorous OOP strategy, may Bill Joy be disproved.

Gray goo may not [taste] like steel-cut oatmeal.

Sharlin•7h ago

It's often said that π is the factor by which one should multiply all estimates – reducing it to ɸ would be a significant improvement in estimation accuracy!

visarga•5h ago

1.6x multiplier is low, we usually need to apply 5x

01100011•8h ago

As an old dev this is really all I want: a sort of autocorrect for my syntactical errors to save me a couple compile-edit cycles.

pferde•7h ago

What I want is not autocorrect, because that won't teach me anything. I want it to yell at me loudly and point to the syntactical error.

Autocorrect is a scourge of humanity.

causal•7h ago

Agreed and +1 on "always feels like it is almost there" leading to time sink. AI is especially good at making you feel like it's doing something useful; it takes a lot of skill to discern the truth.

i_love_retros•5h ago

The problem is I then have to also figure out the code it wrote to be able to complete the final 20%. I have no momentum and am starting from almost scratch mentally.

fritzo•8h ago

As an open source maintainer on the brink of tech debt bankruptcy, I feel like AI is a savior, helping me keep up with rapid changes to dependencies, build systems, release methodology, and idioms.

aerhardt•8h ago

But what about producing actual code?

fritzo•8h ago

Producing code is overrated. There's lots of old code whose lifetime we can extend.

fhd2•6h ago

Very, very much this.

resource_waste•8h ago

I find it useful for simple algorithms and error solving.

candiddevmike•5h ago

If you stewarded that much tech debt in the first place, how can you be sure LLM will help prevent it going forward? In my experience, LLMs add more tech debt due to lacking cohesion with it's edits.

IshKebab•8h ago

I wonder if the discrepancy is that it felt like it was taking less time because they were having to do less thinking which feels like it is easier and hence faster.

Even so... I still would be really surprised if there wasn't some systematic error here skewing the results, like the developers deliberately picked "easy" tasks that they already knew how to do, so implementing them themselves was particularly fast.

Seems like they authors had about as good methodology as you can get for something like this. It's just really hard to test stuff like this. I've seen studies proving that code comments don't matter for example... are you going to stop writing comments? No.

narush•7h ago

> which feels like it is easier and hence faster.

We explore this factor in section (C.2.5) - "Trading speed for ease" - in the paper [1]. It's labeled as a factor with an unclear effect, some developers seem to think so, and others don't!

> like the developers deliberately picked "easy" tasks that they already knew how to do

We explore this factor in (C.2.2) - "Unrepresentative task distribution." I think the effect here is unclear; these are certainly real tasks, but they are sampled from the smaller end of tasks developers would work on. I think the relative effect on AI vs. human performance is not super clear...

[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

IshKebab•5h ago

Sounds like you've thought of everything!

tcdent•8h ago

This study neglects to incorporate the fact that I have forgotten how to write code.

resource_waste•8h ago

I'm curious what space people are working in where AI does their job entirely.

I can use it for parts of code, algorithms, error solving, and maybe sometimes a 'first draft'.

But there is no way I could finish an entire piece of software with AI only.

asdff•6h ago

Not a lot of people are empowered to create an entire piece of software. Most are probably in the trenches squashing tickets.

tcdent•6h ago

I do create entire pieces of software, and while my workflow is always evolving, it goes something like this:

Define schemas, interfaces, and perhaps some base classes that define the attributes I'm thinking about.

Research libraries that support my cause, and include them.

Reference patterns I have established in other parts of the codebase; internal tooling for database, HTTP services, etc.

Instruct the agent to come up with a plan for a first pass at execution in markdown format. Iterate on this plan; "what about X?"

Splat a bunch of code down that supports the structure I'm looking for. Iterate. Cleanup. Iterate. Implement unit tests and get them to pass.

Go back through everything manually and adjust it to suit my personal style, while at the same time fully understanding what's being done and why.

I use STT a lot to have conversations with the agent as we go, and very rarely allow it to make sequential edits without reviewing first; this is a great opportunity to go back and forth and refine what's being written.

asdff•6h ago

You are going well above and beyond what a lot of people do to be fair. There are people in senior roles who are just futzing with json files.

joks•3h ago

I think the question still stands.

narush•7h ago

Honestly, this is a fair point -- and speaks the difficulty of figuring out the right baseline to measure against here!

If we studied folks with _no_ AI experience, then we might underestimate speedup, as these folks are learning tools (see a discussion of learning effects in section (C.2.7) - Below-average use of AI tools - in the paper). If we studied folks with _only_ AI experience, then we might overestimate speedup, as perhaps these folks can't really program without AI at all.

In some sense, these are just two separate and interesting questions - I'm excited for future work to really dig in on both!

NewsaHackO•8h ago

So they paid developers 300 x 246 = about 73K just for developer recruitment for the study, which is not in any academic journal, or has no peer reviews? The underlying paper looks quite polished and not overtly AI generated so I don't want to say it entirely made up, but how were they even able to get funding for this?

iLoveOncall•8h ago

https://metr.org/about Seems like they get paid by AI companies, and they also get government funding.

narush•8h ago

Our largest funding was through The Audacious Project -- you can see an announcement here: https://metr.org/blog/2024-10-09-new-support-through-the-aud...

Per our website, “To date, April 2025, we have not accepted compensation from AI companies for the evaluations we have conducted.” You can check out the footnote on this page: https://metr.org/donate

iLoveOncall•8h ago

This is really disingenuous when you also say that OpenAI and Anthropic have provided you with access and compute credits (on https://metr.org/about).

Not all payment is cash. Compute credits is still by all means compensation.

gtsop•8h ago

Are you willing to be compensated with compute credits for your job?

Such companies spit out "credits" all over the place in order to gain traction and enstablish themselves. I remember when cloud providers gave vps credits to startups like they were peanuts. To me, it really means absolutelly nothing.

bawolff•7h ago

I wouldn't do my job for $10, but if somehow someone did pay me $10 to do something, i wouldn't claim i wasn't compensated.

In-kind compensation is still compensation.

iLoveOncall•7h ago

> Are you willing to be compensated with compute credits for your job?

Well, yes? I use compute for some personal projects so I would be absolutely fine if a part of my compensation was in compute credits.

As a company, even more so.

dolebirchwood•7h ago

Is it "really" disingenuous, or is it just a misinterpretation of what it means to be "compensated for"? Seems more like quibbling to me.

iLoveOncall•5h ago

I was actually being kind by saying it's disingenuous. I think it's an outright lie.

golly_ned•6h ago

Those are compute credits that are directly spent on the experiment itself. It's no more "compensation" than a chemistry researcher being "compensated" with test tubes.

iLoveOncall•5h ago

> Those are compute credits that are directly spent on the experiment itself.

You're extrapolating, it's not saying this anywhere.

> It's no more "compensation" than a chemistry researcher being "compensated" with test tubes.

Yes, that's compensation too. Thanks for contributing another example. Here's another one: it's no more compensation than a software engineer being compensated with a new computer.

Actually the situation here is way worse than your example. Unless the chemistry researcher is commissioned by Big Test Tube Corp. to conduct research on the outcome of using their test tubes, there's no conflict of interest here. But there is an obvious conflict of interest on AI research being financed by credits given by AI companies to use their own AI tools.

bee_rider•8h ago

Companies produce whitepapers all the time, right? They are typically some combination of technical report, policy suggestion, and advertisement for the organization.

fabianhjr•8h ago

Most of the world provides funding for research, the US used to provide funding but now that has been mostly gutted.

resource_waste•8h ago

>which is not in any academic journal, or has no peer reviews?

As a philosopher who is into epistemology and ontology, I find this to be as abhorrent as religion.

'Science' doesn't matter who publishes it. Science needs to be replicated.

The psychology replication crisis is a prime example of why peer reviews and publishing in a journal matters 0.

bee_rider•6h ago

> The psychology replication crisis is a prime example of why peer reviews and publishing in a journal matters 0.

Specifically, it works as an example of a specific case where peer review doesn’t help as much. Peer review checks your arguments, not your data collection process (which the reviewer can’t audit for obvious reasons). It works fine in other scenarios.

Peer review is unrelated to replication problems, except to the extent to which confused people expect peer review to fix totally unrelated replication problems.

raincole•4h ago

Peer reviews are very important to filter out obviously low effort stuff.

...Or should I say "were" very important? With the help of today's GenAI every low effort stuff can look high effort without much extra effort.

30minAdayHN•8h ago

This study focused on experienced OSS maintainers. Here is my personal experience, but a very different persona (or opposite to the one in the study). I always wanted to contribute to OSS but never had time to. Finally was able to do that, thanks to AI. Last month, I was able to contribute to 4 different repositories which I would never have dreamed of doing it. I was using an async coding agent I built[1], to generate PRs given a GitHub issue. Some PRs took a lot of back and forth. And some PRs were accepted as is. Without AI, there is no way I would have contributed to those repositories.

One thing that did work in my favor is that, I was clearly creating a failing repro test case, and adding before and after along with PR. That helped getting the PR landed.

There are also a few PRs that never got accepted because the repro is not as strong or clear.

[1] https://workback.ai

MYEUHD•8h ago

This does not mention the open-source developer time wasted while reviewing vibe coded PRs

narush•7h ago

Yeah, I'll note that this study does _not_ capture the entire OS dev workflow -- you're totally right that reviewing PRs is a big portion of the time that many maintainers spend on their projects (and thanks to them for doing this [often hard] work). In the paper [1], we explore this factor in more detail -- see section (C.2.2) - Unrepresentative task distribution.

There's some existing lit about increased contributions to OS repositories after the introduction of AI -- I've also personally heard a fear anecdotes about an increase in the number of low-quality PRs from first time contributors, seemingly as a result of AI making it easier to get started -- ofc, the tradeoff is that making it easier to get started has pros to it too!

[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

castratikron•8h ago

I really like those graphics, does anyone know the tool was used to create them?

narush•7h ago

The graphs are all matplotlib. The methodology figure is built in Figma! (Source: I'm a paper author :)).

narush•8h ago

Hey HN, study author here. I'm a long-time HN user -- and I'll be in the comments today to answer questions/comments when possible!

If you're short on time, I'd recommend just reading the linked blogpost or the announcement thread here [1], rather than the full paper.

[1] https://x.com/METR_Evals/status/1943360399220388093

jsnider3•7h ago

It's good to know that Claude 3.7 isn't enough to build Skynet!

causal•7h ago

Hey I just wanted to say this is one of the better studies I've seen - not clickbaity, very forthright about what is being claimed, and presented in such an easy-to-digest format. Thanks so much for doing this.

narush•7h ago

Thanks for the kind words!

igorkraw•7h ago

Could you either release the dataset (raw but anonymized) for independent statistical évaluation or at least add the absolute times of each dev per task to the paper? I'm curious what the absolute times of each dev with/without AI was and whether the one guy with lots of Cursor experience was actually faster than the rest of just a slow typer getting a big boost out of llms

Also, cool work, very happy to see actually good evaluations instead of just vibes or observational stuies that don't account for the Hawthorne effect

narush•7h ago

Yep, sorry, meant to post this somewhere but forgot in final-paper-polishing-sprint yesterday!

We'll be releasing anonymized data and some basic analysis code to replicate core results within the next few weeks (probably next, depending).

Our GitHub is here (http://github.com/METR/) -- or you can follow us (https://x.com/metr_evals) and we'll probably tweet about it.

igorkraw•5h ago

Cool, thanks a lot. Btw, I have a very tiny tiny (50 to 100 audience ) podcast where we try to give context to what we call the "muck" of AI discourse (trying to ground claims into both what we would call objectively observable facts/évidence, and then _separately_ giving out own biased takes), if you would be interested to come on it and chat => contact email in my profile.

ryanar•3h ago

podcast link?

antonvs•7h ago

Was any attention paid to whether the tickets being implemented with AI assistance were an appropriate use case for AI?

If the instruction is just "implement this ticket with AI", then that's very realistic in that it's how management often tries to operate, but it's also likely to be quite suboptimal. There are ways to use AI that help a lot, and other ways that hurt more than it helps.

If your developers had sufficient experience with AI to tell the difference, then they might have compensated for that, but reading the paper I didn't see any indication of that.

narush•7h ago

The instructions given to developers was not just "implement with AI" - but rather that they could use AI if they deemed it would be helpful, but indeed did _not need to use AI if they didn't think it would be helpful_. In about ~16% of labeled screen recordings where developers were allowed to use AI, they choose to use no AI at all!

That being said, we can't rule out that the experiment drove them to use more AI than they would have outside of the experiment (in a way that made them less productive). You can see more in section "Experimentally driven overuse of AI (C.2.1)" [1]

[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

isoprophlex•6h ago

I'll just say that the methodology of the paper and the professionalism with which you are answering us here is top notch. Great work.

narush•5h ago

Thank you!

JackC•5h ago

(I read the post but not paper.)

Did you measure subjective fatigue as one way to explain the misperception that AI was faster? As a developer-turned-manager I like AI because it's easier when my brain is tired.

narush•5h ago

We attempted to! We explore this more in the section Trading speed for ease (C.2.5) in the paper (https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf).

TLDR: mixed evidence that developers make it less effortful, from quantitative and qualitative reports. Unclear effect.

incomingpain•8h ago

Essentially an advertisement against Cursor Pro and/or Claude Sonnet 3.5/3.7

I think personally when i tried tools like Void IDE, I was fighting with Void too much. It is beta software, it is buggy, but also the big one... learning curve of the tool.

I havent had the chance to try cursor but i imagine its going to have a learning curve as a new tool.

So perhaps there is a slowdown at first expected; but later after you get your context and prompting down pat. Asking specifically for what you want. Then you get your speed up.

achenet•8h ago

I find agents useful for showing me how to do something I don't already know how to do, but I could see how for tasks I'm an expert on, I'd be faster without an extra thing to have to worry about (the AI).

dboreham•8h ago

Any time you see the word "measuring" in the context of software development, you know what follows will be nonsense and probably in service of someone's business model.

simonw•8h ago

Here's the full paper, which has a lot of details missing from the summary linked above: https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

This study had 16 participants, with a mix of previous exposure to AI tools - 56% of them had never used Cursor before, and the study was mainly about Cursor.

They then had those 16 participants work on issues (about 15 each), where each issue was randomly assigned a "you can use AI" v.s. "you can't use AI" rule.

So each developer worked on a mix of AI-tasks and no-AI-tasks during the study.

A quarter of the participants saw increased performance, 3/4 saw reduced performance.

One of the top performers for AI was also someone with the most previous Cursor experience. The paper acknowledges that here:

> However, we see positive speedup for the one developer who has more than 50 hours of Cursor experience, so it's plausible that there is a high skill ceiling for using Cursor, such that developers with significant experience see positive speedup.

My intuition here is that this study mainly demonstrated that the learning curve on AI-assisted development is high enough that asking developers to bake it into their existing workflows reduces their performance while they climb that learing curve.

mjr00•8h ago

> My intuition here is that this study mainly demonstrated that the learning curve on AI-assisted development is high enough that asking developers to bake it into their existing workflows reduces their performance while they climb that learing curve.

Definitely. Effective LLM usage is not as straightforward as people believe. Two big things I see a lot of developers do when they share chats:

1. Talk to the LLM like a human. Remember when internet search first came out, and people were literally "Asking Jeeves" in full natural language? Eventually people learned that you don't need to type, "What is the current weather in San Francisco?" because "san francisco weather" gave you the same, or better, results. Now we've come full circle and people talk to LLMs like humans again; not out of any advanced prompt engineering, but just because it's so anthropomorphized it feels natural. But I can assure you that "pandas count unique values column 'Foo'" is just as effective an LLM prompt as "Using pandas, how do I get the count of unique values in the column named 'Foo'?" The LLM is also not insulted by you talking to it like this.

2. Don't know when to stop using the LLM. Rather than let the LLM take you 80% of the way there and then handle the remaining 20% "manually", they'll keep trying to prompt to get the LLM to generate what they want. Sometimes this works, but often it's just a waste of time and it's far more efficient to just take the LLM output and adjust it manually.

Much like so-called Google-fu, LLM usage is a skill and people who don't know what they're doing are going to get substandard results.

Jaxan•8h ago

> Effective LLM usage is not as straightforward as people believe

It is not as straightforward as people are told to believe!

sleepybrett•7h ago

^ this, so much this. The amount of bullshit that gets shoveled into hacker news threads about the supposed capabilities of these models is epic.

gedy•8h ago

> Talk to the LLM like a human

Maybe the LLM doesn't strictly need it, but typing out does bring some clarity for the asker. I've found it helps a lot to catch myself - what am I even wanting from this?

frotaur•7h ago

I'm not sure about your example about talking to LLMs. There is good reason to think that speaking to it like a human might produce better results, as that's what most of the training data is composed of.

I don't have any studies, but it eems to me reasonable to assume.

(Unlike google, where presumably it actually used keywords anyway)

mjr00•7h ago

> I'm not sure about your example about talking to LLMs. There is good reason to think that speaking to it like a human might produce better results, as that's what most of the training data is composed of.

In practice I have not had any issues getting information out of an LLM when speaking to them like a computer, rather than a human. At least not for factual or code-related information; I'm not sure how it impacts responses for e.g. creative writing, but that's not what I'm using them for anyway.

lukan•7h ago

"But I can assure you that "pandas count unique values column 'Foo'" is just as effective an LLM prompt as "Using pandas, how do I get the count of unique values in the column named 'Foo'?""

How can you be so sure? Did you compare in a systematic way or read papers by people who did it?

Now I surely get results giving the llm only snippets and keywords, but anything complex, I do notice differences the way I articulate. Not claiming there is a significant difference, but it seems to me this way.

mjr00•7h ago

> How can you be so sure? Did you compare in a systematic way or read papers by people who did it?

No, but I didn't need to read scientific papers to figure how to use Google effectively, either. I'm just using a results-based analysis after a lot of LLM usage.

lukan•7h ago

Well, I did needed some tutorials to use google efficently in the old days when + meant something specific.

skybrian•7h ago

Other people don't have benefit of your experience, though, so there's a communications gap here: this boils down to "trust me, bro."

How do we get beyond that?

mjr00•6h ago

This is the gap between capability (what can this tool do?) versus workflow (what is the best way to use this tool to accomplish a goal?). Capabilities can be strictly evaluated, but workflow is subjective. Saying "Google has the site: and before: operators" is capability, saying "you should use site:reddit.com before:2020 in Google queries" is workflow.

LLMs have made the distinction ambiguous because their capabilities are so poorly understood. When I say "you should talk to an LLM like it's a computer", that's a workflow statement; it's a more efficient way to accomplish the same goal. You can try it for yourself and see if you agree. I personally liken people who talk to LLMs in full, proper English, capitalization and all, to boomers who still type in full sentences when running a Google query. Is there anything strictly wrong with it? Not really. Do I believe it's a more efficient workflow to just type the keywords that will give you the same result? Yes.

Workflow efficiencies can't really be scientifically evaluated. Some people still prefer to have desktop icons for programs on Windows; my workflow is pressing winkey -> typing the first few characters of the program -> enter. Is one of these methods scientifically more correct? Not really.

So, yeah -- eventually you'll either find your own workflow or copy the workflow of someone you see who is using LLMs effectively. It really is "just trust me, bro."

skybrian•4h ago

Maybe it would help if more people wrote tutorials? It doesn't seem reasonable for people who don't have a buddy to learn from to have to figure it out on their own.

bit1993•6h ago

> Rather than let the LLM take you 80% of the way there and then handle the remaining 20% "manually"

IMO 80% is way too much, LLMs are probably good for things that are not your domain knowledge and you can efford to not be 100% correct, like rendering the Mandelbrot set, simple functions like that.

LLMs are not deterministic sometimes they produce correct code and other times they produce wrong code. This means one has to audit LLM generated code and auditing code takes more effort than writing it, especially if you are not the original author of the code being audited.

Code has to be 100% deterministic. As programmers we write code, detailed instructions for the computer (CPU), we have developed allot of tools such as Unit Tests to make sure the computer does exactly what we wrote.

A codebase has allot of context that you gain by writing the code, some things just look wrong and you know exactly why because you wrote the code, there is also allot of context that you should keep in your head as you write the code, context that you miss from simply prompting an LLM.

narush•8h ago

Hey Simon -- thanks for the detailed read of the paper - I'm a big fan of your OS projects!

Noting a few important points here:

1. Some prior studies that find speedup do so with developers that have similar (or less!) experience with the tools they use. In other words, the "steep learning curve" theory doesn't differentially explain our results vs. other results.

2. Prior to the study, 90+% of developers had reasonable experience prompting LLMs. Before we found slowdown, this was the only concern that most external reviewers had about experience was about prompting -- as prompting was considered the primary skill. In general, the standard wisdom was/is Cursor is very easy to pick up if you're used to VSCode, which most developers used prior to the study.

3. Imagine all these developers had a TON of AI experience. One thing this might do is make them worse programmers when not using AI (relatable, at least for me), which in turn would raise the speedup we find (but not because AI was better, but just because with AI is much worse). In other words, we're sorta in between a rock and a hard place here -- it's just plain hard to figure out what the right baseline should be!

4. We shared information on developer prior experience with expert forecasters. Even with this information, forecasters were still dramatically over-optimistic about speedup.

5. As you say, it's totally possible that there is a long-tail of skills to using these tools -- things you only pick up and realize after hundreds of hours of usage. Our study doesn't really speak to this. I'd be excited for future literature to explore this more.

In general, these results being surprising makes it easy to read the paper, find one factor that resonates, and conclude "ah, this one factor probably just explains slowdown." My guess: there is no one factor -- there's a bunch of factors that contribute to this result -- at least 5 seem likely, and at least 9 we can't rule out (see the factors table on page 11).

I'll also note that one really important takeaway -- that developer self-reports after using AI are overoptimistic to the point of being on the wrong side of speedup/slowdown -- isn't a function of which tool they use. The need for robust, on-the-ground measurements to accurately judge productivity gains is a key takeaway here for me!

(You can see a lot more detail in section C.2.7 of the paper ("Below-average use of AI tools") -- where we explore the points here in more detail.)

simonw•8h ago

Thanks for the detailed reply! I need to spend a bunch more time with this I think - above was initial hunches from skimming the paper.

narush•7h ago

Sounds great. Looking forward to hearing more detailed thoughts -- my emails in the paper :)

paulmist•7h ago

Were participants given time to customize their Cursor settings? In my experience tool/convention mismatch kills Cursor's productivity - once it gets going with a wrong library or doesn't use project's functions I will almost always reject code and re-prompt. But, especially for large projects, having a well-crafted repo prompt mitigates most of these issues.

jdp23•7h ago

Really interesting paper, and thanks for the followon points.

The over-optimism is indeed a really important takeaway, and agreed that it's not tool-dependent.

gojomo•6h ago

Did each developer do a large enough mix of AI/non-AI tasks, in varying orders, that you have any hints in your data whether the "AI penalty" grew or shrunk over time?

narush•6h ago

You can see this analysis in the factor analysis of "Below-average use of AI tools" (C.2.7) in the paper [1], which we mark as an unclear effect.

TLDR: over the first 8 issues, developers do not appear to get majorly less slowed down.

[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

gojomo•5h ago

Thanks, that's great!

But: if all developers did 136 AI-assisted issues, why only analyze excluding the 1st 8, rather than, say, the first 68 (half)?

narush•4h ago

Sorry, this is the first 8 issues per-developer!

amirhirsch•6h ago

Figure 6 which breaks-down the time spent doing different tasks is very informative -- it suggest: 15% less active coding 5% less testing, 8% less research and reading

4% more idle time 20% more AI interaction time

The 28% less coding/testing/research is why developers reported 20% less work. You might be spending 20% more time overall "working" while you are really idle 5% more time and feel like you've worked less because you were drinking coffee and eating a sandwich between waiting for the AI and reading AI output.

I think the AI skill-boost comes from having work flows that let you shave half that git-ops time, cut an extra 5% off coding, but cut the idle/waiting and do more prompting of parallel agents and a bit more testing then you really are a 2x dev.

amirhirsch•6h ago

i just realized the figure is showing the time breakdown as a percentage of total time, it would be more useful to show absolute time (hours) for those side-by-side comparisons since the implied hours would boost the AI bars height by 18%

narush•5h ago

There's additional breakdown per-minute in the appendix -- see appendix E.4!

viraptor•1h ago

> You might be spending 20% more time overall "working" while you are really idle 5% more time and feel like you've worked less because you were drinking coffee and eating a sandwich between waiting for the AI and reading AI output.

This is going to be interesting long-term. Realistically people don't spend anywhere close to 100% of time working and they take breaks after intense periods of work. So the real benefit calculation needs to include: outcome itself, time spent interacting with the app, overlap of tasks while agents are running, time spent doing work over a long period of time, any skill degradation, LLM skills, etc. It's going to take a long time before we have real answers to most of those, much less their interactions.

smokel•7h ago

I notice that some people have become more productive thanks to AI tools, while others are not.

My working hypothesis is that people who are fast at scanning lots of text (or code for that matter) have a serious advantage. Being able to dismiss unhelpful suggestions quickly and then iterating to get to helpful assistance is key.

Being fast at scanning code correlates with seniority, but there are also senior developers who can write at a solid pace, but prefer to take their time to read and understand code thoroughly. I wouldn't assume that this kind of developer gains little profit from typical AI coding assistance. There are also juniors who can quickly read text, and possibly these have an advantage.

A similar effect has been around with being able to quickly "Google" something. I wouldn't be surprised if this is the same trait at work.

luxpir•5h ago

Just to thank you for that point. I think it's likely more true than most of us realise. That and maybe the ability to mentally scaffold or outline a system or solution ahead of time.

Filligree•5h ago

An interesting point. I wonder how much my decades-old habit of watching subtitled anime helps there—it’s definitely made me dramatically faster at scanning text.

onlyrealcuzzo•7h ago

How were "experienced engineers" defined?

I've found AI to be quite helpful in pointing me in the right direction when navigating an entirely new code-base.

When it's code I already know like the back of my hand, it's not super helpful, other than maybe doing a few automated tasks like refactoring, where there have already been some good tools for a while.

smj-edison•6h ago

> To directly measure the real-world impact of AI tools on software development, we recruited 16 experienced developers from large open-source repositories (averaging 22k+ stars and 1M+ lines of code) that they’ve contributed to for multiple years.

furyofantares•7h ago

> My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

I totally agree with this. Although also, you can end up in a bad spot even after you've gotten pretty good at getting the AI tools to give you good output, because you fail to learn the code you're producing well.

A developer gets better at the code they're working on over time. An LLM gets worse.

You can use an LLM to write a lot of code fast, but if you don't pay enough attention, you aren't getting any better at the code while the LLM is getting worse. This is why you can get like two months of greenfield work done in a weekend but then hit a brick wall - you didn't learn anything about the code that was written, and while the LLM started out producing reasonable code, it got worse until you have a ball of mud that neither the LLM nor you can effectively work on.

So a really difficult skill in my mind is continually avoiding temptation to vibe. Take a whole week to do a month's worth of features, not a weekend to do two month's worth, and put in the effort to guide the LLM to keep producing clean code, and to be sure you know the code. You do want to know the code and you can't do that without putting in work yourself.

danieldk•6h ago

So a really difficult skill in my mind is continually avoiding temptation to vibe.

I agree. I have found that I can use agents most effectively by letting it write code in small steps. After each step I do review of the changes and polish it up (either by doing the fixups myself or prompting). I have found that this helps me understanding the code, but also avoids that the model gets in a bad solution space or produces unmaintainable code.

I also think this kind of close-loop is necessary. Like yesterday I let an LLM write a relatively complex data structure. It got the implementation nearly correct, but was stuck, unable to find an off-by-one comparison. In this case it was easy to catch because I let it write property-based tests (which I had to fix up to work properly), but it's easy for things to slip through the cracks if you don't review carefully.

(This is all using Cursor + Claude 4.)

bluefirebrand•6h ago

> Take a whole week to do a month's worth of features

Everything else in your post is so reasonable and then you still somehow ended up suggesting that LLMs should be quadrupling our output

furyofantares•6h ago

I'm specifically talking about greenfield work. I do a lot of game prototypes, it definitely does that at the very beginning.

bluefirebrand•5h ago

Greenfield is still such a tiny percentage of all software work going on in the world though :/

furyofantares•5h ago

I agree, that's fair. I think a lot of people are playing around with AI on side projects and making some bad extrapolations from their initial experiences.

It'll also apply to isolated-enough features, which is still a small amount of someone's work (not often something you'd work on for a full month straight), but more people will have experience with this.

lurking_swe•5h ago

greenfield development is also the “easiest” and most fun part of software development. As the famous saying goes, the last 10% of the project takes 90% of the time lol.

I’ve also noticed that, generally, nobody likes maintaining old systems.

so where does this leave us as software engineers? Should I be excited that it’s easy to spin up a bunch of code that I don’t deeply understand at the beginning of my project, while removing the fun parts of the project?

I’m still grappling with what this means for our industry in 5-10 years…

Filligree•5h ago

It’s a tiny percentage of software work because the programming is slow, and setting up new projects is even slower.

It’s been a majority of my projects for the past two months. Not because work changed, but because I’ve written a dozen tiny, personalised tools that I wouldn’t have written at all if I didn’t have Claude to do it.

Most of them were completed in less than an hour, to give you an idea of the size. Though it would have easily been a day on my own.

Dzugaru•5h ago

This is really interesting, because I do gamejams from time to time - and I try every time to make it work, but I'm still quite a lot faster doing stuff myself.

This is visible under extreme time pressure of producing a working game in 72 hours (our team scores consistenly top 100 in Ludum Dare which is a somewhat high standard).

We use a popular Unity game engine all LLMs have wealth of experience (as in game development in general), but the output is 80% so strangely "almost correct but not usable" that I cannot take the luxury of letting it figure it out, and use it as fancy autocomplete. And I also still check docs and Stackoverflow-style forums a lot, because of stuff it plainly mades up.

One of the reasons is maybe our game mechanics often is a bit off the beaten road, though the last game we made was literally a platformer with rope physics (LLM could not produce a good idea how to make stable and simple rope physics under our constraints codeable in 3 hours time).

WD-42•6h ago

I feel the same way. I use it for super small chunks, still understand everything it outputs, and often manually copy/paste or straight up write myself. I don't know if I'm actually faster before, but it feels more comfy than alt-tabbing to stack overflow, which is what I feel like it's mostly replaced.

Poor stack overflow, it looks like they are the ones really hurting from all this.

jona777than•5h ago

> but then hit a brick wall

This is my intuition as well. I had a teammate use a pretty good analogy today. He likened vibe coding to vacuuming up a string in four tries when it only takes one try to reach down and pick it up. I thought that aligned well with my experience with LLM assisted coding. We have to vacuum the floor while exercising the "difficult skill [of] continually avoiding temptation to vibe"

Uehreka•7h ago

> My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

You hit the nail on the head here.

I feel like I’ve seen a lot of people trying to make strong arguments that AI coding assistants aren’t useful. As someone who uses and enjoys AI coding assistants, I don’t find this research angle to be… uh… very grounded in reality?

Like, if you’re using these things, the fact that they are useful is pretty irrefutable. If one thinks there’s some sort of “productivity mirage” going on here, well OK, but to demonstrate that it might be better to start by acknowledging areas where they are useful, and show that your method explains the reality we’re seeing before using that method to show areas where we might be fooling ourselves.

I can maybe buy that AI might not be useful for certain kinds of tasks or contexts. But I keep pushing their boundaries and they keep surprising me with how capable they are, so it feels like it’ll be difficult to prove otherwise in a durable fashion.

TechDebtDevin•7h ago

Still odd to me that the only vibe coded software that gets aquired are by companies selling tools or want to promote vibe coding.

furyofantares•7h ago

That's not odd. These things are incredibly useful and vibe coding mostly sucks.

Uehreka•7h ago

Pardon my caps, but WHO CARES about acquisitions?!

You’ve been given a dubiously capable genie that can write code without you having to do it! If this thing can build first drafts of those side projects you always think about and never get around to, that in and of itself is useful! If it can do the yak-shaving required to set up those e2e tests you know you should have but never have time for it is useful!

Have it try out all the dumb ideas you have that might be cool but don’t feel worth your time to boilerplate out!

I like to think we’re a bunch of creative people here! Stop thinking about how it can make you money and use it for fun!

fwip•6h ago

Unfortunately, HN is YC-backed, and attracts these types by design.

Uehreka•5h ago

I mean sure, but HN/YC’s founder was always going on about the kinship between “Hackers and Painters” (or at least he used to). It hasn’t always been like this, and definitely doesn’t have to be. We can and should aspire to better.

TechDebtDevin•8m ago

I have great code gen tools I've built for myself that build my perfect scaffolding/boilerplate every time, for any project in about 30 seconds.

Took me a week to build those tools. Its much more reliable (and flexible) than any LLM and cost me nothing.

It comes with secure Auth, email, admin, ect ect.. Doesn't cost me a dime and almost never has a common vulnerability.

Best part about it. I know how my side project runs.

furyofantares•7h ago

I think the thing is there IS a learning curve, AND there is a productivity mirage, AND they are immensely useful, AND it is context dependent. All of this leads to a lot of confusion when communicating with people who are having a different experience.

GoatInGrey•7h ago

It always comes back to nuance!

Uehreka•7h ago

Right, my problem is that while some people may be correct about the productivity mirage, many of those people are getting out over their skis and making bigger claims than they can reasonably prove. I’m arguing that they should be more nuanced and tactical.

rcruzeiro•6h ago

Exactly. The people who say that these assistants are useless or "not good enough" are basically burying their heads in the sand. The people who claim that there is no mirage are burying their head in the sand as well...

grey-area•7h ago

Well, there are two possible interpretations here of 75% of participants (all of whom had some experience using LLMs) being slower using generative AI:

LLMs have a v. steep and long learning curve as you posit (though note the points from the paper authors in the other reply).

Current LLMs just are not as good as they are sold to be as a programming assistant and people consistently predict and self-report in the wrong direction on how useful they are.

Terr_•6h ago

> people consistently predict and self-report in the wrong direction

I recall an adage about work-estimation: As chunks get too big, people unconsciously substitute "how possible does the final outcome feel" with "how long will the work take to do."

People asked "how long did it take" could be substituting something else, such as "how alone did I feel while working on it."

sandinmyjoints•6h ago

That’s an interesting adage. Any ideas of its source?

Dilettante_•6h ago

It might have been in Kahneman's "Thinking, Fast and Slow"

Terr_•6h ago

I'm not sure, but something involving Kahneman et al. seems very plausible: The relevant term is probably "Attribute Substitution."

https://en.wikipedia.org/wiki/Attribute_substitution

steveklabnik•6h ago

> Current LLMs

One thing that happened here is that they aren't using current LLMs:

> Most issues were completed in February and March 2025, before models like Claude 4 Opus or Gemini 2.5 Pro were released.

That doesn't mean this study is bad! In fact, I'd be very curious to see it done again, but with newer models, to see if that has an impact.

blibble•6h ago

> One thing that happened here is that they aren't using current LLMs

I've been hearing this for 2 years now

the previous model retroactively becomes total dogshit the moment a new one is released

convenient, isn't it?

simonw•6h ago

The previous model retroactively becomes not as good as the best available models. I don't think that's a huge surprise.

cwillu•5h ago

The surprise is the implication that the crossover between net-negative and net-positive impact happened to be in the last 4 months, in light of the initial release 2 years ago and sufficient public attention for a study to be funded and completed.

Yes, it might make a difference, but it is a little tiresome that there's always a “this is based on a model that is x months old!” comment, because it will always be true: an academic study does not get funded, executed, written up, and published in less time.

Ntrails•5h ago

Some of it is just that (probably different) people said the same damn things 6 months ago.

"No, the 2.8 release is the first good one. It massively improves workflows"

Then, 6 months later, the study comes out.

"Ah man, 2.8 was useless, 3.0 really crossed the threshold on value add"

At some point, you roll your eyes and assume it is just snake oil sales

Filligree•5h ago

Or you accept that different people have different skill levels, workflows and goals, and therefore the AIs reach usability at different times.

steveklabnik•5h ago

There’s a lot of confounding factors here. For example, you could point to any of these things in the last ~8 months as being significant changes:

* the release of agentic workflow tools

* the release of MCPs

* the release of new models, Claude 4 and Gemini 2.5 in particular

* subagents

* asynchronous agents

All or any of these could have made for a big or small impact. For example, I’m big on agentic tools, skeptical of MCPs, and don’t think we yet understand subagents. That’s different from those who, for example, think MCPs are the future.

> At some point, you roll your eyes and assume it is just snake oil sales

No, you have to realize you’re talking to a population of people, and not necessarily the same person. Opinions are going to vary, they’re not literally the same person each time.

There are surely snake oil salesman, but you can’t buy anything from me.

foobarqux•5h ago

That's not the argument being made though, which is that it does "work" now and implying that actually it didn't quite work before; except that that is the same thing the same people say for every model release, including at the time or release of the previous one, which is now acknowledged to be seriously flawed; and including the future one, at which time the current models will similarly be acknowledged to be, not only less performant that the future models, but inherently flawed.

Of course it's possible that at some point you get to a model that really works, irrespective of the history of false claims from the zealots, but it does mean you should take their comments with a grain of salt.

steveklabnik•5h ago

> That's not the argument being made though, which is that it does "work" now and implying that actually it didn't quite work before

Right.

> except that that is the same thing the same people say for every model release,

I did not say that, no.

I am sure you can find someone who is in a Groundhog Day about this, but it’s just simpler than that: as tools improve, more people find them useful than before. You’re not talking to the same people, you are talking to new people each time who now have had their threshold crossed.

blibble•4h ago

> You’re not talking to the same people, you are talking to new people each time who now have had their threshold crossed.

no, it's the same names, again and again

simonw•4h ago

Got receipts?

That sounds like a claim you could back up with a little bit of time spent using Hacker News search or similar.

(I might try to get a tool like o3 to run those searches for me.)

blibble•4h ago

try asking it what sealioning is

maxbond•20m ago

You've no obligation to answer, no one is entitled to your time, but it's a reasonable request. It's not sealioning to respectfully ask for directly relevant evidence that takes about 10-15m to get.

pdabbadabba•5h ago

Maybe it's convenient. But isn't it also just a fact that some of the models available today are better than the ones available five months ago?

bryanrasmussen•5h ago

sure, but after having spent some time trying to get anything useful - programmatically - out of previous models and not getting anything once a new one is announced how much time should one spend.

Sure you may end up missing out on a good thing and then having to come late to the party, but coming early to the party too many times and the beer is watered down and the food has grubs is apt to make you cynical the next time a party announcement comes your way.

Terr_•5h ago

Plus it's not even possible to miss the metaphorical party: If it gets going, it will be quite obvious long before it peaks.

(Unless one believes the most grandiose prophecies of a technological-singularity apocalypse, that is.)

Terr_•5h ago

That's not the issue. Their complaint is that proponents keep revising what ought to be fixed goalposts... Well, fixed unless you believe unassisted human developers are also getting dramatically better at their jobs every year.

Like the boy who cried wolf, it'll eventually be true with enough time... But we should stop giving them the benefit of the doubt.

_____

Jan 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."

Feb 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."

Mar 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."

Apr 2025: [Ad nauseam, you get the idea]

pdabbadabba•4h ago

Fair enough. For what it's worth, I've always thought that the more reasonable claim is that AI tools make poor-average developers more productive, not necessarily expert developers.

bluefirebrand•2h ago

Personally I don't want poor-average developers to be more productive, I want them to be more expert

Terr_•22m ago

"Compared to last quarter, we've shipped 40% more spaghetti-code!"

pdabbadabba•10m ago

Sure. But what would you suppose the ratio is between expert, average, and mediocre coders in the average organization? I think a small minority would be in the first category, and I don’t see a technology on the horizon that will change that except for LLMs, which seem like they could make mediocre coders both more productive and produce higher quality output.

steveklabnik•5h ago

Sorry, that’s not my take. I didn’t think these tools were useful until the latest set of models, that is, they crossed the threshold of usefulness to me.

Even then though, “technology gets better over time” shouldn’t be surprising, as it’s pretty common.

mattmanser•5h ago

Do you really see a massive jump?

For context, I've been using AI, a mix of OpenAi + Claude, mainly for bashing out quick React stuff. For over a year now. Anything else it's generally rubbish and slower than working without. Though I still use it to rubber duck, so I'm still seeing the level of quality for backend.

I'd say they're only marginally better today than they were even 2 years ago.

Every time a new model comes out you get a bunch of people raving how great the new one is and I honestly can't really tell the difference. The only real difference is reasoning models actually slowed everything down, but now I see its reasoning. It's only useful because I often spot it leaving out important stuff from the final answer.

hombre_fatal•5h ago

I see a massive jump every time.

Just two years ago, this failed.

> Me: What language is this: "esto está escrito en inglés"

> LLM: English

Gemini and Opus have solved questions that took me weeks to solve myself. And I'll feed some complex code into each new iteration and it will catch a race condition I missed even with testing and line by line scrutiny.

Consider how many more years of experience you need as a software engineer to catch hard race conditions just from reading code than someone who couldn't do it after trying 100 times. We take it for granted already since we see it as "it caught it or it didn't", but these are massive jumps in capability.

steveklabnik•5h ago

Yes. In January I would have told you AI tools are bullshit. Today I’m on the $200/month Claude Max plan.

As with anything, your miles may vary: I’m not here to tell anyone that thinks they still suck that their experience is invalid, but to me it’s been a pretty big swing.

Uehreka•5h ago

> In January I would have told you AI tools are bullshit. Today I’m on the $200/month Claude Max plan.

Same. For me the turning point was VS Code’s Copilot Agent mode in April. That changed everything about how I work, though it had a lot of drawbacks due to its glitches (many of these were fixed within 6 or so weeks).

When Claude Sonnet 4 came out in May, I could immediately tell it was a step-function increase in capability. It was the first time an AI, faced with ambiguous and complicated situations, would be willing to answer a question with a definitive and confident “No”.

After a few weeks, it became clear that VS Code’s interface and usage limits were becoming the bottleneck. I went to my boss, bullet points in hand, and easily got approval for the Claude Max $200 plan. Boom, another step-function increase.

We’re living in an incredibly exciting time to be a skilled developer. I understand the need to stay skeptical and measure the real benefits, but I feel like a lot of people are getting caught up in the culture war aspect and are missing out on something truly wonderful.

mattmanser•4h ago

Ok, I'll have to try it out then. I've got a side project I've 3/4 finished and will let it loose on it.

So are you using Claude Code via the max plan, Cursor, or what?

I think I'd definitely hit AI news exhaustion and was viewing people raving about this agentic stuff as yet more AI fanbois. I'd just continued using the AI separate as setting up a new IDE seemed like too much work for the fractional gains I'd been seeing.

steveklabnik•3h ago

I had a bad time with Cursor. I use Claude Code inside of VS: Code. You don't necessarily need Max, but you can spend a lot of money very quickly on API tokens, so I'd recommend to anyone trying, start with the $20/month one, no need to spend a ton of money just to try something out.

There is a skill gap, like, I think of it like vim: at first it slows you down, but then as you learn it, you end up speeding up. So you may also find that it doesn't really vibe with the way you work, even if I am having a good time with it. I know people who are great engineers who still don't like this stuff, just like I know ones that do too.

mh-•55m ago

Worth noting for the folks asking: there's an official Claude Code extension for VS Code now [0]. I haven't tried it personally, but that's mostly because I mainly use the terminal and vim.

[0]: https://marketplace.visualstudio.com/items?itemName=anthropi...

simonw•4h ago

The massive jump in the last six months is that the new set of "reasoning" models got really good at reasoning about when to call tools, and were accompanied is by a flurry of tools-in-loop coding agents - Claude Code, OpenAI Codex, Cursor in Agent mode etc.

An LLM that can test the code it is writing and then iterate to fix the bugs turns out to be a huge step forward from LLMs that just write code without trying to then exercise it.

vidarh•3h ago

I've gone from asking the tools how to do things, and cut and pasting the bits (often small) that'd be helpful, via using assistants that I'd review every decision of and often having to start over, to now often starting an assistant with broad permissions and just reviewing the diff later, after they've made the changes pass the test suite, run a linter and fixed all the issues it brought up, and written a draft commit message.

The jump has been massive.

ipaddr•5h ago

Wait until the next set. You will find you the previous ones weren't useful after all.

steveklabnik•5h ago

This makes no sense to me. I’m well aware that I’m getting value today, that’s not going to change in the future: it’s already happened.

Sure they may get even more useful in the future but that doesn’t change my present.

jstummbillig•5h ago

Convenient for whom and what...? There is nothing tangible to gain from you believing or not believing that someone else does (or does not) get a productivity boost from AI. This is not a religion and it's not crypto. The AI users' net worth is not tied to another ones use of or stance on AI (if anything, it's the opposite).

More generally, the phenomenon this is quite simply explained and nothing surprising: New things improve, quickly. That does not mean that something is good or valuable but it's how new tech gets introduced every single time, and readily explains changing sentiment.

card_zero•5h ago

I saw that edit. Indeed you can't predict that rejecting a new thing is part of a routine of being wrong. It's true that "it's strange and new, therefore I hate it" is a very human (and adorable) instinct, but sometimes it's reasonable.

jstummbillig•4h ago

"I saw that edit" lol

card_zero•4h ago

Sorry, just happened to. Slightly rude of me.

jstummbillig•4h ago

Ah, you do you. It's just a fairly kindergarten thing to point out and not something I was actively trying to hide. Whatever it was.

Generally, I do a couple of edits for clarity after posting and reading again. Sometimes that involves removing something that I feel could have been said better. If it does not work, I will just delete the comment. Whatever it was must not have been a super huge deal (to me).

grey-area•5h ago

Honestly the hype cycle feels very like crypto, and just like crypto prominent vcs have a lot of money riding on the outcome.

steveklabnik•5h ago

I agree with you, and I think that’s coloring a lot of people’s perceptions. I am not a crypto fan but am an LLM fan.

Every hype cycle feels like this, and some of them are nonsense and some of them are real. We’ll see.

jstummbillig•4h ago

Of course, lot's of hype, but my point is that the reason why is very different and it matters: As an early bc adopter making your believe in bc is super important to my net worth (and you not believing in bc makes me look like an idiot and lose a lot of money).

In contrast, what do I care if you believe in code generation AI? If you do, you are probably driving up pricing. I mean, I am sure that there are people that care very much, but there is little inherent value for me in you doing so, as long as the people who are building the AI are making enough profit to keep it running.

With regards to the VCs, well, how many VCs are there in the world? How many of the people who have something good to say about AI are likely VCs? I might be off by an order of magnitude, but even then it would really not be driving the discussion.

leshow•4h ago

I don't find that a compelling argument, lots of people get taken in by hype cycles even when they don't profit directly from it.

leshow•4h ago

I think you're missing the broader context. There is a lot of people very invested in the maximalist outcome which does create pressure for people to be boosters. You don't need a digital token for that to happen. There's a social media aspect as well that creates a feedback loop about claims.

We're in a hype cycle, and it means we should be extra critical when evaluating the tech so we don't get taken in by exaggerated claims.

jstummbillig•3h ago

I mostly don't agree. Yes, there is always social pressure with these things, and we are in a hype cycle, but the people "buying in" are simply not doing much at all. They are mostly consumers, waiting for the next model, which they have no control over or stake in creating (by and large).

The people not buying into the hype, on the other hands, are actually the ones that have a very good reason to be invested, because if they turn out to be wrong they might face some very uncomfortable adjustments in the job landscape and a lot of the skills that they worked so hard to gain and believed to be valuable.

As always, be weary of any claims, but the tension here is very much the reverse of crypto and I don't think that's very appreciated.

cfst•5h ago

The current batch of models, specifically Claude Sonnet and Opus 4, are the first I've used that have actually been more helpful than annoying on the large mixed-language codebases I work in. I suspect that dividing line differs greatly between developers and applications.

nalllar•5h ago

If you interact with internet comments and discussions as an amorphous blob of people you'll see a constant trickle of the view that models now are useful, and before were useless.

If you pay attention to who says it, you'll find that people have different personal thresholds for finding llms useful, not that any given person like steveklabnik above keeps flip-flopping on their view.

This is a variant on the goomba fallacy: https://englishinprogress.net/gen-z-slang/goomba-fallacy-exp...

bix6•3h ago

Everything actually got better. Look at the image generation improvements as an easily visible benchmark.

I do not program for my day job and I vibe coded two different web projects. One in twenty mins as a test with cloudflare deployment having never used cloudflare and one in a week over vacation (and then fixed a deep safari bug two weeks later by hammering the LLM). These tools massively raise the capabilities for sub-average people like me and decrease the time / brain requirements significantly.

I had to make a little update to reset the KV store on cloudflare and the LLM did it in 20s after failing the syntax twice. I would’ve spent at least a few minutes looking it up otherwise.

Aeolun•3h ago

It’s true though? Previous models could do well in specifically created settings. You can throw practically everything at Opus, and it’ll work mostly fine.

mwigdahl•2h ago

I've been a proponent for a long time, so I certainly fit this at least partially. However, the combination of Claude Code and the Claude 4 models has pushed the response to my demos of AI coding at my org from "hey, that's kind of cool" to "Wow, can you get me an API key please?"

It's been a very noticeable uptick in power, and although there have been some nice increases with past model releases, this has been both the largest and the one that has unlocked the most real value since I've been following the tech.

achierius•2h ago

Is that really the case vs. 3.7? For me that was the threshold, and since then the improvements have been nice but not as significant.

mwigdahl•2h ago

I would agree with you that the jump from Sonnet 3.7 to Sonnet 4 feels notable but not shocking. Opus 4 is considerably better, and Opus 4 combined with the Claude Code harness is what really unlocks the value for me.

burnte•5h ago

> Current LLMs just are not as good as they are sold to be as a programming assistant and people consistently predict and self-report in the wrong direction on how useful they are.

I would argue you don't need the "as a programming assistant" phrase as right now from my experience over the past 2 years, literally every single AI tool is massively oversold as to its utility. I've literally not seen a single one that delivers on what it's billed as capable of.

They're useful, but right now they need a lot of handholding and I don't have time for that. Too much fact checking. If I want a tool I always have to double check, I was born with a memory so I'm already good there. I don't want to have to fact check my fact checker.

LLMs are great at small tasks. The larger the single task is, or the more tasks you try to cram into one session, the worse they fall apart.

atiedebee•5h ago

Let me bring you a third (not necessarily true) interpretation:

The developer who has experience using cursor saw a productivity increase not because he became better at using cursor, but because he became worse at not using it.

card_zero•5h ago

Or, one person in 16 has a particular personality, inclined to LLM dependence.

runarberg•4h ago

Invoking personality is to the behavioral science as invoking God is to the natural sciences. One can explain anything by appealing to personality, and as such it explains nothing. Psychologists have been trying to make sense of personality for over a century without much success (the best efforts so far have been a five factor model [Big 5] which has ultimately pretty minor predictive value), which is why most behavioral scientists have learned to simply leave personality to the philosophers and concentrate on much simpler theoretical framework.

A much simpler explanation is what your parent offered. And to many behavioralists it is actually the same explanation, as to a true scotsm... [cough] behavioralist personality is simply learned habits, so—by Occam’s razor—you should omit personality from your model.

card_zero•4h ago

Fair comment, but I'm not down with behavioralism, and people have personalities, regrettably.

runarberg•3h ago

This is still ultimately a research within the field of the behavior sciences, and as such the laws of human behavior apply, where behaviorism offers a far more successful theoretical framework than personality psychology.

Nobody is denying that people have personalities btw. Not even true behavioralists do that, they simply argue from reductionism that personality can be explained with learning contingencies and the reinforcement history. Very few people are true behavioralists these days though, but within the behavior sciences, scientists are much more likely to borrow missing factors (i.e. things that learning contingencies fail to explain) from fields such as cognitive science (or even further to neuroscience) and (less often) social science.

What I am arguing here, however, is that the appeal to personality is unnecessary when explaining behavior.

As for figuring out what personality is, that is still within the realm of philosophy. Maybe cognitive science will do a better job at explaining it than psychometricians have done for the past century. I certainly hope so, it would be nice to have a better model of human behavior. But I think even if we could explain personality, it still wouldn’t help us here. At best we would be in a similar situation as physics, where one model can explain things traveling at the speed of light, while another model can explain things at the sub-atomic scale, but the two models cannot be applied together.

cutemonster•3h ago

Didn't they rather mean:

Developers' own skills might atrophy, when they don't write that much code themselves, relying on AI instead.

And now when comparing with/without AI they're faster with. But a year ago they might have been that fast or faster without an AI.

I'm not saying that that's how things are. Just pointing out another way to interpret what GP said

robwwilliams•5h ago

Or a sampling artifact. 4 vs 12 does seem significant within a study, but consider a set of N such studies.

I assume that many large companies have tested efficiency gains and losses of there programmers much more extensively than the authors of this tiny study.

A survey of companies and their evaluation and conclusions would carry more weight—-excluding companies selling AI products, of course.

rs186•3h ago

If you use binomial test, P(X<=4) is about 0.105 which means p = 0.21.

giantg2•2h ago

The third option is that the person who used Cursor before had some sort of skill atrophy that led to lower unassisted speed.

I think an easy measure to help identify why a slow down is happening would be to measure how much refactoring happened on the AI generated code. Often times it seems to be missing stuff like error handling, or adds in unnecessary stuff. Of course this assumes it even had a working solution in the first place.

bgwalter•7h ago

We have heard variations of that narrative for at least a year now. It is not hard to use these chatbots and no one who was very productive in open source before "AI" has any higher output now.

Most people who subscribe to that narrative have some connection to "AI" money, but there might be some misguided believers as well.

bc1000003•7h ago

"My intiution is that..." - AGREED.

I've found that there are a couple of things you need to do to be very efficient.

- Maintain an architecture.md file (with AI assistance) that answers many of the questions and clarifies a lot of the ambiguity in the design and structure of the code.

- A bootstrap.md file(s) is also useful for a lot of tasks.. having the AI read it and start with a correct idea about the subject is useful and a time saver for a variety of kinds of tasks.

- Regularly asking the AI to refactor code, simplify it, modularize it - this is what the experienced dev is for. VIBE coding generally doesn't work as AI's tend to write messy non-modular code unless you tell them otherwise. But if you review code, ask for specific changes.. they happily comply.

- Read the code produced, and carefully review it. And notice and address areas where there are issues, have the AI fix all of these.

- Take over when there are editing tasks you can do more efficiently.

- Structure the solution/architecture in ways that you know the AI will work well with.. things it knows about.. it's general sweet spots.

- Know when to stop using the AI and code it yourself.. particuarly when the AI has entered the confusion doom loop. Wasting time trying to get the AI to figure out what it's never going to is best used just fixing it yourself.

- Know when to just not ever try to use AI. Intuitively you know there's just certain code you can't trust the AI to safely work on. Don't be a fool and break your software.

----

I've found there's no guarantee that AI assistance will speed up any one project (and in some cases slow it down).. but measured cross all tasks and projects, the benefits are pretty substantial. That's probably others experience at this point too.

ericmcer•7h ago

Looking at the example tasks in the pdf ("Sentencize wrongly splits sentence with multiple...") these look like really discrete and well defined bug fixes. AI should smash tasks like that so this is even less hopeful.

rafaelmn•6h ago

>My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

Are we are still selling the "you are an expert senior developer" meme ? I can completely see how once you are working on a mature codebase LLMs would only slow you down. Especially one that was not created by an LLM and where you are the expert.

bicx•6h ago

I think it depends on the kind of work you're doing, but I use it on mature codebases where I am the expert, and I heavily delegate to Claude Code. By being knowledgeable of the codebase, I know exactly how to specify a task I need performed. I set it to work on one task, then I monitor it while personally starting on other work.

I think LLMs shine when you need to write a higher volume of code that extends a proven pattern, quickly explore experiments that require a lot of boilerplate, or have multiple smaller tasks that you can set multiple agents upon to parallelize. I've also had success in using LLMs to do a lot of external documentation research in order to integrate findings into code.

If you are fine-tuning an algorithm or doing domain-expert-level tweaks that require a lot of contextual input-output expert analysis, then you're probably better off just coding on your own.

Context engineering has been mentioned a lot lately, but it's not a meme. It's the real trick to successful LLM agent usage. Good context documentation, guides, and well-defined processes (just like with a human intern) will mean the difference between success and failure.

dmezzetti•6h ago

I'm the developer of txtai, a fairly popular open-source project. I don't use any AI-generated code and it's not integrated into my workflows at the moment.

AI has a lot of potential but it's way over-hyped right now. Listen to the people on the ground who are doing real work and building real projects, none of them are over-hyping it. It's mostly those who have tangentially used LLMs.

It's also not surprising that many in this thread are clinging to a basic premise that it's 3 steps backwards to go 5 steps forward. Perhaps that is true but I'll take the study at face value, it seems very plausible to me.

mnky9800n•6h ago

I feel like I get better at it as I use Claude code more because I both understand its strength and weaknesses and also understand what context it’s usually missing. Like today I was struggling to debug an issue and realised that Claude’s idea of a coordinate system was 90 degrees rotated from mine and thus it was getting confused because I was confusing it.

throwawayoldie•6h ago

One of the major findings is that people's perception--that is, what it felt like--was incorrect.

devin•5h ago

It seems really surprising to me that anyone would call 50 hours of experience a "high skill ceiling".

keeda•5h ago

> My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

Yes, and I'll add that there is likely no single "golden workflow" that works for everybody, and everybody needs to figure it out for themselves. It took me months to figure out how to be effective with these tools, and I doubt my approach will transfer over to others' situations.

For instance, I'm working solo on smallish, research-y projects and I had the freedom to structure my code and workflows in a way that works best for me and the AI. Briefly: I follow an ad-hoc, pair-programming paradigm, fluidly switching between manual coding and AI-codegen depending on an instinctive evaluation of whether a prompt would be faster. This rapid manual-vs-prompt assessment is second nature to me now, but it took me a while to build that muscle.

I've not worked with coding agents, but I doubt this approach will transfer over well to them.

I've said it before, but this is technology that behaves like people, and so you have to approach it like working with a colleague, with all their quirks and fallibilities and potentially-unbound capabilities, rather than a deterministic, single-purpose tool.

I'd love to see a follow-up of the study where they let the same developers get more familiar with AI-assisted coding for a few months and repeat the experiment.

Filligree•5h ago

> I've not worked with coding agents, but I doubt this approach will transfer over well to them.

Actually, it works well so long as you tell them when you’ve made a change. Claude gets confused if things randomly change underneath it, but it has no trouble so long as you give it a short explanation.

ummonk•5h ago

Devil's advocate: it's also possible the one developer hasn't become more productive with Cursor, but rather has atrophied their non-AI productivity due to becoming reliant on Cursor.

thesz•4h ago

  > My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

This is what I heard about strong type systems (especially Haskell's) about 20-15 years ago.

"History does not repeat, but it rhymes."

If we rhyme "strong types will change the world" with "agentic LLMs will change the world," what do we get?

My personal theory is that we will get the same: some people will get modest-to-substantial benefits there, but changes in the world will be small if noticeable at all.

ruszki•4h ago

Maybe it depends on the task. I’m 100% sure, that if you think that type system is a drawback, then you have never code in a diverse, large codebase. Our 1.5 million LOC 30 years old monolith would be completely unmaintainable without it. But seriously, anything without a formal type system above 10 LOC after a few years is unmaintainable. An informal is fine for a while, but not long for sure. On a 30 years old code, basically every single informal rules are broken.

Also, my long experience is that even in PoC phase, using a type system adds almost zero extra time… of course if you know the type system, which should be trivial in any case after you’ve seen a few.

leshow•4h ago

I don't think that's a fair comparison. Type systems don't produce probabilistic output. Their entire purpose is to reduce the scope of possible errors you can write. They kind of did change the world, didn't they? I mean, not everyone is writing Haskell but Rust exists and it's doing pretty well. There was also not really a case to be made where type systems made software in general _worse_. But you could definitely make the case that LLM's might make software worse.

atlintots•3h ago

Its too bad the management people never pushed Haskell as hard as they're pushing AI today! Alas.

Aurornis•4h ago

> A quarter of the participants saw increased performance, 3/4 saw reduced performance.

The study used 246 tasks across 16 developers, for an average of 15 tasks per developer. Divide that further in half because tasks were assigned as AI or not-AI assisted, and the sample size per developer is still relatively small. Someone would have to take the time to review the statistics, but I don’t think this is a case where you can start inferring that the developers who benefited from AI were just better at using AI tools than those who were not.

I do agree that it would be interesting to repeat a similar test on developers who have more AI tool assistance, but then there is a potential confounding effect that AI-enthusiastic developers could actually lose some of their practice in writing code without the tools.

th0ma5•4h ago

Simon's opinion is unsurprisingly that people need to read his blog and spam on every story on HN lest we be left behind.

eightysixfour•4h ago

I have been teaching people at my company how to use AI code tools, the learning curve is way worse for developers and I have had to come up with some exercises to try and breakthrough the curve. Some seemingly can’t get it.

The short version is that devs want to give instructions instead of ask for what outcome they want. When it doesn’t follow the instructions, they double down by being more precise, the worst thing you can do. When non devs don’t get what they want, they add more detail to the description of the desired outcome.

Once you get past the control problem, then you have a second set of issues for devs where the things that should be easy or hard don’t necessarily map to their mental model of what is easy or hard, so they get frustrated with the LLM when it can’t do something “easy.”

Lastly, devs keep a shit load of context in their head - the project, what they are working on, application state, etc. and they need to do that for LLMs too, but you have to repeat themselves often and “be” the external memory for the LLM. Most devs I have taught hate that, they actually would rather have it the other way around where they get help with context and state but want to instruct the computer on their own.

Interestingly, the best AI assisted devs have often moved to management/solution architecture, and they find the AI code tools brought back some of the love of coding. I have a hypothesis they’re wired a bit differently and their role with AI tools is actually closer to management than it is development in a number of ways.

BigGreenJorts•2h ago

> Interestingly, the best AI assisted devs have often moved to management/solution architecture, and they find the AI code tools brought back some of the love of coding. I have a hypothesis they’re wired a bit differently and their role with AI tools is actually closer to management than it is development in a number of ways.

The CTO and VPEng at my company (very small, still do technical work occasionally) both love the agent stuff so much. Part of it for them is that it gives them the opportunity to do technical work again with the limited time they have. Without having to distract an actual dev, or spend a long time reading through the codebase, they can quickly get context for an build small items themselves.

rester324•1h ago

> Interestingly, the best AI assisted devs have often moved to management/solution architecture, and they find the AI code tools brought back some of the love of coding

This suggests me though that they are bad at coding, otherwise they would have stayed longer. And I can't find anything in your comment that would corroborate the opposite. So what gives?

I am not saying what you say is untrue, but you didn't give any convincing arguments to us to believe otherwise.

Also, you didn't define the criteria of getting better. Getting better in terms of what exactly???

heavyset_go•4h ago

Any "tricks" you learn for one model may not be applicable to another, it isn't a given that previous experience with a company's product will increase the likelihood of productivity increases. When models change out from under you, the heuristics you've built up might be useless.

ivanovm•2h ago

I find the very popular response of "you're just not using it right" to be big copout for LLMs, especially at the scale we see today. It's hard to think of any other major tech product where it's acceptable to shift so much blame on the user. Typically if a user doesn't find value in the product, we agree that the product is poorly designed/implemented, not that the user is bad. But AI seems somehow exempt from this sentiment

viraptor•1h ago

> It's hard to think of any other major tech product where it's acceptable to shift so much blame on the user.

It's completely normal in development. How many years of programming experience you need for almost any language? How many days/weeks you need to use debuggers effectively? How long from the first contact with version control until you get git?

I think it's the opposite actually - it's common that new classes of tools in tech need experience to use well. Much less if you're moving to something different within the same class.

Avshalom•1h ago

Linus did not show up in front of congress talking about how dangerously powerful unregulated version control was to the entirety of human civilization a year before he debuted Git and charged thousands a year to use it.

viraptor•36m ago

Ok. You seem to be taking about a completely different issue of regulation.

sanderjd•26m ago

This seems like a non sequitur. What does this have to do with this thread?

Lerc•1h ago

>It's hard to think of any other major tech product where it's acceptable to shift so much blame on the user.

Is that perhaps because of the nature of the category of 'tech peoduct'. In other domains, this certainly isn't the case. Especially if the goal is to get the best result instead of the optimum output/effort balance.

Musical instruments are a clear case where the best results are down to the user. Most crafts are similar. There is the proverb "A bad craftsman blames his tools" that highlights that there are entire fields where the skill of the user is considered to be the most important thing.

When a product is aimed at as many people as the marketers can find, that focus on individual ability is lost and the product targets the lowest common denominator.

They are easier to use, but less capable at their peak. I think of the state of LLMs analogous to home computing at a stage of development somewhere around Altair to TRS-80 level. These are the first ones on the scene, people are exploring what they are good for, how they work, and sometimes putting them to effective use in new and interesting ways. It's not unreasonable to expect a degree of expertise at this stage.

The LLM equivalent of a Mac will come, plenty of people will attempt to make one before it's ready. There will be a few Apple Newtons along the way that will lead people to say the entire notion was foolhardy. Then someone will make it work. That's when you can expect to use something without expertise. We're not there yet.

edmundsauto•1h ago

New technologies that require new ways of thinking are always this way. "Google-fu" was literally a hirable career skill in 2004 because nobody knew how to search to get optimal outcomes. They've done alright improving things since then - let's see how good Cursor is in 10 years.

sanderjd•27m ago

> It's hard to think of any other major tech product where it's acceptable to shift so much blame on the user.

Maybe, but it isn't hard to think of developer tools where this is the case. This is the entire history of editor and IDE wars.

Imagine running this same study design with vim. How well would you expect the not-previously-experienced developers to perform in such a study?

milchek•20m ago

I think the reason for that is maybe you’re comparing to traditional products that are deterministic or have specific features that add value?

If my phone keeps crashing or if the browser is slow or clunky then yes, it’s not on me, it’s the phone, but an LLM is a lot more open ended in what it can do. Unlike the phone example above where I expect it to work from a simple input (turning it on) or action (open browser, punch in a url), what an LLM does is more complex and nuanced.

Even the same prompt from different users might result in different output - so there is more onus on the user to craft the right input.

Perhaps that’s why AI is exempt for now.

AndrewKemendo•1h ago

What you described has been true of the adoption of every technology ever

Nothing new this time except for people who have no vision and no ability to work hard not “getting it” because they don’t have the cognitive capacity to learn

rukuu001•47m ago

I'm sympathetic to the argument re experience with the tools paying off, because my personal anecdata matches that. It hasn't been until the last 6 weeks, after watching a friend demo their workflow, that my personal efficiency has improved dramatically.

The most useful thing of all would have been to have screen recordings of those 16 developers working on their assigned issues, so they could be reviewed for varying approaches to AI-assisted dev, and we could be done with this absurd debate once and for all.

inetknght•8h ago

> We pay developers $150/hr as compensation for their participation in the study.

Can someone point me to these 300k/yr jobs?

recursive•7h ago

These are not 300k/yr jobs.

akavi•6h ago

L5 ("Senior") at any FAANG co, L6 ("Staff") at pretty much any VC-backed startup in the bay.

nestorD•8h ago

One thing I could not find on a cursory read is how used were those developers to AI tools. I would expect someone using those regularly to benefit while someone who only played with them a couple of time would likely be slowed down as they deal with the friction of learning to be productive with the tool.

uludag•7h ago

In this case though you still wouldn't necessarily know if the AI tools had a positive causal effect. For example, I practically live in Emacs. Take that away and no doubt I would be immensely less effective. That Emacs improves my productivity and without it I am much worse in no way implies that Emacs is better than the alternatives.

I feel like a proper study for this would involve following multiple developers over time, tracking how their contribution patterns and social standing changes. For example, take three cohorts of relatively new developers: instruct one to go all in on agentic development, one to freely use AI tools, and one prohibited from AI tools. Then teach these developers open source (like a course off of this book: https://pragprog.com/titles/a-vbopens/forge-your-future-with...) and have them work for a year to become part of a project of their choosing. Then in the end, track a number of metrics such as leadership position in community, coding/non-coding contributions, emotional connection to project, social connections made with community, knowledge of code base, etc.

Personally, my prior probability is that the no-ai group would likely still be ahead overall.

swayvil•8h ago

AI by design can only repeat and recombine past material. Therefore actual invention is out.

elpakal•8h ago

underrated comment

atleastoptimal•7h ago

HN moment

luibelgo•6h ago

Is that actually proven?

greenchair•5h ago

The easiest way to see this for yourself is with an image generator. Try asking for a very specific combination of things that would not normally appear together in an artpiece.

keeda•4h ago

Pretty much all invention is novel combination of known techniques. Anything that introduces a fundamental new technique is usually in the realm of groundbreaking papers and Nobel prizes.

zzzeek•8h ago

As a project for work, I've been using Claude CLI all week to do as many tasks as possible. So with my week's experience, I'm now an expert in this subject and can weigh in.

Two things that stand out to me are 1. it depends a lot on what kind of task you are having the LLM do. and 2. if the LLM process takes more time, it is very likely your cognitive effort was still way less - for sysadmin kinds of tasks, working with less often accessed systems, LLMs can read --help, man pages, doc sites, all for you, and give you the working command right there (And then run it, and then look at the output and tell you why it failed, or how it worked, and what it did). There is absolutely no question that second part is a big deal. Sticking it onto my large open source project to fix a deep, esoteric issue or write some subtle documentation where it doesnt really "get" what I'm doing, yeah it is not as productive in that realm and you might want to skip it for the thinking part there. I think everyone is trying to figure out this question of "when and how" for LLMs. I think the sweet spot is for tasks involving systems and technologies where you'd otherwise be spending a lot of time googling, stackoverflowing, reading man pages to get just the right parameters into commands and so forth. This is cognitive grunt work and the LLMs can do that part very well.

My week of effort with it was not really "coding on my open source project"; two examples were, 1. running a bunch of ansible playbooks that I wrote years ago on a new host, where OS upgrades had lots of snags; I worked with Claude to debug all the various error messages and places where the newer OS distribution had different packages, missing packages, etc. it was ENORMOUSLY helpful since I never look at these playbooks and I dont even remember what I did, Claude can read it for you and interpret it as well as you can. 2. I got a bugzilla for a fedora package that I packaged years ago, where they have some change to the directives used in specfiles that everyone has to make. I look at fedora packaging workflows once every three years. I told Claude to read the BZ and just do it. IT DID IT. I had to get involved running the "mock" suite as it needed sudo but Claude gave me the commands. zero googling. zero even reading the new format of the specfile (the bz linked to a tool that does the conversion). From bug received to bug closed and I didnt do any typing at all outside of the prompt. Had it done before breakfast since I didnt even need any glucose for mental energy expended. This would have been a painful and frustrating mental effort otherwise.

so the studies have to get more nuanced and survey a lot more than 16 devs I think

geerlingguy•8h ago

So far in my own hobby OSS projects, AI has only hampered things as code generation/scaffolding is probably the least of my concerns, whereas code review, community wrangling, etc. are more impactful. And AI tooling can only do so much.

But it's hampered me in the fact that others, uninvited, toss an AI code review tool at some of my open PRs, and that spits out a 2-page document with cute emoji and formatted bullet points going over all aspects of a 30 line PR.

Just adds to the noise, so now I spend time deleting or hiding those comments in PRs, which means I have even _less_ time for actual useful maintenance work. (Not that I have much already.)

heisenbit•7h ago

AI sometimes points out hygiene issues that may be swept under the carpet but once pointed out can't be ignored anymore. I know I don't need that error handling, I'm certain for the near future but maybe it is needed... Also the code produced by the AI has some impedance match with my natural code. Then one needs to figure out whether that is due to moving best practices, until now ignored best practices or the AI being overwhelmed with code from beginners. This all takes time - some of it is transient, some of it is actually improving things and some of it is waste. The jury is still out there.

ChrisMarshallNY•7h ago

It's been very helpful for me. I find ChatGPT the easiest to use; not because it's more accurate (it isn't), but because it seems to understand the intent of my questions most clearly. I don't usually have to iterate much.

I use it like a know-it-all personal assistant that I can ask any question to; even [especially] the embarrassing, "stupid" ones.

> The only stupid question is the one we don't ask.

- On an old art teacher's wall

0xmusubi•7h ago

I find myself having discussions with AI about different design possibilities and it sometimes comes up with ideas I hadn't thought of or features I wasn't aware of. I wouldn't classify this as "overuse" as I often find the discussions useful, even if it's just to get my thoughts down. This might be more relevant for larger scoped tasks or ones where the programmer isn't as familiar with certain features or libraries though.

groos•7h ago

One thing I've experienced in trying to use LLMs to code in an existing large code base is that it's _extremely_ hard to accurately describe what you want to do. Oftentimes, you are working on a problem with a web of interactions all over the code and describing the problem to an LLM will take far longer than just doing it manually. This is not the case with generating new (boilerplate) code for projects, which is where users report the most favorable interaction with LLMs.

9dev•6h ago

That’s my experience as well. It’s where Knuth comes in again: the program doesn’t just live in the code, but also in the minds of its creator. Unless I communicate all that context from the start, I can’t just dump years of concepts and strategy out of my brain into the LLM without missing details that would be relevant.

AvAn12•7h ago

N = 16 developers. Is this enough to draw any meaningful conclusions?

sarchertech•6h ago

That depends on the size of the effect you’re trying to measure. If cursor provides a 5x, 10x, or 100x productivity boost as many people are claiming, you’d expect to see that in a sample size of 16 unless there’s something seriously wrong with your sample selection.

If you are looking for a 0.1% increase in productivity, then 16 is too small.

biophysboy•6h ago

Well it depends on the variance of the random variable itself. You're right that with big, obvious effects, a larger n is less "necessary". I could see individuals having very different "productivities", especially when the idea is flattened down to completion time.

AvAn12•5h ago

“A quarter of the participants saw increased performance, 3/4 saw reduced performance.” So I think any conclusions drawn on these 16 people doesn’t signify much one way or the other. Cool paper but how is this anything other than a null finding?

atleastoptimal•7h ago

I’m not surprised that AI doesn’t help people with 5+ years experience in open source contribution, but I’d imagine most people aren’t claiming AI tools are at senior engineer level yet.

Soon once the tools and how people use them improve AI won’t be a hinderance for advanced tasks like this, and soon after AI will be able to do these prs on their own. It’s inevitable given the rate of improvement even since this study.

artee_49•6h ago

Even for senior levels the claim has been that AI will speed up their coding (take it over) so they can focus on higher level decisions and abstract level concepts. These contributions are not those and based on prior predictions the productivity should have gone up.

atleastoptimal•2h ago

It would be different I'm sure if they were making contributions to repos they had less familiarity with. In my experience and talking with those who use AI most effectively, it is best leveraged as a way of getting up to speed or creating code for a framework/project you have less familiarity with. The general ratio determining the effectiveness of non-AI coding vs AI coding is the familiarity the user has with the codebase * the complexity of the codebase : the amount of closed-loop abstractions in the tasks the coder needs to carry out.

Currently AI is like a junior engineer, and if you don't have good experience managing junior engineers, AI isn't going to help you as much.

pera•7h ago

Wow these are extremely interesting results, specially this part:

> This gap between perception and reality is striking: developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%.

I wonder what could explain such large difference between estimation/experience vs reality, any ideas?

Maybe our brains are measuring mental effort and distorting our experience of time?

alfalfasprout•5h ago

I would speculate that it's because there's been a huge concerted effort to make people want to believe that these tools are better than they are.

The "economic experts" and "ml experts" are in many cases effectively the same group-- companies pushing AI coding tools have a vested interest in people believing they're more useful than they are. Executives take this at face value and broadly promise major wins. Economic experts take this at face value and use this for their forecasts.

This propagates further, and now novices and casual individuals begin to believe in the hype. Eventually, as an experienced engineer it moves the "baseline" expectation much higher.

Unfortunately this is very difficult to capture empirically.

longwave•5h ago

I also wonder how many of the numerous AI proponents in HN comments are subject to the same effect. Unless they are truly measuring their own performance, is AI really making them more productive?

malfist•2h ago

How would you even measure your own performance? You can go and redo something, forgetting everything you did along the way the first time

jwrallie•2h ago

You could go the same way as the study, flip a coin to use AI or not, write down the task you just did, the time you thought the task took you and the actual clock time. Repeat and self-evaluate.

malfist•1h ago

Sample size of 16 is already hard enough to draw conclusions from. Sample size of 1 is even worse.

chamomeal•5h ago

It’s funny cause I sometimes have the opposite experience. I tried to use Claude code today to make a demo app to show off a small library I’m working on. I needed it to set up some very boilerplatey example app stuff.

It was fun to watch, it’s super polished and sci-fi-esque. But after 15 minutes I felt braindead and was bored out of my mind lol

evanelias•5h ago

Here's a scary thought, which I'm admittedly basing on absolutely nothing scientific:

What if agentic coding sessions are triggering a similar dopamine feedback loop as social media apps? Obviously not to the same degree as social media apps, I mean coding for work is still "work"... but there's maybe some similarity in getting iterative solutions from the agent, triggering something in your brain each time, yes?

If that was the case, wouldn't we expect developers to have an overly positive perception of AI because they're literally becoming addicted to it?

EarthLaunch•5h ago

> The LLMentalist Effect: how chat-based Large Language Models replicate the mechanisms of a psychic’s con

https://softwarecrisis.dev/letters/llmentalist/

Plus there's a gambling mechanic: Push the button, sometimes get things for free.

lll-o-lll•2h ago

This is very interesting and disturbing. We are outsourcing our decision making to an algorithmic “Mentalist” and will reap a terrible reward. I need to ween myself off the comforting teat of the chatbot psychic.

csherratt•5h ago

That's my suspicion to.

My issue with this being a 'negative' thing is that I'm not sure it is. It works off of the same hunting / foraging instincts that keep us alive. If you feel addiction to something positive, it is bad?

Social media is negative because it addicts you to mostly low quality filler content. Content that doesn't challenge you. You are reading shit posts instead of reading a book or doing something with better for you in the long run.

One could argue that's true for AI, but I'm not confident enough to make such a statement.

evanelias•4h ago

The study found AI caused a "significant slowdown" in developer efficiency though, so that doesn't seem positive!

jwrallie•2h ago

Like the feeling of the command line being always faster than using the GUI? Different ways we engage with a task can change our time perception.

I wish there was a simple way to measure energy spent instead of time. Maybe nature is just optimizing for something else.

afro88•6h ago

Early 2025. I imagine the results would be quite different with mid 2025 models and tools.

gmaster1440•6h ago

What if the slowdown isn't a bug but a feature? What if AI tools are forcing developers to think more carefully about their code, making them slower but potentially producing better results? AFAIK the study measured speed, not quality, maintainability, or correctness.

The developers might feel more productive because they're engaging with their code at a higher level of abstraction, even if it takes longer. This would be consistent with why they maintained positive perceptions despite the slowdown.

PessimalDecimal•5h ago

In my experience, LLMs are not causing people to think more carefully about their code.

doctoboggan•6h ago

For me, the measurable gain in productiviy comes when I am working with a new language or new technology. If I were to use claude code to help implement a feature of a python library I've worked on for years then I don't think it would help much (Maybe even hurt). However, if I use claude code on some go code I have very little experience with, or using it to write/modify helm charts then I can definitely say it speeds me up.

But, taking a broader view its possible that these initial speed ups are negated by the fact that I never really learn go or helm charts as deeply now that I use claude code. Over time, its possible that my net productiviy is still reduced. Hard to say for sure, especially considering I might not have even considered talking these more difficult go library modifications if I didn't have claude code to hold my hand.

Regardless, these tools are out there, increasing in effectiveness and I do feel like I need to jump on the train before it leaves me at the station.

LegNeato•6h ago

For certain tasks it can speed me up 30x compared to an expert in the space: https://rust-gpu.github.io/blog/2025/06/24/vulkan-shader-por...

lpghatguy•6h ago

This is very disingenuous: we don't know how much spare time Sascha spent, and much of that time was likely spent learning, experimenting, and reporting issues to Slang.

_jayhack_•6h ago

This does not take into account the fact that experienced developers working with AI have shifted into roles of management and triage, working on several tasks simultaneously.

Would be interesting (and in fact necessary to derive conclusions from this study) to see aggregate number of tasks completed per developer with AI augmentation. That is, if time per task has gone up by 20% but we clear 2x as many tasks, that is a pretty important caveat to the results published here

isoprophlex•6h ago

Ed Zitron was 100% right. The mask is off and the AI subprime crisis is coming. Reading TFA, it would be hilarious if the bubble burst AND it turns out there's actually no value to be had, at ANY price. I for one can't wait for this era of hype to end. We'll see.

you're addicted to the FEELING of productivity more than actual productivity. even knowing this, even seeing the data, even acknowledging the complete fuckery of it all, you're still gonna use me. i'm still gonna exist. you're all still gonna pretend this helps because the alternative is admitting you spent billions of dollars on spicy autocomplete.

keerthiko•6h ago

IME AI coding is excellent for one-off scripts, personal automation tooling (I iterate on a tool to scrape receipts and submit expenses for my specific needs) and generally stuff that can be run in environments where the creator and the end user are effectively the same (and only) entity.

Scaled up slightly, we use it to build plenty of internal tooling in our video content production pipeline (syncing between encoding tools and a status dashboard for our non-technical content team).

Using it for anything more than boilerplate code, well-defined but tedious refactors, or quickly demonstrating how to use an unfamiliar API in production code, before a human takes a full pass at everything is something I'm going to be wary of for a long time.

mrwaffle•6h ago

My overall concern has to do with our developer ecosystem from the important points mentioned by simonw and narush. I've been concerned about this for years but AI reliance seems to be pouring jet fuel on the fire. Particularly troubling is the lack of understanding less-experienced devs will have over time. Does anyone have a counter-argument for this they can share on why this is a good thing?

partdavid•3h ago

The shallow analogy is like "why worry about not being able to do arithmetic without a calculator"? Like... the dev of the future just won't need it.

I feel like programming has become increasingly specialized and even before AI tool explosion, it's way more possible to be ignorant of an enormous amount of "computing" than it used to be. I feel like a lot of "full stack" developers only understand things to the margin of their frameworks but above and below it they kind of barely know how a computer works or what different wire protocols actually are or what an OS might actually do at a lower level. Let alone the context in which in application sits beyond let's say, a level above a kubernetes pod and a kind of trial-end-error approach to poking at some YAML templates.

Do we all need to know about processor architectures and microcode and L2 caches and paging and OS distributions and system software and installers and openssl engines and how to make sure you have the one that uses native instructions and TCP packets and envoy and controllers and raft systems and topic partitions and cloud IAM and CDN and DNS? Since that's not the case--nearly everyone has vast areas of ignorance yet still does a bunch of stuff--it's harder to sell the idea that whatever AI tools are doing that we lose skills in will somehow vaguely matter in the future.

I kind of miss when you had to know a little of everything and it also seemed like "a little bit" was a bigger slice of what there was to know. Now you talk to people who use a different framework in your own language and you feel like you're talking to deep specialists whose concerns you can barely understand the existence of, let alone have an opinion on.

OpenSourceWard•6h ago

Very cool work! And I love the nuance in your methodology and findings. Anyway, I'm preparing myself for all the "Bombshell news: AI is slowing down developers" posts that are coming.

mwigdahl•2h ago

Plus the gaslighting to follow for anyone claiming AI improved their productivity.

asciimov•5h ago

I'll be curious of the long term impacts of AI.

Such as: do you end up spending more time to find and fix issues, does AI use reduce institutional knowledge, will you be more inclined to start projects over from scratch.

lmeyerov•5h ago

As someone has been doing hardcore genai for 2+ years, my experience has been, and what we advise internally:

* 3 weeks to transition from ai pairing to AI Delegation to ai multitasking. So work gains are mostly week 3+. That's 120+ hours in, as someone pretty senior here.

* Speedup is the wrong metric. Think throughput, not latency. Some finite amount of work might take longer, but the volume of work should go up because AI can do more on a task and diff tasks/projects in parallel.

Both perspectives seem consistent with the paper description...

ieie3366•5h ago

LLMs are godtier if you know what you’re doing, and prompt them with ”do X”, where x is a SELF-CONTAINED change you would manually know how to implement

For example, today I asked claude to implement per-user rate-limiting into my nestjs service, then iterated by asking implementing specific unit tests and some refactoring. It one-shot everything. I would say 90% time savings.

Unskilled people ask them ”i have giant problem X solve it” and end up with slop

thepasswordis•5h ago

I actually think that pasting questions into chatGPT etc. and then getting general answers to put into your code is the way.

“One shotting” apps, or even cursor and so forth seem like a waste of time. It feels like if you prompt it just right it might help but then it never really does.

partdavid•4h ago

I've done okay with copilot as a very smart autocomplete on: a) very typical codebase, with b) lots of boilerplate, where c) I'm not terribly familiar with the languages and frameworks, which are d) very, very popular but e) I don't really like, so I'm not particularly motivated to become familiar with them. I'm not a frontend developer, I don't like it, but I'm in a position now where I need to do frontend things with a verbose Typescript/React application which is not interesting from a technical point of view (good product, it's just not good because it has an interesting or demanding front end). Copilot (I use Emacs, so cursor is a non-starter, but copilot-mode works very well for Typescript) has been pretty invaluable to just sort of slogging through stuff.

For everything else, I think you're right, and actually the dialog-oriented method is way better. If I learn an approach and apply some general example from ChatGPT, but I do the typing and implementation myself so I need to understand what I'm doing, I'm actually leveling up and I know what I'm finished with. If I weren't "experienced", I'd worry about what it was doing to my critical thinking skills, but I know enough about learning on my own at this point to know I'm doing something.

I'm not interested in vibe coding at all--it seems like a one-way process to automate what was already not the hard part of software engineering; generating tutorial-level initial implementations. Just more scaffolding that eventually needs to be cleared away.

thesz•5h ago

What is interesting here is that all predictions were positive, but results are negative.

This shows that everyone in the study (economic experts, ML experts and even developers themselves, even after getting experience) are novices if we look at them from the Dunning-Kruger effect [1] perspective.

[1] https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect

"The Dunning–Kruger effect is a cognitive bias in which people with limited competence in a particular domain overestimate their abilities."

mattl•5h ago

I don't understand how anyone doing open source can use something trained on other people's code as a tool for contributions.

I wouldn't accept someone's copy and pasted code from another project if it were under an incompatible license, let alone something with unknown origin.

AIorNot•4h ago

Hey guys why are we making it so complicated? do we really need a paper and study?

anyway -AI as the tech currently stand is a new skill to use and takes us humans time to learn, but once we do well, its becomes force multiplier

ie see this: https://claude.ai/public/artifacts/221821f0-0677-409b-8294-3...

tarofchaos•4h ago

Totally flawed study

bit1993•4h ago

It used to be that all you required to program was a computer and to RTFM but now we need to pay for API "tokens" and pray that there are no rug pull in the future.

danparsonson•2h ago

"It used to be that all you required to write was a pen and paper but now we need to pay for 'electricity'..."

You can still do those things.

cadamsdotcom•3h ago

My hot take: Cursor is a bad tool for agentic coding. Had a subscription and canceled it in favor of Claude Code. I don’t want to spend 90% of my time approving every line the agent wants to write. With Claude Code I review whole diffs - 1-2 minutes of the agent’s work at a time. Then I work with the agent at a level of what its approach is, almost never asking about specific lines of code. I can look at 5 files at once in git diff and then ask “why’d you choose that way?” “Can we undo that and try to find a simpler way?”

Cursor’s workflow exposes how differently different people track context. The best ways to work with Cursor may simply not work for some of us.

If Cursor isn’t working for you, I strongly encourage you to try CLI agents like Claude Code.

codyb•3h ago

So slow until a learning curve is hit (or as one user posited "until you forget how to work without it").

But isn't the important thing to measure... how long does it take to debug the resulting code at 3AM when you get a PagerDuty alert?

Similarly... how about the quality of this code over time? It's taken a lot of effort to bring some of the code bases I work in into a more portable, less coupled, more concise state through the hard work of

- bringing shared business logic up into shared folders

- working to ensure call chains flow top down towards root then back up through exposed APIs from other modules as opposed to criss-crossing through the directory structure

- working to separate business logic from API logic from display logic

- working to provide encapsulation through the use of wrapper functions creating portability

- using techniques like dependency injection to decouple concepts allowing for easier testing

etc

So, do we end up with better code quality that ends up being more maintainable, extensible, portable, and composable? Or do we just end up with lots of poor quality code that eventually grows to become a tangled mess we spend 50% of our time fighting bugs on?

solid_fuel•1h ago

I would love to see a comparison of the pull requests generated by each workflow, if possible. My experience with Copilot has generally been that it suggests far more code than I would actually write to solve a specific problem - sometimes adding extra checks where they aren't needed, sometimes just being more verbose than I would be, and oftentimes repeating itself where it would be better to use an abstraction.

My personal hypothesis is that seeing the LLM write _so much_ code may create the feeling that the problems it is solving would take longer to solve by yourself.

budududuroiu•1h ago

My theory is that, outside of programming skill, amazement by AI tools is inversely proportional to typing/navigating speed.

I already know what I need to write, I just need to get it into the editor. I wouldn’t trade the precision I have with vim macros flying across multiple files for an AI workflow.

I do think AI is a good rubber ducky sometimes tho, but I despise letting it take over editing files.

A Bicycle for the Mind

Why Textile Brands Need Supply Chain Traceability

Stress is wrecking your health: how can science help?

Show HN: Intermittent Fasting Calculator – Plan Meals and Fasting Times

Study: Apple's newest AI model flags health conditions with up to 92% accuracy

Jack Dorsey says his 'secure' new Bitchat app has not been tested for security

Give Me Some Advice

My Journey to Build a Working Tesla Coil

Track Work, Progress and Performance Instantly – Zero Manual Updates

Show HN: Looking for Beta Testers: Run AI-Generated Code in AgentSphere Sandbox

Concurrent Programming with Harmony

Show HN: Intuitive Layout Image Generation Prompt Generator

Nerve pain drug gabapentin linked to increased dementia, cognitive impairment

Netflix Tudum Architecture: From CQRS with Kafka to CQRS with Raw Hollow

Budget limits at DHS delayed FEMA's Texas deployment

The first intelligent screenshot tool of the AI era

Hard Usernames for Games Generator

The Egos at id (Software)

'Autofocus' specs promise sharp vision, near or far

Tool strips away anti-AI protections from digital art

A Poor Man's User Study with a Vision Model and E[P]

Extreme Low-Bit Clustering for Large Language Models via Knowledge Distillation

Grok 4 seems to consult Elon Musk to answer controversial questions

America's largest power grid is struggling to meet demand from AI

Show HN: Open-Source Alternative to Mercury

Psilocybin treatment extends cellular lifespan, improves survival of aged mice

Supporting kernel development with large language models

Flickle – connect any two actors via movies in ≤6 guesses

Earth's Spin Picks Up Speed: 3 Shorter Days This Summer

Automating Weekly Releases with GitHub Actions

A Bicycle for the Mind

Why Textile Brands Need Supply Chain Traceability

Stress is wrecking your health: how can science help?

Show HN: Intermittent Fasting Calculator – Plan Meals and Fasting Times

Study: Apple's newest AI model flags health conditions with up to 92% accuracy

Jack Dorsey says his 'secure' new Bitchat app has not been tested for security

Give Me Some Advice

My Journey to Build a Working Tesla Coil

Track Work, Progress and Performance Instantly – Zero Manual Updates

Show HN: Looking for Beta Testers: Run AI-Generated Code in AgentSphere Sandbox

Concurrent Programming with Harmony

Show HN: Intuitive Layout Image Generation Prompt Generator

Nerve pain drug gabapentin linked to increased dementia, cognitive impairment

Netflix Tudum Architecture: From CQRS with Kafka to CQRS with Raw Hollow

Budget limits at DHS delayed FEMA's Texas deployment

The first intelligent screenshot tool of the AI era

Hard Usernames for Games Generator

The Egos at id (Software)

'Autofocus' specs promise sharp vision, near or far

Tool strips away anti-AI protections from digital art

A Poor Man's User Study with a Vision Model and E[P]

Extreme Low-Bit Clustering for Large Language Models via Knowledge Distillation

Grok 4 seems to consult Elon Musk to answer controversial questions

America's largest power grid is struggling to meet demand from AI

Show HN: Open-Source Alternative to Mercury

Psilocybin treatment extends cellular lifespan, improves survival of aged mice

Supporting kernel development with large language models

Flickle – connect any two actors via movies in ≤6 guesses

Earth's Spin Picks Up Speed: 3 Shorter Days This Summer

Automating Weekly Releases with GitHub Actions

Measuring the impact of AI on experienced open-source developer productivity

Comments