``` It's common for engineers to end up working on projects which they don't have an accurate mental model of. Projects built by people who have long since left the company for pastures new. It's equally common for developers to work in environments where little value is placed on understanding systems, but a lot of value is placed on quickly delivering changes that mostly work. In this context, I think that AI tools have more of an advantage. They can ingest the unfamiliar codebase faster than any human can, and can often generate changes that will essentially work. ```
Reason: you cannot evaluate the work accurately if you have no mental model. If there's a bug given the systems unwritten assumptions you may not catch it.
Having said that it also depends on how important it is to be writing bug free code in the given domain I guess.
I like AI particularly for green field stuff and one off scripts as it let's you go faster here. Basically you build up the mental model as you're coding with the AI.
Not sure about whether this breaks down at a certain codebase size though.
So
> Reason: you cannot evaluate the work accurately if you have no mental model. If there's a bug given the systems unwritten assumptions you may not catch it.
This is completely correct. It's a very fair statement. The problem is that a developer coming into a large legacy project is in this spot regardless of the existence of AI.
I've found that asking AI tools to generate a changeset in this case is actually a pretty solid way of starting to learn the mental model.
I want to see where it tries to make changes, what files it wants to touch, what libraries and patterns it uses, etc.
It's a poor man's proxy for having a subject matter expert in the code give you pointers. But it doesn't take anyone else's time, and as long as you're not just trying to dump output into a PR can actually be a pretty good resource.
The key is not letting it dump out a lot of code, in favor of directional signaling.
ex: Prompts like "Which files should I edit to implement a feature which does [detailed description of feature]?" Or "Where is [specific functionality] implemented in this codebase?" Have been real timesavers for me.
The actual code generation has probably been a net time loss.
This. Leveraging the AI to start to develop the mental model is an advantage. But, using the AI is a non-trivial skill set that needs to be learned. Skepticism of what it's saying is important. AI can be really useful just like a 747 can be useful, but you don't want someone picked off the street at random flying it.
Is there any evidence that AI helps you build the mental model of an unfamiliar codebase more quickly?
In my experience trying to use AI for this it often leads me into the weeds
Before beginning the study, the average developer expected about a 20% productivity boost.
After ending the study, the average developer (potentially: you) believed they actually were 20% more productive.
In reality, they were 0% more productive at best, and 40% less productive at worst.
Think about what it would be like to be that developer; off by 60% about your own output.
If you can't even gauge your own output without being 40% off on average, 60% off at worst; be cautious about strong opinions on anything in life. Especially politically.
Edit 1: Also consider, quite terrifyingly, if said developers were in an online group, together, like... here. The one developer who said she thought it made everyone slower (the truth in this particular case), would be unanimously considered an idiot, downvoted to the full -4, even with the benefit of hindsight.
Edit 2: I suppose this goes to show, that even on Hacker News, where there are relatively high-IQ and self-aware individuals present... 95% of the crowd can still possibly be wildly delusional. Stick to your gut, regardless of the crowd, and regardless of who is in it.
It's like Tog's study that people think Keyboard is faster than the mouse even when they are faster with the mouse. Because they are measuring how they feel, not what is actually happening.
This one in particular:
> It takes two seconds to decide upon which special-function key to press.
seems to indicate the study was done on people with no familiarity at all with the software they were testing.
Either way, I don't think there is any evidence out there supporting that either of keyboard-only or mouse-only is faster or equivalent to keyboard+mouse for well known GUIs.
But despite the popularity of some of this (planning poker, particularly; PDCA for process improvements is sadly less popular) as ritual, those elements have become part of a cargo cult where almost no one remembers why we do it.
Yeah, this is me at my job right now. Every time I express even the mildest skepticism about the value of our Cursor subscription, I'm getting follow up conversations basically telling me to shut up about it
It's been very demoralizing. You're not allowed to question the Emperor's new clothes
Using it beyond that is just more work. First parse the broken response, remove any useless junk, have it reprocess with updated query.
It’s a nice tool to have (just as search engines gave us easy access to multiple sources/forums), but its limitations are well known. Trying to use it 100% as intended is a massive waste of time and resources (energy use…)
I just started working on a 3-month old codebase written by someone else, in a framework and architecture I had never used before
Within a couple hours, with the help of Claude Code, I had already created a really nice system to replicate data from staging to local development. Something I had built before in other projects, and I new that manually it would take me a full day or two, especially without experience in the architecture
That immediately sped up my development even more, as now I had better data to test things locally
Then a couple hours later, I had already pushed my first PR. All code following the proper coding style and practices of the existing project and the framework. That PR, would have taken me at least a couple of days and up to 2 weeks to fully manually write out and test
So sure, AI won’t speed everyone or everything up. But at least in this one case, it gave me a huge boost
As I keep going, I expect things to slow down a bit, as the complexity of the project grows. However, it’s also given me the chance to get an amazing jumpstart
> It's equally common for developers to work in environments where little value is placed on understanding systems, but a lot of value is placed on quickly delivering changes that mostly work. In this context, I think that AI tools have more of an advantage. They can ingest the unfamiliar codebase faster than any human can, and can often generate changes that will essentially work.
Sadly clickbait headlines like the OP, "AI slows down open source developers," spread this misinformation, ensuring that a majority of people will have the same misapprehension.
It took me an embarrassingly long time to realize a simple fact: using AI well is a shallow skill that everyone can learn in days or even hours if they want. And then my small advantage of knowing AI tools will disappear. Since the realization I've been always upvoting articles that claims AI makes you less productive (like the OP).
Using Warp terminal (which used Claude) I was get past those barriers and achieve results that weren't happening at all before.
“When open source developers working in codebases that they are deeply familiar with use AI tools to complete a task, they take longer to complete that task”
I have anecdotally found this to be true as well, that an LLM greatly accelerates my ramp up time in a new codebase, but then actually leads me astray once I am familiar with the project.
If you are unfamiliar with the project, how do you determine that it wasn't leading you astray in the first place? Do you ever revisit what you had done with AI previously to make sure that, once you know your way around, it was doing it the right way?
What’s most useful about the LLM in the early stages is not the actual code it writes, but its reasoning that helps me learn about the structure of the project. I don’t take the code blind, I am more interested in the reasoning than the code itself. I have found this to be reliably useful.
Coming from other programming languages, I had a lot of questions that would be tough to nail down in a Google search, or combing through docs and/or tutorials. In retrospect, it's super fast at finding answers to things that _don't exist_ explicitly, or are implied through the lack of documentation, or exist at the intersection of wildly different resources:
- Can I get compile-time type information of Enum values?
- Can I specialize a generic function/type based on Enum values?
- How can I use macros to reflect on struct fields?
- Can I use an enum without its enclosing namespace, as I can in C++?
- Does rust have a 'with' clause?
- How do I avoid declaring timelines on my types?
- What is an idiomatic way to implement the Strategy pattern?
- What is an idiomatic way to return a closure from a function?
...and so on. This "conversation" happened here and there over the period of two weeks. Not only was ChatGPT up to the task, but it was able to suggest what technologies would get me close to the mark if Rust wasn't built to do what I had in mind. I'm now much more comfortable and competent in the language, but miles ahead of where I would have been without it.
How does using AI impact the amount of time it takes you to become sufficiently familiar with the project to recognize when you are being led astray?
One of the worries I have with the fast ramp-up is that a lot of that ramp-up time isn't just grunt work to be optimized a way, it's active learning, and bypassing too much of it can leave you with an incomplete understanding of the problem domain that slows you down perpetually.
Sometimes, there are real efficiencies to be gained; other times those perceived efficiencies are actually incurring heavy technical debt, and I suspect that overuse of AI is usually the latter.
Not always the case, but whenever I read about these strained studies or arguments about how AI is actually making people less productive, I can't help but wonder why nearly every programmer I know, myself included, finds value in these tools. I wonder if the same thing happened with higher level programming languages where people argued, you may THINK not managing your own garbage collector will lead to more productivity but actually...
Even if we weren't more "productive", millions prefer to use these tools, so it has to count for something. And I don't need a "study" to tell me that
Moreover, it acknowledges that for programmers working in most companies the first case is much more frequent.
Again, overwhelming anecdote and millions of users > "study"
In this case clearly anecdotes are not enough. If that quote from the article is accurate, it shows that you cannot trust the developers time perception.
I agree, its only one study and we should not take it as the final answer. It definitely justifies doing a few follow up evaluations to see if this
https://mleverything.substack.com/p/garden-of-forking-paths-...
If I run my own tests on my own codebase I will definitely use some objective time measurement method and a subjective one. I really want to know if there is a big difference.
I really wonder if its just the individuals bias showing. If you are pro-AI you might overestimate one, and if you are against it you might under-estimate it.
The scientific method goes eithht out the window when it comes to true believers. It reminds me of weed-smokers who insist getting high makes them wise or deep-thinkers: it feels that way in the moment, but if you've ever been a sober person caught up in a "deep" discussion with people high on THC, oh boy...
One of the more interesting findings of the study mentioned was that the LLM users, even where use of an LLM had apparently degraded their performance, tended to believe it had enhanced it. Anecdote is a _really_ bad argument against data that shows a _perception_ problem.
> Even if we weren't more "productive", millions prefer to use these tools, so it has to count for something.
I mean, on that basis, so does homeopathy.
Like, it's just one study. It's not the last word. But "my anecdotes disprove it" probably isn't a _terribly_ helpful approach.
Seems to happen every time, doesn't it?
What is your accuracy on software development estimates? I always see these productivity claims matched again “It would’ve taken me” timelines.
But, it’s never examined if we’re good at estimating. I know I am not good at estimates.
It’s also never examined if the quality of the PR is the same as it would’ve been. Are you skipping steps and system understanding which let you go faster, but with a higher % chance of bugs? You can do that without AI and get the same speed up.
I find that when working with an LLM the difference in knowledge is the same as learning a new language. Learning to understanding another language is easier than learning to speak another language.
It's like my knowledge of C++. I can read it, and I can make modifications of existing files. But writing something from scratch without a template? That's a lot harder.
* I wasn’t trying to be dismissive of the article or the study, just wanted to present a different context in which AI tools do help a lot
* It’s not just code. It also helps with a lot of tasks. For example, Claude Code figured out how to “manually” connect to the AWS cluster that hosted the source db, tested different commands via docker inside the project containers and overall helped immensely with discovery of the overall structure and infrastructure of the project
* My professional experience as a developer, has been that 80-90% of the time, results trump code quality. That’s just the projects and companies I’ve been personally involved with. Mostly saas products in which business goals are usually considered more important than the specifics of the tech stack used. This doesn’t mean that 80-90% of code is garbage, it just means that most of the time readability, maintainability and shipping are more important than DRY, clever solutions or optimizations
* I don’t know how helpful AI is or could be for things that require super clever algorithms or special data structures, or where code quality is incredibly important
* Having said that, the AI tools I’ve used can write pretty good quality code, as long as they are provided with good examples and references, and the developer is on top of properly managing the context
* Additionally, these tools are improving almost on a weekly or monthly basis. My experience with them has drastically changed even in the last 3 months
At the end of the day, AI is not magic, it’s a tool, and I as the developer, am still accountable for the code and results I’m expected to deliver
16 devs. And they weren't allowed to pick which tasks they used the AI on. Ridiculous. Also using it on "old and >1 million line" codebases and then extrapolating that to software engineering in general.
Writers like this then theorize why AI isn't helpful, then those "theories" get repeated until it feels less like a theory and more like a fact and it all proliferates into an echo chamber of AI isn't a useful tool. There have been too many anecdotes and my own personal experience to ignore that it isn't useful.
It is a tool and you have to learn it to be successful with it.
They were allowed to pick whether or not to use AI on a subset of tasks. They weren't forced to use AI on tasks that don't make sense for AI
"To directly measure the impact of AI tools on developer productivity, we conduct a randomized controlled trial by having 16 developers complete 246 tasks (2.0 hours on average) on well-known open-source repositories (23,000 stars on average) they regularly contribute to. Each task is randomly assigned to allow or disallow AI usage, and we measure how long it takes developers to complete tasks in each condition."
> If AI is allowed, developers can use any AI tools or models they choose, including no AI tooling if they expect it to not be helpful. If AI is not allowed, no generative AI tooling can be used.
AI is allowed not required
I do believe however that it's important to emphasize the fact that they didn't got to choose in general, though, which I think your wording (even though it is correct) does not make evident.
Can you bring up any specific issues with the metr study? Alternatively, can you site a journal that critiques it?
They used 16 developers. The confidence intervals are wide and a few atypical issues per dev could swing the headline figure.
Veteran maintainers on projects they know inside-out. This is a bias.
Devs supplied the issue list (then randomized) which still leads to subtle self-selection bias. Maintainers may pick tasks they enjoy or that showcase deep repo knowledge—exactly where AI probably has least marginal value.
Time was not independently logged and was self-reported.
No possible direct quality metric is possible. Could the AI code be better?
The Hawthorne effect. Knowing they are observed paid may make devs over-document, over-prompt, or simply take their time.
Many of the devs were new to Cursor
Bias in forecasting.
To the credit of the paper authors, they were very clear that they were not making a claim against software engineering in general. But everyone wants to reinforce their biases, so...
Metr may overall have an ok mission, but their motivation is questionable. They published something like this to get attention. Mission accomplished on that but they had to have known how this would be twisted.
I really like how the author then brought up the point that for most daily work we don't have the theory built, even a small fraction of it, and that this may or may not change the equation.
Just one problem with that...
(The SPACE [1] framework is a pretty overview of considerations here; I agree with a lot of it, although I'll note that METR [2] has different motivations for studying developer productivity than Microsoft does.)
It can make some things faster and better than a human with a saw, but you have to learn how to use them right (or you will loose some fingers).
I personally find that agentic AI tools make me be more ambitious in my projects, I can tackle some things I didn't tthougth about doing before. And I also delegate work that I don't like to them because they are going to do it better and quicker than me. So my mind is free to think on the real problems like architecture, the technical debt balance of my code...
Problem is that there is the temptation of letting the AI agent do everything and just commit the result without understanding YOUR code (yes, it was generated by an AI but if you sign the commit YOU are responsible for that code).
So as with any tool try to take the time to understand how to better use it and see if it works for you.
This is insulting to all pre-2023 open source developers, who produced the entire stack that the "AI" robber barons use in their companies.
It is even more insulting because no actual software of value has been demonstrably produced using "AI".
Claude Code and Amp (equivalent from Sourcegraph) are created by humans using these same tools to add new features and fix bugs.
Having used both tools for some weeks I can tell you that they provide a great value to me, enough that I see paying $100 monthly as a bargain related to that value.
Edit: typo
Think that these were internal tools that provided value to engineers on Anthropic, OpenAI, Google & others and now are starting to be adopted by the general public.
Some people are overhyped and some seem hurt because I don't know, maybe they define themselves by their ability to write code by hand.
I have no horse in this race and I can only tell you about my experience and I can tell you that the change is coming.
Also if you don't trust a random HN nickname go read about the experiences of people like Armin Ronacher (Flask creator), Steve Yegge or Thomas H. Ptacek.
- https://lucumr.pocoo.org/2025/6/4/changes/ - https://sourcegraph.com/blog/the-brute-squad - https://fly.io/blog/youre-all-nuts/
Github got massive adoption in a year, probably 100K developers and tens of thousands of projects including big names like Ruby on Rails.
I'm sure if I spent more than 2 minutes on this I'd have even more examples but this one is enough to neuter your claims.
This is a ridiculous comparison because the table saw is a precision tool (compared to manual woodworking) when agentic AI is anything but IMO.
I think this blog post is an interesting take on one specific factor that is likely contributing to slowdown. We discuss this in the paper [2] in the section "Implicit repository context (C.1.5)" -- check it out if you want to see some developer quotes about this factor.
> This is why AI coding tools, as they exist today, will generally slow someone down if they know what they are doing, and are working on a project that they understand.
I made this point in the other thread discussing the study, but in general, these results being surprising makes it easy to read the paper, find one factor that resonates, and conclude "ah, this one factor probably just explains slowdown." My guess: there is no one factor -- there's a bunch of factors that contribute to this result -- at least 5 seem likely, and at least 9 we can't rule out (see the full factors table on page 11).
> If there are no takers then I might try experimenting on myself.
This sounds super cool! I'd be very excited to see how you set this up + how it turns out... please do shoot me an email (in the paper) if you do this!
> AI slows down open source developers. Peter Naur can teach us why
Nit: I appreciate how hard it is to write short titles summarizing the paper (the graph title is the best I was able to do after a lot of trying) -- but I might have written this "Early-2025 AI slows down experienced open-source developers. Peter Naur can give us more context about one specific factor." It's admittedly less of a catchy-title, but I think getting the qualifications right are really important!
Thanks again for the sweet write-up! I'll hang around in the comments today as well.
[1] https://news.ycombinator.com/item?id=44522772
[2] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf
Even that's too general, because it'll depend on what the task is. It's not as if open source developers in general never work on tasks where AI could save time.
Once I got a hang of identifying problems, or being more targeted, I was spending less time messing about and got things done quicker.
Or it's comparing how long the dev thought it should take with AI vs how long it actually took, which now includes the dev's guess of how AI impacts their productivity?
When it's hard to estimate how difficult an issue should be to complete, how does the study account for this? What percent speed up or slow down would be noise due to estimates being difficult?
I do appreciate that this stuff is very hard to measure.
Using the magic of statistics, if you have completed enough tickets, we can determine whether the null-hypothesis holds (for a given level of statistical certainty), and if it doesn't, low large is the difference (with a margin of error).
That's not to say there couldn't be other causes for the difference (if there is one), but that's how science proceeds, generally.
That said, I think that what I wrote more or less encompasses three of the factors you call out as being likely to contribute: "High developer familiarity with reposito- ries", "Large and complex repositories", and "Implicit repository context".
I thought more about experimenting on myself, and while I hope to do it - I think it will be very hard to create a controlled enviornment whilst also responding to the demands the job puts on me. I also don't have the luxury of a list of well scoped tasks that could feasibly be completed in a few hours.
Ehhhh... not so much. It had serious design flaws in both the protocol and the analysis. This blog post is a fairly approachable explanation of what's wrong with it: https://www.argmin.net/p/are-developers-finally-out-of-a-job
A few notes if it's helpful:
1. This post is primarily worried about ordering considerations -- I think this is a valid concern. We explicitly call this out in the paper [1] as a factor we can't rule out -- see "Bias from issue completion order (C.2.4)". We have no evidence this occurred, but we also don't have evidence it didn't.
2. "I mean, rather than boring us with these robustness checks, METR could just release a CSV with three columns (developer ID, task condition, time)." Seconded :) We're planning on open-sourcing pretty much this data (and some core analysis code) later this week here: https://github.com/METR/Measuring-Early-2025-AI-on-Exp-OSS-D... - star if you want to dig in when it comes out.
3. As I said in my comment on the post, the takeaway at the end of the post is that "What we can glean from this study is that even expert developers aren’t great at predicting how long tasks will take. And despite the new coding tools being incredibly useful, people are certainly far too optimistic about the dramatic gains in productivity they will bring." I think this is a reasonable takeaway from the study overall. As we say in the "We do not provide evidence that:" section of the paper (Page 17), we don't provide evidence across all developers (or even most developers) -- and ofc, this is just a point-in-time measurement that could totally be different by now (from tooling and model improvements in the past month alone).
Thanks again for linking, and to the original author for their detailed review. It's greatly appreciated!
[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf
I think my bias as someone who spends too much time looking at social science papers is that the protocol allows for spillover effects that, to me, imply that the results must be interpreted much more cautiously than a lot of people are doing. (And then on top of that I'm trying to be hyper-cautious and skeptical when I see a paper whose conclusions align with my biases on this topic.)
Granted, that sort of thing is my complaint about basically every study on developer productivity when using LLMs that I've seen so far. So I appreciate how difficult this is to study in practice.
Everyone else was an absolute Cursor beginner with barely any Cursor experience. I don't find it surprising that using tools they're unfamiliar with slows software engineers down.
I don't think this study can be used to reach any sort of conclusion on use of AI and development speed.
If by "people" you mean "general public", and by "accomplish things" you mean solving some immediate problems, that may or may not involve authoring a script or even a small app - then yes, this is already happening, and is a big reason behind the AI hype as it is.
If by "people" you mean "experienced software engineers", and by "accomplish things" you mean meaningful contributions to a large software product, measured by high internal code and process quality standards, then no - AI tools may not help with that directly, though chances are greater when you have enough experience with those tools to reliably give them right context and steer away from failure modes.
Still, solving one-off problems != incremental improvements to a large system.
My post is a single sentence and I literally wrote "people with no experience"
No need for all the "if by people you mean" rigamarole
You said the second. You responded to the first.
Y = [experts]
Z = [noobs]
{Y, Z} ⊆ [all humans]
None of these statements are controversial. What we have to establish is- Does the experienced AI builder outperform the experienced manual coder?
1. Some prior studies that find speedup do so with developers that have similar (or less!) experience with the tools they use. In other words, the "steep learning curve" theory doesn't differentially explain our results vs. other results.
2. Prior to the study, 90+% of developers had reasonable experience prompting LLMs. Before we found slowdown, this was the only concern that most external reviewers had about experience was about prompting -- as prompting was considered the primary skill. In general, the standard wisdom was/is Cursor is very easy to pick up if you're used to VSCode, which most developers used prior to the study.
3. Imagine all these developers had a TON of AI experience. One thing this might do is make them worse programmers when not using AI (relatable, at least for me), which in turn would raise the speedup we find (but not because AI was better, but just because with AI is much worse). In other words, we're sorta in between a rock and a hard place here -- it's just plain hard to figure out what the right baseline should be!
4. We shared information on developer prior experience with expert forecasters. Even with this information, forecasters were still dramatically over-optimistic about speedup.
5. As you say, it's totally possible that there is a long-tail of skills to using these tools -- things you only pick up and realize after hundreds of hours of usage. Our study doesn't really speak to this. I'd be excited for future literature to explore this more.
In general, these results being surprising makes it easy to read the paper, find one factor that resonates, and conclude "ah, this one factor probably just explains slowdown." My guess: there is no one factor -- there's a bunch of factors that contribute to this result -- at least 5 seem likely, and at least 9 we can't rule out (see the factors table on page 11).
I'll also note that one really important takeaway -- that developer self-reports after using AI are overoptimistic to the point of being on the wrong side of speedup/slowdown -- isn't a function of which tool they use. The need for robust, on-the-ground measurements to accurately judge productivity gains is a key takeaway here for me!
(You can see a lot more detail in section C.2.7 of the paper ("Below-average use of AI tools") -- where we explore the points here in more detail.)
I've been using Vim/Neovim for over a decade. I'm sure if I wanted to use something like Cursor, it would take me at least a month before I can productive even a fraction of my usual.
Rather than fit two generally disparate things together it’s probably better to just use VSTO and C# (hammer and nails) rather than some unholy combination no one else has tried or suffered through. When it goes wrong there’s more info to get you unstuck.
Why is interacting with the OS’ API in a compiled language the wrong approach in 2025? Why must I use this managed Frankenstein’s monster of dotnet? I didn’t want to ship or expect a whole runtime for what should be a tiny convenience DLL. Insane
* spec out project goals and relevant context in a README and spec out all components; have the AI build out each component and compose them. I understand the high-level but don't necessarily know all of the low-level details. This is particularly helpful when I'm not deeply familiar with some of the underlying technologies/libraries. * having an AI write tests for code that I've verified is working. As we all know, testing is tedious - so of course I want to automate it. And we written tests (for well written code) can be pretty easy to review.
The serpent is devouring its own tail.
It scares me how much code is being produced by people without enough experience to spot issues or people that just gave up caring. We're going to be in for wild ride when all the exploits start flowing.
...and then it still doesn't actually fix it
I recently had a nice conversation looking for some reading suggestions from an LLM. The first round of suggestions were superb, some of them I'd already read, some were entirely new and turned out great. Maybe a dozen or so great suggestions. Then it was like squeezing blood from a stone but I did get a few more. After that it was like talking to a babbling idiot. Repeating the same suggestions over and over, failing to listen to instructions, and generally just being useless.
LLMs are great on the first pass but the further you get away from that they degrade into uselessness.
Sometimes it works well the first time, and sometimes it spits out a summary where you can see what it is confused about, and you can guide it to create a better summary. Sometimes just having that summary in its context gets it over the hump and you can just say "actually I'm going to continue with you; please reference this summary going forward", and sometimes you actually do have to restart the LLM with the new context. And of course sometimes there's nothing that works at all.
I wrote a TON of LVGL code. The result wasn’t perfect for placement, but when I iterated a couple of times, it fixed almost all of the issues. The result is a little hacked together but a bit better than my typical first pass writing UI code. I think this saved me a factor of 10 in time. Next I am going to see how much of the cleanup and factoring of the pile of code it can do.
Next I had it write a bunch of low level code to init hardware. It saved me a little time compared to reading the reference manual, and was more pleasant, but it wasn’t perfectly correct. If I did not have domain expertise I would not have been able to complete the task with the LLM.
This kind of sums up my experience with LLMs too. They save me a lot of time reading documentation, but I need to review a lot of what they write, or it will just become too brittle and verbose.
From several month of deep work with LLMs I think they are amazing pattern matchers, but not problem solvers. They suggest a solution pattern based on their trained weights. This even could result in real solutions, e.g., when programming Tetris or so, but not when working on somewhat unique problems...
Writing front-end display code and instantiating components to look right is very much playing to the model’s strength, though. A carefully written sentence plus context would become 40 lines of detail-dense but formulaic code.
(I have also had a lot of luck asking it to make a first pass at typesetting things in Tex, too, for similar reasons)
"Find the root cause of this problem and explain it"
"Explain why the previous fix didn't work."
Often, it's best to undo the action and provide more context/tips.
Often, switching to Gemini 2.5 Pro when Claude is stumped helps a lot.
I asked it to remove the comment, which it enthusiastically agreed to, and then... didn't. I couldn't tell if it was the LLM being dense or just a bug in Copilot's implementation.
I admit a tendency to anthropomorphize the LLM and get irritated by this quirk of language, although it's not bad enough to prevent me from leveraging the LLM to its fullest.
The key when acknowledging fault is to show your sincerity through actual effort. For technical problems, that means demonstrating that you have worked to analyze the issue, take corrective action, and verify the solution.
But of course current LLMs are weak at understanding, so they can't pull that off. I wish that the LLM could say, "I don't know", but apparently the current tech can't know that that it doesn't know.
And so, as the LLM flails over and over, it shamelessly kisses ass and bullshits you about the work its doing.
I figure that this quirk of LLMs will be minimized in the near future by tweaking the language to be slightly less obsequious. Improved modeling and acknowledging uncertainty will be a heavier lift.
Granted, the compute required is probably more expensive than github would offer for free, and IDK whether it'd be within budget for many open-source projects.
Also granted, something like this may be useful for human-sourced PRs as well, though perhaps post-submission so that maintainers can see and provide some manual assistance if desired. (And also granted, in some cases maybe maintainers would want to provide manual assistance to AI submissions, but I expect the initial triaging based on whether it's a human or AI would be what makes sense in most cases).
In my rules I tell it that try catches are completely banned unless I explicitly ask for one (an okay tradeoff, since usually my error boundaries are pretty wide and I know where I want them). I know the context length is getting too long when it starts ignore that.
It has been for a while, AI just makes SPAM more effective:
If an actual developer wrote this code and submitted it willingly, it would either constitute malice, an attempt to sabotage the codebase or inject a trojan, or stupidity, for failing to understand the purpose of the error message. With an LLM we mostly have stupidity. Flagging it as such reveals the source of the stupidity, as LLMs do not actually understand anything.
I mean they probly could've articulated it your way, but I think that's basically what they did... they point out the insufficient "fix" later, but the root cause of the "fix" was blind trust in AI output, so that's the part of the story they lead with.
It's clear he just took that feedback and asked the AI to make the change, and it came up with a change that gave them all very long, very unique names, that just listed all the unique properties in the test case. But to the extent that they sort of became noise.
It's clear writing the PR was very fast for that developer, I'm sure they felt they were X times faster than writing it themselves. But this isn't a good outcome for the tool either. And I'm sure if they'd reviewed it to the extent I did, a lot of that gained time would have dissipated.
We asked the person why they made the change, and "silence". They had no reason. It became painfully clear that all they did was copy and paste the method into an LLM and say "add this thing" and it spit out a completely redone method.
So now we had a change that no one in the company actually knew just because the developer took a shortcut. (this change was rejected and reverted).
The scariest thing to me is no one actually knowing what code is running anymore with these models having a tendency to make change for the sake of making change (and likely not actually addressing the root thing but a shortcut like you mentioned)
FWIW, I have seen human developers do this countless times. In fact there are many people in engineering that will argue for these kinds of "fixes" by default. Usually it's in closed-source projects where the shittiness is hidden from the world, but trust me, it's common.
> I suspect their motivation was just to get a commit on their record. This is becoming a troubling trend with AI tools.
There was already a problem (pre-AI) with shitty PRs on GitHub made to try to game a system. Regardless of how they made the change, the underlying problem is a policy one: how to deal with people making shitty changes for ulterior motives. I expect the solution is actually more AI to detect shitty changes from suspicious submitters.
Another solution (that I know nobody's going to go for): stop using GitHub. Back in the "olden times", we just had CVS, mailing lists and patches. You had to perform some effort in order to get to the point of getting the change done and merged, and it was not necessarily obvious afterward that you had contributed. This would probably stop 99% of people who are hoping for a quick change to boost their profile.
When you have an AI that says "here is the race condition and here is the code change to make to fix it", that might be "faster" in the immediate sense, but it means you aren't understanding the program better or making it easier for anyone else to understand. There is also the question of whether this process is sustainable: does an AI-edited program eventually fall so far outside what is "normal" for a program that the AI becomes unable to model correct responses?
It's a myth that you can code a whole day long. I usually do intervals of 1-3 hours for coding, with some breaks in between. Procrastination can even happen on work related things, like reading other project members code/changes for an hour. It has a benefit to some extent, but during this time I don't get my work done.
Agentic AI works the best for me. Small refactoring tasks on a selected code snippet can be helpful, but isn't a huge time saver. The worst are AI code completions (first version Copilot style), they are much more noise then help.
Like, I think 1h would be streaching it for mature codebases.
Like doom scrolling on social media: Let's see what the fancy new guy got done this week. I need to feel better, I'm just going to look at the commits of the guy in the other team that always breaks production. Let's see how close he got to that recently, ...
Just like we put a (2023) on articles here so they are considered in the right context, so too this paper should be. Blanket "AI tools slow sown development" statements with a "look this rigorous paper says so!" is ignoring a key variable: the rate of effectiveness improvement. If said paper evaluated with the current models, the picture would be different. Also in 3 months time. AI tools aren't a static thing that either works or don't indefinitely.
The most interesting point from the article wasn't about how well the AI's worked, rather it was the gap between peoples perception and their actual results.
Even within the study, there were some participants who saw mild improvements to productivity, but most had a significant drop in productivity. This thread is now full of people telling their story about huge productivity gains they made with AI, but none of the comments contend with the central insight of this study: that these productivity gains are illusions. AI is a product designed to make you value the product.
In matters of personal value, perception is reality, no question. Anyone relying heavily on AI should really be worried that it is mostly a tool for warping their self-perception, one that creates dependency and a false sense of accomplishment. After all, it speaks a highly optimized stream of tokens at you, and you really have to wonder what the optimization goal was.
Early in the chat it substituted a `-1` for an `i`, and everything that followed was garbage. There were also some errors that I spotted real-time and got it to correct itself.
But yeah, IDK, it presents itself so confidently and "knows" so much and is so easy to use, that it's hard not to try to use as a reference / teacher. But it's also quite dangerous if you're not confirming things; it can send you down incorrect paths and waste a ton of time. I haven't decided whether the cost is worth the benefit or not.
Presumably they'll get better at this over time, so in the long run (probably no more than a year) it'll likely easily exceed the ROI breakeven point, but for now, you do have to remain vigilant.
Or, "slow is smooth, and smooth is fast"
It sounds like a good thing, right? "Wow, mental model. I want that, I want to be good and have big brain", which encourages you to believe the bullshit.
The truth is, this paper is irrelevant and a waste of time. It only serves the purpose of creating discussion around the subject. It's not science, it's a cupholder for marketing.
A mental model of the software is what allows a programmer to intuitively know why the software is behaving a certain way, or what the most optimal design for a feature would be. In the vast majority of cases these intuitions are correct, and other programmers should pay attention to them. This ability is what separates those with a mental model and those without.
On the other hand, LLMs are unable to do this, and are usually not used in ways that help build a mental model. At best, they can summarize the design of a system or answer questions about its behavior, which can be helpful, but a mental model is an abstract model of the software, not a textual summary of its design or behavior. Those neural pathways can only be activated by natural learning and manual programming.
Explanation missing.
> If you've ever programmed, or worked with programmers, that is not an extraordinary claim at all.
One step ahead of you. I already say this is engineered to encourage belief "I want to be good, big brain, and open source is good, I want to be good big brain".
It's marketing.
> A mental model of the software is what allows a programmer [yadda yadda]
I'm not saying it doesn't exist, I'm saying the paper doesn't provide any relevant information regarding the phenomena.
> Those neural pathways can only be activated by natural learning and manual programming.
Again, probably true. But the paper doesn't provide any relevant information regarding this phenomena.
---
Your answer seems to disagree with me, but displays a disjointed understanding of what I'm really addressing.
---
As a lighthearted fun analogy, I present:
https://isotropic.org/papers/chicken.pdf
The paper does not prove the existence of chickens. It says chicken a lot, but never addresses the phenomena of chickens existing.
It's useless from the research perspective. But it is a cup-holder for marketing something.
I already laid this out very clearly in my first comment.
Developers (people?) in general for some reason just simply cannot see time. It’s why so many people don’t believe in estimation.
What I don’t understand is why. Is this like a general human brain limitation (like not being able to visualize four dimensions, or how some folks don’t have an internal monologue)?
Or is this more psychodynamic or emotional?
It’s been super clear and interesting to me how developers I work with want to believe AI (code generation) is saving them time when it’s clearly obviously not.
Is it just the hope that one day it will? Is it fetishization of AI?
Why in an industry that so requires clarity of thinking and expression (computer processors don’t like ambiguity), can we be so bad at talking about, thinking about… time?
Don’t get me started on the static type enthusiasts who think their strong type system (another seeming fetish) is saving them time.
In boating, there's a notion of a "set and drift" which describes how wind and current pushes a boat off course. If a mariner isn't careful, they'll end up far from their destination because of it.
This is because when you're sitting in a boat, your perception of motion is relative and local. You feel the breeze on your face, and you see how the boat cuts through the surrounding water. You interpret that as motion towards your destination, but it can equally consist of wind and current where the medium itself is moving.
I think a similar effect explains all of these. Our perception of "making progress" is mostly a sense of motion and "stuff happening" in our immediate vicinity. It's not based on a perception of the goal getting closer, which is much harder to measure and develop an intuition for.
So people tend to choose strategies that make them feel like they're making progress even if it's not the most effective strategy. I think this is why people often take "shortcuts" when driving that are actually longer. All of the twists and turns keep them busy and make them feel like they're making more progress than zoning out on a boring interstate does.
Ai tools makes programming feel easier. That it might be actually less productive is interesting but we humans prefer the easier shortcuts. Our memories of coding with AI tells us that we didn't struggle and therefore we made progress.
And I'm not sure about the other either. In my 20+ year career in aerospace software, the most memorable times were solving interesting problems, not days with no struggle just churning out code.
Generally memorable things are different than unmemorable things. Work is unmemorable. Driving is unmemorable except when something negative happens. Waze tries to give some positive feelings to the driving route. Waze knows that people want positive experiences sometimes more than efficiency.
Being stuck in a traffic jam is more memorable than not being so. Or we remember the negative feeling more than the fact that our drive actually wasn't inefficient.
AI tools makes us have a less negative day of work. so we feel like we have no traffic jams. "I got so much done" really means "I didn't get stuck". But it's also removing the positive feelings too!
It's an illusion of progress through our feelings and memories.
Or programming with AI brings different feedback mechanisms and systems and different emotional engagements and different memory behaviours. It's very interesting!
The problem, of course, is that one might thoughtlessly invoke the ai tool when it would be faster to make the one line change directly
Edit
This could make sense with the driving analogy. If the road I was planning to take is closed, gps will happily tell me to try something else. But if that fails too, it might go back to the original suggestion.
This is why pushing for new code, rewrites, new frameworks is so popular. https://www.joelonsoftware.com/2000/04/06/things-you-should-...
So a ton of ai-generated code- is just that, never read. Its generated, tested against test-functions - and thats it. I wouldn't wonder, if some of these devs themselves have only marginal ideas whats in there codebases and why.
All the VCs are gonna lose a ton of money! OpenAI will be NopenAI, relegated to the dustbin of history.
We never asked for this, nobody wants it.
Companies using AI and promoting it in their products will be seen as tacky and cheap. Just like developers and artists that use it.
1. It did not make me faster. I don't know that I expected it to.
2. It's very possible that it made me slower.
3. The quality of my work was better.
Slower and better are related here, because I used these tools more to either check ideas that I had for soundness, or to get some fresh ideas if I didn't have a good one. In many cases the workflow would be: "I don't like that idea, what else do you have for me?"
There were also instances of being led by my tools into a rabbit hole that I eventually just abandoned, so that also contributes to the slowness. This might happen in instances where I'm using "AI" to help cover areas that I'm less of an expert in (and these were great learning experiences). In my areas of expertise, it was much more likely that I would refine my ideas, or the "AI" tool's ideas into something that I was ultimately very pleased with, hence the improved quality.
Now, some people might think that speed is the only metric that matters, and certainly it's harder to quantify quality - but it definitely felt worth it to me.
I will ask the AI for an idea and then start blowing holes in its idea, or will ask it to do the same for my idea.
And I might end up not going with it’s idea regardless but it got me thinking about things I wouldn’t have thought about.
Effectively its like chatting to a coworker that has a reasonable idea about the domain and can bounce ideas around.
As any other tool AI is slow to adopt but has huge gains later on
I'm spending an inordinate amount of time turning that video into an essay, but I feel like I'm being scooped already, so here's my current draft in case anyone wants to get a sneak preview: https://valdottown--89ed76076a6544019f981f7d4397d736.web.val...
Feedback appreciated :)
mkagenius•5h ago
Fully autonomous coding tools like v0, a0, or Aider work well as long as the context is small. But once the context grows—usually due to mistakes made in earlier steps—they just can’t keep up. There's no real benefit of "try again" loop yet.
For now, I think simple VSCode extensions are the most useful. You get focused assistance on small files or snippets you’re working on, and that’s usually all you need.
ethan_smith•4h ago
bluefirebrand•3h ago
Ever since my company made switching to Cursor mandatory, I have not been able to hit any kind of flow. I know my own productivity has plummeted and I suspect many others are as well, but no one is saying anything
I have spoken up once or twice and only been smacked down for my troubles, so I am not surprised everyone else is clammed up