This makes me metaphorically stabby.
What a lot of us must be wondering though is:
- how maintainable is the code being outputted
- how much is this newfound productivity saving (costing) on compute, given that we are definitely seeing more code
- how many livesite/security incidents will be caused by AI generated code that hasn't been reviewed properly
So no, I don't think persistence-through-time is a good metric. Probably better to look at cyclomatic complexity, and maybe for a given code path or module or class hierarchy, how many calls it makes within itself vs to things outside the hierarchy - some measure of how many files you need to jump between to understand it
Is that a per-year number?
If a year has 200 working days that's still only about 40 lines of code a day.
When I'm in full-blown work mode with a decent coding agent (usually Claude Code) I'm genuinely producing 1,000+ lines of (good, tested, reviewed) code a day.
Maybe there is something to those absurd 10x multiplier claims after all!
(I still think there's plenty of work done by software engineers that isn't crunching out code, much of which isn't accelerated by AI assistance nearly as much. 40 lines of code per day felt about right for me a few years ago.)
An example from earlier today: https://github.com/simonw/llm-gemini/commit/fa6d147f5cff9ea9...
That commit added 33 lines and removed 13 - so I'm already at a 20-lines-a-day level just from that one commit (and I shipped a few more plus a release of llm-gemini: https://github.com/simonw/llm-gemini/commits/a2bdec13e03ca8a...)
It took about 3.5 minutes. I started from this issue someone had filed against my repo:
Then I opened Claude Code and said:
Run this command: uv run llm -m gemma-3-27b-it hi
That ran the command and returned the error message. I then said: Yes, fix that - the gemma models do not support media resolution
Which was enough for it to figure out the fix and run the tests to confirm it hadn't broken anything.I ran "git diff", thought about the change it had made for a moment, then committed and pushed it.
Here's the full Claude Code transcript: https://gistpreview.github.io/?62d090551ff26676dfbe54d8eebbc...
I verified the fix myself by running:
uv run llm -m gemma-3-27b-it hi
I pasted the result into an issue comment to prove to myself (and anyone else who cares) that I had manually verified the fix: https://github.com/simonw/llm-gemini/issues/116#issuecomment...Here's a more detailed version of the transcript including timestamps, showing my first prompt at 10:01:13am and the final response at 10:04:55am. https://tools.simonwillison.net/claude-code-timeline?url=htt...
I built that claude-code-timeline application this morning too, and that thing is 2284 lines of code: https://github.com/simonw/tools/commits/main/claude-code-tim... - but that was much more of a vibe-coded thing, I hardly reviewed the code that was written at all and shipped it as soon as it appeared to work correctly. Since it's a standalone HTML file there's not too much that can go wrong if it has bugs in it.
I don't know if code quality really matters to most people or to the bottom line, but a good software engineer writes better code than Claude. It is a testament to library maintainers that Claude is able to code at all, in my opinion. One reason is that Claude uses API's in whacky ways. For instance by reading the SDL2 documentation I was able to find many ways that Claude writes SDL2 using archaic patterns from the SDL days.
I think there are a lot of hidden ways AI booster types benefit from basic software engineering practices that they actively promote damaging ideas about. Maybe it will only be 10 years from now that we learn that having good engineers is actually important.
Same here. So I tell it what improvements I want to make and watch it make them.
I've gained enough experience at prompting it that it genuinely is faster for me to tell it the change I want to make than it is for me to make that change myself, 90% of the time.
If I was hacking on the Linux kernel I would be delighted with myself for producing 40 lines of landed code in a single day.
A lot of people are oblivious to Zipf distributions in effort and output, and if you ever catch on to it as a productive person, it really reframes ideas about fairness and policy and good or bad management.
It also means that you can recognize a good team, and when a bunch of high performers are pushing and supporting eachother and being held to account out in the open, amazing things happen that just make other workplaces look ridiculous.
My hope for AI is that instead of 20% of the humans doing 80% of the work, you end up with force multipliers, and a ramping up, so that more workplaces look like high function teams, making everything more fair and engaging and productive, but i suspect once people get better with AI, at least up to the point of AGI, is we're going to see the same distribution but 10x or 50x the productivity.
How maintainable is this code output? I saw a SPA html file produced by a model, which appeared almost similar to assembly code. So if the code can only be maintained by model, then an appropriate metric should should be based on a long-term maintainability achieved, but not on instant generation of code.
I feel like we humans try to separate things and keep things short. We do this not because we think it's pretty, we do it so our human brains can still reason about a big system. As a result LOC is a bad measure as being concise then hurts your productivity????
As a dev I very much subscribe to this line of thought, but I also have to admit most of the business class people would disagree.
Unfortunately I’m not sure there are good metrics.
Also, my anecdotal experience is that LLM code is flat wrong sometimes. Like a significant percentage. I can't quote a number really, because I rarely do the same thing/similar thing twice. But it's a double digit percentage.
I would expect code that continually changes and deprecates and creates new features is still looking for a good problem domain fit.
I guess you can already derive this value if you sum the total line changed by all PRs and divide it by (SLOC end - SLOC start). Ideally it must be a value slightly greater than 1.
fyi: You headline with "cross-industry", lead with fancy engineering productivity graphics, then caption it with small print saying its from your internal team data. Unless I'm completely missing something, it comes of as a little misleading and disingenuous. Maybe intro with what your company does and your data collection approach.
Also I notice it when the LLMs are offline. It feels a bit like when the internet connect fails. You remember the old days of lower productivity.
Of course, there is a lot of junk/silly ways to approach these tools but all tools are just a lever, and need judgement/skill to use them well.
dakshgupta•5h ago
About a billion lines of code go through Greptile every month, and we're able to do a lot of interesting analysis on that data.
We decided to compile some of the most interesting findings into a report. This is the first time we've done this, so any feedback would be great, especially around what analytics we should include next time.
wrs•1h ago
So, do you have any quality metrics to go with these?
dakshgupta•1h ago
ChrisbyMe•1h ago
Would be interested in seeing the breakdown between uplift vs company size.
e.g. I work in a FAANG and have seen an uptick in the number of lines on PRs, partially due to AI coding tools and partially due to incentives for performance reviews.
dakshgupta•47m ago
An interesting subtrend is that Devin and other full async agents write the highest proportion of code at the largest companies. Ticket-to-PR hasn't worked nearly as well for startups as it has for the F500.
neom•1h ago
jacekm•45m ago
Which stats in the report come from such analysis? I see that most metrics are based on either data from your internal teams or publicly available stats from npm and PyPi.
Regardless of the source, it's still an interesting report, thank you for this!
dakshgupta•35m ago
chis•42m ago
Super interesting report though.