TLDR: I made a Claude Code plugin to measure coding productivity.
This helped us measure over +70% productivity in iubenda's dev team.
Rationale and details follow.
For the past year I've been pretty obsessed with using AI for productivity improvement and I've been running initiatives to increase AI adoption within iubenda and team.blue, particularly amongst developers.
The challenge was to measure the results, but I saw problems with the most commonly used methods:
- % of developers using Claude Code: pretty mute, just tells you who is using it, fine for initial rollout but doesn't really give you a sense of what the productivity really is, it's the kind of "tick the box" approach that leaves many companies with a very superficial AI adoption
- Number of MRs / PRs: not the worst metric but very unreliable as different teams and developers have different styles in terms of sizes of the contribution (few but large vs many but small), which means that more or fewer MRs / PRs doesn't necessarily mean more or less productive team
- Story points: not all teams use story points, plus story point scoring is a qualitative process and subjective. It also requires tracking story points across MRs / PRs / commits, which is very complex as very few teams have really deterministic connection between their git repo and their task management tool, meaning that issues in data coverage makes this method unreliable even on teams that actually use story points
- Lines of code changed: I really like the objectivity of this metric, and if we keep as a constant the fact that a specific development team will keep the verbosity of their code and the mix between types of code (tests, translation, updates, comments, refactors, actual new code) indeed constant, then this metric is not bad at all, but in tests this still had huge variability due to large refactors or wide but low value changes skewing the metrics completely
Several weeks into the rabbit hole, I landed on using lines of code changed, BUT scoring them using Haiku. In essence, the plugin will:
- Download all diffs from all repos you select, across all branches, and deduplicate them to avoid double-counting merge commits
- Score each file diff with Haiku, giving it a weight that will score e.g. 0 a file change, low a translation change, low or zero a library update, high an actual genuine code change or refactor, etc (this can also act as code verbosity index)
- Calculate a sort of "weighted lines of code" metric that you can plot over time to measure productivity improvements
Scoring is very cheap at around $7 per K commits.
The plugin also has a # of other features like creating reports, anonymizing developers with local hashing and the possibility to use BigQuery to share the database across a team.
I'm publishing it so you can grill me on the methodology, cross-check it, find bugs, you name it. All contributions welcome
apothegm•22m ago
Really? We’re back to using LoC as a metric? Have we learned absolutely nothing in the past 50 years?
Oh, never mind, we already know the answer to that…
Facens•1h ago
This helped us measure over +70% productivity in iubenda's dev team. Rationale and details follow.
For the past year I've been pretty obsessed with using AI for productivity improvement and I've been running initiatives to increase AI adoption within iubenda and team.blue, particularly amongst developers.
The challenge was to measure the results, but I saw problems with the most commonly used methods: - % of developers using Claude Code: pretty mute, just tells you who is using it, fine for initial rollout but doesn't really give you a sense of what the productivity really is, it's the kind of "tick the box" approach that leaves many companies with a very superficial AI adoption - Number of MRs / PRs: not the worst metric but very unreliable as different teams and developers have different styles in terms of sizes of the contribution (few but large vs many but small), which means that more or fewer MRs / PRs doesn't necessarily mean more or less productive team - Story points: not all teams use story points, plus story point scoring is a qualitative process and subjective. It also requires tracking story points across MRs / PRs / commits, which is very complex as very few teams have really deterministic connection between their git repo and their task management tool, meaning that issues in data coverage makes this method unreliable even on teams that actually use story points - Lines of code changed: I really like the objectivity of this metric, and if we keep as a constant the fact that a specific development team will keep the verbosity of their code and the mix between types of code (tests, translation, updates, comments, refactors, actual new code) indeed constant, then this metric is not bad at all, but in tests this still had huge variability due to large refactors or wide but low value changes skewing the metrics completely
Several weeks into the rabbit hole, I landed on using lines of code changed, BUT scoring them using Haiku. In essence, the plugin will: - Download all diffs from all repos you select, across all branches, and deduplicate them to avoid double-counting merge commits - Score each file diff with Haiku, giving it a weight that will score e.g. 0 a file change, low a translation change, low or zero a library update, high an actual genuine code change or refactor, etc (this can also act as code verbosity index) - Calculate a sort of "weighted lines of code" metric that you can plot over time to measure productivity improvements
Scoring is very cheap at around $7 per K commits.
The plugin also has a # of other features like creating reports, anonymizing developers with local hashing and the possibility to use BigQuery to share the database across a team.
I'm publishing it so you can grill me on the methodology, cross-check it, find bugs, you name it. All contributions welcome
apothegm•22m ago
Oh, never mind, we already know the answer to that…