Developer work is unusually public. Aside from git diffs, you can see PR comments, Linear threads, and get a sense of both the complexity of the work and how people collaborate.
I tried a little adversarial experiment: - Take recent commits and have an LLM guess the "spec" (simulating a ticket on Linear, without building this step) - Ask Claude Code to implement the same thing - Use another LLM to compare the two solutions blindly - If the LLM version is worse than the human version, keep giving it hints until it matches or exceeds the human contribution - More elaborate hints = higher complexity score - Evaluating comments is even simpler. I didn’t try an adversarial approach, but there’s no reason it wouldn’t work.
This turned into a small library I hacked together. You can score devs on repos for fun.
I wonder if managers use numbers simply because they can’t hold all the context of a person’s contributions, and so lose out on nuance. What if LLMs could hold all of the context of your work and give a fairer evaluation? Could we move away from PMs deciding the “what” and engineers deciding the “how”, to Engineers deciding both?
PRs welcome!