I'm Nilesh. My brother Abhishek and I built ProdE. Carnegie Mellon and IIT Delhi.
We benchmarked four AI code documentation tools: ProdE, DeepWiki, Claude Code, and Google Code Wiki. ProdE scored highest on usefulness for coding agents. 15% ahead of DeepWiki, 38% ahead of Google, 40% ahead of Claude Code.
I know this might feel like self praise, but we couldn't find an existing benchmark to use, so created one ourselves and open sourced it.
The biggest gap is coverage. Coding agents can only answer questions about parts of the codebase that are documented. If your docs cover routing but skip middleware, every middleware question becomes a hallucination. ProdE documents 114-140 files per project. Claude Code covers 13-17. So agents using Claude Code's docs are blind to roughly 90% of the codebase.
Zero hallucinations across all 9 evaluations. Every file path, function reference, and claim we checked pointed to real code. So it's not just that we cover more, what we cover is also accurate.
DeepWiki did really well here -- 5x more diagrams per project than us, best visual docs by far. Claude Code had the strongest writing quality of the four.
Honestly, if I saw this post I'd also assume the vendor rigged it. So here's everything we did to make it not that. Claude Opus judges all four tools using a published rubric. Claude Code's output was renamed to doc_x/ so the judge couldn't tell it was Claude Code. ProdE launched after Claude's training cutoff, so the judge had no prior knowledge of our tool. We don't use Claude anywhere in our pipeline. 9 evaluation passes across 3 open-source repos (FastAPI, Pydantic, Mermaid), all pinned to exact commits to tackle the non deterministic outputs.
We scored usefulness for coding agents and readability for humans as separate things, because what makes docs good for agents is different from what makes them good for humans. Agents need lots of references to specific files and functions. Humans need clear writing and good diagrams. The tool with the best writing scored lowest on usefulness for agents. Ofcourse the usefulness for Humans is better judged by humans.
curious_nile•2h ago
We benchmarked four AI code documentation tools: ProdE, DeepWiki, Claude Code, and Google Code Wiki. ProdE scored highest on usefulness for coding agents. 15% ahead of DeepWiki, 38% ahead of Google, 40% ahead of Claude Code.
I know this might feel like self praise, but we couldn't find an existing benchmark to use, so created one ourselves and open sourced it.
The biggest gap is coverage. Coding agents can only answer questions about parts of the codebase that are documented. If your docs cover routing but skip middleware, every middleware question becomes a hallucination. ProdE documents 114-140 files per project. Claude Code covers 13-17. So agents using Claude Code's docs are blind to roughly 90% of the codebase.
Zero hallucinations across all 9 evaluations. Every file path, function reference, and claim we checked pointed to real code. So it's not just that we cover more, what we cover is also accurate.
DeepWiki did really well here -- 5x more diagrams per project than us, best visual docs by far. Claude Code had the strongest writing quality of the four.
Honestly, if I saw this post I'd also assume the vendor rigged it. So here's everything we did to make it not that. Claude Opus judges all four tools using a published rubric. Claude Code's output was renamed to doc_x/ so the judge couldn't tell it was Claude Code. ProdE launched after Claude's training cutoff, so the judge had no prior knowledge of our tool. We don't use Claude anywhere in our pipeline. 9 evaluation passes across 3 open-source repos (FastAPI, Pydantic, Mermaid), all pinned to exact commits to tackle the non deterministic outputs.
We scored usefulness for coding agents and readability for humans as separate things, because what makes docs good for agents is different from what makes them good for humans. Agents need lots of references to specific files and functions. Humans need clear writing and good diagrams. The tool with the best writing scored lowest on usefulness for agents. Ofcourse the usefulness for Humans is better judged by humans.
Blog (full analysis): https://prode.ai/blogs/we-benchmarked-ai-code-documentation-... Repo (everything, run it yourself): https://github.com/abhishek-curiousboxai/code-documentation-... You can fork it and re-run. Everything is MIT licensed.