Over the past 2-3 months I’ve done this linking process for ~40k papers, as well as compiling a complimentary dataset of ~400k structured review comments from the paper discussions.
This blog post has a few preliminary pieces of analysis, including the “Best Rejected Papers” from some recent ML conferences (including ROBERTa (42k citations) and Improved Denoising Diffusion models, a very influential paper in diffusion modeling).
Any feedback on the dataset/interesting further analysis is welcome!