Build a benchmark to evaluate how good document parser work on a dataset of 2000 PDFs manually annotated, trying to evaluate accross multiple dimensions: charts, tables, text styling, text correctness, and attribution.
The benchmark evaluate performance on full page (not selected part of the pages), and evaluaye different OSS / crobtier model / commercial approach.
For transparency it is available as a HF leaderbaord.
Yes, it evaluate using frontier model for parsing from all 3 major provider (google, anthropic and openai). It is also easy to extend to evaluaye new model (code/dataset is available)
pierre•1h ago
The benchmark evaluate performance on full page (not selected part of the pages), and evaluaye different OSS / crobtier model / commercial approach.
For transparency it is available as a HF leaderbaord.
Paper: https://arxiv.org/abs/2604.08538