In a real extraction benchmark: - Human baseline: physician-curated workbook (MD+PhD profile), ~2-3 days - Pubmed-max: end-to-end run in ~3m30s
Measured deltas: - Core non-empty clinical endpoint values: 186 -> 198 (+6.5%) - Unique trials across task sheets: 23 -> 30 (+30.4%) - Evidence rows: 94 -> 112 (+19.1%) - Incremental dynamic evidence quality: 11/11 A-level - Evidence mapping for non-empty core cells: 100% - Unresolved extraction conflicts admitted to main tables: 0
This is not a retrieval-only benchmark; it is content extraction with traceability constraints. Repo and benchmark report are linked in README.