PageIndex OCR is a long-context OCR approach that preserves a document's global structure. It can detect true hierarchy and semantic relationships across pages, addressing common issues in traditional OCR or PDF-to-Markdown pipelines.
In internal tests, it consistently produced more accurate structures than other approaches we tried.
Feedback and ideas for improving multi-page document structure extraction are welcome.