For those who have built something similar recently, what approach did you use to preserve document structure reliably in the normalized markdown (headings, reading order, nested tables, page boundaries), especially when documents are messy or scanned? Did you do page-level extraction only, or did you use overlapping windows / multi-page context to handle tables and sections spanning pages?
On the indexing side, do you store only chunks + embeddings, or do you also persist richer metadata per chunk (page ranges, heading hierarchy, has_table/contains_image flags, extraction confidence/quality notes, source pointers) and if so, what proved most valuable? How does that help in the agent retrieval process?
What prompt patterns worked best for layout-heavy pages (multi-column text, complex tables, footnotes, repeated headers/footers), and what failed in practice?
How did you evaluate extraction quality at scale beyond spot checks (golden sets, automatic heuristics, diffing across runs/models, table-structure metrics)?
Any lessons learned, anti-patterns, or “if I did it again” recommendations would be very helpful.