Some patterns that kept repeating: • PDFs extracting differently after a small template or export tool change • headings collapsing or shifting levels • hidden characters creeping into tokens • tables losing their structure • documents updated without being re-ingested • different converters producing slightly different text layouts
We only noticed the drift once we started diffing extraction output week-to-week and tracking token count variance. Running two extractors on the same file also revealed inconsistencies that weren’t obvious from looking at the text.
Even with pinned extractor versions, mixed-format sources (Google Docs, Word, Confluence exports, scanned PDFs) still drifted subtly over time. The retriever was doing exactly what it was told, the input data just wasn’t consistent anymore.
Curious if others have seen this. How do you keep ingestion stable in production RAG/Agentic AI systems?
chasing0entropy•9m ago