The document is otherwise clean, so I’m trying to understand whether this is a known limitation of PyMuPDF or if there are better approaches for handling watermarked PDFs before OCR. I’m working with an RTX 4000 (8GB VRAM), so I’m also trying to stay within reasonable GPU constraints.
I’d really appreciate any ideas on:
more robust OCR libraries or models that handle watermarks well
preprocessing strategies to suppress watermark text
better extraction pipelines for RAG use cases
or any general advice on improving this part of the system
The project is open-source, and if anyone is interested in digging deeper, finding issues, or contributing improvements, here’s the repository:
GitHub: https://github.com/Hundred-Trillion/L88-Full
If you find it useful, starring the repo helps increase visibility so more people with domain expertise might notice it.
Thanks in advance for any insights.