frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Seeking Advice on Improving OCR for Watermarked PDFs in My RAG Pipeline

2•hundredtrillion•1d ago
I’ve been developing a small RAG pipeline and ran into a specific technical issue involving OCR. I’m using PyMuPDF for extraction, and whenever a PDF contains a centered watermark on each page, the OCR becomes noisy—text breaks, artifacts show up, and the output degrades enough that it affects chunking and retrieval accuracy downstream.

The document is otherwise clean, so I’m trying to understand whether this is a known limitation of PyMuPDF or if there are better approaches for handling watermarked PDFs before OCR. I’m working with an RTX 4000 (8GB VRAM), so I’m also trying to stay within reasonable GPU constraints.

I’d really appreciate any ideas on:

more robust OCR libraries or models that handle watermarks well

preprocessing strategies to suppress watermark text

better extraction pipelines for RAG use cases

or any general advice on improving this part of the system

The project is open-source, and if anyone is interested in digging deeper, finding issues, or contributing improvements, here’s the repository:

GitHub: https://github.com/Hundred-Trillion/L88-Full

If you find it useful, starring the repo helps increase visibility so more people with domain expertise might notice it.

Thanks in advance for any insights.

Comments

NoahZuniga•13h ago
I've had a lot of success doing "OCR" with gemini-<n>-pro. It gives incredibly accurate text (Most documents ~20 pages long have 0 errors), but no coordinates of the text. I don't need to coordinates so that's fine by me.

Tell HN: MitID, Denmark's digital ID, was down

137•mousepad12•2d ago•180 comments

Tell HN: YC companies scrape GitHub activity, send spam emails to users

673•miki123211•3d ago•256 comments

Tell HN: My daily game won a Players Choice Award

17•paulhebert•19h ago•2 comments

Ask HN: How do we solve the bot flooding problem without destroying anonymity?

7•txrx0000•8h ago•10 comments

Ask HN: Builder.ai ($1B Microsoft-backed AI company) who's lookin at the assets?

6•gamelock•17h ago•4 comments

I built AI agents that do the grunt work solo founders hate

3•Seleci•12h ago•5 comments

Ask HN: Article to share with a technical manager about modern AI coding tools?

5•killmill•17h ago•4 comments

Garbage In, Garbage Out: The Degradation of Human Requirements in the LLM Era

4•waylake•23h ago•5 comments

Seeking Advice on Improving OCR for Watermarked PDFs in My RAG Pipeline

2•hundredtrillion•1d ago•1 comments

1Password pricing increasing up to 33% in March

145•otterley•4d ago•209 comments

Super Editor – Atomic file editor with automatic backups (Python and Go)

6•larryste•1d ago•1 comments

I don't need AI to build me a new app. I need it to make Jira bearable

21•niel_hu•2d ago•21 comments

Ask HN: Why are some websites locking or using the audio device on Windows?

4•ezconnect•1d ago•1 comments

36yo: Career at home vs. Simple life abroad?

12•Slaboli•2d ago•34 comments

Ask HN: Who Is Using XMPP?

23•nunobrito•3d ago•11 comments

Ask HN: How do you handle duplicate side effects when jobs, workflows retry?

10•shineDaPoker•2d ago•10 comments

Ask HN: My competitor wants to buy us out, recommend a lawyer?

7•VladVladikoff•2d ago•8 comments

New Claude Code Feature "Remote Control"

10•rob•4d ago•0 comments

Ask HN: What's it like working in big tech recently with all the AI tools?

22•ex-aws-dude•3d ago•14 comments

Ask HN: What Happened to HTTPS://Www.keyvalues.com/?

4•alexgotoi•20h ago•0 comments

If you drive clock wise along the beach on an island

7•Cookingboy•2d ago•5 comments

Ask HN: Starting a New Role with Ada

10•NoNameHaveI•4d ago•5 comments

LazyGravity – I made my phone control Antigravity so I never leave bed

8•masaTokyo•3d ago•4 comments

Ask HN: What will happen with Anthropics ultimatum?

7•maniacwhat•3d ago•4 comments

I built a 151k-node GraphRAG swarm that autonomously invents SDG solutions

2•wisdomagi•2d ago•0 comments

You've reached the end!