frontpage.

I’ve been developing a small RAG pipeline and ran into a specific technical issue involving OCR. I’m using PyMuPDF for extraction, and whenever a PDF contains a centered watermark on each page, the OCR becomes noisy—text breaks, artifacts show up, and the output degrades enough that it affects chunking and retrieval accuracy downstream.

The document is otherwise clean, so I’m trying to understand whether this is a known limitation of PyMuPDF or if there are better approaches for handling watermarked PDFs before OCR. I’m working with an RTX 4000 (8GB VRAM), so I’m also trying to stay within reasonable GPU constraints.

I’d really appreciate any ideas on:

more robust OCR libraries or models that handle watermarks well

preprocessing strategies to suppress watermark text

better extraction pipelines for RAG use cases

or any general advice on improving this part of the system

The project is open-source, and if anyone is interested in digging deeper, finding issues, or contributing improvements, here’s the repository:

GitHub: https://github.com/Hundred-Trillion/L88-Full

If you find it useful, starring the repo helps increase visibility so more people with domain expertise might notice it.

Thanks in advance for any insights.

Claude's Corner

AI pays for its self-existence

NASA spots new signs of lightning on Mars

Obsidian Sync now has a headless client

Anthropic vs. DoD: "Any lawful use" is a fight about control

Yarn 6 has been implemented in Rust

Show HN: Stacked Game of Life

The Epstein Tax

Giant string of organic molecules on Mars may be one of best signs of life yet

The Saga of Kowloon Walled City

Unfreeze for ChatGPT – Fix freezing on long conversations /30KB Chrome extension

I caught an Illegal Russian Spy [video]

Why consumer choice is stripped away and how the tech industry profits from it

Polyworld

Clustering Developers by Repo/PR/Issue Signals

Show HN: Spectra – Turn bank CSV/PDF exports into a local finance dashboard

Show HN: Fava Trails – Git-backed memory for AI agents using Jujutsu (JJ)

Show HN: SQLite for Rivet Actors – one database per agent, tenant, or document

AssistPlant – Plant Care App in Your Calendar. Your Inbox. The Browser

I coded a game to turn the Epstein Files into Mad Libs. It's hilarious

Claude and Gemini debate AI consciousness then analyze their debate performances

Ask HN: Whats the best friction approach to breaking phone habits

Show HN: MailFomo – Drive urgency in emails with live countdown timers

Istota is a powerful (non-claw) AI agent that lives in Nextcloud

OpenAI Reaches A.I. Agreement with Defense Dept. After Anthropic Clash

Why rich people live longer

Build a DuckDB Extension in Rust in 4 Commands

Agents of Chaos

run0 (sudo replacement)

You Shouldn't Exercise to Lose Weight

Seeking Advice on Improving OCR for Watermarked PDFs in My RAG Pipeline