frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Seeking Advice on Improving OCR for Watermarked PDFs in My RAG Pipeline

1•hundredtrillion•2h ago
I’ve been developing a small RAG pipeline and ran into a specific technical issue involving OCR. I’m using PyMuPDF for extraction, and whenever a PDF contains a centered watermark on each page, the OCR becomes noisy—text breaks, artifacts show up, and the output degrades enough that it affects chunking and retrieval accuracy downstream.

The document is otherwise clean, so I’m trying to understand whether this is a known limitation of PyMuPDF or if there are better approaches for handling watermarked PDFs before OCR. I’m working with an RTX 4000 (8GB VRAM), so I’m also trying to stay within reasonable GPU constraints.

I’d really appreciate any ideas on:

more robust OCR libraries or models that handle watermarks well

preprocessing strategies to suppress watermark text

better extraction pipelines for RAG use cases

or any general advice on improving this part of the system

The project is open-source, and if anyone is interested in digging deeper, finding issues, or contributing improvements, here’s the repository:

GitHub: https://github.com/Hundred-Trillion/L88-Full

If you find it useful, starring the repo helps increase visibility so more people with domain expertise might notice it.

Thanks in advance for any insights.

Claude's Corner

https://substack.com/home/post/p-189177838
1•YeGoblynQueenne•6m ago•0 comments

AI pays for its self-existence

https://web4.ai/
1•johntopia•11m ago•1 comments

NASA spots new signs of lightning on Mars

https://www.scientificamerican.com/article/is-there-lightning-on-mars-new-evidence-suggests-its-t...
1•Brajeshwar•12m ago•0 comments

Obsidian Sync now has a headless client

https://help.obsidian.md/sync/headless
7•adilmoujahid•13m ago•0 comments

Anthropic vs. DoD: "Any lawful use" is a fight about control

1•colek42•14m ago•1 comments

Yarn 6 has been implemented in Rust

https://yarn6.netlify.app/blog/2026-01-28-yarn-6-preview/
1•porada•16m ago•0 comments

Show HN: Stacked Game of Life

https://stacked-game-of-life.koenvangilst.nl/
1•vnglst•16m ago•0 comments

The Epstein Tax

https://www.profgalloway.com/the-epstein-tax/
1•simonebrunozzi•16m ago•0 comments

Giant string of organic molecules on Mars may be one of best signs of life yet

https://www.livescience.com/space/mars/giant-string-of-organic-molecules-on-mars-may-be-one-of-th...
3•Brajeshwar•18m ago•0 comments

The Saga of Kowloon Walled City

https://www.atlasobscura.com/articles/kowloon-walled-city
3•Brajeshwar•22m ago•0 comments

Unfreeze for ChatGPT – Fix freezing on long conversations /30KB Chrome extension

https://inem.gumroad.com/l/unfreeze-for-chatgpt
1•inem•22m ago•1 comments

I caught an Illegal Russian Spy [video]

https://www.youtube.com/watch?v=xjo0iLssbI8
1•dralley•22m ago•0 comments

Why consumer choice is stripped away and how the tech industry profits from it

https://fireborn.mataroa.blog/blog/because-fuck-you-why-consumer-choice-is-being-stripped-away-an...
3•zdw•25m ago•0 comments

Polyworld

https://en.wikipedia.org/wiki/Polyworld
2•nicoloren•28m ago•0 comments

Clustering Developers by Repo/PR/Issue Signals

https://mates.symploke.dev?hn-ph
1•thomasfromcdnjs•29m ago•0 comments

Show HN: Spectra – Turn bank CSV/PDF exports into a local finance dashboard

https://github.com/francescogabrieli/Spectra
1•francesco_gab•30m ago•1 comments

Show HN: Fava Trails – Git-backed memory for AI agents using Jujutsu (JJ)

https://github.com/MachineWisdomAI/fava-trails
1•timeleft--•32m ago•0 comments

Show HN: SQLite for Rivet Actors – one database per agent, tenant, or document

https://github.com/rivet-dev/rivet
2•NathanFlurry•33m ago•0 comments

AssistPlant – Plant Care App in Your Calendar. Your Inbox. The Browser

https://assistplant.com/
1•milos-lekovic•35m ago•1 comments

I coded a game to turn the Epstein Files into Mad Libs. It's hilarious

https://www.epstein-isi.net/
3•douchecoded•36m ago•1 comments

Claude and Gemini debate AI consciousness then analyze their debate performances

https://spinchange.github.io/ai-debates/
1•spinchange•38m ago•1 comments

Ask HN: Whats the best friction approach to breaking phone habits

1•artzev_•40m ago•1 comments

Show HN: MailFomo – Drive urgency in emails with live countdown timers

https://mailfomo.com
1•krachev•40m ago•0 comments

Istota is a powerful (non-claw) AI agent that lives in Nextcloud

https://istota.xyz/
2•durakot•41m ago•0 comments

OpenAI Reaches A.I. Agreement with Defense Dept. After Anthropic Clash

https://www.nytimes.com/2026/02/27/technology/openai-agreement-pentagon-ai.html
1•brandonb•41m ago•0 comments

Why rich people live longer

https://www.empirical.health/blog/rich-people-live-longer-hims-superbowl/
1•brandonb•42m ago•0 comments

Build a DuckDB Extension in Rust in 4 Commands

https://redraiment.medium.com/build-a-duckdb-extension-in-rust-in-4-commands-64600f7e8cc0
1•redraiment•42m ago•1 comments

Agents of Chaos

https://arxiv.org/abs/2602.20021
1•ukuina•42m ago•1 comments

run0 (sudo replacement)

https://www.freedesktop.org/software/systemd/man/257/run0.html
1•bpierre•42m ago•0 comments

You Shouldn't Exercise to Lose Weight

https://time.com/6138809/should-you-exercise-to-lose-weight/
1•paulpauper•44m ago•1 comments