news newest ask show jobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

A Java library for extracting tables from Text-Based PDFs and scanned PDFs

https://github.com/ExtractPDF4J/ExtractPDF4J

1•mehulimukherjee•2h ago

Comments

mehulimukherjee•2h ago

Hi HN,

Over the past year I’ve been working on ExtractPDF4J, an open-source Java library for extracting tables from real-world PDFs.

Many document processing pipelines rely on PDFs like bank statements, financial reports, or invoices. In practice these files are inconsistent: some are text-based, others are scanned images, and many contain irregular layouts or multi-page tables.

Most existing tools in this space are Python-based (like Camelot or Tabula). In JVM-heavy environments this often means running a separate Python service or building a hybrid stack.

ExtractPDF4J was designed to solve this problem directly in Java.

Key ideas behind the project:

• Hybrid parsing strategies (stream + lattice detection) • OCR fallback for scanned documents • CLI and service modules for production workflows • Maven Central distribution for easy integration

The latest release also introduced a BOM module to simplify dependency management and a full documentation site.

Project: https://github.com/ExtractPDF4J/ExtractPDF4J

Docs: https://extractpdf4j.github.io/ExtractPDF4J/

I’d really appreciate feedback from people who have dealt with messy PDF extraction problems. Suggestions and contributions are welcome. Star the repo for more reach to the Java community. Thank you!

Show HN: Parascene – a platform for AI, algorithmic, and traditional art

https://sh.parascene.com/s/v1/AA4uAAAa.V2RJl71Us2AJ/tbopzs

1•heddycrow•31s ago•0 comments

Inside a bot operator's email verification infrastructure

https://blog.castle.io/inside-a-bot-operators-email-verification-infrastructure/

1•avastel•2m ago•0 comments

I Topped the HuggingFace Open LLM Leaderboard on Two Gaming GPUs

https://dnhkng.github.io/posts/rys/

1•dnhkng•2m ago•1 comments

Caution: Read the Docs for Claude 4.6's Effort Parameter

https://everyrow.io/blog/claude-effort-parameter

4•Bullhorn9268•4m ago•0 comments

Thinking Machines Lab and Nvidia announce gigawatt-scale AI partnership

https://thinkingmachines.ai/news/nvidia-partnership/

1•meetpateltech•4m ago•0 comments

I built a hub to organize and share all my AI prompts

https://ideaprompts.com/

1•Kamil_KKA•4m ago•1 comments

Unstructured Data and the Joy of having Something Else think for you

https://shkspr.mobi/blog/2026/03/unstructured-data-and-the-joy-of-having-something-else-think-for...

1•edent•6m ago•0 comments

K-Shaped Economy Continues

https://www.apolloacademy.com/k-shaped-expansion-continues/

1•akyuu•6m ago•0 comments

State of Interactive Product Demos 2026: Benchmarks and Trends

https://supademo.com/content/state-of-interactive-demos-2026

1•avanticc•6m ago•0 comments

A Unix Manifesto for the Age of AI

https://linuxtoaster.com/manifesto.html

2•dirk94018•7m ago•1 comments

Show HN: Smux – Terminal Multiplexer built for AI agents

https://github.com/gergomiklos/smux

3•garymiklos•7m ago•0 comments

Show HN: DD Photos – open-source photo album site generator (Go and SvelteKit)

https://github.com/dougdonohoe/ddphotos

3•dougdonohoe•7m ago•0 comments

Show HN: Local-first firmware analyzer using WebAssembly

https://xray.boldwark.com

4•asabil•7m ago•0 comments

A New Algorithmic MIDI Sequencer in Pure Python (Open Source)

https://github.com/simonholliday/subsequence

2•deepvibrations•10m ago•0 comments

Startup Ideas VCs Are Funding in 2026

https://stellisoft.com/stellify/startup-ideas-vcs-funding-2026

2•Stellisoft•10m ago•0 comments

Intel Demos Chip to Compute with Encrypted Data

https://spectrum.ieee.org/fhe-intel

3•sohkamyung•10m ago•0 comments

A usage circuit breaker for Cloudflare Workers

5•ethan_zhao•12m ago•0 comments

GPS jamming: The invisible battle in the Middle East

https://www.bbc.com/news/articles/c3ewwlx9e1xo

2•throw0101d•12m ago•0 comments

You can read the web in seasons

https://enocc.com/2025/11/12/read-web-seasonally.html

2•nyoki•12m ago•1 comments

A 100 Year Old Consul Typewriter?

https://www.os2museum.com/wp/a-100-year-old-consul-typewriter/

2•jruohonen•12m ago•0 comments

Ig Nobels to move awards to Europe due to concern over US travel visas

https://www.theguardian.com/science/2026/mar/09/ig-nobel-prize-europe

4•sohkamyung•13m ago•0 comments

What is Y Combinator Betting On?

1•Rushalee•13m ago•0 comments

PIDKill – Auto-kill rogue macOS processes on a loop

https://www.pidkill.com

1•thomasmillerGo•14m ago•2 comments

Hooking Coding Agents with the Cedar Policy Language

https://blog.sondera.ai/p/hooking-coding-agents-with-the-cedar

1•joshdevon•14m ago•1 comments

Gen AI Consumer Apps – 6th Edition

https://a16z.com/100-gen-ai-apps-6/

1•qrios•15m ago•0 comments

Jailbreaking Game

https://jailbreak.app.space/

1•mbocanu•16m ago•0 comments

Agents, TODOs and Blockchain: Why the Future Will Not Have Programming Languages

https://ethresear.ch/t/agents-todos-and-blockchain-why-the-future-will-almost-have-no-programming...

1•kladko1•16m ago•0 comments

MeshCore ESPHome Component

https://github.com/meshcore-dev/MeshCore/issues/1225

1•netmilk•17m ago•0 comments

I built a achievement based app to fight procrastination

https://schooly-waitinglist.app/

1•boriswizaard•17m ago•0 comments

$82K GCP bill in 48 hours – so I built an automatic API key kill switch

https://cloudsentinel.dev

1•daudmalik06•18m ago•1 comments