frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

So you want to parse a PDF?

https://eliot-jones.com/2025/8/pdf-parsing-xref
44•UglyToad•2h ago

Comments

JKCalhoun•1h ago
Yeah, PDF didn't anticipate streaming. That pesky trailer dictionary at the end means you have to wait for the file to fully load to parse it.

Having said that, I believe there are "streamable" PDF's where there is enough info up front to render the first page (but only the first page).

(But I have been out of the PDF loop for over a decade now so keep that in mind.)

UglyToad•1h ago
Yes, you're right there are Linearized PDFs which are organized to enable parsing and display of the first page(s) without having to download the full file. I skipped those from the summary for now because they have a whole chunk of an appendix to themselves.
wackget•1h ago
> So you want to parse a PDF?

Absolutely not. For the reasons in the article.

yoyohello13•1h ago
One of the very first programming projects I tried, after learning Python, was a PDF parser to try to automate grabbing maps for one of my DnD campaigns. It did not go well lol.
simonw•1h ago
I convert the PDF into an image per page, then dump those images into either an OCR program (if the PDF is a single column) or a vision-LLM (for double columns or more complex layouts).

Some vision LLMs can accept PDF inputs directly too, but you need to check that they're going to convert to images and process those rather than attempting and failing to extract the text some other way. I think OpenAI, Anthropic and Gemini all do the images-version of this now, thankfully.

trebligdivad•55m ago
Sadly this makes some sense; pdf represents characters in the text as offsets into it's fonts, and often the fonts are incomplete fonts; so an 'A' in the pdf is often not good old ASCII 65. In theory there's two optional systems that should tell you it's an 'A' - except when they don't; so the only way to know is to use the font to draw it.
UglyToad•39m ago
If you don't have a known set of PDF producers this is really the only way to safely consume PDF content. Type 3 fonts alone make pulling text content out unreliable or impossible, before even getting to PDFs containing images of scans.

I expect the current LLMs significantly improve upon the previous ways of doing this, e.g. Tesseract, when given an image input? Is there any test you're aware of for model capabilities when it comes to ingesting PDFs?

simonw•16m ago
I've been trying it informally and noting that it's getting really good now - Claude 4 and Gemini 2.5 seem to do a perfect job now, though I'm still paranoid that some rogue instruction in the scanned text (accidental or deliberate) might result in an inaccurate result.
throwaway840932•56m ago
As a matter of urgency PDF needs to go the way of Flash, same goes for TTF. Those that know, know why.
internetter•38m ago
I think a PDF 2.0 would just be an extension of a single file HTML page with a fixed viewport
farkin88•55m ago
Great rundown. One thing you didn't mention that I thought was interesting to note is incremental-save chains: the first startxref offset is fine, but the /Prev links that Acrobat appends on successive edits may point a few bytes short of the next xref. Most viewers (PDF.js, MuPDF, even Adobe Reader in "repair" mode) fall back to a brute-force scan for obj tokens and reconstruct a fresh table so they work fine while a spec-accurate parser explodes. Building a similar salvage path is pretty much necessary if you want to work with real-world documents that have been edited multiple times by different applications.
UglyToad•45m ago
You're right, this was a fairly common failure state seen in the sample set. The previous reference or one in the reference chain would point to offset of 0 or outside the bounds of the file, or just be plain wrong.

What prompted this post was trying to rewrite the initial parse logic for my project PdfPig[0]. I had originally ported the Java PDFBox code but felt like it should be 'simple' to rewrite more performantly. The new logic falls back to a brute-force scan of the entire file if a single xref table or stream is missed and just relies on those offsets in the recovery path.

However it is considerably slower than the code before it and it's hard to have confidence in the changes. I'm currently running through a 10,000 file test-set trying to identify edge-cases.

[0]: https://github.com/UglyToad/PdfPig/pull/1102

farkin88•15m ago
That robustness-vs-throughput trade-off is such a staple of PDF parsing. My guess is that the new path is slower because the recovery scan now always walks the whole byte range and has to inflate any object streams it meets before it can trust the offsets even when the first startxref would have been fine.

The 10k-file test set sounds great for confidence-building. Are the failures clustering around certain producer apps like Word, InDesign, scanners, etc.? Or is it just long-tail randomness?

Reading the PR, I like the recovery-first mindset. If the common real-world case is that offsets lie, treating salvage as the default is arguably the most spec-conformant thing you can do. Slow-and-correct beats fast-and-brittle for PDFs any day.

coldcode•45m ago
I parsed the original Illustrator format in 1988 or 1989, which is a precursor to PDF. It was simpler than today's PDF, but of course I had zero documentation to guide me. I was mostly interested in writing Illustrator files, not importing them, so it was easier than this.
sergiotapia•35m ago
I did some exploration using LLMs to parse, understand then fill in PDFs. It was brutal but doable. I don't think I could build something a "generalized" solution like this without LLMs. The internals are spaghetti!

Also, god bless the open source developers. Without them also impossible to do this in a timely fashion. pymupdf is incredible.

https://www.linkedin.com/posts/sergiotapia_completed-a-reall...

diptanu•34m ago
Disclaimer - Founder of Tensorlake, we built a Document Parsing API for developers.

This is exactly the reason why Computer Vision approaches for parsing PDFs works so well in the real world. Relying on metadata in files just doesn't scale across different source of PDFs.

We convert PDFs to images, run a layout understanding model on them first, and then apply specialized models like text recognition and table recognition models on them, stitch them back together to get acceptable results for domains where accuracy is table stakes.

rkagerer•28m ago
So you've outsourced the parsing to whatever software you're using to render the PDF as an image.
bee_rider•25m ago
Seems like a fairly reasonable decision given all the high quality implementations out there.
throwaway4496•23m ago
How is it reasonable to render the PDF, rasterize it, OCR it, use AI, instead of just using the "quality implementation" to actually get structured data out? Sounds like "I don't know programming, so I will just use AI".
throwaway4496•24m ago
So you parse PDFs, but also OCR images, to somehow get better results?

Do you know you could just use the parsing engine that renders the PDF to get the output? I mean, why raster it, OCR it, and then use AI? Sounds creating a problem to use AI to solve it.

Alex3917•23m ago
> This is exactly the reason why Computer Vision approaches for parsing PDFs works so well in the real world.

One of the biggest benefits of PDFs though is that they can contain invisible data. E.g. the spec allows me to embed cryptographic proof that I've worked at the companies I claim to have worked at within my resume. But a vision-based approach obviously isn't going to be able to capture that.

throwaway4496•18m ago
Cryptographic proof of job experience? Please explain more. Sounds interesting.
throwaway4496•22m ago
This is the parallel of some of the dotcom peak absurdities. We are in the AI peak now.
HocusLocus•22m ago
Thanks kindly for this well done and brave introduction. There are few people these days who'd even recognize the bare ASCII 'Postscript' form of a PDF at first sight. First step is to unroll into ASCII of course and remove the first wrapper of Flate/ZIP,LZW,RLE. I recently teased Gemini for accepting .PDF and not .EPUB (html inna zip basically, with almost-guaranteed paragraph streams of UTF-8) and it lamented apologetically that its pdf support was opaque and library oriented. That was very human of it. Aside from a quick recap of the most likely LZW wrapper format, a deep dive into Lineariziation and reordering the objects by 'first use on page X' and writing them out again would be a good pain project.

UglyToad is a good name for someone who likes pain. ;-)

userbinator•12m ago
As someone who has written a PDF parser - it's definitely one of the weirdest formats I've seen, and IMHO much of it is caused by attempting to be a mix of both binary and text; and I suspect at least some of these weird cases of bad "incorrect but close" xref offsets may be caused by buggy code that's dealing with LF/CR conversions.

What the article doesn't mention is a lot of newer PDFs (v1.5+) don't even have a regular textual xref table, but the xref table is itself inside an "xref stream", and I believe v1.6+ can have the option of putting objects inside "object streams" too.

The Complete Weekend Micro-App Builder's Playbook: From Zero to Live SaaS

https://sidsaladi.substack.com/p/the-complete-weekend-micro-app-builders
1•Sidsaladi•46s ago•1 comments

She owes a private school $27,000. Her daughter never attended

https://www.washingtonpost.com/dc-md-va/2025/08/03/private-schools-lawsuits-families/
1•pwthornton•1m ago•0 comments

Optical pooled CRISPR screening used to identify potential Ebola drug targets

https://news.mit.edu/2025/scientists-apply-optical-pooled-crispr-screening-identify-potential-new-ebola-drug-targets-0724
1•gmays•2m ago•0 comments

Generative Art Vending Machine

https://www.youtube.com/watch?v=g9BWODjYZBY
1•bookofjoe•3m ago•0 comments

Shelling Out Is Selling Out

https://petersobot.com/blog/shelling-out-is-selling-out/
1•psobot•5m ago•0 comments

Why Doctors Hate Their Computers (2018)

https://www.newyorker.com/magazine/2018/11/12/why-doctors-hate-their-computers
1•mitchbob•6m ago•2 comments

Epic Effort to Ground Physics in Math Opens Up the Secrets of Time

https://www.quantamagazine.org/epic-effort-to-ground-physics-in-math-opens-up-the-secrets-of-time-20250611/
1•alexcos•7m ago•0 comments

Show HN: Grow a Garden Cooking Recipes

https://growagardencookingrecipes.com
1•yiyiyayo•8m ago•0 comments

Russian volcano erupts for first time in more than 500 years

https://www.bbc.com/news/articles/c0r7qlwg4zro
1•wslh•9m ago•0 comments

Is Chatgpt.com Up or Down?

https://websitedown.xyz/chatgpt-com-status
1•kajnes•12m ago•0 comments

Show HN: Open-source macOS dictation replacement, 5Mb

https://github.com/j05u3/VTS
2•josue_0•17m ago•1 comments

New Aging Clock Forecasts Dementia, Disease Risk from Single MRI Scan

https://www.insideprecisionmedicine.com/topics/patient-care/new-aging-clock-forecasts-dementia-disease-risk-from-single-mri-scan/
3•gscott•21m ago•1 comments

Is this a real woman? AI model in Vogue raises concerns about beauty standards

https://www.bbc.com/news/articles/cgeqe084nn4o
3•gnabgib•24m ago•1 comments

Teznewz : AI-Powered Fin News & Social Sentiment for Retail Investor

https://teznewz.com/
1•sixteen_dev•29m ago•1 comments

Vibe coded a Workout Wrapped Website to summarize my yearly gains

https://wodwrapped.netlify.app/
1•dominickgurnari•30m ago•1 comments

Anywhere on Earth

https://en.wikipedia.org/wiki/Anywhere_on_Earth
2•sandinmyjoints•43m ago•0 comments

Dana Morgan Jr.

http://deadessays.blogspot.com/2025/08/dana-morgan-jr.html
1•tkgally•44m ago•0 comments

LLM Economist – Mechanism Design for Simulated Agent Societies

https://github.com/sethkarten/LLM-Economist
2•milkkarten•46m ago•1 comments

Building Testable Telegram Bots with Zustand

https://zwit.link/posts/zustand-telegram-bot/
1•gicrisf•50m ago•0 comments

Typed languages are better suited for vibecoding

https://solmaz.io/typed-languages-are-better-suited-for-vibecoding
2•hosolmaz•52m ago•0 comments

OldTimeyComputerShow: 24/7 curated video tapes/films on computers/games [video]

https://www.twitch.tv/oldtimeycomputershow/about
1•dijksterhuis•54m ago•0 comments

Physicists disagree on what quantum mechanics says about reality

https://www.nature.com/articles/d41586-025-02342-y
1•danielam•54m ago•0 comments

BackupGuardian-Stop trusting DB backups that fail during critical migrations

https://www.backupguardian.org
1•neural_drift•54m ago•1 comments

Google's DMCA Transparency Report 'Freezes' After Recent Volume Surge

https://torrentfreak.com/googles-dmca-transparency-report-freezes-after-recent-volume-surge/
2•gslin•55m ago•0 comments

Europe's cocoa slowdown highlights global chocolate struggle

https://www.japantimes.co.jp/business/2025/07/17/markets/europe-cocoa-global-chocolate/
1•PaulHoule•58m ago•0 comments

The Killing Code

https://www.theguardian.com/australia-news/ng-interactive/2025/aug/04/the-killing-code-strange-symbols-in-a-wa-settlers-diaries-lay-bare-frontier-atrocities-ntwnfb
1•kawera•58m ago•0 comments

Efforts to Ground Physics in Math Are Opening the Secrets of Time

https://www.wired.com/story/efforts-to-ground-physics-in-math-are-opening-the-secrets-of-time/
1•colinprince•1h ago•0 comments

First-Ever Antimatter Qubit Could Help Crack Cosmic Mysteries

https://www.scientificamerican.com/article/scientists-create-first-antimatter-qubit/
1•stared•1h ago•0 comments

Project Follow Through

https://www.nifdi.org/what-is-di/project-follow-through.html
1•indigodaddy•1h ago•0 comments

Show HN: Avrprices

https://avrprices.com/
1•Cicero22•1h ago•0 comments