Absolutely not. For the reasons in the article.
Some vision LLMs can accept PDF inputs directly too, but you need to check that they're going to convert to images and process those rather than attempting and failing to extract the text some other way. I think OpenAI, Anthropic and Gemini all do the images-version of this now, thankfully.
I expect the current LLMs significantly improve upon the previous ways of doing this, e.g. Tesseract, when given an image input? Is there any test you're aware of for model capabilities when it comes to ingesting PDFs?
What prompted this post was trying to rewrite the initial parse logic for my project PdfPig[0]. I had originally ported the Java PDFBox code but felt like it should be 'simple' to rewrite more performantly. The new logic falls back to a brute-force scan of the entire file if a single xref table or stream is missed and just relies on those offsets in the recovery path.
However it is considerably slower than the code before it and it's hard to have confidence in the changes. I'm currently running through a 10,000 file test-set trying to identify edge-cases.
The 10k-file test set sounds great for confidence-building. Are the failures clustering around certain producer apps like Word, InDesign, scanners, etc.? Or is it just long-tail randomness?
Reading the PR, I like the recovery-first mindset. If the common real-world case is that offsets lie, treating salvage as the default is arguably the most spec-conformant thing you can do. Slow-and-correct beats fast-and-brittle for PDFs any day.
Also, god bless the open source developers. Without them also impossible to do this in a timely fashion. pymupdf is incredible.
https://www.linkedin.com/posts/sergiotapia_completed-a-reall...
This is exactly the reason why Computer Vision approaches for parsing PDFs works so well in the real world. Relying on metadata in files just doesn't scale across different source of PDFs.
We convert PDFs to images, run a layout understanding model on them first, and then apply specialized models like text recognition and table recognition models on them, stitch them back together to get acceptable results for domains where accuracy is table stakes.
Do you know you could just use the parsing engine that renders the PDF to get the output? I mean, why raster it, OCR it, and then use AI? Sounds creating a problem to use AI to solve it.
One of the biggest benefits of PDFs though is that they can contain invisible data. E.g. the spec allows me to embed cryptographic proof that I've worked at the companies I claim to have worked at within my resume. But a vision-based approach obviously isn't going to be able to capture that.
UglyToad is a good name for someone who likes pain. ;-)
What the article doesn't mention is a lot of newer PDFs (v1.5+) don't even have a regular textual xref table, but the xref table is itself inside an "xref stream", and I believe v1.6+ can have the option of putting objects inside "object streams" too.
JKCalhoun•1h ago
Having said that, I believe there are "streamable" PDF's where there is enough info up front to render the first page (but only the first page).
(But I have been out of the PDF loop for over a decade now so keep that in mind.)
UglyToad•1h ago