Absolutely not. For the reasons in the article.
Some vision LLMs can accept PDF inputs directly too, but you need to check that they're going to convert to images and process those rather than attempting and failing to extract the text some other way. I think OpenAI, Anthropic and Gemini all do the images-version of this now, thankfully.
I expect the current LLMs significantly improve upon the previous ways of doing this, e.g. Tesseract, when given an image input? Is there any test you're aware of for model capabilities when it comes to ingesting PDFs?
Also, absolutely not to your "single file HTML" theory: it would still allow javascript, random image formats (via data: URIs), conversely I don't _think_ that one can embed fonts in a single file HTML (e.g. not using the same data: URI trick), and to the best of my knowledge there's no cryptographic signing for HTML at all
It would also suffer from the linearization problem mentioned elsewhere in that one could not display the document if it were streaming in (the browsers work around this problem by just janking items around as the various .css and .js files resolve and parse)
I'd offer Open XPS as an alternative even given its Empire of Evil origins because I'll take XML over a pseudo-text-pseudo-binary file format all day every day https://en.wikipedia.org/wiki/Open_XML_Paper_Specification#C...
I've also heard people cite DjVu https://en.wikipedia.org/wiki/DjVu as an alternative but I've never had good experience with it, its format doesn't appear to be an ECMA standard, and (lol) its linked reference file is a .pdf
What prompted this post was trying to rewrite the initial parse logic for my project PdfPig[0]. I had originally ported the Java PDFBox code but felt like it should be 'simple' to rewrite more performantly. The new logic falls back to a brute-force scan of the entire file if a single xref table or stream is missed and just relies on those offsets in the recovery path.
However it is considerably slower than the code before it and it's hard to have confidence in the changes. I'm currently running through a 10,000 file test-set trying to identify edge-cases.
The 10k-file test set sounds great for confidence-building. Are the failures clustering around certain producer apps like Word, InDesign, scanners, etc.? Or is it just long-tail randomness?
Reading the PR, I like the recovery-first mindset. If the common real-world case is that offsets lie, treating salvage as the default is arguably the most spec-conformant thing you can do. Slow-and-correct beats fast-and-brittle for PDFs any day.
Also, god bless the open source developers. Without them also impossible to do this in a timely fashion. pymupdf is incredible.
https://www.linkedin.com/posts/sergiotapia_completed-a-reall...
This is exactly the reason why Computer Vision approaches for parsing PDFs works so well in the real world. Relying on metadata in files just doesn't scale across different source of PDFs.
We convert PDFs to images, run a layout understanding model on them first, and then apply specialized models like text recognition and table recognition models on them, stitch them back together to get acceptable results for domains where accuracy is table stakes.
PDFs don't always use UTF-8, sometimes they assign random-seeming numbers to individual glyphs (this is common if unused glyphs are stripped from an embedded font, for example)
etc etc
If you were leading Tensorlake, running on early stage VC with only 10 employees (https://pitchbook.com/profiles/company/594250-75), you'd focus all your resources on shipping products quickly, iterating over unseen customer needs that could make the business skyrocket, and making your customers so happy that they tell everyone and buy lots more licenses.
Because you're a stellar tech leader and strategist, you wouldn't waste a penny reinventing low-level plumbing that's available off-the-shelf, either cheaply or as free OSS. You'd be thinking about the inevitable opportunity costs: If I build X then I can't build Y, simply because a tiny startup doesn't have enough resources to build X and Y. You'd quickly conclude that building a homegrown, robust PDF parser would be an open-ended tar pit that precludes us from focusing on making our customers happy and growing the business.
And the rest of us would watch in awe, seeing truly great tech leadership at work, making it all look easy.
Misspellings, default names, a mixture, home brew naming schemes, meticulous schemes, I’ve seen it all. It’s definitely easier to just rasterize it and OCR it.
Because the underlying "structured data" is never checked while the visual output is checked by dozens of people.
"Truth" is the stuff that the meatbags call "truth" as seen by their squishy ocular balls--what the computer sees doesn't matter.
Do you know you could just use the parsing engine that renders the PDF to get the output? I mean, why raster it, OCR it, and then use AI? Sounds creating a problem to use AI to solve it.
Another thing is that most document parsing tasks are going to run into a significant volume of PDFs which are actually just a bunch of scans/images of paper, so you need to build this capability anyways.
TL;DR: PDFs are basically steganography
One of the biggest benefits of PDFs though is that they can contain invisible data. E.g. the spec allows me to embed cryptographic proof that I've worked at the companies I claim to have worked at within my resume. But a vision-based approach obviously isn't going to be able to capture that.
Well, to be fair, in many cases there's no way around it anyway since the documents in question are only scanned images. And the hardest problems I've seen there are narrative typography artbooks, department store catalogs with complex text and photo blending, as well as old city maps.
Its search speed on big pdfs is dramatically faster than everything else I've tried and I've often wondered why the others can't be as fast as mupdf-gl.
Thanks for any insights!
Printing a PDF and scanning it for an email it would normally be worthy of major ridicule.
But you’re basically doing that to parse it.
I get it, have heard of others doing the same. Just seems damn frustrating that such is necessary. The world sure doesn’t parse HTML that way!
UglyToad is a good name for someone who likes pain. ;-)
What the article doesn't mention is a lot of newer PDFs (v1.5+) don't even have a regular textual xref table, but the xref table is itself inside an "xref stream", and I believe v1.6+ can have the option of putting objects inside "object streams" too.
By god its so annoying, I don't think I would be able to without the help of Claude Code with it just reiterating different libraries and methods over and over again.
Can we just write things in markdown from now on? I really, really, really, don't care that the images you put is nicely aligned to the right side and every is boxed together nicely.
Just give me the text and let me render it however I want on my end.
JKCalhoun•4h ago
Having said that, I believe there are "streamable" PDF's where there is enough info up front to render the first page (but only the first page).
(But I have been out of the PDF loop for over a decade now so keep that in mind.)
UglyToad•4h ago