In 2025 LLMs can 'fake it' using Trilobites of memory and Petaflops. It's funny actually, like a supercomputer being emulated in real time on a really fast Jacquard loom. By 2027 even simple hand held calculator addition will be billed in kilowatt-hours.
Trilobites? Those were truly primitve computers.
OCR'ing a fixed, monospaced, font from a pristine piece of paper really is "solved." It's all the nasties of tue real world that its an issue.
As I mockingly demonstrated- kerning, character similarity, grammar, lexing- all present large and hugely time consuming problems to solve in processes where OCR is the most useful.
That said, over the last two years I've come across many use cases to parse PDFs and each has its own requirements (e.g., figuring out titles, removing page numbers, extracting specific sections, etc). And each require a different approach.
My point is, this is awesome, but I wonder if there needs to be a broader push / initiative to stop leveraging PDFs so much when things like HTML, XML, JSON and a million other formats exist. It's a hard undertaking I know, no doubt, but it's not unheard of to drop technologies (e.g., fax) for a better technology.
I’d just prefer that any images and diagrams are copied over, and rendered into a popular format like markdown.
For each page:
- Extract text as usual.
- Capture the whole page as an image (~200 DPI).
- Optionally extract images/graphs within the page and include them in the same LLM call.
- Optionally add a bit of context from neighboring pages.
Then wrap everything with a clear prompt (structured output + how you want graphs handled), and you’re set.
At this point, models like GPT-5-nano/mini or Gemini 2.5 Flash are cheap and strong enough to make this practical.
Yeah, it’s a bit like using a rocket launcher on a mosquito, but this is actually very easy to implement and quite flexible and powerfuL. works across almost any format, Markdown is both AI and human friendly, and surprisingly maintainable.
It all depends on the scale you need them, with the API it's easy to generate millions of tokens without thinking.
It is hype-compatible so it is good.
It is AI so it is good.
It is blockchain so it is good.
It is cloud so it is good.
It is virtual so it is good.
It is UML so it is good.
It is RPN so it is good.
It is a steam engine so it is good.
Yawn...
It's not.
LLMWhisperer(from Unstract), Docling(IBM), Marker(Surya OCR), Nougat(Facebook Research), Llamaparse.
david_draco•6h ago
firesteelrain•6h ago
https://github.com/ocrmypdf/OCRmyPDF
No LLMs required.
moritonal•5h ago
enjaydee•5h ago
https://threadreaderapp.com/thread/1955355127818358929.html
constantinum•2h ago
ethan_smith•24m ago