The way I understand this works is that the researchers found a clever architectural hack to stop AI from hoarding memory when reading long documents.
Normally, when an AI transcribes a 100 page PDF, it tries to remember every single word it has already ingested. This short-term memory (the KV cache) grows linearly O(N) until the model runs out of VRAM and crashes (or caps it) To avoid this, developers are forced to build janky code that chops PDFs into individual pages, processes them one by one, and glues the text back together.
Unlimited OCR uses Reference Sliding Window Attention (R-SWA) to split the AI's focus into two paths:
Global Reference: The AI keeps full, uncompromised sight of the original document image so it never loses context.
Local Generation: The AI restricts its memory of its own typed text to a tight, moving window (like the last 128 words) and safely forgets the rest.
Will be very interesting for local AI and can’t wait to see what the community builds and extends with it!
Class Act.
A simple example is words that are supposed to be in other languages being automatically translated to English, which ruins the effect
Oras•52m ago
I would definitely understand post processing, like extracting data, answering question .. etc, but why re-doing the OCR engine itself?
vulture916•47m ago
"A widely held view is that employing a large language model (LLM) as the decoder allows the model to leverage the prior distribution of language, leading to improved OCR performance. However, the downside is equally evident: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows down generation."
ta988•46m ago
Oras•44m ago
ta988•39m ago
j16sdiz•27m ago
CJK have lots of character and high confusion rate.
Arabic scripts are complex and have lots of morphs.
Vietnamese have easily confused diacritics.
Thai have lots of non-standard fonts.
JodieBenitez•11m ago
cannonpalms•42m ago
ta988•38m ago
chpatrick•42m ago
JohnKemeny•42m ago
In your opinion, what is SOTA here?
sscaryterry•41m ago
wongarsu•15m ago
But if you are trying to ingest diverse documents with headings, multi-column layouts, headers and footers, ad space in the middle of your text, etc, vision-llms are a giant step forward. But you need the context of the previous page to make good decisions about the current page, which is where things quickly get janky (or slow, if you choose the naive approach)
Vision-llms also seem to deal much better with variance in scripts. Cursive, random Japanese in the middle of the text, weird math symbols, handwriting from three centuries ago, all "just works" without you even having to remember that this can happen
Aboutplants•16m ago
joss82•6m ago
OCR still sucks in 2026. Hopefully this might improve the situation but I haven't tested it yet.
ljouhet•3m ago
- marker (with --force-ocr) gives me the best results
- Mistral OCR (seems really great, but I never managed to get it work)
- Mathpix (tried a long time ago)
- docling (gives me garbage, I must use it wrong)
- Unlimited OCR (will try it)
- ???