Unlimited OCR: One-Shot Long-Horizon Parsing

82•ingve•1h ago

Comments

Oras•52m ago

OCR has been solved long time ago with vision models. Solutions are consistent, reliable, and stable. What is the point of reinventing the wheel?

I would definitely understand post processing, like extracting data, answering question .. etc, but why re-doing the OCR engine itself?

vulture916•47m ago

I haven't done much long-run OCR, so unsure of the current state, but it would seem they overcome this (from their paper):

"A widely held view is that employing a large language model (LLM) as the decoder allows the model to leverage the prior distribution of language, leading to improved OCR performance. However, the downside is equally evident: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows down generation."

ta988•46m ago

Cost, throughput, latency...

Oras•44m ago

Traditional OCR is faster, cheaper, and much more reliable than LLMs

ta988•39m ago

I don't think that's a universal statement that aplies to every kind of documents and languages. Mistral OCR is able to do things no "traditional" OCR was ever able to.

j16sdiz•27m ago

If you consider non-English script, traditional OCR is not more reliable.

CJK have lots of character and high confusion rate.

Arabic scripts are complex and have lots of morphs.

Vietnamese have easily confused diacritics.

Thai have lots of non-standard fonts.

JodieBenitez•11m ago

I wish it were. Alas...

cannonpalms•42m ago

I guess, in theory, the prior distribution of language would allow for improved performance in some cases, especially where input quality is low.

ta988•38m ago

This is already used in OCR, tesseract uses that.

chpatrick•42m ago

It absolutely hasn't been solved, it's just got pretty decent in recent years.

JohnKemeny•42m ago

OCR has definitely not "been solved long time ago", what are you talking about?

In your opinion, what is SOTA here?

sscaryterry•41m ago

Detecting characters almost, layout no.

wongarsu•15m ago

Exactly my experience. If you try to OCR hand-filled forms with a fixed structure, traditional OCR models are great. Vision-llms can improve a bit on character recognition, but at the cost of harder to detect failure modes.

But if you are trying to ingest diverse documents with headings, multi-column layouts, headers and footers, ad space in the middle of your text, etc, vision-llms are a giant step forward. But you need the context of the previous page to make good decisions about the current page, which is where things quickly get janky (or slow, if you choose the naive approach)

Vision-llms also seem to deal much better with variance in scripts. Cursive, random Japanese in the middle of the text, weird math symbols, handwriting from three centuries ago, all "just works" without you even having to remember that this can happen

Aboutplants•16m ago

lol nope it hasn’t been solved. I deal with this constantly and we still have a longggg ways to go

joss82•6m ago

I've been working on Parseur for the last 10 years, and OCR has not been solved yet, let me tell you.

OCR still sucks in 2026. Hopefully this might improve the situation but I haven't tested it yet.

ljouhet•3m ago

Real question: what tool do you use? (for long/complex documents with tables, code, maths)

- marker (with --force-ocr) gives me the best results

- Mistral OCR (seems really great, but I never managed to get it work)

- Mathpix (tried a long time ago)

- docling (gives me garbage, I must use it wrong)

- Unlimited OCR (will try it)

- ???

robotswantdata•46m ago

Very interesting.

The way I understand this works is that the researchers found a clever architectural hack to stop AI from hoarding memory when reading long documents.

Normally, when an AI transcribes a 100 page PDF, it tries to remember every single word it has already ingested. This short-term memory (the KV cache) grows linearly O(N) until the model runs out of VRAM and crashes (or caps it) To avoid this, developers are forced to build janky code that chops PDFs into individual pages, processes them one by one, and glues the text back together.

Unlimited OCR uses Reference Sliding Window Attention (R-SWA) to split the AI's focus into two paths:

Global Reference: The AI keeps full, uncompromised sight of the original document image so it never loses context.

Local Generation: The AI restricts its memory of its own typed text to a tight, moving window (like the last 128 words) and safely forgets the rest.

Will be very interesting for local AI and can’t wait to see what the community builds and extends with it!

KitN•38m ago

"We would like to thank Deepseek-OCR, Deepseek-OCR-2, PaddleOCR for their valuable models and ideas."

Class Act.

gcr•13m ago

I don’t understand the shade being thrown ?

manipalite•26m ago

Whatever happened to Reducto, was very promising 12-15 months ago

ramon156•13m ago

I love that the entire goal is to push Deepseek OCR further. The west can learn greatly from these companies

pmarreck•4m ago

my attempts at using AI to do OCR have always resulted in invented artifacts, which is not production feasible. does this suffer from that as well?

A simple example is words that are supposed to be in other languages being automatically translated to English, which ruins the effect

ExoModel – the object calls the LLM; you just describe what you want

The Robot That Rolls Until It Has to Climb – Mobility and Field Robotics

Record Type Inference for Dummies

Emerging technologies of 2026 according to WEF

Lamini: Build mini-agents with 90%+ accuracy

Show HN: Parlel – 250+ SaaS and DB emulators on local Docker

String

In Defense of the Marginal Baby

Show HN: Khala – let your AI sessions talk to each other, across any LLM

Deploy from Claude Design to Vercel

Explodex – mod the official Codex app

The State of AI Font Generation

Show HN: One global text input, shared by everyone

AST-Grep Outline

Health board apologizes for phishing staff with with bogus vacation day

China Takes Supercomputer Crown from U.S. for First Time Since 2017

Show HN: Anonymous Confessions over SSH

The AI race might be entering a highly volatile phase

GM installs robots at flagship EV factory after laying off 1,300 workers

Lossless GIF recompression via exhaustive search

The Complete Kubrick

Why I Use Uruky, a Private Search Engine

Encoding Knowledge with Automation Scripts

KV Cache Store – Reduce prefill KV costs 99.99%

Trump Suggests GM, Ford to Shift to Missile Production

Show HN: Styler – CSS-in-JS rebuilt around React 19 streaming SSR (5KB, 0 deps)

Show HN: I built an online patch backup tool for vintage 80s synths

DiffusionGemma: 1k tok/s on an H100, 43 tok/s on a Mac

Vite 8.1 is out with an experimental full bundle mode

The Value of Getting Closer to the Work