frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

Unlimited OCR: One-Shot Long-Horizon Parsing

https://github.com/baidu/Unlimited-OCR
82•ingve•1h ago

Comments

Oras•52m ago
OCR has been solved long time ago with vision models. Solutions are consistent, reliable, and stable. What is the point of reinventing the wheel?

I would definitely understand post processing, like extracting data, answering question .. etc, but why re-doing the OCR engine itself?

vulture916•47m ago
I haven't done much long-run OCR, so unsure of the current state, but it would seem they overcome this (from their paper):

"A widely held view is that employing a large language model (LLM) as the decoder allows the model to leverage the prior distribution of language, leading to improved OCR performance. However, the downside is equally evident: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows down generation."

ta988•46m ago
Cost, throughput, latency...
Oras•44m ago
Traditional OCR is faster, cheaper, and much more reliable than LLMs
ta988•39m ago
I don't think that's a universal statement that aplies to every kind of documents and languages. Mistral OCR is able to do things no "traditional" OCR was ever able to.
j16sdiz•27m ago
If you consider non-English script, traditional OCR is not more reliable.

CJK have lots of character and high confusion rate.

Arabic scripts are complex and have lots of morphs.

Vietnamese have easily confused diacritics.

Thai have lots of non-standard fonts.

JodieBenitez•11m ago
I wish it were. Alas...
cannonpalms•42m ago
I guess, in theory, the prior distribution of language would allow for improved performance in some cases, especially where input quality is low.
ta988•38m ago
This is already used in OCR, tesseract uses that.
chpatrick•42m ago
It absolutely hasn't been solved, it's just got pretty decent in recent years.
JohnKemeny•42m ago
OCR has definitely not "been solved long time ago", what are you talking about?

In your opinion, what is SOTA here?

sscaryterry•41m ago
Detecting characters almost, layout no.
wongarsu•15m ago
Exactly my experience. If you try to OCR hand-filled forms with a fixed structure, traditional OCR models are great. Vision-llms can improve a bit on character recognition, but at the cost of harder to detect failure modes.

But if you are trying to ingest diverse documents with headings, multi-column layouts, headers and footers, ad space in the middle of your text, etc, vision-llms are a giant step forward. But you need the context of the previous page to make good decisions about the current page, which is where things quickly get janky (or slow, if you choose the naive approach)

Vision-llms also seem to deal much better with variance in scripts. Cursive, random Japanese in the middle of the text, weird math symbols, handwriting from three centuries ago, all "just works" without you even having to remember that this can happen

Aboutplants•16m ago
lol nope it hasn’t been solved. I deal with this constantly and we still have a longggg ways to go
joss82•6m ago
I've been working on Parseur for the last 10 years, and OCR has not been solved yet, let me tell you.

OCR still sucks in 2026. Hopefully this might improve the situation but I haven't tested it yet.

ljouhet•3m ago
Real question: what tool do you use? (for long/complex documents with tables, code, maths)

- marker (with --force-ocr) gives me the best results

- Mistral OCR (seems really great, but I never managed to get it work)

- Mathpix (tried a long time ago)

- docling (gives me garbage, I must use it wrong)

- Unlimited OCR (will try it)

- ???

robotswantdata•46m ago
Very interesting.

The way I understand this works is that the researchers found a clever architectural hack to stop AI from hoarding memory when reading long documents.

Normally, when an AI transcribes a 100 page PDF, it tries to remember every single word it has already ingested. This short-term memory (the KV cache) grows linearly O(N) until the model runs out of VRAM and crashes (or caps it) To avoid this, developers are forced to build janky code that chops PDFs into individual pages, processes them one by one, and glues the text back together.

Unlimited OCR uses Reference Sliding Window Attention (R-SWA) to split the AI's focus into two paths:

Global Reference: The AI keeps full, uncompromised sight of the original document image so it never loses context.

Local Generation: The AI restricts its memory of its own typed text to a tight, moving window (like the last 128 words) and safely forgets the rest.

Will be very interesting for local AI and can’t wait to see what the community builds and extends with it!

KitN•38m ago
"We would like to thank Deepseek-OCR, Deepseek-OCR-2, PaddleOCR for their valuable models and ideas."

Class Act.

gcr•13m ago
I don’t understand the shade being thrown ?
manipalite•26m ago
Whatever happened to Reducto, was very promising 12-15 months ago
ramon156•13m ago
I love that the entire goal is to push Deepseek OCR further. The west can learn greatly from these companies
pmarreck•4m ago
my attempts at using AI to do OCR have always resulted in invented artifacts, which is not production feasible. does this suffer from that as well?

A simple example is words that are supposed to be in other languages being automatically translated to English, which ruins the effect

ExoModel – the object calls the LLM; you just describe what you want

https://github.com/exomodel-ai/exomodel
1•pessoaleo•2m ago•0 comments

The Robot That Rolls Until It Has to Climb – Mobility and Field Robotics

https://atomsfrontier.substack.com/p/the-robot-that-rolls-until-it-has
1•jpatel3•3m ago•0 comments

Record Type Inference for Dummies

http://haskellforall.com/2026/06/record-type-inference-for-dummies
1•g0xA52A2A•3m ago•0 comments

Emerging technologies of 2026 according to WEF

https://www.weforum.org/publications/top-10-emerging-technologies-of-2026/digest/
1•giuliomagnifico•4m ago•0 comments

Lamini: Build mini-agents with 90%+ accuracy

https://docs.lamini.ai/
1•doener•5m ago•0 comments

Show HN: Parlel – 250+ SaaS and DB emulators on local Docker

https://github.com/dksingh1997/parlel
1•Dheerajiitr•5m ago•0 comments

String

https://fivetakes.news/data-center-buildout-and-public-sentiment-ai-backlash-in-infrastructure
1•mmeirovich•7m ago•0 comments

In Defense of the Marginal Baby

https://caseyhandmer.wordpress.com/2026/06/22/in-defense-of-the-marginal-baby/
1•surprisetalk•7m ago•0 comments

Show HN: Khala – let your AI sessions talk to each other, across any LLM

https://khala.to/
1•lanakim9410•7m ago•0 comments

Deploy from Claude Design to Vercel

https://vercel.com/changelog/claude-design-and-vercel
1•osener•8m ago•0 comments

Explodex – mod the official Codex app

https://github.com/dan-dr/explodex
1•danr4•8m ago•0 comments

The State of AI Font Generation

https://simoncozens.github.io/state-of-ai-font-generation/
1•gsky•8m ago•0 comments

Show HN: One global text input, shared by everyone

https://onlyoneinput.com/
1•askrzypczak•9m ago•0 comments

AST-Grep Outline

https://ast-grep.github.io/blog/ast-grep-outline.html
1•becojo•9m ago•0 comments

Health board apologizes for phishing staff with with bogus vacation day

https://www.theregister.com/security/2026/06/22/canadian-health-board-sorry-after-tasteless-phish...
1•Bender•9m ago•0 comments

China Takes Supercomputer Crown from U.S. for First Time Since 2017

https://www.nytimes.com/2026/06/23/technology/china-supercomputer-crown-us.html
1•0in•9m ago•0 comments

Show HN: Anonymous Confessions over SSH

https://github.com/pwnwriter/eipi.boo
1•pwn0x01•9m ago•0 comments

The AI race might be entering a highly volatile phase

https://nasengetu.com/article/the-ai-race-is-getting-wilder
1•StizzurpXDD•10m ago•1 comments

GM installs robots at flagship EV factory after laying off 1,300 workers

https://arstechnica.com/ai/2026/06/gm-installs-robots-at-flagship-ev-factory-after-laying-off-130...
1•Bender•10m ago•0 comments

Lossless GIF recompression via exhaustive search

https://blog.arusekk.pl/posts/lossless-gif-recompression/
1•ZacnyLos•11m ago•0 comments

The Complete Kubrick

https://www.criterion.com/boxsets/9000-the-complete-kubrick
1•sohkamyung•12m ago•0 comments

Why I Use Uruky, a Private Search Engine

https://theprivacydad.com/why-i-use-uruky-a-private-search-engine/
1•BrunoBernardino•12m ago•0 comments

Encoding Knowledge with Automation Scripts

https://iterativetangents.com/encoding-knowledge-with-automation-scripts/
1•gsky•12m ago•0 comments

KV Cache Store – Reduce prefill KV costs 99.99%

https://kvcachestore.com/
1•vibeagency•14m ago•1 comments

Trump Suggests GM, Ford to Shift to Missile Production

https://www.cnn.com/2026/06/17/politics/trump-weapons-iran-defense-production-act
1•Rooster61•15m ago•0 comments

Show HN: Styler – CSS-in-JS rebuilt around React 19 streaming SSR (5KB, 0 deps)

https://github.com/vitus-labs/ui-system/tree/main/packages/styler
1•vitbokisch•15m ago•0 comments

Show HN: I built an online patch backup tool for vintage 80s synths

https://knob.monster/
2•halfradaition•15m ago•0 comments

DiffusionGemma: 1k tok/s on an H100, 43 tok/s on a Mac

https://astgl.com/p/diffusiongemma-vs-gemma-apple-silicon
1•Jmeg8r•15m ago•0 comments

Vite 8.1 is out with an experimental full bundle mode

https://vite.dev/blog/announcing-vite8-1
1•TheAlexLichter•16m ago•0 comments

The Value of Getting Closer to the Work

https://cate.blog/2026/06/23/the-value-of-getting-closer-to-the-work/
1•ingve•16m ago•0 comments