GLM-OCR: Accurate × Fast × Comprehensive

65•ms7892•4d ago

Comments

aliljet•1h ago

This is actually the thing I really desperately need. I'm routinely analyzing contracts that were faxed to me, scanned with monstrously poor resolution, wet signed, all kinds of shit. The big LLM providers choke on this raw input and I burn up the entire context window for 30 pages of text. Understandable evals of the quality of these OCR systems (which are moving wicked fast) would be helpful...

And here's the kicker. I can't afford mistakes. Missing a single character or misinterpreting it could be catastrophic. 4 units vacant? 10 days to respond? Signature missing? Incredibly critical things. I can't find an eval that gives me confidence around this.

cinntaile•1h ago

Deciphering fax messages? What is this, the 90s?

daveguy•1h ago

If your needs are that sensitive, I doubt you'll find anything anytime soon that doesn't require a human in the loop. Even SOTA models only average 95% accuracy on messy inputs. If that's a per character accuracy (which OCR is generally measured by), that's going to be 5+ errors per page of 100+ words. If you really can't afford mistakes you have to consider the OCR inaccurate. If you have key components like "days to respond" and "units vacant" you need to identify the presence of those specifically with bias in favor of false positives (over false negatives), and human confirmation of the source-> OCR.

coder543•58m ago

If you want OCR with the big LLM providers, you should probably be passing one page per request. Having the model focus on OCR for only a single page at a time seemed to help a lot in my anecdotal testing a few months ago. You can even pass all the pages in parallel in separate requests, and get the better quality response much faster too.

But, as others said, if you can't afford mistakes, then you're going to need a human in the loop to take responsibility.

HPsquared•19m ago

You could maybe then do a second pass on the whole text (as plain text not OCR) to look for likely mistakes.

chrsw•39m ago

I'm keeping my eye on progress in this area as well. I need to free engineering design data from tens of thousands of PDF pages and make them easily and quickly accessible to LLMs.

aliljet•35m ago

All of healthcare is crying. Trust me.

Imustaskforhelp•31m ago

I suppose tears of joy?

fragmede•6m ago

Of sadness because they're not allowed to use it yet.

yieldcrv•17m ago

> I burn up the entire context window for 30 pages of text

We analyze 200 page contracts no problem

I think you're doing it wrong or in an antiquated way (until context window sizes improve)

Are you doing this programmatically at all or are you doing something closer to dropping a contract into a chat window?

We use a main agent to classify the pages and we build subagents that are familiar with page classifications and are fed page ranges. They all have their own full context window and prompts

coder543•1h ago

There are a bunch of new OCR models.

I’ve also heard very good things about these two in particular:

- LightOnOCR-2-1B: https://huggingface.co/lightonai/LightOnOCR-2-1B

- PaddleOCR-VL-1.5: https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5

The OCR leaderboards I’ve seen leave a lot to be desired.

With the rapid release of so many of these models, I wish there were a better way to know which ones are actually the best.

I also feel like most/all of these models don’t handle charts, other than to maybe include a link to a cropped image. It would be nice for the OCR model to also convert charts into markdown tables, but this is obviously challenging.

StableAlkyne•13m ago

How do these compare to something like Tesseract?

I remember that one clearing the scoreboard for many years, and usually it's the one I grab for OCR needs due to its reputation.

rdos•31m ago

Is it possible for such a small model to outperform gemini 3 or is this a case of benchmarks not showing the reality? I would love to be hopeful, but so far an open source model was never better than a closed one even when benchmarks were showing that.

amluto•28m ago

Off the top of my head: for a lot of OCR tasks, it’s kind of worse for the model to be smart. I don’t want my OCR to make stuff up or answer questions — I want to to recognize what is actually on the page.

rdos•17m ago

Interesting. Won't stuff like entity extraction suffer? Especially in multilingual use cases. My worry is that a smaller model might not realize some text is actually a persons name because it is very unusual.

alaanor•11m ago

There was so many OCR models released in the past few months, all VLM models and yet none of them handle Korean well. Every time I try with a random screenshot (not a A4 document) they just fail at a "simple" task. And funnily enough Qwen3 8B VL is the best model that usually get it right (although I couldn't get the bbox quite well). Even more funny, whatever is running on an iphone locally on cpu is insanely good, same with google's OCR api. I don't know why we don't get more of the traditional OCR stuff. Paddlepaddle v5 is the closest I could find. At this point, I feel like I might be doing something wrong with those VLMs.

GLM-OCR: Accurate × Fast × Comprehensive

Railway (PaaS) Global Outage

It's all a blur

Show HN: AI agents play SimCity through a REST API

WiFi Could Become an Invisible Mass Surveillance System

Why Vampires Live Forever

FAA Halts All Flights at El Paso Airport for 10 Days

NanoClaw solves one of OpenClaw's biggest security issues

Exposure Simulator

Rome is studded with cannon balls (2022)

Show HN: Renovate – The Kubernetes-Native Way

Communities are not fungible

Chrome extensions spying on users' browsing data

GLM5 Released on Z.ai Platform

The Day the Telnet Died

Lessons you will learn living in a snowy place

Ask HN: Why electronics are still so unrecyclable?

Windows Notepad App Remote Code Execution Vulnerability

The Feynman Lectures on Physics (1961-1964)

A Cosmic Miracle: A Remarkably Luminous Galaxy at z=14.44 Confirmed with JWST

Visualize MySQL query execution plans as interactive FlameGraphs

CoLoop (YC S21) Is Hiring Ex Technical Founders in London

The Singularity will occur on a Tuesday

End of an era for me: no more self-hosted git

Do not apologize for replying late to my email

Ex-GitHub CEO launches a new developer platform for AI agents

AI-First Company Memos

Exploring a Modern SMTPE 2110 Broadcast Truck

Clean-room implementation of Half-Life 2 on the Quake 1 engine

Both GCC and Clang generate strange/inefficient code

GLM-OCR: Accurate × Fast × Comprehensive

Comments

GLM-OCR: Accurate × Fast × Comprehensive

Railway (PaaS) Global Outage

It's all a blur

Show HN: AI agents play SimCity through a REST API

WiFi Could Become an Invisible Mass Surveillance System

Why Vampires Live Forever

FAA Halts All Flights at El Paso Airport for 10 Days

NanoClaw solves one of OpenClaw's biggest security issues

Exposure Simulator

Rome is studded with cannon balls (2022)

Show HN: Renovate – The Kubernetes-Native Way

Communities are not fungible

Chrome extensions spying on users' browsing data

GLM5 Released on Z.ai Platform

The Day the Telnet Died

Lessons you will learn living in a snowy place

Ask HN: Why electronics are still so unrecyclable?

Windows Notepad App Remote Code Execution Vulnerability

The Feynman Lectures on Physics (1961-1964)

A Cosmic Miracle: A Remarkably Luminous Galaxy at z=14.44 Confirmed with JWST

Visualize MySQL query execution plans as interactive FlameGraphs

CoLoop (YC S21) Is Hiring Ex Technical Founders in London

The Singularity will occur on a Tuesday

End of an era for me: no more self-hosted git

Do not apologize for replying late to my email

Ex-GitHub CEO launches a new developer platform for AI agents

AI-First Company Memos

Exploring a Modern SMTPE 2110 Broadcast Truck

Clean-room implementation of Half-Life 2 on the Quake 1 engine

Both GCC and Clang generate strange/inefficient code