Because my experience is not at all like that. If I use both Google Translate and ChatGPT on an image, ChatGPT is pretty much always better. It can even translate Japanese hand written menus quite well. With the added benefit of it being able to add context and explain what the dishes are.
With the big commercial offerings like chatgpt I'd fully expect them to work fine, due to the absolutely massive horsepower in use.
https://huggingface.co/echo840/MonkeyOCR/blob/main/Recogniti...
We have a benchmark for evaluating VLM on document understanding tasks: https://idp-leaderboard.org/ . But unfortunately, it does not include image to markdown as a task. The problem with evaluating an image to markdown is that even if the order of two blocks are different, it can still be correct. Eg: if you have both seller info and buyer info side by side in the image one model can extract the seller info first, and another model can extract the buyer info first. Both model will be correct but depending on the ground truth if you do fuzzy matching one model will have higher accuracy than the other one.
Normally, a company will train and test on a dataset that is trained on the same type of annotation (either left block first or right block first), and all other models can get a low score on their benchmark because they are trained on the opposite order of annotations.
We are currently working on creating completely handwritten document datasets for our next model release.
Result:
Perhaps it needed more than 1K tokens? But it took about an hour (number 28 in queue) to generate that and I didn't feel like trying again.
How many tokens does it usually take to represent a page of text with 554 characters?
Regarding the token limit, it depends on the text. We are using the qwen-2.5-vl tokenizer in case you are interested in reading about it.
You can run it very easily in a Colab notebook. This should be faster than the demo https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...
There are incorrect words in the extraction, so I would suggest you to wait for the handwritten text model's release.
PixelPanda•6h ago
Excited to share Nanonets-OCR-s, a powerful and lightweight (3B) VLM model that converts documents into clean, structured Markdown. This model is trained to understand document structure and content context (like tables, equations, images, plots, watermarks, checkboxes, etc.). Key Features:
LaTeX Equation Recognition Converts inline and block-level math into properly formatted LaTeX, distinguishing between $...$ and $$...$$.
Image Descriptions for LLMs Describes embedded images using structured <img> tags. Handles logos, charts, plots, and so on.
Signature Detection & Isolation Finds and tags signatures in scanned documents, outputting them in <signature> blocks.
Watermark Extraction Extracts watermark text and stores it within <watermark> tag for traceability.
Smart Checkbox & Radio Button Handling Converts checkboxes to Unicode symbols like , , and for reliable parsing in downstream apps.
Complex Table Extraction Handles multi-row/column tables, preserving structure and outputting both Markdown and HTML formats.
Huggingface / GitHub / Try it out: https://huggingface.co/nanonets/Nanonets-OCR-s
Try it with Docext in Colab: https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...
mvac•2h ago