Because my experience is not at all like that. If I use both Google Translate and ChatGPT on an image, ChatGPT is pretty much always better. It can even translate Japanese hand written menus quite well. With the added benefit of it being able to add context and explain what the dishes are.
With the big commercial offerings like chatgpt I'd fully expect them to work fine, due to the absolutely massive horsepower in use.
https://huggingface.co/echo840/MonkeyOCR/blob/main/Recogniti...
My understanding:
- Weight available: You can download the weights.
- Open weight: You can download the weights, and it is licensed freely (e.g. public domain, CC BY-SA, MIT).
- Open source: (Debated) You can download the weights, it is licensed freely, and the training dataset is also available and licensed freely.
For context:> You're right. The Apache-2.0 license was mistakenly listed, and I apologize for the confusion. Since it's a derivative of Qwen-2.5-VL-3B, it will have the same license as the base model (Qwen RESEARCH LICENSE AGREEMENT). Thanks for pointing this out.
We have a benchmark for evaluating VLM on document understanding tasks: https://idp-leaderboard.org/ . But unfortunately, it does not include image to markdown as a task. The problem with evaluating an image to markdown is that even if the order of two blocks are different, it can still be correct. Eg: if you have both seller info and buyer info side by side in the image one model can extract the seller info first, and another model can extract the buyer info first. Both model will be correct but depending on the ground truth if you do fuzzy matching one model will have higher accuracy than the other one.
Normally, a company will train and test on a dataset that is trained on the same type of annotation (either left block first or right block first), and all other models can get a low score on their benchmark because they are trained on the opposite order of annotations.
OCR that has lower accuracy, but where the inaccurate parts are left blank or flagged are far superior. Mistral OCR also suffers from this problem.
If your OCR produced bounding boxes for every text line, and ran a traditional OCR on the text, this could alleviate it. Or at the very least bounding boxes let users cross-correlate with output from traditional OCR.
Also a small note, it's probably best not to say your product beats Mistral when it's not even tested against it. Having more features doesn't make a product better if the accuracy is not better on those features.
I don't mean to be discouraging, this is an important space and it looks like you have a very feature rich model. I'd like to see a good solution be developed!
We are currently working on creating completely handwritten document datasets for our next model release.
Result:
Perhaps it needed more than 1K tokens? But it took about an hour (number 28 in queue) to generate that and I didn't feel like trying again.
How many tokens does it usually take to represent a page of text with 554 characters?
Regarding the token limit, it depends on the text. We are using the qwen-2.5-vl tokenizer in case you are interested in reading about it.
You can run it very easily in a Colab notebook. This should be faster than the demo https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...
There are incorrect words in the extraction, so I would suggest you to wait for the handwritten text model's release.
Apologies if there's some unspoken nuance in this exchange, but by "working correctly" did you just mean that it ran to completion? I don't even recognize some of the unicode characters that it emitted (or maybe you're using some kind of strange font, I guess?)
Don't misunderstand me, a ginormous number of floating point numbers attempting to read that handwriting is already doing better than I can, but I was just trying to understand if you thought that outcome is what was expected
Page# 8
Log: MA 6100 2.03.15
34 cement emitter resistors - 0.33R 5W 5% measure 0.29R 0.26R
35 replaced R436, R430 emitter resistors on R-chn P.O. brd w/new WW 5W .33R 5% w/ ceramic lead insulators
36 applied de-oxit d100 to speaker outs, card terminals, terminal blocks, output trans jacks
37 replace R-chn drivers and class A BJTs w/ BD139/146, & TIP31AG
38 placed boards back in
39 desoldered grnd lug from volume control
40 contact cleaner, Deoxit D5, faderlube on pots & switches teflon lube on rotor joint
41 cleaned ground lug & resoldered, reattached panel
Log: MA 6100 Z. O 3. 15
<table> <tr> <td>34</td> <td>cement emitter resistors -</td> </tr> <tr> <td></td> <td>0.33 R SW 5% measure</td> </tr> <tr> <td></td> <td>0.29 R, 0.26 R</td> </tr> <tr> <td>35</td> <td>replaced R'4 36, R4 30</td> </tr> <tr> <td></td> <td>emitter resistor on R-44</td> </tr> <tr> <td></td> <td>0.0. 3rd w/ new WW 5W .33R</td> </tr> <tr> <td>36</td> <td>% w/ ceramic lead insulators</td> </tr> <tr> <td></td> <td>applied de-oat d100 to Speak</td> </tr> <tr> <td></td> <td>outs, card terminals, terminal</td> </tr> <tr> <td></td> <td>blocks, output tran jacks</td> </tr> <tr> <td>37</td> <td>replace &-clun diviers</td> </tr> <tr> <td></td> <td>and class A BJTs w/ BD139/140</td> </tr> <tr> <td></td> <td>& TIP37A2</td> </tr> <tr> <td>38</td> <td>placed boards back in</td> </tr> <tr> <td>39</td> <td>desoldered ground lus from volume</td> </tr> <tr> <td></td> <td>(con 48)</td> </tr> <tr> <td>40</td> <td>contact cleaner, Deox. t DS, facel/42</td> </tr> <tr> <td></td> <td>on pots & switches</td> </tr> <tr> <td></td> <td>· teflon lube on rotor joint</td> </tr> <tr> <td>41</td> <td>reably cleaned ground lus &</td> </tr> <tr> <td></td> <td>resoldered, reattatched panel</td> </tr> </table> ```
You can paste it in https://markdownlivepreview.com/ and see the extraction. This is using the Colab notebook I have shared before.
Which Unicode characters are you mentioning here?
the inline $\sigma_0$ is mangled as "<sup>s</sup> 0", and $f(t)$ is mangled as "f~~t*!". The current model gets them both correct.
I ran both with no setting specified, and with force_ocr, and I didn't see the issues either time.
I’m currently using the Datalab online playground with default settings - does that enable inline math recognition?
We're working on improving the playground generally now - expect a big update tomorrow, which among other things will default to format lines.
Thanks for the kind words! The team was just me until pretty recently, but we're growing quickly and will be addressing a lot of issues quickly in the next few weeks.
Also, we extract the tables as HTML tables instead of markdown for complex tables.
Have you considered using something like Pandoc’s method of marking them up? Footnotes are a fairly common part of scanned pages, and markdown that doesn’t indicate that a footnote is a footnote can be fairly incomprehensible.
(I'm a MyST contributor)
As a project, the tooling to parse MyST Markdown was built on top of Sphinx, which primarily expects ReST as input. Now, I would not be surprised if most _new_ Sphinx users are using MyST Markdown (but I have no data there!)
Subsequently, the Jupyter Book project that built those tools has pivoted to building a new document engine that's better focused on the use-cases of our audience and leaning into modern tooling.
The fact that someone would go to all the work to build a model to extract the structure of documents, then choose an output format strictly less expressive than XML, speaks poorly of the state of cross-generational knowledge sharing within the industry.
If goal is to parse this output programmatically, then I agree a more structured markup language is better choice.
Now I need a catalog, archive, or historian function that archives and pulls the elements easily. Amazing work!
It does work, but it is very slow on my older GPU (Nvidia 1080 8GB). I would say it's taking at least 5 minutes per page right now, but maybe more.
Edit: If anyone is interested in trying a PDF to markdown conversion utility built this that is hosted on Cloud Run (with GPU support), let me know. It should be done in about an hour or so and I will post a link up here when it's done.
THE ANIMATE
AND THE INANIMATE
WILLIAM JAMES SIDIS
<img>A black-and-white illustration of a figure holding a book with the Latin phrase "ARTI et VERITATI" below it.</img>
BOSTON
RICHARD G. BADGER, PUBLISHER
THE GORHAM PRESS
Digitized by Google
I haven't see ANY errors in what it has done, which is quite impressive.Here, it's doing tables of contents (I used a slightly different copy of the PDF than I linked to):
<table>
<tr>
<td>Chapter</td>
<td>Page</td>
</tr>
<tr>
<td>PREFACE</td>
<td>3</td>
</tr>
<tr>
<td>I. THE REVERSE UNIVERSE</td>
<td>9</td>
</tr>
<tr>
<td>II. REVERSIBLE LAWS</td>
<td>14</td>
</tr>
Other than the fact it is ridiculously slow, this seems to be quite good at doing what it says it does.e.g. https://www.japanracing.de/Teilegutachten/Teilegutachten-JR1... page 1 rowspan page29 colspan
I can comfortably run 27B models on my Mac and I'd much rather process my PDF library with something that is less prone to hallucinations and handles multiple languages better…
PixelPanda•7mo ago
Excited to share Nanonets-OCR-s, a powerful and lightweight (3B) VLM model that converts documents into clean, structured Markdown. This model is trained to understand document structure and content context (like tables, equations, images, plots, watermarks, checkboxes, etc.). Key Features:
LaTeX Equation Recognition Converts inline and block-level math into properly formatted LaTeX, distinguishing between $...$ and $$...$$.
Image Descriptions for LLMs Describes embedded images using structured <img> tags. Handles logos, charts, plots, and so on.
Signature Detection & Isolation Finds and tags signatures in scanned documents, outputting them in <signature> blocks.
Watermark Extraction Extracts watermark text and stores it within <watermark> tag for traceability.
Smart Checkbox & Radio Button Handling Converts checkboxes to Unicode symbols like , , and for reliable parsing in downstream apps.
Complex Table Extraction Handles multi-row/column tables, preserving structure and outputting both Markdown and HTML formats.
Huggingface / GitHub / Try it out: https://huggingface.co/nanonets/Nanonets-OCR-s
Try it with Docext in Colab: https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...
mvac•7mo ago
generalizations•7mo ago
j45•7mo ago
generalizations•7mo ago
RicoElectrico•7mo ago
gibsonf1•7mo ago
nattaylor•7mo ago
gibsonf1•7mo ago
prats226•7mo ago
michaelt•7mo ago
Honestly I was expecting the opposite - a repetition penalty to kick in having repeated zero too many times, resulting in too few zeros - but apparently not. So you might want to steer clear of this model if your document has a trillion pages.
Other than that, it did a solid job - I've certainly seen worse attempts to OCR a table.
[1] https://imgur.com/a/8rJeHf8
wisdomseaker•7mo ago
arkh•7mo ago
uselesswords•7mo ago