frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

UnAutomating the Economy: More Labor but at What Cost?

https://www.greshm.org/blog/unautomating-the-economy/
1•Suncho•5m ago•1 comments

Show HN: Gettorr – Stream magnet links in the browser via WebRTC (no install)

https://gettorr.com/
1•BenaouidateMed•6m ago•0 comments

Statin drugs safer than previously thought

https://www.semafor.com/article/02/06/2026/statin-drugs-safer-than-previously-thought
1•stareatgoats•8m ago•0 comments

Handy when you just want to distract yourself for a moment

https://d6.h5go.life/
1•TrendSpotterPro•9m ago•0 comments

More States Are Taking Aim at a Controversial Early Reading Method

https://www.edweek.org/teaching-learning/more-states-are-taking-aim-at-a-controversial-early-read...
1•lelanthran•11m ago•0 comments

AI will not save developer productivity

https://www.infoworld.com/article/4125409/ai-will-not-save-developer-productivity.html
1•indentit•16m ago•0 comments

How I do and don't use agents

https://twitter.com/jessfraz/status/2019975917863661760
1•tosh•22m ago•0 comments

BTDUex Safe? The Back End Withdrawal Anomalies

1•aoijfoqfw•24m ago•0 comments

Show HN: Compile-Time Vibe Coding

https://github.com/Michael-JB/vibecode
5•michaelchicory•27m ago•1 comments

Show HN: Ensemble – macOS App to Manage Claude Code Skills, MCPs, and Claude.md

https://github.com/O0000-code/Ensemble
1•IO0oI•30m ago•1 comments

PR to support XMPP channels in OpenClaw

https://github.com/openclaw/openclaw/pull/9741
1•mickael•31m ago•0 comments

Twenty: A Modern Alternative to Salesforce

https://github.com/twentyhq/twenty
1•tosh•32m ago•0 comments

Raspberry Pi: More memory-driven price rises

https://www.raspberrypi.com/news/more-memory-driven-price-rises/
1•calcifer•38m ago•0 comments

Level Up Your Gaming

https://d4.h5go.life/
1•LinkLens•42m ago•1 comments

Di.day is a movement to encourage people to ditch Big Tech

https://itsfoss.com/news/di-day-celebration/
3•MilnerRoute•43m ago•0 comments

Show HN: AI generated personal affirmations playing when your phone is locked

https://MyAffirmations.Guru
4•alaserm•44m ago•3 comments

Show HN: GTM MCP Server- Let AI Manage Your Google Tag Manager Containers

https://github.com/paolobietolini/gtm-mcp-server
1•paolobietolini•45m ago•0 comments

Launch of X (Twitter) API Pay-per-Use Pricing

https://devcommunity.x.com/t/announcing-the-launch-of-x-api-pay-per-use-pricing/256476
1•thinkingemote•45m ago•0 comments

Facebook seemingly randomly bans tons of users

https://old.reddit.com/r/facebookdisabledme/
1•dirteater_•47m ago•1 comments

Global Bird Count Event

https://www.birdcount.org/
1•downboots•47m ago•0 comments

What Is Ruliology?

https://writings.stephenwolfram.com/2026/01/what-is-ruliology/
2•soheilpro•49m ago•0 comments

Jon Stewart – One of My Favorite People – What Now? with Trevor Noah Podcast [video]

https://www.youtube.com/watch?v=44uC12g9ZVk
2•consumer451•51m ago•0 comments

P2P crypto exchange development company

1•sonniya•1h ago•0 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
2•jesperordrup•1h ago•0 comments

Write for Your Readers Even If They Are Agents

https://commonsware.com/blog/2026/02/06/write-for-your-readers-even-if-they-are-agents.html
1•ingve•1h ago•0 comments

Knowledge-Creating LLMs

https://tecunningham.github.io/posts/2026-01-29-knowledge-creating-llms.html
1•salkahfi•1h ago•0 comments

Maple Mono: Smooth your coding flow

https://font.subf.dev/en/
1•signa11•1h ago•0 comments

Sid Meier's System for Real-Time Music Composition and Synthesis

https://patents.google.com/patent/US5496962A/en
1•GaryBluto•1h ago•1 comments

Show HN: Slop News – HN front page now, but it's all slop

https://dosaygo-studio.github.io/hn-front-page-2035/slop-news
7•keepamovin•1h ago•1 comments

Show HN: Empusa – Visual debugger to catch and resume AI agent retry loops

https://github.com/justin55afdfdsf5ds45f4ds5f45ds4/EmpusaAI
1•justinlord•1h ago•0 comments
Open in hackernews

Nanonets-OCR-s – OCR model that transforms documents into structured markdown

https://huggingface.co/nanonets/Nanonets-OCR-s
361•PixelPanda•7mo ago

Comments

PixelPanda•7mo ago
Full disclaimer: I work at Nanonets

Excited to share Nanonets-OCR-s, a powerful and lightweight (3B) VLM model that converts documents into clean, structured Markdown. This model is trained to understand document structure and content context (like tables, equations, images, plots, watermarks, checkboxes, etc.). Key Features:

LaTeX Equation Recognition Converts inline and block-level math into properly formatted LaTeX, distinguishing between $...$ and $$...$$.

Image Descriptions for LLMs Describes embedded images using structured <img> tags. Handles logos, charts, plots, and so on.

Signature Detection & Isolation Finds and tags signatures in scanned documents, outputting them in <signature> blocks.

Watermark Extraction Extracts watermark text and stores it within <watermark> tag for traceability.

Smart Checkbox & Radio Button Handling Converts checkboxes to Unicode symbols like , , and for reliable parsing in downstream apps.

Complex Table Extraction Handles multi-row/column tables, preserving structure and outputting both Markdown and HTML formats.

Huggingface / GitHub / Try it out: https://huggingface.co/nanonets/Nanonets-OCR-s

Try it with Docext in Colab: https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...

mvac•7mo ago
Correct link for Docext: https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...
generalizations•7mo ago
Does it have a way to extract the images themselves, or is that still a separate process later?
j45•7mo ago
If you are after extracting images from pdfs there’s plenty of tools that do that just fine without LLMs.
generalizations•7mo ago
I mean, ideally it would be in context, so the generated markdown references the correct image at the correct location in the doc. Unless that's what you're talking about? In which case I don't know about those tools.
RicoElectrico•7mo ago
Could be it used to (maybe with help of a downstream LLM) parse a photo/PDF of a restaurant menu into a JSON file conforming to a schema? Or would bigger, hosted multimodal LLMs work better in such case?
gibsonf1•7mo ago
Does it hallucinate with the LLM being used?
nattaylor•7mo ago
The base model is Qwen2.5-VL-3B and the announcement says a limitation is "Model can suffer from hallucination"
gibsonf1•7mo ago
Seems a bit scary that the "source" text from the pdfs could actually be hallucinated.
prats226•7mo ago
Given that input is image and not raw pdf, its not completely unexpected
michaelt•7mo ago
Sometimes. I just fed the huggingface demo an image containing some rather improbable details [1] and it OCRed "Page 1000000000000" with one extra trailing zero.

Honestly I was expecting the opposite - a repetition penalty to kick in having repeated zero too many times, resulting in too few zeros - but apparently not. So you might want to steer clear of this model if your document has a trillion pages.

Other than that, it did a solid job - I've certainly seen worse attempts to OCR a table.

[1] https://imgur.com/a/8rJeHf8

wisdomseaker•7mo ago
Would any of this be able to handle magazine layouts? I've yet to find anything that can follow their fairly random layouts with text at varying angles etc
arkh•7mo ago
So it feels like it finally let me do one thing I'd wanted for some time: scan printed documents and generate structured pdfs (and not pdf as a picture container).
uselesswords•7mo ago
Have you found it has better accuracy or scales with larger models? Or are the improvements, if any, marginal compared to the 3B VLM model?
silversmith•7mo ago
I'm curious, how does it do with non-english texts? It's my understanding that LLM-based OCR solutions fall way behind traditional ones once you introduce other languages.
wickedsight•7mo ago
Understanding or experience?

Because my experience is not at all like that. If I use both Google Translate and ChatGPT on an image, ChatGPT is pretty much always better. It can even translate Japanese hand written menus quite well. With the added benefit of it being able to add context and explain what the dishes are.

silversmith•7mo ago
I'm passively interested in small, local LLM OCR, due to couple ideas kicking around between my ears. Tried some a while ago, but most of my recent knowledge is second-hand. Waiting for someone to exclaim "hey this works now!" before committing more time :)

With the big commercial offerings like chatgpt I'd fully expect them to work fine, due to the absolutely massive horsepower in use.

raus22•7mo ago
With models like these, when multilingual is not mentioned it will perform really bad on real life non-english pdfs.
souvik3333•7mo ago
The model was primarily trained on English documents, which is why English is listed as the main language. However, the training data did include a smaller proportion of Chinese and various European languages. Additionally, the base model (Qwen-2.5-VL-3B) is multilingual. Someone on Reddit mentioned it worked on Chinese: https://www.reddit.com/r/LocalLLaMA/comments/1l9p54x/comment...
progval•7mo ago
It's not open-source (nor open-weight): https://huggingface.co/nanonets/Nanonets-OCR-s/discussions/2
souvik3333•7mo ago
Hi, author of the model here. It is an open-weight model, you can download it from here: https://huggingface.co/nanonets/Nanonets-OCR-s
gardnr•7mo ago
Interestingly, another OCR model based on Qwen2.5-VL-3B just dropped which also publishes as Apache 2. It's right next to Nanonets-OCR-s on the HF "Trending" list.

https://huggingface.co/echo840/MonkeyOCR/blob/main/Recogniti...

CaptainFever•7mo ago
IMO weights being downloadable doesn't mean it's open weight.

My understanding:

    - Weight available: You can download the weights.
    - Open weight: You can download the weights, and it is licensed freely (e.g. public domain, CC BY-SA, MIT).
    - Open source: (Debated) You can download the weights, it is licensed freely, and the training dataset is also available and licensed freely.
For context:

> You're right. The Apache-2.0 license was mistakenly listed, and I apologize for the confusion. Since it's a derivative of Qwen-2.5-VL-3B, it will have the same license as the base model (Qwen RESEARCH LICENSE AGREEMENT). Thanks for pointing this out.

tensor•7mo ago
There are no benchmarks or accuracy measures on a hold out set?
souvik3333•7mo ago
Hi, author of the model here..

We have a benchmark for evaluating VLM on document understanding tasks: https://idp-leaderboard.org/ . But unfortunately, it does not include image to markdown as a task. The problem with evaluating an image to markdown is that even if the order of two blocks are different, it can still be correct. Eg: if you have both seller info and buyer info side by side in the image one model can extract the seller info first, and another model can extract the buyer info first. Both model will be correct but depending on the ground truth if you do fuzzy matching one model will have higher accuracy than the other one.

Normally, a company will train and test on a dataset that is trained on the same type of annotation (either left block first or right block first), and all other models can get a low score on their benchmark because they are trained on the opposite order of annotations.

krapht•7mo ago
If this is the only issue, can't this be addressed by normalizing the post-processed data before scoring? (that is, if it really is just a matter of block ordering)
tensor•7mo ago
The more important thing to me with any VLM is base OCR performance and hallucinations. It's not too hard to get improved average accuracy on very low quality scans using language models. Unfortunately these also typically produce large numbers of hallucinations, which are a deal breaker if you are trying to get out values for financial or legal purposes.

OCR that has lower accuracy, but where the inaccurate parts are left blank or flagged are far superior. Mistral OCR also suffers from this problem.

If your OCR produced bounding boxes for every text line, and ran a traditional OCR on the text, this could alleviate it. Or at the very least bounding boxes let users cross-correlate with output from traditional OCR.

Also a small note, it's probably best not to say your product beats Mistral when it's not even tested against it. Having more features doesn't make a product better if the accuracy is not better on those features.

I don't mean to be discouraging, this is an important space and it looks like you have a very feature rich model. I'd like to see a good solution be developed!

Eisenstein•7mo ago
How does it do with handwriting?
souvik3333•7mo ago
We have not trained explicitly on handwriting datasets (completely handwritten documents). But, there are lots of forms data with handwriting present in training. So, do try on your files, there is a huggingface demo, you can quickly test there: https://huggingface.co/spaces/Souvik3333/Nanonets-ocr-s

We are currently working on creating completely handwritten document datasets for our next model release.

Eisenstein•7mo ago
Document:

* https://imgur.com/cAtM8Qn

Result:

* https://imgur.com/ElUlZys

Perhaps it needed more than 1K tokens? But it took about an hour (number 28 in queue) to generate that and I didn't feel like trying again.

How many tokens does it usually take to represent a page of text with 554 characters?

souvik3333•7mo ago
Hey, the reason for the long processing time is that lots of people are using it, and with probably larger documents. I tested your file locally seems to be working correctly. https://ibb.co/C36RRjYs

Regarding the token limit, it depends on the text. We are using the qwen-2.5-vl tokenizer in case you are interested in reading about it.

You can run it very easily in a Colab notebook. This should be faster than the demo https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...

There are incorrect words in the extraction, so I would suggest you to wait for the handwritten text model's release.

mdaniel•7mo ago
> I tested your file locally seems to be working correctly

Apologies if there's some unspoken nuance in this exchange, but by "working correctly" did you just mean that it ran to completion? I don't even recognize some of the unicode characters that it emitted (or maybe you're using some kind of strange font, I guess?)

Don't misunderstand me, a ginormous number of floating point numbers attempting to read that handwriting is already doing better than I can, but I was just trying to understand if you thought that outcome is what was expected

Eisenstein•7mo ago
It actually did a decent. Perhaps the font is weird? For reference here is the 'ground truth' content, not in markdown:

Page# 8

Log: MA 6100 2.03.15

34 cement emitter resistors - 0.33R 5W 5% measure 0.29R 0.26R

35 replaced R436, R430 emitter resistors on R-chn P.O. brd w/new WW 5W .33R 5% w/ ceramic lead insulators

36 applied de-oxit d100 to speaker outs, card terminals, terminal blocks, output trans jacks

37 replace R-chn drivers and class A BJTs w/ BD139/146, & TIP31AG

38 placed boards back in

39 desoldered grnd lug from volume control

40 contact cleaner, Deoxit D5, faderlube on pots & switches teflon lube on rotor joint

41 cleaned ground lug & resoldered, reattached panel

souvik3333•7mo ago
This is the result. ``` Page 1 of 1 Page # &lt;page_number&gt;8&lt;/page_number&gt;

Log: MA 6100 Z. O 3. 15

<table> <tr> <td>34</td> <td>cement emitter resistors -</td> </tr> <tr> <td></td> <td>0.33 R SW 5% measure</td> </tr> <tr> <td></td> <td>0.29 R, 0.26 R</td> </tr> <tr> <td>35</td> <td>replaced R'4 36, R4 30</td> </tr> <tr> <td></td> <td>emitter resistor on R-44</td> </tr> <tr> <td></td> <td>0.0. 3rd w/ new WW 5W .33R</td> </tr> <tr> <td>36</td> <td>% w/ ceramic lead insulators</td> </tr> <tr> <td></td> <td>applied de-oat d100 to Speak</td> </tr> <tr> <td></td> <td>outs, card terminals, terminal</td> </tr> <tr> <td></td> <td>blocks, output tran jacks</td> </tr> <tr> <td>37</td> <td>replace &-clun diviers</td> </tr> <tr> <td></td> <td>and class A BJTs w/ BD139/140</td> </tr> <tr> <td></td> <td>& TIP37A2</td> </tr> <tr> <td>38</td> <td>placed boards back in</td> </tr> <tr> <td>39</td> <td>desoldered ground lus from volume</td> </tr> <tr> <td></td> <td>(con 48)</td> </tr> <tr> <td>40</td> <td>contact cleaner, Deox. t DS, facel/42</td> </tr> <tr> <td></td> <td>on pots & switches</td> </tr> <tr> <td></td> <td>· teflon lube on rotor joint</td> </tr> <tr> <td>41</td> <td>reably cleaned ground lus &</td> </tr> <tr> <td></td> <td>resoldered, reattatched panel</td> </tr> </table> ```

You can paste it in https://markdownlivepreview.com/ and see the extraction. This is using the Colab notebook I have shared before.

Which Unicode characters are you mentioning here?

mvac•7mo ago
How does it compare to Datalab/Marker https://github.com/datalab-to/marker ? We evaluated many PDF->MD converters and this one performed the best, though it is not perfect.
wittjeff•7mo ago
I am just getting started with my own cross-comparison, would appreciate your list of considered candidates if you have it handy.
nxobject•7mo ago
As anecdotal evidence, it serves my complex-enough purposes very well - mathematics and code interspersed together. One of my "litmus test" papers is this old paper on a Fortran inverse-Laplace transform algorithm [1] that intersperses inline and display equations, and monospace code blocks, while requiring OCR from scratch, and very few models currently do a satisfactory job, i.e. in the following page transcribed by Marker,

https://imgur.com/a/Q7UYIfW

the inline $\sigma_0$ is mangled as "<sup>s</sup> 0", and $f(t)$ is mangled as "f~~t*!". The current model gets them both correct.

vikp•7mo ago
Hi, author of marker here - I tried your image, and I don't see the issues you're describing with the newest version of marker (1.7.5).

I ran both with no setting specified, and with force_ocr, and I didn't see the issues either time.

nxobject•7mo ago
Hi there - thanks for getting back to me. I do genuinely want this workflow to work - Marker has been very useful for other purposes for me!

I’m currently using the Datalab online playground with default settings - does that enable inline math recognition?

vikp•7mo ago
I assume you're using a PDF, and not the image you shared? You need to set force ocr or format lines to get inline math with a PDF (for images, we just OCR everything anyways, so you don't need any settings).

We're working on improving the playground generally now - expect a big update tomorrow, which among other things will default to format lines.

Thanks for the kind words! The team was just me until pretty recently, but we're growing quickly and will be addressing a lot of issues quickly in the next few weeks.

nxobject•7mo ago
Perfect - it works! Yes, I’m glad for all the time you’ve spent on this project: one of my ulterior goals is to make technical documentation for old systems and their programming environments accessible to LLMs, so that programming in retro computing can benefit from the advances in productivity that modern languages have. I’m sure you’ll find plenty of other user stories like that :)
ks2048•7mo ago
It’s a shame all these models target markdown and not something with more structure and a specification. There are different flavors of Markdown and limited support for footnotes, references, figures, etc.
souvik3333•7mo ago
Actually, we have trained the model to convert to markdown and do semantic tagging at the same time. Eg, the equations will be extracted as LaTeX equations, and images (plots, figures, and so on) will be described within the `<img>` tags. Same with `<signature>`, `<watermark>`, <page_number>.

Also, we extract the tables as HTML tables instead of markdown for complex tables.

jtbayly•7mo ago
What happens to footnotes?
souvik3333•7mo ago
They will be extracted in a new line as normal text. It will be the last line.
jtbayly•7mo ago
So I’m left to manually link them up?

Have you considered using something like Pandoc’s method of marking them up? Footnotes are a fairly common part of scanned pages, and markdown that doesn’t indicate that a footnote is a footnote can be fairly incomprehensible.

agoose77•7mo ago
I am lazily posting this all over the thread, but do check out MyST Markdown too! https://mystmd.org. We handle footnotes as a structured object.
mgr86•7mo ago
Have you considered XML. TEI, for example, is very robust and mature for marking up documents.
esafak•7mo ago
First I heard of it. https://en.wikipedia.org/wiki/Text_Encoding_Initiative
mgr86•7mo ago
Understandable. I work in academic publishing, and while the XML is everywhere crowd is graying, retiring, or even dying :( it still remains an excellent option for document markup. Additionally, a lot of government data produced in the US and EU make heavy use of XML technologies. I imagine they could be an interested consumer of Nanonets-OCR. TEI could be a good choice as well tested and developed conversions exist to other popular, less structured, formats.
jxramos•7mo ago
maybe even epub, which is xhtml
agoose77•7mo ago
Do check out MyST Markdown (https://mystmd.org)! Academic publishing is a space that MyST is being used, such as https://www.elementalmicroscopy.com/ via Curvenote.

(I'm a MyST contributor)

viraptor•7mo ago
Do you know why myst got traction, instead of RST which seems to have all the custom tagging and extensibility build in from the beginning?
agoose77•7mo ago
MyST Markdown (the MD flavour, not the same-named Document Engine) was inspired by ReST. It was created to address the main pain-point of ReST for incoming users (it's not Markdown!).

As a project, the tooling to parse MyST Markdown was built on top of Sphinx, which primarily expects ReST as input. Now, I would not be surprised if most _new_ Sphinx users are using MyST Markdown (but I have no data there!)

Subsequently, the Jupyter Book project that built those tools has pivoted to building a new document engine that's better focused on the use-cases of our audience and leaning into modern tooling.

lukev•7mo ago
Yeah this really hurts. If your goal is to precisely mark up a document with some structural elements, XML is strictly superior to Markdown.

The fact that someone would go to all the work to build a model to extract the structure of documents, then choose an output format strictly less expressive than XML, speaks poorly of the state of cross-generational knowledge sharing within the industry.

prats226•7mo ago
I think the choice mainly stems from how you want to use the output. If the output is going to get fed to another LLM, then you want to select markup language where 1) the grammer would not cause too many issues with tokenization 2) which LLM has seen a lot in past 3) generates minimal number of tokens. I think markdown fits it much better compared to other markup languages.

If goal is to parse this output programmatically, then I agree a more structured markup language is better choice.

starkparker•7mo ago
I was more excited to hear about "structured Markdown" than the LLM OCR model, but the extent of it just seems to be tagging certain elements. It's useful in the LLM context but not as much outside of it.
agoose77•7mo ago
Feel free to check out MyST Markdown, which very much aims to specify "structured Markdown": https://mystmd.org
el_don_almighty•7mo ago
I have been looking for something that would ingest a decade of old Word and PowerPoint documents and convert them into a standardized format where the individual elements could be repurposed for other formats. This seems like a critical building block for a system that would accomplish this task.

Now I need a catalog, archive, or historian function that archives and pulls the elements easily. Amazing work!

pxc•7mo ago
Can't you just start with unoconv or pandoc, then maybe use an LLM to clean up after converting to plain text?
toledocavani•7mo ago
Which decade? DOCX and PPTX is just zipped XMLs, seems pretty standard to me
constantinum•7mo ago
It would be interesting to know how it compares with Llamaparse, LLMWhisperer, Marker, Reducto
prats226•7mo ago
Unfortunately my reducto account was disabled rigth after this launch. But would be uploading benchmarks for rest at https://idp-leaderboard.org/
nehalem•7mo ago
How does it do with multi-column text and headers and footers?
souvik3333•7mo ago
We have trained the model on tables with hierarchical column headers and with rowspan and colspan >1. So it should work fine. This is the reason we predict the table in HTML instead of markdown.
nehalem•7mo ago
Thank you. I was rather thinking of magazine like layouts with columns of text and headers and footers on every page holding article title and page number.
souvik3333•7mo ago
It should work there also. We have trained on research papers with two columns of text. Generally, papers have references as a footer and contains page number.
kordlessagain•7mo ago
I created a Powershell script to run this locally on any PDF: https://gist.github.com/kordless/652234bf0b32b02e39cef32c71e...

It does work, but it is very slow on my older GPU (Nvidia 1080 8GB). I would say it's taking at least 5 minutes per page right now, but maybe more.

Edit: If anyone is interested in trying a PDF to markdown conversion utility built this that is hosted on Cloud Run (with GPU support), let me know. It should be done in about an hour or so and I will post a link up here when it's done.

kordlessagain•7mo ago
Reporting back on this, here's some sample output from https://www.sidis.net/animate.pdf:

  THE ANIMATE
  AND THE INANIMATE

  WILLIAM JAMES SIDIS

  <img>A black-and-white illustration of a figure holding a book with the Latin phrase "ARTI et VERITATI" below it.</img>

  BOSTON

  RICHARD G. BADGER, PUBLISHER

  THE GORHAM PRESS

  Digitized by Google
I haven't see ANY errors in what it has done, which is quite impressive.

Here, it's doing tables of contents (I used a slightly different copy of the PDF than I linked to):

  <table>
    <tr>
      <td>Chapter</td>
      <td>Page</td>
    </tr>
    <tr>
      <td>PREFACE</td>
      <td>3</td>
    </tr>
    <tr>
      <td>I. THE REVERSE UNIVERSE</td>
      <td>9</td>
    </tr>
    <tr>
      <td>II. REVERSIBLE LAWS</td>
      <td>14</td>
    </tr>
Other than the fact it is ridiculously slow, this seems to be quite good at doing what it says it does.
2pointsomone•7mo ago
Very very interested!
kordlessagain•7mo ago
Ok, I have it built but things came up and I'm testing this morning (probably still broken but the code is all there):

https://github.com/kordless/gnosis-ocr

ZQ-Dev8•7mo ago
How's this compare with docling (https://github.com/docling-project/docling)?
temp0826•7mo ago
I have a Shipibo (indigenous Peruvian language) to Spanish dictionary that I've been trying to translate into a Shipibo to English dictionary using a couple different llms but keep struggling with formatting (two columns, strange line breaks, but also both Shipibo and Spanish in the definitions make it difficult to grok). That all plus being pretty poorly scanned. May need to give this a try.
Bestora•7mo ago
How does it handle documents with multi column or multi row tables?

e.g. https://www.japanracing.de/Teilegutachten/Teilegutachten-JR1... page 1 rowspan page29 colspan

CMCDragonkai•7mo ago
Can this work on diagrams? Like box and lines?
jwr•7mo ago
Thank you! This is very interesting — I'm just curious, why use such a small model?

I can comfortably run 27B models on my Mac and I'd much rather process my PDF library with something that is less prone to hallucinations and handles multiple languages better…

nnurmanov•7mo ago
Are there benchmarks for such kind of tools? How does it handle tables? Different languages?
huqedato•7mo ago
Can it extract data from scientific graphs like barcharts, time-series, etc. ?