The general field is called "document structure analysis" or "document layout analysis." There's been lots of work--at a cursory glance at this article, I'm not sure they've discussed that literature.
I worked on a similar problem a decade or so, although our work was done mostly by hand. We were trying to not only read in (bilingual) dictionaries using OCR, but turn them into dictionary entries, and then parse each entry into its parts (headword, part of speech, definitions or glosses, example sentences, subentries...). I won't go into details, but to our surprise one of the most difficult parts for the machine to get right was recognizing bold or italicized text.
While this isn't something I need on a regular basis, it's timely news to hear about someone making progress on what seems like it ought to be a straightforward problem to solve. As the results of my efforts show, it must not be nearly as simple as one might expect.
solutions using things like img2table or pymupdf are really bad (pymupdf is not even reliable for text pdfs)
Handcrafting based on the dataset is the only way to get high performance.
JKCalhoun•4mo ago
It was surfaced in iOS a decade ago as "tap to zoom" feature for PDFs. It's funny — as with a lot of things there was a lot of sophisticated engineering under the hood and then marketing simply wants it to detect a tap in a paragraph and zoom to its bounds.
I can't think of the last time I read a PDF on my phone or I would test it to see if it still works as I remember.
el_benhameen•4mo ago
NoPicklez•4mo ago
herodotus•4mo ago
herodotus•4mo ago
msephton•4mo ago
msarnoff•4mo ago
JKCalhoun•4mo ago
herodotus•4mo ago
JKCalhoun•4mo ago
weinzierl•4mo ago
queuebert•4mo ago
actionfromafar•4mo ago
jez•4mo ago
But for PDFs which are really hard to read on a phone otherwise, it’s really a nice investment.