frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Audible is giving publishers AI tools to quickly make more audiobooks

https://www.theverge.com/news/666136/amazon-audible-ai-narration-audiobooks-translation
1•mfiguiere•2m ago•0 comments

AmiBlitz3 – a BASIC-compiler for 68k-Amiga

https://github.com/AmiBlitz/AmiBlitz3
1•doener•3m ago•0 comments

Time to Take Whey Protein for Optimal Results

https://www.wheyindex.com/guides/when-is-the-best-time-to-take-whey-protein-for-optimal-results
1•thewheyguy•4m ago•0 comments

How Does Surveillance Work?

https://www.youtube.com/watch?v=Y-38idOepjg
1•brudgers•4m ago•0 comments

I built an AI tool that turns photos into collectible action figures (TurnToy)

https://apps.apple.com/us/app/ai-action-figure-maker-turntoy/id6745129638
1•incendies•8m ago•1 comments

SPITBOL – high performance implementation of SNOBOL for x64

https://github.com/spitbol/x64
1•michaelsbradley•8m ago•0 comments

Ask HN: How do you store the knowledge gained in a day?

2•dennisy•8m ago•0 comments

Physicists Build a 'Black Hole Bomb' in the Laboratory

https://www.scientificamerican.com/article/how-to-build-a-black-hole-bomb/
1•rolph•9m ago•0 comments

Rust nightly features you should watch out for

https://www.wakunguma.com/blog/interesting-rust-nightly-features
1•ChadNauseam•10m ago•0 comments

Near Photorealism Driven by MSAA – Graphics Optimization by Devs Who Care [video]

https://www.youtube.com/watch?v=jNR5EiqA05c
1•skibz•10m ago•0 comments

Why Isn't There a One-Pager for Every Politician's Performance?

2•moeenmiri•11m ago•2 comments

George R. R. Martin Still Uses a DOS Word Processor (2014)

https://www.youtube.com/watch?v=X5REM-3nWHg
1•doener•13m ago•0 comments

Replacing Google Search with Kagi

https://vladde.net/blog/kagi-replacement-for-google-search/
1•vladde•14m ago•0 comments

Ash HN: Is Privacy Dead to the Consumer?

1•labadal•15m ago•0 comments

Why Is Google Pushing AI on My Kids?

https://www.crossplay.news/p/why-is-google-pushing-ai-on-my-kids
2•awnird•17m ago•1 comments

OpenAI's Sam Altman on Building the 'Core AI Subscription' for Your Life [video]

https://www.youtube.com/watch?v=ctcMA6chfDY
1•antfarm•19m ago•0 comments

Copyright, AI and American Politics

https://handyai.substack.com/p/copyright-ai-and-american-politics
1•surprisetalk•20m ago•0 comments

9000-year-old 'Stonehenge-like' structure found hidden in Lake Michigan

https://www.thebrighterside.news/global-good/9000-year-old-stonehenge-like-structure-found-hidden-in-lake-michigan/
1•lolinder•21m ago•1 comments

Google Lens Modes

https://dejan.ai/blog/google-lens-modes/
1•mooreds•21m ago•0 comments

No One Is Thinking About You – and That's Perfectly Fine

https://sleepbattle.com/no-one-is-thinking-about-you/
1•mooreds•22m ago•0 comments

Effortless iOS Snapshot Testing Using Emerge Tools

https://joinhandshake.com/blog/our-team/effortless-ios-snapshot-testing-using-emerge-tools/
2•mooreds•25m ago•0 comments

Factors steadily fueling Linux's desktop rise

https://www.zdnet.com/article/5-factors-steadily-fueling-linuxs-desktop-rise/
2•CrankyBear•27m ago•0 comments

Face to Face with an Alligator? Here's What to Do

https://www.nytimes.com/2025/05/12/us/alligator-attacks-safety-tips.html
1•bookofjoe•29m ago•1 comments

JEP 518: JFR Cooperative Sampling

https://openjdk.org/jeps/518
2•za3faran•30m ago•0 comments

RPF: Rob's Programming Facility

https://www.prince-webdesign.nl/rpf
2•rbanffy•31m ago•0 comments

AI Diffusion Rule Rescinded

https://www.bis.gov/press-release/department-commerce-rescinds-biden-era-artificial-intelligence-diffusion-rule-strengthens-chip-related
1•asciimike•31m ago•0 comments

Apple Announces New iOS 19 and macOS 16 Accessibility Features Ahead of WWDC

https://www.macrumors.com/2025/05/13/apple-previews-ios-19-accessibility-features/
2•Tomte•32m ago•0 comments

Don't Unwrap Options: There Are Better Ways

https://corrode.dev/blog/rust-option-handling-best-practices/
5•mu0n•33m ago•1 comments

Psion Series 5

https://en.wikipedia.org/wiki/Psion_Series_5
1•DanielleMolloy•34m ago•0 comments

OpenTelemetry Protocol with Apache Arrow

https://opentelemetry.io/blog/2025/otel-arrow-phase-2/
2•tanelpoder•34m ago•0 comments
Open in hackernews

PDF to Text, a challenging problem

https://www.marginalia.nu/log/a_119_pdf/
113•ingve•3h ago

Comments

rad_gruchalski•3h ago
So many of these problems have been solved by mozilla pdf.js together with its viewer implementation: https://mozilla.github.io/pdf.js/.
zzleeper•2h ago
Any sense on how PDF.js compares against other tools such as pdfminer?
rad_gruchalski•1h ago
I don’t know. I use pdf.js for everything PDF.
egnehots•2h ago
I don't think so, pdf.js is able to render a pdf content.

Which is different from extracting "text". Text in PDF can be encoded in many ways, in an actual image, in shapes (think, segments, quadratic bezier curves...), or in an XML format (really easy to process).

PDF viewers are able to render text, like a printer would work, processing command to show pixels on the screen at the end.

But often, paragraph, text layout, columns, tables are lost in the process. Even though, you see them, so close yet so far. That is why AI is quite strong at this task.

lionkor•2h ago
Correct me if im wrong, but pdf.js actually has a lot of methods to manipulate PDFs, no?
rad_gruchalski•1h ago
You are wrong. Pdf.js can extract text and has all facilities required to render and extract formatting. The latest version can also edit PDF files. It’s basically the same engine as the Firefox PDF viewer. Which also has a document outline, search, linking, print preview, scaling, scripting sandbox… it does not simply „render” a file.

Regarding tables, this here https://www.npmjs.com/package/pdf-table-extractor does a very good job at table interpretation and works on top of pdf.js.

I also didn’t say what works better or worse, neither do I go into PDF being good or bad.

I simply said that a ton of problems have been covered by

iAMkenough•1h ago
A good PDF reader makes the problems easier to deal with, but does not solve the underlying issue.

The PDF itself is still flawed, even if pdf.js interprets it perfectly, which is still a problem for non-pdf.js viewers and tasks where "viewing" isn't the primary goal.

bartread•2h ago
Yeah, getting text - even structured text - out of PDFs is no picnic. Scraping a table out of an HTML document is often straightforward even on sites that use the "everything's a <div>" (anti-)pattern, and especially on sites that use more semantically useful elements, like <table>.

Not so PDFs.

I'm far from an expert on the format, so maybe there is some semantic support in there, but I've seen plenty of PDFs where tables are simply an loose assemblage of graphical and text elements that, only when rendered, are easily discernible as a table because they're positioned in such a way that they render as a table.

I've actually had decent luck extracting tabular data from PDFS by converting the PDFs to HTML using the Poppler PDF utils, then finding the expected table header, and then using the x-coordinate of the HTML elements for each value within the table to work out columns, and extract values for each rows.

It's kind of groaty but it seems reliable for what I need. Certainly much moreso than going via formatted plaintext, which has issues with inconsistent spacing, and the insertion of newlines into the middle of rows.

j45•2h ago
PDFs inherently are a markup / xml format, the standard is available to learn from.

It's possible to create the same PDF in many, many, many ways.

Some might lean towards exporting a layout containing text and graphics from a graphics suite.

Others might lean towards exporting text and graphics from a word processor, which is words first.

The lens of how the creating app deals with information is often something that has input on how the PDF is output.

If you're looking for an off the shelf utility that is surprisingly decent at pulling structured data from PDFs, tools like cisdem have already solved enough of it for local users. Lots of tools like this out there, many do promise structured data support but it needs to match what you're up to.

layer8•2h ago
> PDFs inherently are a markup / xml format

This is false. PDFs are an object graph containing imperative-style drawing instructions (among many other things). There’s a way to add structural information on top (akin to an HTML document structure), but that’s completely optional and only serves as auxiliary metadata, it’s not at the core of the PDF format.

davidthewatson•1h ago
Thanks for your comment.

Indeed. Therein lies the rub.

Why?

Because no matter the fact that I've spent several years of my latent career crawling and parsing and outputting PDF data, I see now that pointing my LLLM stack at a directory of *.pdf just makes the invisible encoding of the object graph visible. It's a skeptical science.

The key transclusion may be to move from imperative to declarative tools or conditional to probabilistic tools, as many areas have in the last couple decades.

I've been following John Sterling's ocaml work for a while on related topics and the ideas floating around have been a good influence on me in forests and their forester which I found resonant given my own experience:

https://www.jonmsterling.com/index/index.xml

https://github.com/jonsterling/forest

I was gonna email john and ask whether it's still being worked on as I hope so, but I brought it up this morning as a way out of the noise that imperative programming PDF has been for a decade or more where turtles all the way down to the low-level root cause libraries mean that the high level imperative languages often display the exact same bugs despite significant differences as to what's being intended in the small on top of the stack vs the large on the bottom of the stack. It would help if "fitness for a particular purpose" decisions were thoughtful as to publishing and distribution but as the CFO likes to say, "Dave, that ship has already sailed." Sigh.

¯\_(ツ)_/¯

j45•2h ago
Part of a problem being challenging is recognizing if it's new, or just new to us.

We get to learn a lot when something is new to us.. at the same time the untouchable parts of PDF to Text are largely being solved with the help of LLMs.

I built a tool to extract information from PDFs a long time ago, and the break through was having no ego or attachment to any one way of doing it.

Different solutions and approaches offered different depth or quality of solutions and organizing them to work together in addition to anything I built myself provided what was needed - one place where more things work.. than not.

xnx•2h ago
Weird that there's no mention of LLMs in this article even though the article is very recent. LLMs haven't solved every OCR/document data extraction problem, but they've dramatically improved the situation.
j45•2h ago
LLMs are definitely helping approach some problems that couldn't be to date.
simonw•2h ago
I've had great results against PDFs from recent vision models. Gemini, OpenAI and Claude can all accept PDFs directly now and treat them as image input.

For longer PDFs I've found that breaking them up into images per page and treating each page separately works well - feeing a thousand page PDF to even a long context model like Gemini 2.5 Pro or Flash still isn't reliable enough that I trust it.

As always though, the big challenge of using vision LLMs for OCR (or audio transcription) tasks is the risk of accidental instruction following - even more so if there's a risk of deliberately malicious instructions in the documents you are processing.

marginalia_nu•2h ago
Author here: LLMs are definitely the new gold standard for smaller bodies of shorter documents.

The article is in the context of an internet search engine, the corpus to be converted is of order 1 TB. Running that amount of data through an LLM would be extremely expensive, given the relatively marginal improvement in outcome.

mediaman•2h ago
Corpus size doesn't mean much in the context of a PDF, given how variable that can be per page.

I've found Google's Flash to cut my OCR costs by about 95+% compared to traditional commercial offerings that support structured data extraction, and I still get tables, headers, etc from each page. Still not perfect, but per page costs were less than one tenth of a cent per page, and 100 gb collections of PDFs ran to a few hundreds of dollars.

constantinum•41m ago
True indeed, but there are a few problems — hallucinations and trusting the output(validation). More here https://unstract.com/blog/why-llms-struggle-with-unstructure...
svat•2h ago
One thing I wish someone would write is something like the browser's developer tools ("inspect elements") for PDF — it would be great to be able to "view source" a PDF's content streams (the BT … ET operators that enclose text, each Tj operator for setting down text in the currently chosen font, etc), to see how every “pixel” of the PDF is being specified/generated. I know this goes against the current trend / state-of-the-art of using vision models to basically “see” the PDF like a human and “read” the text, but it would be really nice to be able to actually understand what a PDF file contains.

There are a few tools that allow inspecting a PDF's contents (https://news.ycombinator.com/item?id=41379101) but they stop at the level of the PDF's objects, so entire content streams are single objects. For example, to use one of the PDFs mentioned in this post, the file https://bfi.uchicago.edu/wp-content/uploads/2022/06/BFI_WP_2... has, corresponding to page number 6 (PDF page 8), a content stream that starts like (some newlines added by me):

    0 g 0 G
    0 g 0 G
    BT
    /F19 10.9091 Tf 88.936 709.041 Td
    [(Subsequen)28(t)-374(to)-373(the)-373(p)-28(erio)-28(d)-373(analyzed)-373(in)-374(our)-373(study)83(,)-383(Bridge's)-373(paren)27(t)-373(compan)28(y)-373(Ne)-1(wGlob)-27(e)-374(reduced)]TJ
    -16.936 -21.922 Td
    [(the)-438(n)28(um)28(b)-28(er)-437(of)-438(priv)56(ate)-438(sc)28(ho)-28(ols)-438(op)-27(erated)-438(b)28(y)-438(Bridge)-437(from)-438(405)-437(to)-438(112,)-464(and)-437(launc)28(hed)-438(a)-437(new)-438(mo)-28(del)]TJ
    0 -21.923 Td
and it would be really cool to be able to see the above “source” and the rendered PDF side-by-side, hover over one to see the corresponding region of the other, etc, the way we can do for a HTML page.
whenc•2h ago
Try with cpdf (disclaimer, wrote it):

  cpdf -output-json -output-json-parse-content-streams in.pdf -o out.json
Then you can play around with the JSON, and turn it back to PDF with

  cpdf -j out.json -o out.pdf
No live back-and-forth though.
svat•2h ago
The live back-and-forth is the main point of what I'm asking for — I tried your cpdf (thanks for the mention; will add it to my list) and it too doesn't help; all it does is, somewhere 9000-odd lines into the JSON file, turn the part of the content stream corresponding to what I mentioned in the earlier comment into:

        [
          [ { "F": 0.0 }, "g" ],
          [ { "F": 0.0 }, "G" ],
          [ { "F": 0.0 }, "g" ],
          [ { "F": 0.0 }, "G" ],
          [ "BT" ],
          [ "/F19", { "F": 10.9091 }, "Tf" ],
          [ { "F": 88.93600000000001 }, { "F": 709.0410000000001 }, "Td" ],
          [
            [
              "Subsequen",
              { "F": 28.0 },
              "t",
              { "F": -374.0 },
              "to",
              { "F": -373.0 },
              "the",
              { "F": -373.0 },
              "p",
              { "F": -28.0 },
              "erio",
              { "F": -28.0 },
              "d",
              { "F": -373.0 },
              "analyzed",
              { "F": -373.0 },
              "in",
              { "F": -374.0 },
              "our",
              { "F": -373.0 },
              "study",
              { "F": 83.0 },
              ",",
              { "F": -383.0 },
              "Bridge's",
              { "F": -373.0 },
              "paren",
              { "F": 27.0 },
              "t",
              { "F": -373.0 },
              "compan",
              { "F": 28.0 },
              "y",
              { "F": -373.0 },
              "Ne",
              { "F": -1.0 },
              "wGlob",
              { "F": -27.0 },
              "e",
              { "F": -374.0 },
              "reduced"
            ],
            "TJ"
          ],
          [ { "F": -16.936 }, { "F": -21.922 }, "Td" ],
This is just a more verbose restatement of what's in the PDF file; the real questions I'm asking are:

- How can a user get to this part, from viewing the PDF file? (Note that the PDF page objects are not necessarily a flat list; they are often nested at different levels of “kids”.)

- How can a user understand these instructions, and “see” how they correspond to what is visually displayed on the PDF file?

dleeftink•2h ago
Have a look at this notebook[0], not exactly what you're looking for but does provide a 'live' inspector of the various drawing operations contained in a PDF.

[0]: https://observablehq.com/@player1537/pdf-utilities

svat•2h ago
Thanks, but I was not able to figure out how to get any use out of the notebook above. In what sense is it a 'live' inspector? All it seems to do is to just decompose the PDF into separate “ops” and “args” arrays (neither of which is meaningful without the other), but it does not seem “live” in any sense — how can one find the ops (and args) corresponding to a region of the PDF page, or vice-versa?
dleeftink•1h ago
You can load up your own PDF and select a page up front after which it will display the opcodes for this page. Operations are not structurally grouped, but decomposed in three aligned arrays which can be grouped to your liking based on opcode or used as coordinates for intersection queries (e.g. combining the ops and args arrays).

The 'liveness' here is that you can derive multiple downstream cells (e.g. filters, groupings, drawing instructions) from the initial parsed PDF, which will update as you swap out the PDF file.

kccqzy•43m ago
When you use PDF.js from Mozilla to render a PDF file in DOM, I think you might actually get something pretty close. For example I suppose each Tj becomes a <span> and each TJ becomes a collection of <span>s. (I'm fairly certain it doesn't use <canvas>.) And I suppose it must be very faithful to the original document to make it work.
wrs•2h ago
Since these are statistical classification problems, it seems like it would be worth trying some old-school machine learning (not an LLM, just an NN) to see how it compares with these manual heuristics.
marginalia_nu•2h ago
I imagine that would work pretty well given an adequate and representative body of annotated sample data. Though that is also not easy to come by.
ted_dunning•1h ago
Actually, it is easy to come up with reasonably decent heuristics that can auto-tag a corpus. From that you can look for anomalies and adjust your tagging system.

The problem of getting a representative body is (surprisingly) much harder than the annotation. I know. I spent quite some time years ago doing this.

andrethegiant•2h ago
Cloudflare’s ai.toMarkdown() function available in Workers AI can handle PDFs pretty easily. Judging from speed alone, it seems they’re parsing the actual content rather than shoving into OCR/LLM.

Shameless plug: I use this under the hood when you prefix any PDF URL with https://pure.md/ to convert to raw text.

burkaman•2h ago
If you're looking for test cases, this is the first thing I tried and the result is very bad: https://pure.md/https://docs.house.gov/meetings/IF/IF00/2025...
andrethegiant•2h ago
Apart from lacking newlines, how is the result bad? It extracts the text for easy piping into an LLM.
burkaman•1h ago
- Most of the titles have incorrectly split words, for example "P ART 2—R EPEAL OF EPA R ULE R ELATING TO M ULTI -P OLLUTANT E MISSION S TANDARDS". I know LLMs are resilient against typos and mistakes like this, but it still seems not ideal.

- The header is parsed in a way that I suspect would mislead an LLM: "BRETT GUTHRIE, KENTUCKY FRANK PALLONE, JR., NEW JERSEY CHAIRMAN RANKING MEMBER ONE HUNDRED NINETEENTH CONGRESS". Guthrie is the chairman and Pallone is the ranking member, but that isn't implied in the text. In this particular case an LLM might already know that from other sources, but in more obscure contexts it will just have to rely on the parsed text.

- It isn't converted into Markdown at all, the structure is completely lost. If you only care about text then I guess that's fine, and in this case an LLM might do an ok job at identifying some of the headers, but in the context of this discussion I think ai.toMarkdown() did a bad job of converting to Markdown and a just ok job of converting to text.

I would have considered this a fairly easy test case, so it would make me hesitant to trust that function for general use if I were trying to solve the challenges described in the submitted article (Identifying headings, Joining consecutive headings, Identifying Paragraphs).

I see that you are trying to minimize tokens for LLM input, so I realize your goals are probably not the same as what I'm talking about.

Edit: Another test case, it seems to crash on any Arxiv PDF. Example: https://pure.md/https://arxiv.org/pdf/2411.12104.

andrethegiant•10m ago
> it seems to crash on any Arxiv PDF

Fixed, thanks for reporting :-)

marginalia_nu•1h ago
That PDF actually has some weird corner cases.

First it's all the same font size everywhere, it's also got bolded "headings" with spaces that are not bolded. Had to fix my own handling to get it to process well.

This is the search engine's view of the document as of those fixes: https://www.marginalia.nu/junk/congress.html

Still far from perfect...

mdaniel•35m ago
> That PDF actually has some weird corner cases.

Heh, in my experience with PDFs that's a tautology

_boffin_•2h ago
You’re aware that PDFs are containers that can hold various formats, which can be interlaced in different ways, such as on top, throughout, or in unexpected and unspecified ways that aren’t “parsable,” right?

I would wager that they’re using OCR/LLM in their pipeline.

andrethegiant•2h ago
Could be. But their pricing for the conversion is free, which leads me to believe LLMs are not involved.
cpursley•2h ago
How's their function do on complex data tables, charts and that sort of stuff?
bambax•1h ago
It doesn't seem to handle multi-columns PDFs well?
bob1029•2h ago
When accommodating the general case, solving PDF-to-text is approximately equivalent to solving JPEG-to-text.

The only PDF parsing scenario I would consider putting my name on is scraping AcroForm field values from standardized documents.

kapitalx•2h ago
This is approximately the approach we're taking also at https://doctly.ai, add to that a "multiple experts" approach for analyzing the image (for our 'ultra' version), and we get really good results. And we're making it better constantly.
layer8•2h ago
If you assume standardized documents, you can impose the use of Tagged PDF: https://pdfa.org/resource/tagged-pdf-q-a/
dwheeler•2h ago
The better solution is to embed, in the PDF, the editable source document. This is easily done by LibreOffice. Embedding it takes very little space in general (because it compresses well), and then you have MUCH better information on what the text is and its meaning. It works just fine with existing PDF readers.
layer8•2h ago
That’s true, but it also opens up the vulnerability of the source document being arbitrarily different from the rendered PDF content.
kerkeslager•1h ago
That's true, but it's dependent on the creator of the PDF having aligned incentives with the consumer of the PDF.

In the e-Discovery field, it's commonplace for those providing evidence to dump it into a PDF purely so that it's harder for the opposing side's lawyers to consume. If both sides have lots of money this isn't a barrier, but for example public defenders don't have funds to hire someone (me!) to process the PDFs into a readable format, so realistically they end up taking much longer to process the data, which takes a psychological toll on the defendant. And that's if they process the data at all.

The solution is to make it illegal to do this: wiretap data, for example, should be provided in a standardized machine-readable format. There's no ethical reason for simple technical friction to be affecting the outcomes of criminal proceedings.

giovannibonetti•1h ago
I wonder if AI will solve that
GaggiX•14m ago
There are specialized models, but even generic ones like Gemini 2.0 Flash are really good and cheap, you can use them and embed the OCR inside the PDF to index to the original content.
carabiner•1h ago
I bet 90% of the problem space is legacy PDFs. My company has thousands of these. Some are crappy scans. Some have Adobe's OCR embedded, but most have none at all.
lelandfe•1h ago
The better solution to a search engine extracting text from existing PDFs is to provide advice on how to author PDFs?

What's the timeline for this solution to pay off

Obscurity4340•2h ago
Is this what GoodPDF does?
reify•1h ago
https://github.com/jalan/pdftotext

pdftotext -layout input.pdf output.txt

pip install pdftotext

EmilStenstrom•1h ago
I think using Gemma3 in vision mode could be a good use-case for converting PDF to text. It’s downloadable and runnable on a local computer, with decent memory requirements depending on which size you pick. Did anyone try it?
CaptainFever•1h ago
Kind of unrelated, but Gemma 3's weights are unfree, so perhaps LLaVA (https://ollama.com/library/llava) would be a good alternative.
ted_dunning•1h ago
One of my favorite documents for highlighting the challenges described here is the PDF for this article:

https://academic.oup.com/auk/article/126/4/717/5148354

The first page is classic with two columns of text, centered headings, a text inclusion that sits between the columns and changes the line lengths and indentations for the columns. Then we get the fun of page headers that change between odd and even pages and section header conventions that vary drastically.

Oh... to make things even better, paragraphs doing get extra spacing and don't always have an indented first line.

Some of everything.

JKCalhoun•50m ago
The API in CoreGraphics (MacOS) for PDF, at a basic level, simply presented the text, per page, in the order in which it was encoded in the dictionaries. And 95% of the time this was pretty good — and when working with PDFKit and Preview on the Mac, we got by with it for years.

If you stepped back you could imagine the app that originally had captured/produced the PDF — perhaps a word processor — it was likely rendering the text into the PDF context in some reasonable order from it's own text buffer(s). So even for two columns, you rather expect, and often found, that the text flowed correctly from the left column to the right. The text was therefore already in the correct order within the PDF document.

Now, footers, headers on the page — that would be anyone's guess as to what order the PDF-producing app dumped those into the PDF context.

devrandoom•52m ago
I currently use ocrmypdf for my private library. Then Recoll to index and search. Is there a better solution I'm missing?
constantinum•46m ago
PDF parsing is hell indeed, with all sorts of edge cases that breaks business workflows, more on that here https://unstract.com/blog/pdf-hell-and-practical-rag-applica...
gibsonf1•44m ago
We[1] Create "Units of Thought" from PDF's and then work with those for further discovery where a "Unit of Thought" is any paragraph, title, note heading - something that stands on its own semantically. We then create a hierarchy of objects from that pdf in the database for search and conceptual search - all at scale.

[1] https://graphmetrix.com/trinpod-server https://trinapp.com

kbyatnal•32m ago
"PDF to Text" is a bit simplified IMO. There's actually a few class of problems within this category:

1. reliable OCR from documents (to index for search, feed into a vector DB, etc)

2. structured data extraction (pull out targeted values)

3. end-to-end document pipelines (e.g. automate mortgage applications)

Marginalia needs to solve problem #1 (OCR), which is luckily getting commoditized by the day thanks to models like Gemini Flash. I've now seen multiple companies replace their OCR pipelines with Flash for a fraction of the cost of previous solutions, it's really quite remarkable.

Problems #2 and #3 are much more tricky. There's still a large gap for businesses in going from raw OCR outputs —> document pipelines deployed in prod for mission-critical use cases. LLMs and VLMs aren't magic, and anyone who goes in expecting 100% automation is in for a surprise.

You still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort. The future is definitely moving in this direction though.

Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.ai)

anonu•23m ago
They should called it NDF - Non-Portable Document Format.
dobraczekolada•16m ago
Reminds me of github.com/docwire/docwire
90s_dev•13m ago
Have any of you ever thought to yourself, this is new and interesting, and then vaguely remembered that you spent months or years becoming an expert at it earlier in life but entirely forgot it? And in fact large chunks of the very interesting things you've done just completely flew out of your mind long ago, to the point where you feel absolutely new at life, like you've accomplished relatively nothing, until something like this jars you out of that forgetfulness?

I definitely vaguely remember doing some incredibly cool things with PDFs and OCR about 6 or 7 years ago. Some project comes to mind... google tells me it was "tesseract" and that sounds familiar.

90s_dev•11m ago
I also remember that there was an alternative to tesseract, and that one of them was much better for my particular needs, but there was some slight drawback to it. Maybe I should have documented all these exploits for the slight entertainment of random people or even myself in the distant future.
downboots•5m ago
No different than a fire ant whose leaf got knocked over by the wind and it moved on to the next.
PeterStuer•13m ago
I guess I'm lucky the PDF's I need to process are mostly rather dull unadventurous layouts. So far I've had great success using docling.
keybored•12m ago
For people who want people to read their documents[1] they should have their PDF point to a more digital-friendly format, an alt document.

Looks like you’ve found my PDF. You might want this version instead:

PDFs are often subpar. Just see the first example: standard Latex serif section title. I mean, PDFs often aren’t even well-typeset for what they are (dead-tree simulations).

[1] No sarcasm or truism. Some may just want to submit a paper to whatever publisher and go through their whole laundry list of what a paper ought to be. Wide dissemanation is not the point.

1vuio0pswjnm7•4m ago
"The crux of the problem is that the file format isn't a text format at all, but a graphical format."

This seems to suggest that PDF is a "graphics only" format.

Below is a PDF. It is a textfile. I can save it as a .pdf and open it in a PDF viewer. I can make changes in a text editor. For example, by editing the textfile, I can change the text displayed on the screen when the PDF is opened, the font, font size, line spacing, the maximum characters per line, number of lines per page, the paper width and height, as well as portrat versus landscape mode.

   %PDF-1.4
   1 0 obj
   <<
   /CreationDate (D:2025)
   /Producer 
   >>
   endobj
   2 0 obj
   <<
   /Type /Catalog
   /Pages 3 0 R
   >>
   endobj
   4 0 obj
   <<
   /Type /Font
   /Subtype /Type1
   /Name /F1
   /BaseFont /Times-Roman
   >>
   endobj
   5 0 obj
   <<
     /Font << /F1 4 0 R >>
     /ProcSet [ /PDF /Text ]
   >>
   endobj
   6 0 obj
   <<
   /Type /Page
   /Parent 3 0 R
   /Resources 5 0 R
   /Contents 7 0 R
   >>
   endobj
   7 0 obj
   <<
   /Length 8 0 R
   >>
   stream
   BT
   /F1 50 Tf
   1 0 0 1 50 752 Tm
   54 TL
   (PDF is)' 
   ((a) a text format)'
   ((b) a graphics format)'
   ((c) (a) and (b).)'
   ()'
   ET
   endstream
   endobj
   8 0 obj
   53
   endobj
   3 0 obj
   <<
   /Type /Pages
   /Count 1
   /MediaBox [ 0 0 612 792 ]
   /Kids [ 6 0 R ]
   >>
   endobj
   xref
   0 9
   0000000000 65535 f 
0000000009 00000 n 0000000113 00000 n 0000000514 00000 n 0000000162 00000 n 0000000240 00000 n 0000000311 00000 n 0000000391 00000 n 0000000496 00000 n trailer << /Size 9 /Root 2 0 R /Info 1 0 R >> startxref 599 %%EOF