frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Llama-Scan: Convert PDFs to Text W Local LLMs

https://github.com/ngafar/llama-scan
93•nawazgafar•7h ago

Comments

david_draco•6h ago
Looking at the code, this converts PDF pages to images, then transcribes each image. I might have expected a pdftotext post-processor. The complexity of PDF I guess ...
firesteelrain•6h ago
There is a very popular Python module called ocrmypdf. I used it to help my HOA and OCR’ing of old PDFs.

https://github.com/ocrmypdf/OCRmyPDF

No LLMs required.

moritonal•5h ago
I imagine part of the issue is how many PDFs are just a series of images anyway.
enjaydee•5h ago
Saw this tweet the other day that helped me understand just how crazy PDF parsing can be

https://threadreaderapp.com/thread/1955355127818358929.html

constantinum•2h ago
There are a few other reasons why PDF parsing is Hell! > https://unstract.com/blog/pdf-hell-and-practical-rag-applica...
ethan_smith•24m ago
Image-based extraction often preserves layout and handles PDFs with embedded fonts, scanned content, or security restrictions better than direct text extraction methods.
firesteelrain•6h ago
Ironically, Ollama likely is using Tesseract under the hood. Python library ocrmypdf uses Tesseract too. https://github.com/ocrmypdf/OCRmyPDF
wittjeff•5h ago
Please add a license file. Thanks!
HocusLocus•5h ago
By 1990 Omnipage 3 and its successors were 'good enough' and with their compact dictionaries and letter form recognition were miracles of their time at ~300MB installed.

In 2025 LLMs can 'fake it' using Trilobites of memory and Petaflops. It's funny actually, like a supercomputer being emulated in real time on a really fast Jacquard loom. By 2027 even simple hand held calculator addition will be billed in kilowatt-hours.

Y_Y•3h ago
https://en.wikipedia.org/wiki/Trilobite

Trilobites? Those were truly primitve computers.

privatelypublic•3h ago
If you think 1990's ocr- even 2000's OCR is remotely as good as modern OCR... I`v3 g0ta bnedge to sell.
skygazer•2h ago
I had an on-screen OCR app on my Amiga in the early 90s that was amazing, so long as the captured text image used a system font. Avoiding all the mess of reality like optics, perspective, sensors and physics and it could be basically perfect.
privatelypublic•1h ago
If you want to go back to the start, look up MICR. Used to sort checks.

OCR'ing a fixed, monospaced, font from a pristine piece of paper really is "solved." It's all the nasties of tue real world that its an issue.

As I mockingly demonstrated- kerning, character similarity, grammar, lexing- all present large and hugely time consuming problems to solve in processes where OCR is the most useful.

fcoury•5h ago
I really wanted this to be good. Unfortunately it converted a page that contained a table that is usually very hard for converters to properly convert and I got a full page with "! Picture 1:" and nothing else. On top of that, it hung at page 17 of a 25 page document and never resumed.
thorum•5h ago
I’ve been trying to convert a dense 60 page paper document to Markdown today from photos taken on my iPhone. I know this is probably not the best way to do it but it’s still been surprising to find that even the latest cloud models are struggling to process many of the pages. Lots of hallucination and “I can’t see the text” (when the photo is perfectly clear). Lots of retrying different models, switching between LLMs and old fashioned OCR, reading and correcting mistakes myself. It’s still faster than doing the whole transcription manually but I thought the tech was further along.
bugglebeetle•5h ago
Try this:

https://github.com/rednote-hilab/dots.ocr

cronoz30•5h ago
Does this work with images embedded in the PDF and rasterized images?
kaycey2022•1h ago
It converts each page into an image and feeds it to Qwen2.5VL So it should be fine.
ahmedhawas123•5h ago
This may be a bit of an irrelevant and at best imaginative rant, but there is no shortage of solutions that are mediocre or near perfect for specific use cases out there to parse PDFs. This is a great addition to that.

That said, over the last two years I've come across many use cases to parse PDFs and each has its own requirements (e.g., figuring out titles, removing page numbers, extracting specific sections, etc). And each require a different approach.

My point is, this is awesome, but I wonder if there needs to be a broader push / initiative to stop leveraging PDFs so much when things like HTML, XML, JSON and a million other formats exist. It's a hard undertaking I know, no doubt, but it's not unheard of to drop technologies (e.g., fax) for a better technology.

mdaniel•4h ago
That ship has sailed, and I'd guess the majority of the folks in these threads are in the same boat I am: one does not get to choose what files your customers send you, you have to meet them where they are
bm-rf•4h ago
For the purposes of an llm "reading" a pdf, it just renders it as an image. The file format does not matter. Let's say you have documents that already exist, a robust ocr solution that can handle tables and diagrams could be very valuable.
evolve2k•4h ago
“Turn images and diagrams into detailed text descriptions.”

I’d just prefer that any images and diagrams are copied over, and rendered into a popular format like markdown.

ggnore7452•3h ago
I’ve done a similar PDF → Markdown workflow.

For each page:

- Extract text as usual.

- Capture the whole page as an image (~200 DPI).

- Optionally extract images/graphs within the page and include them in the same LLM call.

- Optionally add a bit of context from neighboring pages.

Then wrap everything with a clear prompt (structured output + how you want graphs handled), and you’re set.

At this point, models like GPT-5-nano/mini or Gemini 2.5 Flash are cheap and strong enough to make this practical.

Yeah, it’s a bit like using a rocket launcher on a mosquito, but this is actually very easy to implement and quite flexible and powerfuL. works across almost any format, Markdown is both AI and human friendly, and surprisingly maintainable.

GaggiX•2h ago
>are cheap and strong enough to make this practical.

It all depends on the scale you need them, with the API it's easy to generate millions of tokens without thinking.

KnuthIsGod•3h ago
Sub-2010 level OCR using LLM.

It is hype-compatible so it is good.

It is AI so it is good.

It is blockchain so it is good.

It is cloud so it is good.

It is virtual so it is good.

It is UML so it is good.

It is RPN so it is good.

It is a steam engine so it is good.

Yawn...

GaggiX•2h ago
>Sub-2010 level OCR

It's not.

deepsquirrelnet•2h ago
Give the nanonets-ocr-s model a try. It’s a fine tune of Qwen 2.5 vl which I’ve had good success with for markdown and latex with image captioning. It uses a simple tagging scheme for page numbers, captions and tables.
captainregex•2h ago
I desperately wanted Qwen vl to work but it just unleashes rambling hallucinations off basic screencaps. going to try nanonet!
constantinum•2h ago
Other tools worthy of mention that help with OCR'ing PDF/Scans to markdown/layout-preserved text:

LLMWhisperer(from Unstract), Docling(IBM), Marker(Surya OCR), Nougat(Facebook Research), Llamaparse.

Areibman•2h ago
Similar project used to organize PDFs with Ollama https://github.com/iyaja/llama-fs
ekianjo•1h ago
careful if you plan on using this. it leverages pymupdf which is AGPL.

Clojure Async Flow Guide

https://clojure.github.io/core.async/flow-guide.html
84•simonpure•4h ago•33 comments

Google admits anti-competitive conduct involving Google Search in Australia

https://www.accc.gov.au/media-release/google-admits-anti-competitive-conduct-involving-google-search-in-australia
100•Improvement•2h ago•54 comments

A gigantic jet caught on camera: A spritacular moment for NASA astronaut

https://science.nasa.gov/science-research/heliophysics/a-gigantic-jet-caught-on-camera-a-spritacular-moment-for-nasa-astronaut-nicole-ayers/
26•acossta•3d ago•4 comments

Claudia – Desktop companion for Claude code

https://claudiacode.com/
361•zerealshadowban•11h ago•178 comments

NUMA Is the New Network: Reshaping Per-Socket Microservice Placement

https://codemia.io/blog/path/NUMA-Is-the-New-Network-How-Per-Socket-Memory-Models-Are-Reshaping-Microservice-Placement
30•signa11•3h ago•13 comments

The Enterprise Experience

https://churchofturing.github.io/the-enterprise-experience.html
298•Improvement•12h ago•79 comments

Llama-Scan: Convert PDFs to Text W Local LLMs

https://github.com/ngafar/llama-scan
93•nawazgafar•7h ago•51 comments

Modifying other people's software

https://natkr.com/2025-08-14-modifying-other-peoples-software/
36•todsacerdoti•4d ago•18 comments

Show HN: Doxx – Terminal .docx viewer inspired by Glow

https://github.com/bgreenwell/doxx
124•w108bmg•9h ago•30 comments

Mangle – a language for deductive database programming

https://github.com/google/mangle
23•simonpure•4h ago•2 comments

Show HN: OverType – A Markdown WYSIWYG editor that's just a textarea

247•panphora•13h ago•68 comments

Derivatives, Gradients, Jacobians and Hessians

https://blog.demofox.org/2025/08/16/derivatives-gradients-jacobians-and-hessians-oh-my/
220•ibobev•15h ago•51 comments

Show HN: NextDNS Adds "Bypass Age Verification"

322•nextdns•14h ago•93 comments

Fun with Finite State Transducers

https://blog.yossarian.net/2025/08/14/Fun-with-finite-state-transducers
22•woodruffw•3d ago•1 comments

Node.js is able to execute TypeScript files without additional configuration

https://nodejs.org/en/blog/release/v22.18.0
380•steren•23h ago•219 comments

ArchiveTeam has finished archiving all goo.gl short links

https://tracker.archiveteam.org/goo-gl/
325•pentagrama•11h ago•74 comments

I Prefer RST to Markdown (2024)

https://buttondown.com/hillelwayne/archive/why-i-prefer-rst-to-markdown/
60•shlomo_z•9h ago•38 comments

AI vs. Professional Authors Results

http://mark---lawrence.blogspot.com/2025/08/the-ai-vs-authors-results-part-2.html
83•biffles•7h ago•53 comments

BBC Micro, ancestor to ARM

https://retrogamecoders.com/bbc-micro-the-ancestor-to-a-device-you-are-guaranteed-to-own/
107•ingve•16h ago•91 comments

MS-DOS development resources

https://github.com/SuperIlu/DOSDevelResources/blob/main/README.md
74•mariuz•13h ago•14 comments

Show HN: ASCII Tree Editor

https://asciitree.reorx.com/
4•novoreorx•2h ago•0 comments

A Visual Exploration of Gaussian Processes (2019)

https://distill.pub/2019/visual-exploration-gaussian-processes/
46•vinhnx•2d ago•0 comments

Why Nim?

https://undefined.pyfy.ch/why-nim
130•TheWiggles•15h ago•139 comments

Here be dragons: Preventing static damage, latchup, and metastability in the 386

http://www.righto.com/2025/08/static-latchup-metastability-386.html
70•todsacerdoti•13h ago•40 comments

Show HN: Fallinorg - Offline Mac app that organizes files by meaning

https://fallinorg.com/#
70•bobnarizes•13h ago•39 comments

LL3M: Large Language 3D Modelers

https://threedle.github.io/ll3m/
393•simonpure•18h ago•170 comments

Faster Index I/O with NVMe SSDs

https://www.marginalia.nu/log/a_123_index_io/
144•ingve•16h ago•22 comments

Primitive Streaming Gods

https://tedium.co/2018/01/30/legal-music-streaming-history/
14•_vaporwave_•2d ago•1 comments

IMDB Terminal Browser

https://github.com/isene/IMDB
9•thunderbong•1h ago•6 comments

Teaching GPT-5 to Use a Computer

https://prava.co/archon/
56•Areibman•2d ago•13 comments