frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

France's homegrown open source online office suite

https://github.com/suitenumerique
105•nar001•1h ago•48 comments

Start all of your commands with a comma (2009)

https://rhodesmill.org/brandon/2009/commands-with-comma/
345•theblazehen•2d ago•117 comments

Hoot: Scheme on WebAssembly

https://www.spritely.institute/hoot/
49•AlexeyBrin•2h ago•10 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
734•klaussilveira•17h ago•230 comments

Reinforcement Learning from Human Feedback

https://arxiv.org/abs/2504.12501
29•onurkanbkrc•2h ago•2 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
990•xnx•22h ago•562 comments

Coding agents have replaced every framework I used

https://blog.alaindichiappari.dev/p/software-engineering-is-back
75•alainrk•2h ago•71 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
114•jesperordrup•7h ago•52 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
84•videotopia•4d ago•17 comments

Making geo joins faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
144•matheusalmeida•2d ago•39 comments

Ga68, a GNU Algol 68 Compiler

https://fosdem.org/2026/schedule/event/PEXRTN-ga68-intro/
24•matt_d•3d ago•5 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
247•isitcontent•17h ago•27 comments

Cross-Region MSK Replication: K2K vs. MirrorMaker2

https://medium.com/lensesio/cross-region-msk-replication-a-comprehensive-performance-comparison-o...
6•andmarios•4d ago•1 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
257•dmpetrov•17h ago•135 comments

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

https://github.com/sandys/kappal
6•sandGorgon•2d ago•2 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
350•vecti•19h ago•157 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
518•todsacerdoti•1d ago•252 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
400•ostacke•23h ago•104 comments

What Is Ruliology?

https://writings.stephenwolfram.com/2026/01/what-is-ruliology/
52•helloplanets•4d ago•51 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
316•eljojo•20h ago•196 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
365•aktau•23h ago•189 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
445•lstoll•23h ago•293 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
99•quibono•4d ago•26 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
79•kmm•5d ago•12 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
285•i5heu•20h ago•238 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
48•gmays•12h ago•21 comments

Was Benoit Mandelbrot a hedgehog or a fox?

https://arxiv.org/abs/2602.01122
26•bikenaga•3d ago•15 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
162•vmatsiiako•22h ago•73 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
1098•cdrnsf•1d ago•479 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
70•gfortaine•15h ago•29 comments
Open in hackernews

Llama-Scan: Convert PDFs to Text W Local LLMs

https://github.com/ngafar/llama-scan
221•nawazgafar•5mo ago

Comments

roscas•5mo ago
Almost perfect, the PDF I tested it missed only a few symbols.

But that is something I will use for sure. Thank you.

nawazgafar•5mo ago
Glad to hear it! What types of symbols did it miss?
david_draco•5mo ago
Looking at the code, this converts PDF pages to images, then transcribes each image. I might have expected a pdftotext post-processor. The complexity of PDF I guess ...
firesteelrain•5mo ago
There is a very popular Python module called ocrmypdf. I used it to help my HOA and OCR’ing of old PDFs.

https://github.com/ocrmypdf/OCRmyPDF

No LLMs required.

cess11•5mo ago
It's nice, I've used it as a fallback text extraction method in an ETL flow that chugged through tens of thousands of corporate and legal PDF files.
dreamcompiler•5mo ago
20 years ago I tried in vain to get my HOA to use the virtual printer for PDF documents so they'd be searchable. The capability was built in to both Mac and Windows even way back then.

No luck. They just could not grasp it. So they kept using their process of printing out the file on paper and then scanning it back in as a PDF image file.

I finally quit trying. Now of course they've seen the light and are painstakingly OCRing all that old stuff.

firesteelrain•5mo ago
Ouch! I am on the BOD so as an IT/Engineering Professional I can influence things better
westurner•5mo ago
Shell: GNU parallel, pdftotext

Python: PyPdf2, PdfMiner.six, Grobid, PyMuPdf; pytesseract (C++)

paperetl is built on grobid: https://github.com/neuml/paperetl

annotateai: https://github.com/neuml/annotateai :

> annotateai automatically annotates papers using Large Language Models (LLMs). While LLMs can summarize papers, search papers and build generative text about papers, this project focuses on providing human readers with context as they read.

pdf.js-hypothes.is: https://github.com/hypothesis/pdf.js-hypothes.is:

> This is a copy of Mozilla's PDF.js viewer with Hypothesis annotation tools added

Hypothesis is built on the W3C Web Annotations spec.

dokieli implements W3C Web Annotations and many other Linked Data Specs: https://github.com/dokieli/dokieli :

> Implements versioning and has the notion of immutable resources.

> Embedding data blocks, e.g., Turtle, N-Triples, JSON-LD, TriG (Nanopublications).

A dokieli document interface to LLMs would be basically the anti-PDF.

Rust crates: rayon handles parallel processing, pdf-rs, tesseract (C++)

pdf-rs examples/src/bin/extract_page.rs: https://github.com/pdf-rs/pdf/blob/master/examples/src/bin/e...

moritonal•5mo ago
I imagine part of the issue is how many PDFs are just a series of images anyway.
enjaydee•5mo ago
Saw this tweet the other day that helped me understand just how crazy PDF parsing can be

https://threadreaderapp.com/thread/1955355127818358929.html

constantinum•5mo ago
There are a few other reasons why PDF parsing is Hell! > https://unstract.com/blog/pdf-hell-and-practical-rag-applica...
ethan_smith•5mo ago
Image-based extraction often preserves layout and handles PDFs with embedded fonts, scanned content, or security restrictions better than direct text extraction methods.
firesteelrain•5mo ago
Ironically, Ollama likely is using Tesseract under the hood. Python library ocrmypdf uses Tesseract too. https://github.com/ocrmypdf/OCRmyPDF
rafram•5mo ago
> Ironically, Ollama likely is using Tesseract under the hood.

No, it isn’t.

no_creativity_•5mo ago
Which llama model would have the best results for transcribing an image, I wonder. Say, for a screen grab of a newspaper page.
wittjeff•5mo ago
Please add a license file. Thanks!
nawazgafar•5mo ago
Will do!
leodip•5mo ago
Nice! I wonder what is the hardware required to run qwen2.5vl locally. A 6gb 2cpu VPS can do?
mdaniel•5mo ago
It does not appear that qwen2.5vl is one thing, so it would depend a great deal on the size you wish to use

Also, watch out, it seems the weights do not carry a libre license https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/blob/main...

kaycey2022•5mo ago
What is a libre license and why is it important?
HocusLocus•5mo ago
By 1990 Omnipage 3 and its successors were 'good enough' and with their compact dictionaries and letter form recognition were miracles of their time at ~300MB installed.

In 2025 LLMs can 'fake it' using Trilobites of memory and Petaflops. It's funny actually, like a supercomputer being emulated in real time on a really fast Jacquard loom. By 2027 even simple hand held calculator addition will be billed in kilowatt-hours.

Y_Y•5mo ago
https://en.wikipedia.org/wiki/Trilobite

Trilobites? Those were truly primitve computers.

__alexs•5mo ago
Didn't the discworld books have these?
privatelypublic•5mo ago
If you think 1990's ocr- even 2000's OCR is remotely as good as modern OCR... I`v3 g0ta bnedge to sell.
skygazer•5mo ago
I had an on-screen OCR app on my Amiga in the early 90s that was amazing, so long as the captured text image used a system font. Avoiding all the mess of reality like optics, perspective, sensors and physics and it could be basically perfect.
privatelypublic•5mo ago
If you want to go back to the start, look up MICR. Used to sort checks.

OCR'ing a fixed, monospaced, font from a pristine piece of paper really is "solved." It's all the nasties of tue real world that its an issue.

As I mockingly demonstrated- kerning, character similarity, grammar, lexing- all present large and hugely time consuming problems to solve in processes where OCR is the most useful.

Someone•5mo ago
MacPaint had that in 1983, but it never shipped because Bill Atkinson “was afraid that if he left it in, people would actually use it a lot, and MacPaint would be regarded as an inadequate word processor instead of a great drawing program” (https://www.folklore.org/MacPaint_Evolution.html)

Also shows a way to do that fast:

“ First, he wrote assembly language routines to isolate the bounding box of each character in the selected range. Then he computed a checksum of the pixels within each bounding box, and compared them to a pre-computed table that was made for each known font, only having to perform the full, detailed comparison if the checksum matched.”

bayindirh•5mo ago
Tesseract can do wonders for scanned paper (and web generated PDFs) both in its old and new version. If you want to pay for something closed, Prizmo on macOS is extremely good as well.

On the other hând, LLm5 are sl0wwer, moré resource hangry and l3ss accurale fr their outpu1z.

We shoulD stop gl0rıfying LLMs for 3verylhin9.

agentcoops•5mo ago
I've worked extensively with Tesseract, ABBYY, etc in a personal and professional context. Of course they work well for English-language documents without any complexity of layout that are scanned without the slightest defect. At this point, based on extensive testing for work, state of the art LLMs simply have better accuracy -- and an order of magnitude so if you have non-English documents with complex layouts and less than ideal scans. I'll give you speed, but the accuracy is so much greater (and the need for human intervention so much less) that in my experience it's a worthwhile trade-off.

I'm not saying this applies to you, but my sense from this thread is that many are comparing the results of tossing an image into a free ChatGPT session with an "OCR this document" prompt to a competent Tesseract-based tool... LLMs certainly don't solve any and every problem, but this should be based on real experiments. In fact, OCR is probably the main area where I've found them to simply be the best solution for a professional system.

privatelypublic•5mo ago
Yea. As usual, I inarticulately didn't make a good argument for my point. A tuned system with optimized workflow will by far have the best results. And- maybe llms will be a key resource in bringing the OCR into usable/profitable areas.

But, theres also a ton of "I don't want to deal with this" type work items that can't justify a full workflow process build out- but that LLMs get near enough to perfect to be "good enough." The bad part is, the LLMs don't explain to people the kinds of mistakes to expect from them.

jchw•5mo ago
A bit ago I tried throwing a couple of random simple Japanese comics (think 4koma but I don't think either of the ones I threw in were actually 4 panels) from Pixiv into Gemma 3b on AI studio.

- It transcribed all of the text, including speech, labels on objects, onomatopoeias in actions, etc. I did notice a kana was missing a diacritic in a transcription, so the transcriptions were not perfect, but pretty close actually. To my eye all of the kanji looked right. Latin characters already OCR pretty well, but at least in my experience other languages can be a struggle.

- It also, unprompted, correctly translated the fairly simple Japanese to English. I'm not an expert, but the translations looked good to me. Gemini 2.5 did the same, and while it had a slightly different translation, both of them were functionally identical, and similar to Google Translate.

- It also explained the jokes, the onomatopoeias, etc. To my ability to verify these things they seemed to be correct, though notably Japanese onomatopoeias used for actions in comics is pretty diverse and not necessarily super well-documented. But contextually it seemed right.

To me this is interesting. I don't want to anthropomorphize the models (at least unduly, though I am describing the models as if they chose to do these things, since it's natural to do so) but the fact that even relatively small local models such as Gemma can perform tasks like this on arbitrary images with handwritten Japanese text bodes well. Traditional OCR struggles to find and recognize text that isn't English or is stylized/hand-written, and can't use context clues or its own "understanding" to fill in blanks where things are otherwise unreadable; at best they can take advantage of more basic statistics, which can take you quite far but won't get you to the same level of proficiency at the job as a human. vLLMs however definitely have an advantage in the amount of knowledge embedded within them, and can use that knowledge to cut through ambiguity. I believe this gets them closer.

I've messed around with using vLLMs for OCR tasks a few times primarily because I'm honestly just not very impressed with more traditional options like Tesseract, which sometimes need a lot of help even just to find the text you want to transcribe, depending on how ideal the case is.

On the scale of AI hype bullshit, the use case of image recognition and transcription is damn near zero. It really is actually useful here. Some studies have shown that vLLMs are "blind" in some ways (in that they can be made to fail by tricking them, like Photoshopping a cat to have an extra leg and asking how many legs the animal in the photo has; in this case the priors of the model from its training data work against it) and there are some other limitations (I think generally when you use AI for transcription it's hard to get spatial information about what is being recognized, though I think some techniques have been applied, like recursively cutting an image up and feeding it to try to refine bounding boxes) but the degree to which it works is, in my honest opinion, very impressive and very useful already.

I don't think that this demonstrates that basic PDF transcription, especially of cleanly-scanned documents, really needs large ML models... But on the other hand, large ML models can handle both easy and hard tasks here pretty well if you are working within their limitations.

Personally, I look forward to seeing more work done on this sort of thing. If it becomes reliable enough, it will be absurdly useful for both accessibility and breaking down language barriers; machine translation has traditionally been a bit limited in how well it can work on images, but I've found Gemini, and surprisingly often even Gemma, can make easy work of these tasks.

I agree these models are inefficient, I mean traditional OCR aside, our brains do similar tasks but burn less electricity and ostensibly need less training data (at least certainly less text) to do it. It certainly must be physically possible to make more efficient machines that can do these tasks with similar fidelity to what we have now.

agentcoops•5mo ago
100%. My sense is that many in this thread have never gone through the misery of trying to use classical OCR for non-English documents or where you can't control scan quality. I did a test recently with 18th-century German documents, written in a well-known and standardized but archaic script. The accuracy of classical models specifically trained on this corpus was an order of magnitude lower than GPT5. I haven't experimented personally or professionally with smaller models, but your experience makes me hopeful that we might even get this accurate OCR on phones sooner rather than later...
bugglebeetle•5mo ago
William Mattingly has been doing a lot of work on similar documents in an archival context with VLLMs. You should check in on their work:

https://x.com/wjb_mattingly

https://github.com/wjbmattingly

abnry•5mo ago
I would really like a tool to reliably get the title of PDF. It is not as easy as it seems. If the PDF exists online (say a paper or course notes) a bonus would be to find that or related metadata.
s0rce•5mo ago
Zotero does an ok job at this for papers.
treetalker•5mo ago
I presume this doesn't handle handwriting.

Does anyone have a suggestion for locally converting PDFs of handwriting into text, say on a recent Mac? Use case would be converting handwritten journals and daily note-taking.

ntnsndr•5mo ago
+1. I have tried a bunch of local models (albeit the smaller end, b/c hardware limits), and I can't get handwriting recognition yet. But online Gemini and Claude do great. Hoping the local models catch up soon, as this is a wonderful LLM use case.

UPDATE: I just tried this with the default model on handwriting, and IT WORKED. Took about 5-10 minutes on my laptop, but it worked. I am so thrilled not to have to send my personal jottings into the cloud!

simonw•5mo ago
This one should handle handwriting - it's using Qwen 2.5 VL which is a vision LLM that is very good at handwritten text.
nawazgafar•5mo ago
Author here, I tested it with this PDF of a handwritten doc [1], and it converted both pages accurately.

1. https://github.com/pnshiralkar/text-to-handwriting/blob/mast...

treetalker•5mo ago
Amazing, can't wait to try it!

FYI, your GitHub link tells me it's unable to render because the pdf is invalid.

password4321•5mo ago
I don't know re: handwriting so only barely relevant but here is a new contender for a CLI "OCR Tool using Apple's Vision Framework API": https://github.com/riddleling/macocr which I found while searching for this recent discussion:

My iPhone 8 Refuses to Die: Now It's a Solar-Powered Vision OCR Server

https://news.ycombinator.com/item?id=44310944

phren0logy•5mo ago
If you use Docling, you can set your OCR engine to OCRMac then set it to use LiveText. It’s a good arrangement. You can send these as command-line arguments, but I generally configure it from the Python API.
fcoury•5mo ago
I really wanted this to be good. Unfortunately it converted a page that contained a table that is usually very hard for converters to properly convert and I got a full page with "! Picture 1:" and nothing else. On top of that, it hung at page 17 of a 25 page document and never resumed.
nawazgafar•5mo ago
Author here, that sucks. I'd love to recreate this locally. Would you be willing to share the PDF?
threeducks•5mo ago
As far as I am aware, the "hanging" issue remains unsolved to this day. The underlying problem is that LLMs sometimes get stuck in a loop where they repeat the same text again and again until they reach the token limit. You can break the loop by setting a repeat penalty, but when your image contains repeated text, such as in tables, the LLM will output incorrect results to prevent repetition.

Here is the corresponding GitHub issue for your default model (Qwen2.5-VL):

https://github.com/QwenLM/Qwen2.5-VL/issues/241

You can mitigate the fallout of this repetition issue to some degree by chopping up each page into smaller pieces (paragraphs, tables, images, etc.) with a page layout model. Then at least only part of the text is broken instead of the entire page.

A better solution might be to train a model to estimate a heat map of character density for a page of text. Then, condition the vision-language model on character density by feeding the density to the vision encoder. Also output character coordinates, which can be used with the heat map to adjust token probabilities.

thorum•5mo ago
I’ve been trying to convert a dense 60 page paper document to Markdown today from photos taken on my iPhone. I know this is probably not the best way to do it but it’s still been surprising to find that even the latest cloud models are struggling to process many of the pages. Lots of hallucination and “I can’t see the text” (when the photo is perfectly clear). Lots of retrying different models, switching between LLMs and old fashioned OCR, reading and correcting mistakes myself. It’s still faster than doing the whole transcription manually but I thought the tech was further along.
bugglebeetle•5mo ago
Try this:

https://github.com/rednote-hilab/dots.ocr

mdaniel•5mo ago
The code is MIT, and the weights are labeled MIT although the actual license file in the weights repo seems to be mostly Apache 2 https://huggingface.co/rednote-hilab/dots.ocr/blob/main/NOTI...

Seems to weigh about 6GB which feels reasonable to manage locally

cronoz30•5mo ago
Does this work with images embedded in the PDF and rasterized images?
kaycey2022•5mo ago
It converts each page into an image and feeds it to Qwen2.5VL So it should be fine.
ahmedhawas123•5mo ago
This may be a bit of an irrelevant and at best imaginative rant, but there is no shortage of solutions that are mediocre or near perfect for specific use cases out there to parse PDFs. This is a great addition to that.

That said, over the last two years I've come across many use cases to parse PDFs and each has its own requirements (e.g., figuring out titles, removing page numbers, extracting specific sections, etc). And each require a different approach.

My point is, this is awesome, but I wonder if there needs to be a broader push / initiative to stop leveraging PDFs so much when things like HTML, XML, JSON and a million other formats exist. It's a hard undertaking I know, no doubt, but it's not unheard of to drop technologies (e.g., fax) for a better technology.

mdaniel•5mo ago
That ship has sailed, and I'd guess the majority of the folks in these threads are in the same boat I am: one does not get to choose what files your customers send you, you have to meet them where they are
bm-rf•5mo ago
For the purposes of an llm "reading" a pdf, it just renders it as an image. The file format does not matter. Let's say you have documents that already exist, a robust ocr solution that can handle tables and diagrams could be very valuable.
evolve2k•5mo ago
“Turn images and diagrams into detailed text descriptions.”

I’d just prefer that any images and diagrams are copied over, and rendered into a popular format like markdown.

ggnore7452•5mo ago
I’ve done a similar PDF → Markdown workflow.

For each page:

- Extract text as usual.

- Capture the whole page as an image (~200 DPI).

- Optionally extract images/graphs within the page and include them in the same LLM call.

- Optionally add a bit of context from neighboring pages.

Then wrap everything with a clear prompt (structured output + how you want graphs handled), and you’re set.

At this point, models like GPT-5-nano/mini or Gemini 2.5 Flash are cheap and strong enough to make this practical.

Yeah, it’s a bit like using a rocket launcher on a mosquito, but this is actually very easy to implement and quite flexible and powerfuL. works across almost any format, Markdown is both AI and human friendly, and surprisingly maintainable.

GaggiX•5mo ago
>are cheap and strong enough to make this practical.

It all depends on the scale you need them, with the API it's easy to generate millions of tokens without thinking.

rdos•5mo ago
In that case you should run a model locally, this one for example: https://huggingface.co/ds4sd/docling-models
agentcoops•5mo ago
You don't need full reasoning to get accurate results, so even with GPT5 it's still pretty cheap for a one-time job and easy to reason about costs. It's certainly cheaper if you have data where reliability is key and classical OCR will undoubtedly require some manual data cleaning...

I can recommend the Mistral OCR API [1] if you have large jobs and don't want to think about it too much.

[1] https://mistral.ai/solutions/document-ai

KnuthIsGod•5mo ago
Sub-2010 level OCR using LLM.

It is hype-compatible so it is good.

It is AI so it is good.

It is blockchain so it is good.

It is cloud so it is good.

It is virtual so it is good.

It is UML so it is good.

It is RPN so it is good.

It is a steam engine so it is good.

Yawn...

GaggiX•5mo ago
>Sub-2010 level OCR

It's not.

deepsquirrelnet•5mo ago
Give the nanonets-ocr-s model a try. It’s a fine tune of Qwen 2.5 vl which I’ve had good success with for markdown and latex with image captioning. It uses a simple tagging scheme for page numbers, captions and tables.
captainregex•5mo ago
I desperately wanted Qwen vl to work but it just unleashes rambling hallucinations off basic screencaps. going to try nanonet!
davidwritesbugs•5mo ago
I've tried nanonets but it seems very sensitive to the prompt, changing it slightly turned the output to rubbish. When it worked it was pretty good.
deepsquirrelnet•5mo ago
This is true. It’s not meant to be run with any prompt but the one they trained with. I found that out as well. It’s only meant for ocr. Qwen 2.5vl is better if you need that option.
constantinum•5mo ago
Other tools worthy of mention that help with OCR'ing PDF/Scans to markdown/layout-preserved text:

LLMWhisperer(from Unstract), Docling(IBM), Marker(Surya OCR), Nougat(Facebook Research), Llamaparse.

Areibman•5mo ago
Similar project used to organize PDFs with Ollama https://github.com/iyaja/llama-fs
ekianjo•5mo ago
careful if you plan on using this. it leverages pymupdf which is AGPL.
pyuser583•5mo ago
It seems we've entered the "AI is local" phase.
visarga•5mo ago
The crucial information is missing - accuracy comparison with other OCR providers. From my experience LLM based OCR might misread the layout and hallucinate values, it is very subtle but sometimes critically wrong. Classical OCR has more precision but doesn't get the layout at all. Combining both has other issues, no approach is 100% reliable.
WithinReason•5mo ago
Breaking up the page, feeding the pieces one-by-one and reassembling the output helps with that. I was expecting this project to do that but it can only feed a whole page.
worldsayshi•5mo ago
Yes I tried using LLM for reading CV:s a while back and I really struggled with getting it to not omit important information.
agentcoops•5mo ago
Have you evaluated this lately? Last year or even just earlier this year I would have mostly agreed with you. At this point, however, with at least the documents I have been working on, OCR reliability with GPT5 or Mistral OCR [1] has been much better than even domain-trained classical OCR. If the documents have even slightly complex layout (to say nothing of page numbers or page headings or an uncommon font), the accuracy of state of the art LLMs has been in my work an order of magnitude greater. The ability to have the LLM tentatively combine trailing sentences across pages, which is especially useful if you have to work with documents in German say, is invaluable.

[1] https://mistral.ai/news/mistral-ocr

zarzavat•5mo ago
I asked GPT-5 to OCR a table for me the other day, it hallucinated perhaps 10% of the values. This was a screenshot of a spreadsheet, with large font, not challenging except for the layout.

What's interesting is that I asked it to also read the background colors of the cells and it did much worse on that task.

I believe these models could be useful for a first pass if you are willing to manually review everything they output, but the failure mode is unsettling.

smusamashah•5mo ago
Any tool that takes a scanned PDF, then overlay's OCRed text over scan so that text becomes searchable?
Xmd5a•5mo ago
https://github.com/ocrmypdf/OCRmyPDF

>OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

I ... I nailed it.

quibono•5mo ago
Just a note that OCRmyPDF currently uses Tesseract
philips•5mo ago
It would be nice to provide a way to edit the prompt. I have a use case where I need to extract tabular handwritten data from PDFs scanned with a phone and I don't want it to extract the printed instructions on the form, etc.

I have a very similar Go script that does this. My prompt: Create a CSV of the handwritten text in the table. Include the package number on each line. Only output a CSV.

cast42•5mo ago
It would be nice to see how it performs on this benchmark: https://github.com/opendatalab/OmniDocBench
jonwinstanley•5mo ago
What else can be hooked up to Ollama? Can Cursor use it yet?
AmazingTurtle•5mo ago
Yet another Prompt Wrapper

TRANSCRIPTION_PROMPT = """Task: Transcribe the page from the provided book image.

- Reproduce the text exactly as it appears, without adding or omitting anything. - Use Markdown syntax to preserve the original formatting (e.g., headings, bold, italics, lists). - Do not include triple backticks (```) or any other code block markers in your response, unless the page contains code. - Do not include any headers or footers (for example, page numbers). - If the page contains an image, or a diagram, describe it in detail. Enclose the description in an <image> tag. For example:

<image> This is an image of a cat. </image>

"""

edwinjones•5mo ago
Agreed, it really looks like quite a small prompt wrapper: https://github.com/ngafar/llama-scan/blob/main/llama_scan/co...

The url to connect to ollama seems to just be hard coded so I don't see why you couldn't point this at a different machine on your network rather than having Ollama running locally on every machine you need this for like the readme implies.

fforflo•5mo ago
If you're interested in this sort of thing with an SQL flavor, you may find the pgpdf PostgreSQL extension useful https://github.com/Florents-Tselai/pgpdf .

It's basically an SQL wrapper around poppler.

jdthedisciple•5mo ago
If its not nearly as accurate as state of the art OCR (such as GPT's), then I'm not sure it being offline is worth the tradeoff to me personally.

I'm personally on the watchout for the absolute best possible multilingual OCR performance, local or not, cost it what it may (almost).