frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

OpenDataLoader-PDF: An open source tool for structured PDF parsing

https://github.com/opendataloader-project/opendataloader-pdf
46•phobos44•4h ago

Comments

clueless•2h ago
Given the current llm context size limitation, what is the state of art for feeding large doc/text blobs into llm for accurate processing?
simonw•2h ago
The current generation of models all support pretty long context now - the Gemini family has had 1m tokens for over a year, GPT-4.1 is 1m, interestingly GPT-5 is back down to 400,000, Claude 4 is 200,000 but there's a mode of Claude Sonnet 4 that can do 1m as well.

The bigger question is how well they perform - there are needle-in-haystack benchmarks that test that, they're mostly scoring quite highly on those now.

https://cloud.google.com/blog/products/ai-machine-learning/t... talks about that for Gemini 1.5.

Here's a couple of relevant leaderboards: https://huggingface.co/spaces/RMT-team/babilong and https://longbench2.github.io/

clueless•1h ago
sorry I should have been more clear, I meant around open source llms. and I guess the question is, how are closed source llm doing it so well. And if OS OpenNote is the best we have...
lysecret•1h ago
Generally use 2.5 flash for this, works incredibly well. So many traditionally hard things can now we solved by stuffing it into a pretty cheap llm haha.
mekael•32m ago
What do you mean by “traditionally hard” in relation to a pdf? Most if not all of the docs I’m tasked with parsing are secured, flattened, and handwritten, which can cause any tool (traditional or ai) to require a confidence score and manual intervention. Also might be that i just get stuck with the edge cases 90% of the time.
trevor-e•2h ago
I've been thinking lately that maybe we need a new AI-friendly file format rather than continuing to hack on top of PDF's complicated spec. PDF was designed to have consistent and portable page display rendering, it was not a goal for it to be easily parseable afaik, which is why we have to go through these crazy hoops. If you've ever looked at how text is stored internally in PDF this becomes immediately obvious.

I've been toying with an idea of a new format that stores text naturally and captures semantics (e.g. to help with table parsing), but also preserves formatting rules so you can still achieve fairly consistent rendering. This format could be easily converted to PDF, although the opposite conversion would have the regular challenges. The main challenge is distribution of course.

Jaxan•1h ago
Wouldn’t it be better to invest in a human-friendly format first (which also could be AI-friendly).
trevor-e•41m ago
Not really sure what you mean by a "human-friendly" file format, can you elaborate? File formats are inherently not friendly to humans, they are a bag of bytes. But that doesn't mean they can't be better consumed by tools which is what I mean by "AI friendly".
fedeb95•1h ago
Very cool. I'll probably use it, but not for AI. I have lots of pdfs for which an epub doesn't exist.

Or if anything I'll add it to the projects-that-already-do-this-but-havent-yet-found list.

agsqwe•47m ago
How does it compare to docling?
emilburzo•38m ago
I just tested it on one of my nemeses: PDF bank statements. They're surprisingly tough to work with if you want to get clean, structured transaction data out of them.

The JSON extract actually looks pretty good and seems to produce something usable in one shot, which is very good compared to all the other tools I've tried so far, but I still need to check it more in-depth.

Sharing here in case someone chimes in with "hey, doofus, $magic_project already solves this."

constantinum•13m ago
There is also Unstract open-source. Structured data extraction + ETL. https://github.com/Zipstack/unstract

Libghostty is coming

https://mitchellh.com/writing/libghostty-is-coming
118•kingori•4h ago•24 comments

Android users can now use conversational editing in Google Photos

https://blog.google/products/photos/android-conversational-editing-google-photos/
36•meetpateltech•54m ago•23 comments

Getting AI to work in complex codebases

https://github.com/humanlayer/advanced-context-engineering-for-coding-agents/blob/main/ace-fca.md
32•dhorthy•3h ago•15 comments

Launch HN: Strata (YC X25) – One MCP server for AI to handle thousands of tools

72•wirehack•3h ago•42 comments

Go has added Valgrind support

https://go-review.googlesource.com/c/go/+/674077
366•cirelli94•8h ago•92 comments

From MCP to shell: MCP auth flaws enable RCE in Claude Code, Gemini CLI and more

https://verialabs.com/blog/from-mcp-to-shell/
48•stuxf•2h ago•15 comments

x402 — An open protocol for internet-native payments

https://www.x402.org/
132•thm•3h ago•53 comments

Imagining a language without booleans

https://justinpombrio.net/2025/09/22/imagining-a-language-without-booleans.html
25•todsacerdoti•19h ago•18 comments

Nine things I learned in ninety years

http://edwardpackard.com/wp-content/uploads/2025/09/Nine-Things-I-Learned-in-Ninety-Years.pdf
758•coderintherye•14h ago•293 comments

Getting More Strategic

https://cate.blog/2025/09/23/getting-more-strategic/
105•gpi•5h ago•12 comments

Restrictions on house sharing by unrelated roommates

https://marginalrevolution.com/marginalrevolution/2025/08/the-war-on-roommates-why-is-sharing-a-h...
212•surprisetalk•4h ago•245 comments

Structured Outputs in LLMs

https://parthsareen.com/blog.html#sampling.md
158•SamLeBarbare•7h ago•74 comments

Mesh: I tried Htmx, then ditched it

https://ajmoon.com/posts/mesh-i-tried-htmx-then-ditched-it
69•alex-moon•5h ago•66 comments

Shopify, pulling strings at Ruby Central, forces Bundler and RubyGems takeover

https://joel.drapper.me/p/rubygems-takeover/
188•bradgessler•2h ago•112 comments

OpenDataLoader-PDF: An open source tool for structured PDF parsing

https://github.com/opendataloader-project/opendataloader-pdf
46•phobos44•4h ago•12 comments

Always Invite Anna

https://sharif.io/anna-alexei
15•walterbell•2h ago•0 comments

Thundering herd problem: Preventing the stampede

https://distributed-computing-musings.com/2025/08/thundering-herd-problem-preventing-the-stampede/
10•pbardea•18h ago•0 comments

Zinc (YC W14) Is Hiring a Senior Back End Engineer (NYC)

https://app.dover.com/apply/Zinc/4d32fdb9-c3e6-4f84-a4a2-12c80018fe8f/?rs=76643084
1•FriedPickles•5h ago

Markov Chains Are the Original Language Models

https://elijahpotter.dev/articles/markov_chains_are_the_original_language_models
6•chilipepperhott•3d ago•0 comments

Agents turn simple keyword search into compelling search experiences

https://softwaredoug.com/blog/2025/09/22/reasoning-agents-need-bad-search
39•softwaredoug•3h ago•15 comments

Are elites meritocratic and efficiency-seeking? Evidence from MBA students

https://arxiv.org/abs/2503.15443
84•bikenaga•2h ago•44 comments

Smooth weighted round-robin balancing

https://github.com/nginx/nginx/commit/52327e0627f49dbda1e8db695e63a4b0af4448b1
12•grep_it•4d ago•1 comments

Zoxide: A Better CD Command

https://github.com/ajeetdsouza/zoxide
264•gasull•13h ago•168 comments

YAML document from hell (2023)

https://ruudvanasseldonk.com/2023/01/11/the-yaml-document-from-hell
146•agvxov•8h ago•101 comments

Show HN: Run Qwen3-Next-80B on 8GB GPU at 1tok/2s throughput

https://github.com/Mega4alik/ollm
73•anuarsh•3d ago•7 comments

Processing Strings 109x Faster Than Nvidia on H100

https://ashvardanian.com/posts/stringwars-on-gpus/
139•ashvardanian•3d ago•23 comments

Show HN: Kekkai – a simple, fast file integrity monitoring tool in Go

https://github.com/catatsuy/kekkai
31•catatsuy•3h ago•4 comments

Cache of devices capable of crashing cell network is found in NYC

https://www.nytimes.com/2025/09/23/us/politics/secret-service-sim-cards-servers-un.html
201•adriand•6h ago•128 comments

Zig feels more practical than Rust for real-world CLI tools

https://dayvster.com/blog/why-zig-feels-more-practical-than-rust-for-real-world-cli-tools/
132•dayvster•5h ago•193 comments

Permeable materials in homes act as sponges for harmful chemicals: study

https://news.uci.edu/2025/09/22/indoor-surfaces-act-as-massive-sponges-for-harmful-chemicals-uc-i...
81•XzetaU8•8h ago•68 comments