frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Start all of your commands with a comma

https://rhodesmill.org/brandon/2009/commands-with-comma/
202•theblazehen•2d ago•61 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
684•klaussilveira•15h ago•204 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
957•xnx•20h ago•553 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
65•videotopia•4d ago•3 comments

How we made geo joins 400× faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
126•matheusalmeida•2d ago•35 comments

Jeffrey Snover: "Welcome to the Room"

https://www.jsnover.com/blog/2026/02/01/welcome-to-the-room/
28•kaonwarb•3d ago•23 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
44•jesperordrup•5h ago•23 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
236•isitcontent•15h ago•26 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
230•dmpetrov•15h ago•122 comments

Where did all the starships go?

https://www.datawrapper.de/blog/science-fiction-decline
25•speckx•3d ago•14 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
332•vecti•17h ago•145 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
499•todsacerdoti•23h ago•244 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
384•ostacke•21h ago•96 comments

ga68, the GNU Algol 68 Compiler – FOSDEM 2026 [video]

https://fosdem.org/2026/schedule/event/PEXRTN-ga68-intro/
7•matt_d•3d ago•2 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
360•aktau•21h ago•183 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
294•eljojo•18h ago•185 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
420•lstoll•21h ago•280 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
66•kmm•5d ago•10 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
95•quibono•4d ago•22 comments

Was Benoit Mandelbrot a hedgehog or a fox?

https://arxiv.org/abs/2602.01122
21•bikenaga•3d ago•11 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
262•i5heu•18h ago•208 comments

Delimited Continuations vs. Lwt for Threads

https://mirageos.org/blog/delimcc-vs-lwt
33•romes•4d ago•3 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
38•gmays•10h ago•13 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
61•gfortaine•12h ago•26 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
1074•cdrnsf•1d ago•460 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
294•surprisetalk•3d ago•44 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
152•vmatsiiako•20h ago•72 comments

The AI boom is causing shortages everywhere else

https://www.washingtonpost.com/technology/2026/02/07/ai-spending-economy-shortages/
13•1vuio0pswjnm7•1h ago•0 comments

Why I Joined OpenAI

https://www.brendangregg.com/blog/2026-02-07/why-i-joined-openai.html
157•SerCe•11h ago•143 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
187•limoce•3d ago•102 comments
Open in hackernews

OpenDataLoader-PDF: An open source tool for structured PDF parsing

https://github.com/opendataloader-project/opendataloader-pdf
109•phobos44•4mo ago

Comments

clueless•4mo ago
Given the current llm context size limitation, what is the state of art for feeding large doc/text blobs into llm for accurate processing?
simonw•4mo ago
The current generation of models all support pretty long context now - the Gemini family has had 1m tokens for over a year, GPT-4.1 is 1m, interestingly GPT-5 is back down to 400,000, Claude 4 is 200,000 but there's a mode of Claude Sonnet 4 that can do 1m as well.

The bigger question is how well they perform - there are needle-in-haystack benchmarks that test that, they're mostly scoring quite highly on those now.

https://cloud.google.com/blog/products/ai-machine-learning/t... talks about that for Gemini 1.5.

Here's a couple of relevant leaderboards: https://huggingface.co/spaces/RMT-team/babilong and https://longbench2.github.io/

clueless•4mo ago
sorry I should have been more clear, I meant around open source llms. and I guess the question is, how are closed source llm doing it so well. And if OS OpenNote is the best we have...
simonw•4mo ago
Mainly I think it's that you need a LOT of VRAM to handle long context - server-class hardware is pretty much a requirement to work with more than ~10,000 tokens.
ranger_danger•4mo ago
On my i9 desktop with 128GB RAM and only 8GB VRAM, using llama.cpp I can split the work between both CPU/GPU and get the max 200k context to run on Qwen3 at a decent (human-reading) speed.
lysecret•4mo ago
Generally use 2.5 flash for this, works incredibly well. So many traditionally hard things can now we solved by stuffing it into a pretty cheap llm haha.
mekael•4mo ago
What do you mean by “traditionally hard” in relation to a pdf? Most if not all of the docs I’m tasked with parsing are secured, flattened, and handwritten, which can cause any tool (traditional or ai) to require a confidence score and manual intervention. Also might be that i just get stuck with the edge cases 90% of the time.
trevor-e•4mo ago
I've been thinking lately that maybe we need a new AI-friendly file format rather than continuing to hack on top of PDF's complicated spec. PDF was designed to have consistent and portable page display rendering, it was not a goal for it to be easily parseable afaik, which is why we have to go through these crazy hoops. If you've ever looked at how text is stored internally in PDF this becomes immediately obvious.

I've been toying with an idea of a new format that stores text naturally and captures semantics (e.g. to help with table parsing), but also preserves formatting rules so you can still achieve fairly consistent rendering. This format could be easily converted to PDF, although the opposite conversion would have the regular challenges. The main challenge is distribution of course.

Jaxan•4mo ago
Wouldn’t it be better to invest in a human-friendly format first (which also could be AI-friendly).
trevor-e•4mo ago
Not really sure what you mean by a "human-friendly" file format, can you elaborate? File formats are inherently not friendly to humans, they are a bag of bytes. But that doesn't mean they can't be better consumed by tools which is what I mean by "AI friendly".
dotancohen•4mo ago
If you can convince your bank to make available your bank statement in Markdown, let us know.

Your transactions are probably already available in CSV.

s0rce•4mo ago
Doesn't Latex do this?
trevor-e•4mo ago
Yea I think Latex is capable of much of this but it's also cursed
s0rce•4mo ago
Don't need to convince me. I typeset my wife's PhD thesis in LaTeX and it looks great but it was so frustrating that after I did mine in Word.
kykat•4mo ago
Sounds like you want XML
fedeb95•4mo ago
Very cool. I'll probably use it, but not for AI. I have lots of pdfs for which an epub doesn't exist.

Or if anything I'll add it to the projects-that-already-do-this-but-havent-yet-found list.

agsqwe•4mo ago
How does it compare to docling?
favorited•4mo ago
Docling primarily uses AI models to extract PDF content, this project looks like it uses a custom parser written in Java, built atop veraPDF.
brumar•4mo ago
Correct me if I am wrong, but Docling can do both. It has also, among other strategies, a non-AI pipeline to determine the layout (based on qpdf I believe). So these projects are not that different.
favorited•4mo ago
While it has a PDF parser, my understanding is that it is mainly used to break a PDF document into chunks, which are then handed off to various specialized models. From its docs: "The main purpose of Docling is to run local models which are not sharing any user data with remote services."
emilburzo•4mo ago
I just tested it on one of my nemeses: PDF bank statements. They're surprisingly tough to work with if you want to get clean, structured transaction data out of them.

The JSON extract actually looks pretty good and seems to produce something usable in one shot, which is very good compared to all the other tools I've tried so far, but I still need to check it more in-depth.

Sharing here in case someone chimes in with "hey, doofus, $magic_project already solves this."

dleeftink•4mo ago
For 'zoned' extraction, Cermine[0] may be of use as a pre-processing step. Mileage may vary as its tailored towards papers.

[0]: http://cermine.ceon.pl/about.html

vortex_ape•4mo ago
Camelot[1] worked very well for me with bank statements. Disclaimer: I'm one of the core contributors.

[1] https://github.com/camelot-dev/camelot

constantinum•4mo ago
There is also Unstract open-source. Structured data extraction + ETL. https://github.com/Zipstack/unstract
hermitcrab•4mo ago
I got excited until I read that it was Java/Python based.

I'm looking for a library that can extract data tables from PDF and can be called from a C++ program (for https://www.easydatatransform.com). If anyone can suggest something, I'm all ears.

therealpygon•4mo ago
What makes Java/Python not able to be called from C++, or did you mean you have other requirements that make the project unsuitable?
hermitcrab•4mo ago
I can fire up a Java program in a separate process. But it is slow and passing data backwards and forwards is clunky. Much better to be able to do it all in one process.
4d66ba06•4mo ago
Just finished migrating to it to replace pdf2docx in a new project I’ve been working on and it is so much better. Thanks for open sourcing OpenDataLoader-PDF!