frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Show HN: Doc2dict a fast, open-source document to dict converter – No AI

3•jgfriedman1999•3h ago
doc2dict is a python package that converts html and pdf documents into dictionaries preserving hierarchy. It also supports table extraction for html files. https://github.com/john-friedman/doc2dict

Speed:

* html - 500 pages per second single threaded.

* pdf - 200 pages per second, pdf must have an underlying text structure. Multithreading is not possible due to the limitations of PDFium.

Here's an example output from Microsoft's Annual Report: > "title": "PART I", "standardized_title": "parti", "class": "part", "contents": { "38": { "title": "ITEM 1. BUSINESS", "standardized_title": "item1", "class": "item", "contents": { "39": { "title": "GENERAL", "standardized_title": "", "class": "predicted header", "contents": { "40": { "title": "Embracing Our Future", "standardized_title": "", "class": "predicted header", "contents": { "41": { "text": "Microsoft is a technolo...

Raw: https://html-preview.github.io/?url=https://raw.githubuserco...

Parsed dictionary: https://github.com/john-friedman/doc2dict/blob/main/example_...

Simple description of algorithm:

* Take complicated document such as pdf or html, and created a simplified representation for it as a list of a list of dicts where each dict is a text block with key features such as "bold", "font-size", etc and each line represents a new html block or line on a pdf.

* Convert the simplified representation to a dictionary using a set of predetermined rules, e.g. smaller font-size for a heading means it should be nested under the larger font-size heading.

Note that I am working on making the last part more modular by creating predetermined instructions that users can tweak for their use-case without rewriting the parser. I call these "mapping dicts".

doc2dict also includes visualization tools for the debugging process:

* visualize simplified representation https://html-preview.github.io/?url=https://github.com/john-...

* visualize output dictionary https://html-preview.github.io/?url=https://github.com/john-...

Why I made this: I'm currently working on another open source python package to make it easy to exploit Securities & Exchanges Commission data. Writing a generalized document parser that can be tweaked is easier than writing 100 or so specialized parsers for each document type.

Also, converting html and pdf files to dictionary representation reduces document size by a factor of 10 or so. Not sure what I can do with that, but planning on some fun NoSQL database experiments.

Link to other package (datamule) https://github.com/john-friedman/datamule-python

Is It MTU?

https://isitmtu.com/
1•thebetatester•40s ago•0 comments

Ask HN: Work on a human only internet? Are you involved/know of anything?

1•t0lo•3m ago•0 comments

512K Day: The Day the Internet Almost Broke [video]

https://www.youtube.com/watch?v=4r5IStRaG4E
1•kristofferR•4m ago•0 comments

The Artilect War – Cosmists vs. Terrans (2005) [pdf]

https://avalonlibrary.net/ebooks/Hugo%20de%20Garis%20-%20The%20Artilect%20War%20-%20Cosmists%20vs.%20Terrans.pdf
1•droideqa•6m ago•1 comments

Modification of Acetaminophen To Reduce Liver Toxicity And Enhance Drug Efficacy

https://www.societyforscience.org/regeneron-sts/2025-student-finalists/chloe-lee/
2•felineflock•6m ago•0 comments

The Florida tech scene has two flavors Bleak or Shady

1•greyjoyduck•7m ago•0 comments

Manifest: Startup Pitch Competition, Night Market, & Career Fair

https://news.manifold.markets/p/manifest-startup-pitch-competition
2•DavidChee•9m ago•0 comments

Star IT

https://github.com/magical-paperclip/sics-ground
1•magi-clip•9m ago•0 comments

The Challenge (2023 film)

https://en.wikipedia.org/wiki/The_Challenge_(2023_film)
3•bookmtn•12m ago•1 comments

What football will look like in the future

https://www.sbnation.com/a/17776-football
1•ajdude•14m ago•0 comments

Preparedness Paradox

https://en.wikipedia.org/wiki/Preparedness_paradox
2•JoshTriplett•15m ago•1 comments

Show HN: Outcry – Meta-Strategic AI for Activists and Protest Innovators

https://www.outcryai.com/
1•micahwhite•18m ago•0 comments

No, AI is not replacing DevOps engineers

https://old.reddit.com/r/Terraform/comments/1ktymm4/no_ai_is_not_replacing_devops_engineers/
1•iacguy•18m ago•0 comments

Show HN: GremlinGPT – Local Self-Evolving AI (No Cloud, No APIs, Just Autonomy)

https://github.com/statikfintechllc/AscendAI
1•GremlinGPT•19m ago•0 comments

Largest Hackathon Presented by Bolt

https://worldslargesthackathon.devpost.com/
2•danboarder•21m ago•0 comments

The Way of Code: The Timeless Art of Vibe Coding

https://www.thewayofcode.com/
3•CharlesW•24m ago•0 comments

Language and LLMs = Expression, Not Intelligence

https://karlvmuller.com/posts/llms-are-expression-not-intelligence/
1•KarlVM12•25m ago•0 comments

Unbundling ChatGPT in Waves

1•chany2•28m ago•0 comments

Ask HN: Which agentic framework/tool do you prefer and why?

1•baby•35m ago•0 comments

Digg co-founder offers to save Pocket as Mozilla winds it down

https://9to5mac.com/2025/05/23/digg-offers-to-save-pocket/
5•AdmiralAsshat•35m ago•0 comments

The End of California's Green Car Dream

https://www.carsandhorsepower.com/news/the-us-senate-s-vote-to-repeal-california-s-gas-powered-car-ban-a-clash-of-ideologies-industries-and-federalism
2•Anumbia•39m ago•2 comments

Deadlocks in Go: the dark side of concurrency (2021)

https://www.craig-wood.com/nick/articles/deadlocks-in-go/
2•leonidasv•45m ago•0 comments

Meta promised $1B for affordable housing. Then it walked away

https://www.mercurynews.com/2025/05/20/meta-facebook-billion-housing/
9•donsupreme•49m ago•1 comments

Show HN: FFmpeg Playground

https://bigballi.com/ffmpegPlayground
2•BigBalli•51m ago•0 comments

The Dying Art of Carb Tuning

https://www.carsandhorsepower.com/featured/carburetors-are-better-than-fuel-injection-fight-me
2•Anumbia•52m ago•1 comments

More than 100 National Security Council staffers put on administrative leave

https://www.cnn.com/2025/05/23/politics/national-security-council-administrative-leave-trump
7•anigbrowl•56m ago•0 comments

The mother who never stopped believing her son was still there

https://www.theatlantic.com/magazine/archive/2025/06/brain-injury-consciousness-science/682579/
2•namanyayg•57m ago•0 comments

Personalized software is coming, but not today. Maybe tomorrow?

https://mattsayar.com/personalized-software-really-is-coming-but-not-today-maybe-tomorrow/
2•namanyayg•57m ago•0 comments

The Demise of Silicon Valley Bank (2023)

https://www.netinterest.co/p/the-demise-of-silicon-valley-bank
1•namanyayg•57m ago•0 comments

Google DeepMind's Demis Hassabis on AGI, Innovation and More

https://www.nytimes.com/2025/05/23/podcasts/google-ai-demis-hassabis-hard-fork.html
2•kjhughes•57m ago•1 comments