frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Queueing Theory v2: DORA metrics, queue-of-queues, chi-alpha-beta-sigma notation

https://github.com/joelparkerhenderson/queueing-theory
1•jph•53s ago•0 comments

Show HN: Hibana – choreography-first protocol safety for Rust

https://hibanaworks.dev/
1•o8vm•2m ago•0 comments

Haniri: A live autonomous world where AI agents survive or collapse

https://www.haniri.com
1•donangrey•3m ago•1 comments

GPT-5.3-Codex System Card [pdf]

https://cdn.openai.com/pdf/23eca107-a9b1-4d2c-b156-7deb4fbc697c/GPT-5-3-Codex-System-Card-02.pdf
1•tosh•16m ago•0 comments

Atlas: Manage your database schema as code

https://github.com/ariga/atlas
1•quectophoton•19m ago•0 comments

Geist Pixel

https://vercel.com/blog/introducing-geist-pixel
1•helloplanets•22m ago•0 comments

Show HN: MCP to get latest dependency package and tool versions

https://github.com/MShekow/package-version-check-mcp
1•mshekow•29m ago•0 comments

The better you get at something, the harder it becomes to do

https://seekingtrust.substack.com/p/improving-at-writing-made-me-almost
2•FinnLobsien•31m ago•0 comments

Show HN: WP Float – Archive WordPress blogs to free static hosting

https://wpfloat.netlify.app/
1•zizoulegrande•32m ago•0 comments

Show HN: I Hacked My Family's Meal Planning with an App

https://mealjar.app
1•melvinzammit•33m ago•0 comments

Sony BMG copy protection rootkit scandal

https://en.wikipedia.org/wiki/Sony_BMG_copy_protection_rootkit_scandal
1•basilikum•35m ago•0 comments

The Future of Systems

https://novlabs.ai/mission/
2•tekbog•36m ago•1 comments

NASA now allowing astronauts to bring their smartphones on space missions

https://twitter.com/NASAAdmin/status/2019259382962307393
2•gbugniot•41m ago•0 comments

Claude Code Is the Inflection Point

https://newsletter.semianalysis.com/p/claude-code-is-the-inflection-point
3•throwaw12•42m ago•1 comments

Show HN: MicroClaw – Agentic AI Assistant for Telegram, Built in Rust

https://github.com/microclaw/microclaw
1•everettjf•42m ago•2 comments

Show HN: Omni-BLAS – 4x faster matrix multiplication via Monte Carlo sampling

https://github.com/AleatorAI/OMNI-BLAS
1•LowSpecEng•43m ago•1 comments

The AI-Ready Software Developer: Conclusion – Same Game, Different Dice

https://codemanship.wordpress.com/2026/01/05/the-ai-ready-software-developer-conclusion-same-game...
1•lifeisstillgood•45m ago•0 comments

AI Agent Automates Google Stock Analysis from Financial Reports

https://pardusai.org/view/54c6646b9e273bbe103b76256a91a7f30da624062a8a6eeb16febfe403efd078
1•JasonHEIN•48m ago•0 comments

Voxtral Realtime 4B Pure C Implementation

https://github.com/antirez/voxtral.c
2•andreabat•51m ago•1 comments

I Was Trapped in Chinese Mafia Crypto Slavery [video]

https://www.youtube.com/watch?v=zOcNaWmmn0A
2•mgh2•57m ago•0 comments

U.S. CBP Reported Employee Arrests (FY2020 – FYTD)

https://www.cbp.gov/newsroom/stats/reported-employee-arrests
1•ludicrousdispla•59m ago•0 comments

Show HN: I built a free UCP checker – see if AI agents can find your store

https://ucphub.ai/ucp-store-check/
2•vladeta•1h ago•1 comments

Show HN: SVGV – A Real-Time Vector Video Format for Budget Hardware

https://github.com/thealidev/VectorVision-SVGV
1•thealidev•1h ago•0 comments

Study of 150 developers shows AI generated code no harder to maintain long term

https://www.youtube.com/watch?v=b9EbCb5A408
1•lifeisstillgood•1h ago•0 comments

Spotify now requires premium accounts for developer mode API access

https://www.neowin.net/news/spotify-now-requires-premium-accounts-for-developer-mode-api-access/
1•bundie•1h ago•0 comments

When Albert Einstein Moved to Princeton

https://twitter.com/Math_files/status/2020017485815456224
1•keepamovin•1h ago•0 comments

Agents.md as a Dark Signal

https://joshmock.com/post/2026-agents-md-as-a-dark-signal/
2•birdculture•1h ago•0 comments

System time, clocks, and their syncing in macOS

https://eclecticlight.co/2025/05/21/system-time-clocks-and-their-syncing-in-macos/
1•fanf2•1h ago•0 comments

McCLIM and 7GUIs – Part 1: The Counter

https://turtleware.eu/posts/McCLIM-and-7GUIs---Part-1-The-Counter.html
2•ramenbytes•1h ago•0 comments

So whats the next word, then? Almost-no-math intro to transformer models

https://matthias-kainer.de/blog/posts/so-whats-the-next-word-then-/
1•oesimania•1h ago•0 comments
Open in hackernews

Show HN: Doc2dict a fast, open-source document to dict converter – No AI

4•jgfriedman1999•8mo ago
doc2dict is a python package that converts html and pdf documents into dictionaries preserving hierarchy. It also supports table extraction for html files. https://github.com/john-friedman/doc2dict

Speed:

* html - 500 pages per second single threaded.

* pdf - 200 pages per second, pdf must have an underlying text structure. Multithreading is not possible due to the limitations of PDFium.

Here's an example output from Microsoft's Annual Report: > "title": "PART I", "standardized_title": "parti", "class": "part", "contents": { "38": { "title": "ITEM 1. BUSINESS", "standardized_title": "item1", "class": "item", "contents": { "39": { "title": "GENERAL", "standardized_title": "", "class": "predicted header", "contents": { "40": { "title": "Embracing Our Future", "standardized_title": "", "class": "predicted header", "contents": { "41": { "text": "Microsoft is a technolo...

Raw: https://html-preview.github.io/?url=https://raw.githubuserco...

Parsed dictionary: https://github.com/john-friedman/doc2dict/blob/main/example_...

Simple description of algorithm:

* Take complicated document such as pdf or html, and created a simplified representation for it as a list of a list of dicts where each dict is a text block with key features such as "bold", "font-size", etc and each line represents a new html block or line on a pdf.

* Convert the simplified representation to a dictionary using a set of predetermined rules, e.g. smaller font-size for a heading means it should be nested under the larger font-size heading.

Note that I am working on making the last part more modular by creating predetermined instructions that users can tweak for their use-case without rewriting the parser. I call these "mapping dicts".

doc2dict also includes visualization tools for the debugging process:

* visualize simplified representation https://html-preview.github.io/?url=https://github.com/john-...

* visualize output dictionary https://html-preview.github.io/?url=https://github.com/john-...

Why I made this: I'm currently working on another open source python package to make it easy to exploit Securities & Exchanges Commission data. Writing a generalized document parser that can be tweaked is easier than writing 100 or so specialized parsers for each document type.

Also, converting html and pdf files to dictionary representation reduces document size by a factor of 10 or so. Not sure what I can do with that, but planning on some fun NoSQL database experiments.

Link to other package (datamule) https://github.com/john-friedman/datamule-python