frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: udoc. Dependency-free document extraction in Rust

https://newelh.github.io/udoc/
2•newelh•51m ago

Comments

newelh•51m ago
I built udoc because most document extraction tools I've used require significant dependencies, only handle one format, or have restrictive licenses. I wanted a single binary that reads PDFs, Office docs (including legacy .doc/.xls/.ppt), ODF, and RTF — with no external parsers, no system packages, nothing to install. It's written in pure Rust with Python bindings via PyO3. If you have uv, you can try it right now without installing anything:

`curl -sL https://arxiv.org/pdf/1706.03762 \ | uvx udoc - | grep -A 18 '^Abstract'`

Highlights: A CLI: e.g. udoc -J ingest.pdf | duckdb -c "COPY (SELECT * FROM read_json_auto('/dev/stdin')) TO 'pages.parquet'". One unified Document model across all formats: extracted documents are organized into 5 layers - Content, Metadata, Presentation, Relationships, Interactions. Streaming page-by-page extraction, so a 10 GB PDF doesn't need to fit in memory. A JSONL-based hook protocol for plugging in OCR (Tesseract, cloud APIs), layout detection (DocLayout-YOLO), or vision-language models as subprocesses. PDF rendering engine "udoc render paper.pdf -o ./pages" Typed diagnostics enable recoverable issues like font fallbacks or malformed xref tables are structured warnings you can filter on.

A frequent question: if udoc is a full document toolkit, why does it not include OCR? Because OCR is not a parser; it is a model that reconstructs text from pixels. No parser can substitute for it. The relevant question is whether the parser knows when to invoke it.

udoc's approach: Automatic scan detection. Pages with one large image, fewer than five text spans, and no extractable glyph data are flagged as LikelyScanned on the diagnostics sink. The OCR hook fires only on those pages by default. OCR as a hook, not a built-in. Tesseract, GLM-OCR, DeepSeek-OCR, Textract, Document AI, Azure Form Recognizer: the right engine depends on the document, the language, the hardware, the budget, and the data-egress policy. udoc does not ship one. The hook protocol lets you wire whichever engine you need. Per-page granularity. The detector runs per page, not per document. OCR fires on scanned inserts and skips the digitally-generated body.

This is an alpha release. APIs and output format may still change. The docs are at https://newelh.github.io/udoc/ if you want to go deeper. Happy to answer questions about the parsing approach, format quirks, or anything else.

Pokemon Gen2 Compression Myth

https://old.reddit.com/r/TruePokemon/comments/hwluk9/while_it_is_true_that_iwata_did_write_a_new/
1•birdculture•4m ago•0 comments

The Letter S, by Donald Knuth [pdf]

https://gwern.net/doc/design/typography/1980-knuth.pdf
1•bambax•7m ago•0 comments

Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models

https://www.computer.org/csdl/journal/tp/2026/06/11364256/2dAZTYlgxVu
1•teleforce•9m ago•0 comments

Vollebak alters emotions with new sonic jacket

https://www.designboom.com/technology/vollebak-sonic-jacket-emotional-resonance-chamber-body/
1•thunderbong•11m ago•0 comments

An AI system to help scientists write expert-level empirical software

https://www.nature.com/articles/s41586-026-10658-6
2•anigbrowl•11m ago•0 comments

I reverse engineered Apple's video wallpapers

https://github.com/kageroumado/phosphene
1•kageroumado•11m ago•1 comments

Show HN: Remote Job Board

https://www.remotejobs.place
1•beefive•13m ago•0 comments

SpaceX S-1 Filing

https://www.sec.gov/Archives/edgar/data/1181412/000162828026036936/spaceexplorationtechnologi.htm...
2•snewe•16m ago•0 comments

I built a tool to auto-generate iMessage replies

https://github.com/aditya-r123/iMessage-Bot
1•adityarao4•19m ago•0 comments

How Many Questions Can the World Afford to Ask AI?

https://www.chicagobooth.edu/review/how-many-questions-can-world-afford-ask-ai
1•chris_money202•20m ago•0 comments

New Zealand's close shave with the mongoose

https://thespinoff.co.nz/science/18-05-2026/nzs-close-shave-with-the-mongoose
1•HBcodes•21m ago•0 comments

Nvidia Rides Blistering Chip Sales to Another Record Quarter

https://www.wsj.com/tech/ai/nvidia-nvda-1st-quarter-earnings-report-2026-stock-c2bb9c1c
2•bookofjoe•23m ago•1 comments

Meta Begins AI-Driven Layoffs, Report Says. Can They Boost the Struggling Stock?

https://www.barrons.com/articles/meta-stock-ai-layoffs-f9dba997
1•1vuio0pswjnm7•26m ago•0 comments

InferenceBench: A Benchmark for Open-Ended Inference Optimization by AI Agents

https://inferencebench.ai/
1•matt_d•28m ago•0 comments

Valkey 9.1 trims memory 10% and pulls search into the core

https://thenewstack.io/valkey-91-cuts-memory/
1•ohjeez•29m ago•0 comments

RFC 9989: What's New in the Latest DMARC Specification

https://dmarcchecker.app/articles/rfc-9989-dmarc-changes
1•awulf•30m ago•0 comments

Tracking Capabilities for Safer Agents

https://arxiv.org/abs/2603.00991
1•matt_d•35m ago•0 comments

The Elements of Power (AI Supply Chain)

https://z-library.im/book/xkRN77V9kg/the-elements-of-power-a-story-of-war-technology-and-the-dirt...
1•mmirshekar•36m ago•0 comments

JAM: DSP audio engine programmable via AI chat

https://jeffsaudiomachines.com/my-story
1•jcward•39m ago•1 comments

Ask HN: Mac users – has Mac-to-Android texting broke for you, since macOS 26.5?

https://discussions.apple.com/verify-human/verify.html?next=/thread/256300699
1•akulbe•39m ago•2 comments

Marketing Help

https://itcrowd.io
1•chrisrichardson•39m ago•0 comments

Anti-Meme Explosion

https://www.jernesto.com/articles/anti_meme_explosion
3•slopranker•42m ago•2 comments

Search engines and me. Is the classic internet dying? [off-topic]

1•nullpwr•44m ago•3 comments

SpaceX files for IPO that could make Elon Musk a trillionaire

https://www.bbc.com/news/articles/cg4pe2953q1o
3•colinprince•45m ago•0 comments

Kickstarter apologises for its criticised adult content rules

https://www.eurogamer.net/kickstarter-adult-content-rules-censorship-apology
2•healsdata•45m ago•0 comments

Primer Pay – micropayments for WordPress on Base requiring Chrome extension

https://wordpress.org/plugins/primer-pay/
1•adrianwaj•48m ago•1 comments

Deep – CLI/REPL for generating and iterating on codebases using DeepSeek

https://github.com/cynchro/deepseekCLI
3•cynchro980•49m ago•0 comments

Show HN: udoc. Dependency-free document extraction in Rust

https://newelh.github.io/udoc/
2•newelh•51m ago•1 comments

Understanding Bitcoin Inscriptions

https://www.learnbitcoin.com/glossary/inscriptions
3•granya•53m ago•2 comments

New Bitcoin Fee Pressure Signal

https://chainquery.com/reports/fee-pressure
2•granya•55m ago•0 comments