PDF Oxide – Fast PDF library in Rust with Python bindings – 0.8ms,100% pass rate

1•yfedoseev•1h ago

Comments

yfedoseev•1h ago

PDF Oxide is a PDF text extraction and manipulation library written in Rust with Python bindings (via PyO3). It is MIT licensed.

I started building it when I needed fast, reliable PDF extraction for a data pipeline and couldn't find a permissively-licensed option that was both fast and handled edge cases well. PyMuPDF is fast but AGPL. pypdf is MIT but 15x slower. pdfplumber is great for tables but too slow for batch processing.

Technical Architecture:

Zero-Dependency Parser: Built from scratch in Rust using nom combinators (no MuPDF, no Poppler).

Layout Analysis: Uses XY-Cut projection partitioning for multi-column layout detection.

Robust Font Decoding: Implements a multi-level encoding fallback chain (ToUnicode CMap -> Encoding differences -> Base encoding -> CIDFont CMap -> Adobe Glyph List -> Identity). This is where most libraries produce garbage on CJK documents.

Benchmarks (Mean / p99 / Pass Rate) on 3,830 PDFs: pdf_oxide: 0.8ms / 9ms / 100% (MIT) PyMuPDF: 4.6ms / 28ms / 99.3% (AGPL-3.0) pypdfium2: 4.1ms / 42ms / 99.2% (Apache/BSD) pypdf: 12.1ms / 97ms / 98.4% (BSD)

Performance was profiling-driven. A bulk page tree cache turned a 10,000-page PDF from 55s to 332ms (O(n^2) to O(1)).

Quick Start (Python):

from pdf_oxide import PdfDocument doc = PdfDocument("document.pdf") for i in range(doc.page_count()): print(doc.extract_text(i))

Quick Start (Rust):

use pdf_oxide::PdfDocument; let mut doc = PdfDocument::open("document.pdf")?; for i in 0..doc.page_count()? { println!("{}", doc.extract_text(i)?); }

Capabilities: Text/Markdown/Image extraction, PDF creation from Markdown/HTML, form filling, and OCR (PaddleOCR via ONNX Runtime).

GitHub: https://github.com/yfedoseev/pdf_oxide Docs: https://oxide.fyi

I would love to hear what you think—especially if you throw it at PDFs that other libraries struggle with. The best way to improve is finding edge cases in the wild.

80386 Protection

Release v2.0.0 · Charmbracelet/Bubbletea

Tomorrow's Social Networks

AI Data Centers Turn to High-Temperature Superconductors

The US Had a Big Battery Boom Last Year

A risky maneuver could send a spacecraft to interstellar comet 3I/ATLAS

Show HN: Forecasts as GIFs (Free Lifetime Access)

Show HN: Noodles – Turn any codebase into a diagram with Claude and Tree-sitter

Show HN: I just released v7 Javalin, a JVM web framework

Real-time settlement reshapes everyday financial support across borders

America's spymasters terrified Tim Cook with Taiwan invasion timeline

Show HN: Neuron – Independent Rust crates for building AI agents

Stop Parallelizing Your Agents

Waymo Opens 4 New Cities to Public Riders (Now at 10 Total)

Teens Use and View AI

Lamborghini cancels electric Lanzador as supercar buyers reject EVs

Show HN: Tacit – The missing Layer 3 of the AI agent stack (open source)

Dental group offers to fix Olympic Jack Hughes' smile for free

AI's Math Tricks Don't Work for Scientific Computing

Show HN: TTSLab – Text-to-speech that runs in the browser via WebGPU

AIProx: An open registry and manifest standard for autonomous agent discovery

Anthropic Links AI Agent with Tools for Investment Banking, HR

OpenAI safety reps called to Ottawa after Tumbler Ridge, B.C., mass shooting

Show HN: A minimal coding agent in Elixir (Erlang/OTP)

Near-Instantly Aborting the Worst Pain Imaginable with Psychedelics

Change your default date format to the least ambiguous

Georgist land taxes balance community benefit and the efficiency of markets (2024)

Pecking Order and Flight Leadership (2019)

Apple's Multibillion-Dollar Push to Make Chips in the U.S. [video]

A catecholamine-independent pathway controlling adaptive adipocyte lipolysis