PDF Oxide is a PDF text extraction and manipulation library written in Rust with Python bindings (via PyO3). It is MIT licensed.
I started building it when I needed fast, reliable PDF extraction for a data pipeline and couldn't find a permissively-licensed option that was both fast and handled edge cases well. PyMuPDF is fast but AGPL. pypdf is MIT but 15x slower. pdfplumber is great for tables but too slow for batch processing.
Technical Architecture:
Zero-Dependency Parser: Built from scratch in Rust using nom combinators (no MuPDF, no Poppler).
Layout Analysis: Uses XY-Cut projection partitioning for multi-column layout detection.
Robust Font Decoding: Implements a multi-level encoding fallback chain (ToUnicode CMap -> Encoding differences -> Base encoding -> CIDFont CMap -> Adobe Glyph List -> Identity). This is where most libraries produce garbage on CJK documents.
I would love to hear what you think—especially if you throw it at PDFs that other libraries struggle with. The best way to improve is finding edge cases in the wild.
yfedoseev•1h ago
I started building it when I needed fast, reliable PDF extraction for a data pipeline and couldn't find a permissively-licensed option that was both fast and handled edge cases well. PyMuPDF is fast but AGPL. pypdf is MIT but 15x slower. pdfplumber is great for tables but too slow for batch processing.
Technical Architecture:
Zero-Dependency Parser: Built from scratch in Rust using nom combinators (no MuPDF, no Poppler).
Layout Analysis: Uses XY-Cut projection partitioning for multi-column layout detection.
Robust Font Decoding: Implements a multi-level encoding fallback chain (ToUnicode CMap -> Encoding differences -> Base encoding -> CIDFont CMap -> Adobe Glyph List -> Identity). This is where most libraries produce garbage on CJK documents.
Benchmarks (Mean / p99 / Pass Rate) on 3,830 PDFs: pdf_oxide: 0.8ms / 9ms / 100% (MIT) PyMuPDF: 4.6ms / 28ms / 99.3% (AGPL-3.0) pypdfium2: 4.1ms / 42ms / 99.2% (Apache/BSD) pypdf: 12.1ms / 97ms / 98.4% (BSD)
Performance was profiling-driven. A bulk page tree cache turned a 10,000-page PDF from 55s to 332ms (O(n^2) to O(1)).
Quick Start (Python):
from pdf_oxide import PdfDocument doc = PdfDocument("document.pdf") for i in range(doc.page_count()): print(doc.extract_text(i))
Quick Start (Rust):
use pdf_oxide::PdfDocument; let mut doc = PdfDocument::open("document.pdf")?; for i in 0..doc.page_count()? { println!("{}", doc.extract_text(i)?); }
Capabilities: Text/Markdown/Image extraction, PDF creation from Markdown/HTML, form filling, and OCR (PaddleOCR via ONNX Runtime).
GitHub: https://github.com/yfedoseev/pdf_oxide Docs: https://oxide.fyi
I would love to hear what you think—especially if you throw it at PDFs that other libraries struggle with. The best way to improve is finding edge cases in the wild.