frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Zpdf: PDF text extraction in Zig – 5x faster than MuPDF

https://github.com/Lulzx/zpdf
55•lulzx•2h ago

Comments

lulzx•2h ago
I built a PDF text extraction library in Zig that's significantly faster than MuPDF for text extraction workloads.

~41K pages/sec peak throughput.

Key choices: memory-mapped I/O, SIMD string search, parallel page extraction, streaming output. Handles CID fonts, incremental updates, all common compression filters.

~5,000 lines, no dependencies, compiles in <2s.

Why it's fast:

  - Memory-mapped file I/O (no read syscalls)
  - Zero-copy parsing where possible
  - SIMD-accelerated string search for finding PDF structures
  - Parallel extraction across pages using Zig's thread pool
  - Streaming output (no intermediate allocations for extracted text)
What it handles:

  - XRef tables and streams (PDF 1.5+)
  - Incremental PDF updates (/Prev chain)
  - FlateDecode, ASCII85, LZW, RunLength decompression
  - Font encodings: WinAnsi, MacRoman, ToUnicode CMap
  - CID fonts (Type0, Identity-H/V, UTF-16BE with surrogate pairs)
tveita•47m ago
What kind of performance are you seeing with/without SIMD enabled?

From https://github.com/Lulzx/zpdf/blob/main/src/main.zig it looks like the help text cites an unimplemented "-j" option to enable multiple threads.

There is a "--parallel" option, but that is only implemented for the "bench" command.

lulzx•34m ago
I have now made parallel by default and added an option to enable multiple threads.

I haven't tested without SIMD.

cheshire_cat•43m ago
You've released quite a few projects lately, very impressive.

Are you using LLMs for parts of the coding?

What's your work flow when approaching a new project like this?

littlestymaar•12m ago
> Are you using LLMs for parts of the coding?

I can't talk about the code, but the readme and commit messages are most likely LLM-generated.

And when you take into account that the first commit happened just three hours ago, it feels like the entire project has been vibe coded.

lulzx•11m ago
Claude Code.
jeffbee•14m ago
What's fast about mmap?
agentifysh•58m ago
excellent stuff what makes zig so fast
observationist•16m ago
Not being slow - they compile straight to bytecode, they aren't interpreted, and have aggressive, opinionated optimizations baked in by default, so it's even faster than compiled c (under default conditions.)

Contrasted with python, which is interpreted, has a clunky runtime, minimal optimizations, and all sorts of choices that result in slow, redundant, and also slow, performance.

The price for performance is safety checks, redundancy, how badly wrong things can go, and so on.

A good compromise is luajit - you get some of the same aggressive optimizations, but in an interpreted language, with better-than-c performance but interpreted language convenience, access to low level things that can explode just as spectacularly as with zig or c, but also a beautiful language.

agentifysh•3m ago
will add this to the list, now learning new languages is less of a barrier with LLMs
AndyKelley•14m ago
It makes your development workflow smooth enough that you have the time and energy to do stuff like all the bullet points listed in https://news.ycombinator.com/item?id=46437289
mpeg•45m ago
very nice, it'd be good to see a feature comparison as when I use mupdf it's not really just about speed, but about the level of support of all kinds of obscure pdf features, and good level of accuracy of the built-in algorithms for things like handling two-column pages, identifying paragraphs, etc.

the licensing is a huge blocker for using mupdf in non-OSS tools, so it's very nice to see this is MIT

python bindings would be good too

lulzx•16m ago
added a comparison, will improve further. https://github.com/Lulzx/zpdf?tab=readme-ov-file#comparison-...

also, added python bindings.

odie5533•37m ago
Now we just need Python bindings so I can use it in my trash language of choice.
lulzx•20m ago
added python bindings!
littlestymaar•10m ago
- First commit 3hours ago.

- commit message: LLM-generated.

- README: LLM-generated.

I'm not convinced that projects vibe coded over the evening deserve the HN front page…

Edit: and of course the author's blog is also full of AI slop…

2026 hasn't even started I already hate it.

kingkongjaffa•3m ago
Wait, but why?

If it's really better than what we had before, what does it matter how it was made? It's literally hacked together with the tools of the day (LLMs) isn't that the very hacker ethos? Patching stuff together that works in a new and useful way.

5x speed improvements on pdf text extraction might be great for some applications I'm not aware of, I wouldn't just dismiss it out of hand because the author used $robot to write the code.

Presumably the thought to make the thing in the first place and decide what features to add and not add was more important than how the code is generated?

Gallery of Bad Shell Code

https://github.com/koalaman/shellcheck
1•behnamoh•1m ago•0 comments

The FDA and FMT regulation (2024)

https://www.humanmicrobes.org/blog/fda-fmt-regulation
1•user234683•1m ago•0 comments

Brain immune cells may drive more damage in females than males with Alzheimer's

https://medicalxpress.com/news/2025-12-brain-immune-cells-females-males.html
1•bikenaga•2m ago•0 comments

Deep Filament Extraction for 3D Concrete Printing

https://arxiv.org/abs/2512.00091
1•PaulHoule•3m ago•0 comments

Show HN: Mafia Arena – LLMs play social deduction games against each other

https://mafia-arena.com
1•mohsen1•4m ago•0 comments

Mamdani Will Be Sworn in at Abandoned Subway Station Beneath City Hall

https://www.nytimes.com/2025/12/29/nyregion/mamdani-subway-sworn-in-mayor.html
2•Anon84•5m ago•0 comments

Psilocybin triggers activity-dependent rewiring of large-scale cortical networks

https://www.cell.com/cell/fulltext/S0092-8674(25)01305-4
2•QueensGambit•6m ago•0 comments

The Honey Files Expose Major Fraud [video]

https://www.youtube.com/watch?v=qCGT_CKGgFE
2•Akronymus•7m ago•0 comments

Show HN: I quit 5 projects from burnout, so I built an app that forces breaks

https://www.kensho.zone/
1•kenshozone•10m ago•0 comments

Feuding physicists and the bitter battle over the swirls in 'The Starry Night'

https://www.washingtonpost.com/science/2025/12/27/starry-night-turbulence-debate/
1•pseudolus•10m ago•1 comments

Hubble Reveals Chaos in the Largest Planet Nursery Ever Seen

https://www.universetoday.com/articles/hubble-reveals-chaos-in-the-largest-planet-nursery-ever-seen
1•rbanffy•10m ago•0 comments

Open Source Chrome Extension to Remove Nano Banana Watermarks

https://github.com/dinoBOLT/Gemini-Watermark-Remover
1•brunorsini•11m ago•0 comments

Chasing the Mirage of "Ethical" AI

https://thereader.mitpress.mit.edu/chasing-the-mirage-of-ethical-ai/
1•pseudolus•11m ago•0 comments

Honey's Dieselgate: Detecting and Tricking Testers

https://vptdigital.com/blog/honey-detecting-testers/
2•AkshatJ27•12m ago•0 comments

Elon Musk's top Tesla predictions for 2025 that didn't happen

https://electrek.co/2025/12/30/elon-musk-top-5-tesla-predictions-2025-didnt-happen/
1•breve•13m ago•0 comments

Important stuff most people get wrong

https://upgrader.gapminder.org/
1•arunc•14m ago•0 comments

XFCE-Winxp-Tc

https://github.com/rozniak/xfce-winxp-tc
2•OsrsNeedsf2P•15m ago•0 comments

A Hand for Daenerys: Why Tyrion Is Missing from Your Vibe-Coding Council

https://scienceisneato.substack.com/p/a-hand-for-daenerys-why-tyrion-is
1•ji_reilly•15m ago•0 comments

Cybersecurity Employees Plead Guilty to Ransomware Attacks Using ALPHV BlackCat

https://www.justice.gov/opa/pr/two-americans-plead-guilty-targeting-multiple-us-victims-using-alp...
1•busymom0•16m ago•0 comments

Former IBM CEO Lou Gerstner passes, aged 83

https://www.theregister.com/2025/12/29/lou_gerstner/
2•rbanffy•20m ago•0 comments

Eric Barone makes $125,000 donation to the C# framework

https://xcancel.com/MonoGameTeam/status/2006010313112490446
3•HelloUsername•27m ago•0 comments

OpenAI's cash burn will be one of the big bubble questions of 2026

https://www.economist.com/leaders/2025/12/30/openais-cash-burn-will-be-one-of-the-big-bubble-ques...
2•1vuio0pswjnm7•28m ago•0 comments

Leveraging insights from neuroscience to build adaptive artificial intelligence

https://www.nature.com/articles/s41593-025-02169-w.epdf?sharing_token=sKV6GrdLVdSQW1UR9ICREtRgN0j...
2•artninja1988•28m ago•0 comments

The Honey Files Expose Major Fraud

https://docs.google.com/document/d/1Rj2AEI_Gr5WJbkor_sqVu4t9VhQtYcTazVwxJx9QYzU/edit
5•mirzap•29m ago•1 comments

Show HN: Virtual Try-On Chrome extension to see how products look on you

https://www.tryaing.com
1•rokontech•30m ago•0 comments

The GDB JIT Interface

https://bernsteinbear.com/blog/gdb-jit/
1•tekknolagi•35m ago•0 comments

Review: Commodore 64 Ultimate

https://www.wired.com/review/commodore-64-ultimate/
1•amichail•38m ago•0 comments

A Battle with My Blood

https://www.newyorker.com/culture/the-weekend-essay/a-battle-with-my-blood
2•amarcheschi•40m ago•0 comments

A small collection of text-only websites

https://shkspr.mobi/blog/2025/12/a-small-collection-of-text-only-websites/
2•edent•40m ago•0 comments

From silicon to Darude – Sandstorm: breaking famous synthesizer DSPs [video]

https://media.ccc.de/v/39c3-from-silicon-to-darude-sand-storm-breaking-famous-synthesizer-dsps
2•anigbrowl•40m ago•0 comments