frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Extract (YC P25) – Fast, accurate document parsing

https://extract.page
1•soamikapadia•1h ago
Hey HN, we’re Soami, David, and Achyut, co-founders of Extract. Extract parses documents into structured data (text, tables, and figures). Teams use it for RAG, feeding llms, and populating databases and forms. Today we’re launching our first OCR model, now used in Extract.

You can try some examples here or upload your own (no signup required) to test it out: https://extract.page/demo

We built Extract out of YouLearn, where we were processing 70m+ pages and slow parsing was the bottleneck. We started with a purely algorithmic pipeline that pulled native text straight from the document and only ran OCR on pages that needed it. It was cheap and fast, but once we put it in front of our Extract customers and their hardest documents, it hit an accuracy ceiling. We wanted to keep the speed and cost while improving accuracy, so we trained our own VLM for the cases that broke. It also provides element level bboxes, so each result points back to its exact place on the page.

That took one customer from 71% to 92% text accuracy in under a week, at the same speed and cost. We can do this because of our synthetic data generation pipeline that recreates the messy, real-world documents the model gets wrong, so we can retrain on those exact cases without having to hand-label data.

To see how this holds up against other providers, we benchmarked Extract against AWS Textract, Extend, Reducto, LlamaParse, and Unstructured on 130 human labeled pages from difficult real-world documents. Extract is #1 on text accuracy (81.9%) and word-overlap F1 (84.5%), second on grounded accuracy, and competitive on layout IoU, while running at least 2x faster than every parser we tested.

Here are the benchmarks: https://extract.page/bench

Extract is $3 per 1000 pages and about 5x cheaper than AWS Textract (layout + table enabled). To see how it performs on your own docs, feel free to send us a few and we’ll run a benchmark on them. We’ll get back to you with the results in a few days once we receive the docs: https://cal.com/team/youlearnai/extract-intro

Thanks for reading this post! It's our first version of the model and we're shipping further improvements to handwritten, multilingual, and table-heavy documents. We know there are documents it won't handle well yet. If you have one, we'd love to see it.

Async Rust: deep dive into cooperative scheduling and Tokio's architecture

https://kerkour.com/async-rust-cooperative-scheduling-tokio
1•Tomte•36s ago•0 comments

Supernatural isn't dead after all

https://www.theverge.com/news/941816/supernatural-health-meta-quest-vr
1•mikro•47s ago•0 comments

Iran attack on Kuwait airport injures at least 63 and damages terminal building

https://www.reuters.com/world/iran-war-live-us-says-iranian-strikes-bahrain-kuwait-failed-2026-06...
1•alephnerd•50s ago•0 comments

Missing IPsec Integrity Protection for IMS Sip Signaling in Verizon VoLTE

https://www.kb.cert.org/vuls/id/615987
1•luu•1m ago•0 comments

Rolling Redesigns: A Sneaky Smart Way to Refresh Your Website

https://www.culturefoundry.com/cultivate/design-ux/rolling-redesign/
1•mooreds•2m ago•0 comments

AI Search Visibility Checker

https://www.techwrath.com/ai-search-visibility-checker/
1•techwrath11•3m ago•0 comments

Xenomorph head and T. Ocellus – I built last year

https://old.reddit.com/r/scifi/comments/1sc0afl/xenomorph_head_and_t_ocellus_i_built_last_year/
1•vpuna•3m ago•1 comments

Gmail Bulk Automator

https://www.techwrath.com/gmail-bulk-automator-extensions/
1•techwrath11•4m ago•0 comments

The Thermodynamics of Decoupled Success

https://osf.io/preprints/socarxiv/m2nrj_v1
1•rendersgame•4m ago•0 comments

Intelligence per Dollar

https://tomtunguz.com/tokens-per-result/
1•swolpers•5m ago•0 comments

Spherical Voronoi Diagram

https://www.jasondavies.com/maps/voronoi/
1•marysminefnuf•6m ago•0 comments

The Next Chapter for Suno

https://suno.com/blog/series-d-announcement
1•doppp•6m ago•0 comments

SkyBrief – Pre-flight briefing dashboard for GA pilots (METARs, NOTAMs, SIGMETs)

https://skybrief.flights
1•phironaka•7m ago•0 comments

Android introduces fake call detection to stop deepfake scams

https://blog.google/security/android-fake-call-detection/
1•speckx•7m ago•0 comments

Jailbreaking the Lululemon Mirror [video]

https://www.youtube.com/watch?v=_0gtiMi5AzI
1•tortilla•8m ago•0 comments

Largest scorpion revealed from 415M-year-old fossils

https://www.manchester.ac.uk/about/news/worlds-largest-scorpion-revealed-from-415-million-year-ol...
1•gnufx•9m ago•0 comments

Atlasflow – Push to GitHub, we build and run it on bare metal

https://atlasflow.com/
2•tomhaerter•10m ago•0 comments

Ibn Tufayl

https://en.wikipedia.org/wiki/Ibn_Tufayl
2•simonebrunozzi•10m ago•0 comments

Show HN: The Last Museum – Semantic search across all museum art

https://lastmuseum.com
1•dbrereton•11m ago•0 comments

AkiraConsole – open-source handheld for embedded developers

https://www.crowdsupply.com/pen-engineering/akiraconsole
1•IlyaPen•11m ago•0 comments

Apple Developer Centers are expanding to Berlin

https://developer.apple.com/news/?id=f0jfy9py
1•alechash•12m ago•0 comments

Google's water stewardship commitments for local communities

https://blog.google/company-news/outreach-and-initiatives/sustainability/new-water-stewardship-co...
3•ChrisArchitect•14m ago•0 comments

AI has a water problem. Google thinks it has a fix

https://www.theverge.com/policy/942296/google-water-commitments-data-centers
1•Brajeshwar•16m ago•1 comments

Show HN: Solving complex optimization problems with Google OR-Tools in browser

https://github.com/Axelwickm/or-tools-wasm
2•AxelWickman•19m ago•0 comments

Find a Way

https://usefulfictions.substack.com/p/find-a-way
1•jger15•20m ago•0 comments

Companies Are Using Reddit to Manipulate ChatGPT and Google AI Search

https://www.404media.co/companies-are-using-reddit-to-manipulate-chatgpt-and-google-ai-search/
3•pavel_lishin•20m ago•1 comments

Porting Btrfs-Progs to Rust

https://xfbs.net/posts/2026/porting-btrfs-progs-to-rust/
3•qbane•24m ago•0 comments

Joel David Hamkins – Set Theory, Pluralism and the Multiverse View – About Logic

https://www.youtube.com/watch?v=060p4gKCCbg
2•FillMaths•25m ago•0 comments

YouTube overtakes Netflix in average daily viewing around the world

https://www.theguardian.com/technology/2026/jun/03/youtube-overtakes-netflix-in-average-daily-vie...
5•novaRom•25m ago•4 comments

ExtendDB – DynamoDB compatible adapter with pluggable storage back ends

https://aws.amazon.com/blogs/database/introducing-extenddb-an-open-source-dynamodb-compatible-ada...
1•tcp_handshaker•27m ago•0 comments