frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Large Scale Article Extract of Newspapers 1730s-1960s

https://snewpapers.com/
2•brettnbutter•1h ago
Hello HN, over the past 7 months I've spent nearly 3,000 hours on building SNEWPAPERS, the first historical newpaper archive with full-text extractions, nearly perfect OCR, a vast categorization taxonomy and of course with semantic and agentic search capabilities.

Problem: I wanted to search through newspaper archives, but when I tried every service only lets you search for keywords and dates, and gives you back raw images of the papers, and too many of them with no context. A sea of noise.

Solution: I taught machines how to read the newspapers and so far I've extracted the content from > 600k pages (about 5TB) from the Chronicling America collection. Problems I had to deal with were an infinite variety of layouts, font sizes, image scan qualities, resolutions, aspect ratios, navigating around the images on the page. I also had to figure out how to get OCR to be nearly perfect so people wouldn't hate reading the extracts. I stitched together a multi-model pipeline (layout tech, ocr tech, llm, vllm) with heuristics to go from layout -> segmentation -> classification. I put it all in OpenSearch / Postgres and made it semantically searchable and also put an agentic search tool on top that knows how to use the API really well and helps you write queries to find what you're looking for. Happy to discuss AWS architecture and scaling as well, that was tough!

If you have five minutes and you just want to jump in and have your own personalized experience, what I would suggest is:

Before searching for anything, go to the Sleuth page Ask it about anything from 1736 to 1963, maybe 1 or 2 follow up questions Then go to the search page so you can see the queries it wrote for you (bottom left "saved queries") and uncover more info on whatever it is you're interested in

If you think it's cool and you want to learn more, then there's about 10 minutes of video guides on the various capabilities in "Guide" on the nav bar

Some other people have also taken a crack at this, notably:

https://dell-research-harvard.github.io/resources/americanst... (very good attempt) https://labs.loc.gov/work/experiments/newspaper-navigator/ (focused on images)

Comments

brettnbutter•1h ago
A few examples you can click on without having to authenticate or sign up for free trial etc...

https://snewpapers.com/components/b2d40c08-db63-40e8-890f-09...

https://snewpapers.com/components/0fabc8e4-a60b-4f31-9ad1-b0...

https://snewpapers.com/components/cdde790f-4e97-4f2d-a2c2-95...

Why TUIs are making a comeback

https://wiki.alcidesfonseca.com/blog/why-tuis-are-back/
1•alcidesfonseca•1m ago•0 comments

ICANN opens applications for new top-level domains for the first time since 2012

https://www.theregister.com/2026/05/01/icann_new_gtld_applications/
1•thunderbong•1m ago•0 comments

Inverse Sapir-Whorf and programming languages

https://lukeplant.me.uk/blog/posts/inverse-sapir-whorf-and-programming-languages/
2•birdculture•4m ago•0 comments

The AI Race Is Charged by the Fear of Being Left Behind

https://thewalrus.ca/the-ai-race-is-charged-by-the-fear-of-being-left-behind/
1•pseudolus•11m ago•0 comments

"Gazump"

https://notoneoffbritishisms.com/2026/05/01/gazump/
1•jjgreen•14m ago•0 comments

We migrated 100 services from Nginx to Envoy in one month

https://www.qovery.com/blog/alan-from-nginx-to-envoy-what-actually-happens-when-you-swap-your-pro...
1•ev0xmusic•15m ago•0 comments

Bep/gallerydeluxe: Fast Hugo gallery theme/module suitable for lots of images

https://github.com/bep/gallerydeluxe
1•Tomte•20m ago•0 comments

PgAdmin: The Most Popular PostgreSQL Admin Tool

https://www.pgadmin.org/
1•doener•21m ago•0 comments

Some Notes on AI

https://www.math.columbia.edu/~woit/wordpress/?p=15672
1•jjgreen•25m ago•0 comments

RAG isn't memory. It's Ctrl+F with embeddings

https://medium.com/@vbcherepanov/rag-isnt-memory-it-s-ctrl-f-with-embeddings-c461b90ac7b1
2•vbcherepanov•27m ago•0 comments

•28m ago

How GitHub lost its way

https://substack.com/@usiddique09/p-196195940
2•usmansidd•29m ago•0 comments

clang-format configurator v2

https://clang-format-configurator.site/
1•gjvc•31m ago•0 comments

Apple just gave a clue that a big AI acquisition may be in the cards

https://www.marketwatch.com/story/apple-just-gave-a-subtle-clue-that-a-splashy-ai-acquisition-may...
2•dalvrosa•32m ago•0 comments

First Nations students are teaching themselves

https://www.cbc.ca/news/canada/edmonton/frog-lake-cree-language-app-9.7185348
2•01-_-•32m ago•0 comments

Convicted former Harvard scientist rebuilds brain computer lab in China

https://www.reuters.com/world/china/convicted-former-harvard-scientist-rebuilds-brain-computer-la...
3•01-_-•33m ago•0 comments

Looking for Employers for the job fair and hiring with Meeting C++

https://www.meetingcpp.com/meetingcpp/news/items/Looking-for-Employers-for-the-job-fair-and-hirin...
1•dalvrosa•34m ago•0 comments

Gall's Law – Yagnipedia

https://yagnipedia.com/wiki/galls-law
3•ankitg12•38m ago•0 comments

Neomd: A minimal terminal email client for people who write in Markdown

https://neomd.ssp.sh/
1•handfuloflight•39m ago•0 comments

The Discord migration that didn't happen

https://productimpossible.com/articles/discord-migration-that-didnt-happen/
2•sebakubisz•45m ago•0 comments

Show HN: Autorank – Rank on Google and AI search while you sleep

https://www.getautorank.ai/
1•alokjnv10•46m ago•0 comments

How fast is a macOS VM, and how small could it be?

https://eclecticlight.co/2026/05/02/how-fast-is-a-macos-vm-and-how-small-could-it-be/
7•moosia•46m ago•0 comments

ZenBusiness Data Breach

https://haveibeenpwned.com/Breach/ZenBusiness
1•amazonhut•48m ago•0 comments

How Casey Newton is revamping his newsletter to compete with AI

https://www.niemanlab.org/2026/04/more-scoops-less-aggregation-and-analysis-how-casey-newton-is-r...
1•giuliomagnifico•49m ago•0 comments

US to Withdraw Troops from Germany

https://www.dw.com/en/us-to-withdraw-thousands-of-troops-from-germany/a-77016071
2•pera•52m ago•0 comments

Dazzle Camouflage

https://en.wikipedia.org/wiki/Dazzle_camouflage
2•tosh•53m ago•0 comments

AMD Posts HDMI 2.1 FRL Patches for Their Amdgpu Linux Driver

https://www.phoronix.com/news/AMDGPU-HDMI-2.1-FRL-Patches
2•type0•55m ago•0 comments

Andrej Karpathy: From Vibe Coding to Agentic Engineering

https://www.youtube.com/watch?v=96jN2OCOfLs
3•swolpers•58m ago•0 comments

Study: AI models that consider user's feeling are more likely to make errors

https://arstechnica.com/ai/2026/05/study-ai-models-that-consider-users-feeling-are-more-likely-to...
1•rbanffy•1h ago•0 comments

Show HN: I built Male Hormone Lab Interpreter that does what LLMs can't

https://www.longevity-tools.com/male-hormones-interpreter
2•zsolt224•1h ago•0 comments