frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Large Scale Article Extract of Newspapers 1730s-1960s

https://snewpapers.com/
2•brettnbutter•1h ago
Hello HN, over the past 7 months I've spent nearly 3,000 hours on building SNEWPAPERS, the first historical newpaper archive with full-text extractions, nearly perfect OCR, a vast categorization taxonomy and of course with semantic and agentic search capabilities.

Problem: I wanted to search through newspaper archives, but when I tried every service only lets you search for keywords and dates, and gives you back raw images of the papers, and too many of them with no context. A sea of noise.

Solution: I taught machines how to read the newspapers and so far I've extracted the content from > 600k pages (about 5TB) from the Chronicling America collection. Problems I had to deal with were an infinite variety of layouts, font sizes, image scan qualities, resolutions, aspect ratios, navigating around the images on the page. I also had to figure out how to get OCR to be nearly perfect so people wouldn't hate reading the extracts. I stitched together a multi-model pipeline (layout tech, ocr tech, llm, vllm) with heuristics to go from layout -> segmentation -> classification. I put it all in OpenSearch / Postgres and made it semantically searchable and also put an agentic search tool on top that knows how to use the API really well and helps you write queries to find what you're looking for. Happy to discuss AWS architecture and scaling as well, that was tough!

If you have five minutes and you just want to jump in and have your own personalized experience, what I would suggest is:

Before searching for anything, go to the Sleuth page Ask it about anything from 1736 to 1963, maybe 1 or 2 follow up questions Then go to the search page so you can see the queries it wrote for you (bottom left "saved queries") and uncover more info on whatever it is you're interested in

If you think it's cool and you want to learn more, then there's about 10 minutes of video guides on the various capabilities in "Guide" on the nav bar

Some other people have also taken a crack at this, notably:

https://dell-research-harvard.github.io/resources/americanst... (very good attempt) https://labs.loc.gov/work/experiments/newspaper-navigator/ (focused on images)

Comments

brettnbutter•1h ago
A few examples you can click on without having to authenticate or sign up for free trial etc...

https://snewpapers.com/components/b2d40c08-db63-40e8-890f-09...

https://snewpapers.com/components/0fabc8e4-a60b-4f31-9ad1-b0...

https://snewpapers.com/components/cdde790f-4e97-4f2d-a2c2-95...

Show HN: Mljar Studio – local AI data analyst that saves analysis as notebooks

https://mljar.com/
3•pplonski86•12m ago•0 comments

Show HN: DAC – open-source dashboard as code tool for agents and humans

https://github.com/bruin-data/dac
11•karakanb•2d ago•0 comments

Show HN: Browser-based light pollution simulator using real photometric data

https://iesna.eu/?wasm=skyglow_demo
11•holg•1h ago•0 comments

Show HN: Filling PDF forms with AI using client-side tool calling

https://copilot.simplepdf.com/?share=a7d00ad073c75a75d493228e6ff7b11eb3f2d945b6175913e87898ec96ca...
12•nip•1h ago•5 comments

Show HN: Large Scale Article Extract of Newspapers 1730s-1960s

https://snewpapers.com/
4•brettnbutter•1h ago•2 comments

Show HN: SimDrive – a browser racing game with your phone as the controller:D

https://simdrive.xyz/
4•1000xcat•2d ago•2 comments

Show HN: Stop playing my matchstick puzzles, start building your own in seconds

https://mathstick.github.io
14•trangram•5h ago•13 comments

Show HN: I built Male Hormone Lab Interpreter that does what LLMs can't

https://www.longevity-tools.com/male-hormones-interpreter
2•zsolt224•1h ago•0 comments

Show HN: AI CAD Harness

https://fusion.adam.new/install
85•zachdive•16h ago•86 comments

Show HN: Agent-desktop – Native desktop automation CLI for AI agents

https://github.com/lahfir/agent-desktop
81•lahfir•8h ago•25 comments

Show HN: Create the right image sizes for social media

https://skills.sh/branding5/social-media-image-sizes/social-media-image-sizes
2•mnewme•2h ago•0 comments

Show HN: WhatCable, a tiny menu bar app for inspecting USB-C cables

https://github.com/darrylmorley/whatcable
506•sleepingNomad•1d ago•146 comments

Show HN: Glacier – A zero-config macOS terminal I vibecoded in Rust

https://github.com/pranjolm/glacier-terminal
2•ArqueNova•3h ago•0 comments

Show HN: Agent with its own computer on the cloud

https://pulsarbot.cloud/
2•akshayballal95•3h ago•0 comments

Show HN: Site Mogging

https://sitemogging.com
63•jilles•23h ago•73 comments

Show HN: Loopsy, a way for terminals and AI agents on different machines to talk

https://github.com/leox255/loopsy
51•todience•1d ago•8 comments

Show HN: GhostBox – Borrow a disposable little machine from the Global Free Tier

https://www.ghost.charity/
119•keepamovin•19h ago•85 comments

Show HN: Perfect Bluetooth MIDI for Windows

101•mayerwin•1d ago•31 comments

Show HN: Raptor – fast, energy efficient small file uploads to S3

https://github.com/proxylity/raptor
4•mlhpdx•6h ago•0 comments

Show HN: My Private GitHub on Postgres

https://github.com/calebwin/gitgres
39•calebhwin•16h ago•23 comments

Show HN: Omar – A TUI for managing 100 coding agents

https://omar.tech
14•karim7•15h ago•2 comments

Show HN: Blotter, a live map of police radio activity

https://blotter.fm
6•s_e__a___n•16h ago•2 comments

Show HN: MemHub, Turn Your GPT/Claude/Gemini History into LLM-Wiki Mindmap

https://github.com/XTraceAI/memhub-llm-wiki-guide
4•TristanX•9h ago•0 comments

Show HN: Pu.sh – a full coding-agent harness in 400 lines of shell

https://pu.dev/
88•nahimn•1d ago•26 comments

Show HN: Winpodx – run Windows apps on Linux as native windows

https://github.com/kernalix7/winpodx
96•kernalix7•1d ago•47 comments

Show HN: A new benchmark for testing LLMs for deterministic outputs

https://interfaze.ai/blog/introducing-structured-output-benchmark
58•khurdula•2d ago•28 comments

Show HN: WeSearch – Anonymous news aggregator with no algorithm, 700 sources

https://wesearch.press/
14•EGCstudy•1d ago•8 comments

Show HN: Drive any macOS app in the background without stealing the cursor

https://github.com/trycua/cua
186•frabonacci•3d ago•41 comments

Show HN: Fast, privacy-first macOS configuration bootstrapper for (MDM) Macs

https://mac.olegkoval.com/
2•orthodoz•12h ago•0 comments

Show HN: Rocky – Rust SQL engine with branches, replay, column lineage

https://github.com/rocky-data/rocky
119•hugocorreia90•3d ago•48 comments