I have a much simpler database: https://play.clickhouse.com/play?user=play#U0VMRUNUIHRpbWUsI...
I did something similar. I build a tool[1] to import the Project Arctic Shift dumps[2] of reddit into sqlite. It was mostly an exercise to experiment with Rust and SQLite (HN's two favorite topics). If you don't build a FTS5 index and import without WAL (--unsafe-mode), import of every reddit comment and submission takes a bit over 24 hours and produces a ~10TB DB.
SQLite offers a lot of cool json features that would let you store the raw json and operate on that, but I eschewed them in favor of parsing only once at load time. THat also lets me normalize the data a bit.
I find that building the DB is pretty "fast", but queries run much faster if I immediately vacuum the DB after building it. The vacuum operation is actually slower than the original import, taking a few days to finish.
[1] https://github.com/Paul-E/Pushshift-Importer
[2] https://github.com/ArthurHeitmann/arctic_shift/blob/master/d...
I watched it in the browser network panel and saw it fetch:
https://hackerbook.dosaygo.com/static-shards/shard_1636.sqlite.gz
https://hackerbook.dosaygo.com/static-shards/shard_1635.sqlite.gz
https://hackerbook.dosaygo.com/static-shards/shard_1634.sqlite.gz
As I paginated to previous days.It's reminiscent of that brilliant SQLite.js VFS trick from a few years ago: https://github.com/phiresky/sql.js-httpvfs - only that one used HTTP range headers, this one uses sharded files instead.
The interactive SQL query interface at https://hackerbook.dosaygo.com/?view=query asks you to select which shards to run the query against, there are 1636 total.
Converting 22GB of uncompressed text into video essay lands us at ~1PB or 1000TB.
keepamovin•3h ago
Go to this repo (https://github.com/DOSAYGO-STUDIO/HackerBook): you can download it. Big Query -> ETL -> npx serve docs - that's it. 20 years of HN arguments and beauty, can be yours forever. So they'll never die. Ever. It's the unkillable static archive of HN and it's your hands. That's my Year End gift to you all. Thank you for a wonderful year, have happy and wonderful 2026. make something of it.
carbocation•2h ago
Question - did you consider tradeoffs between duckdb (or other columnar stores) and SQLite?
keepamovin•2h ago
cess11•1h ago
It's different in that it is tailored to analytics, among other things storage is columnar, and it can run off some common data analytics file formats.
fsiefken•1h ago
It has transparent compression built-in and has support for natural language queries. https://buckenhofer.com/2025/11/agentic-ai-with-duckdb-and-s...
"DICT FSST (Dictionary FSST) represents a hybrid compression technique that combines the benefits of Dictionary Encoding with the string-level compression capabilities of FSST. This approach was implemented and integrated into DuckDB as part of ongoing efforts to optimize string storage and processing performance." https://homepages.cwi.nl/~boncz/msc/2025-YanLannaAlexandre.p...
simonw•1h ago
So you can dump e.g. all of Hacker News in a single multi-GB Parquet file somewhere and build a client-side JavaScript application that can run queries against that without having to fetch the whole thing.
You can run searches on https://lil.law.harvard.edu/data-gov-archive/ and watch the network panel to see DuckDB in action.
linhns•2h ago
cobolcomesback•1h ago
formerly_proven•1h ago
Doesn't scream columnar database to me.
embedding-shape•1h ago
agolliver•1h ago
3eb7988a1663•1h ago
wslh•2h ago
With all due respect it would be great if there is an official HN public dump available (and not requiring stuff such as BigQuery which is expensive).
yupyupyups•1h ago
Thank you btw
abixb•1h ago
I've been taking frequent "offline-only-day" breaks to consolidate whatever I've been learning, and Kiwix has been a great tool for reference (offline Wikipedia, StackOverflow and whatnot).
[0] https://kiwix.org/en/the-new-kiwix-library-is-available/
Barbing•6m ago
fao_•1h ago
> 20 years of HN arguments and beauty, can be yours forever. So they'll never die. Ever. It's the unkillable static archive of HN and it's your hands
I'm really sorry to have to ask this, but this really feels like you had an LLM write it?
rantingdemon•1h ago
sundarurfriend•1h ago
JavGull•1h ago
naikrovek•16m ago
Ooh, I used “sequential”, ooh, I used an em dash. ZOMG AI IS COMING FOR US ALL
deadbabe•55m ago
walthamstow•1h ago
I wonder if there's something like this going on here. I never thought it was LLM on first read, and I still don't, but when you take snippets and point at them it makes me think maybe they are
naikrovek•31m ago
Ending a sentence with a question mark doesn’t automatically make your sentence a question. You didn’t ask anything. You stated an opinion and followed it with a question mark.
If you intended to ask if the text was written by AI, no, you don’t have to ask that.
I am so damn tired of the “that didn’t happen” and the “AI did that” people when there is zero evidence of either being true.
These people are the most exhausting people I have ever encountered in my entire life.
jesprenj•27m ago
tevon•1h ago
scsh•57m ago