frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Self-host Reddit – 2.38B posts, works offline, yours forever

https://github.com/19-84/redd-archiver
286•19-84•3w ago
Reddit's API is effectively dead for archival. Third-party apps are gone. Reddit has threatened to cut off access to the Pushshift dataset multiple times. But 3.28TB of Reddit history exists as a torrent right now, and I built a tool to turn it into something you can browse on your own hardware.

The key point: This doesn't touch Reddit's servers. Ever. Download the Pushshift dataset, run my tool locally, get a fully browsable archive. Works on an air-gapped machine. Works on a Raspberry Pi serving your LAN. Works on a USB drive you hand to someone.

What it does: Takes compressed data dumps from Reddit (.zst), Voat (SQL), and Ruqqus (.7z) and generates static HTML. No JavaScript, no external requests, no tracking. Open index.html and browse. Want search? Run the optional Docker stack with PostgreSQL – still entirely on your machine.

API & AI Integration: Full REST API with 30+ endpoints – posts, comments, users, subreddits, full-text search, aggregations. Also ships with an MCP server (29 tools) so you can query your archive directly from AI tools.

Self-hosting options: - USB drive / local folder (just open the HTML files) - Home server on your LAN - Tor hidden service (2 commands, no port forwarding needed) - VPS with HTTPS - GitHub Pages for small archives

Why this matters: Once you have the data, you own it. No API keys, no rate limits, no ToS changes can take it away.

Scale: Tens of millions of posts per instance. PostgreSQL backend keeps memory constant regardless of dataset size. For the full 2.38B post dataset, run multiple instances by topic.

How I built it: Python, PostgreSQL, Jinja2 templates, Docker. Used Claude Code throughout as an experiment in AI-assisted development. Learned that the workflow is "trust but verify" – it accelerates the boring parts but you still own the architecture.

Live demo: https://online-archives.github.io/redd-archiver-example/

GitHub: https://github.com/19-84/redd-archiver (Public Domain)

Pushshift torrent: https://academictorrents.com/details/1614740ac8c94505e4ecb9d...

Comments

NickNaraghi•3w ago
Data is available via torrent in this section: https://github.com/19-84/redd-archiver?tab=readme-ov-file#-g...
19-84•3w ago
I have also published sub statistics and profiling for each platform. these can be used to help identify which subs to prioritize for archiving.

reddit: https://github.com/19-84/redd-archiver/blob/main/tools/subre...

voat: https://github.com/19-84/redd-archiver/blob/main/tools/subve...

ruqqus: https://github.com/19-84/redd-archiver/blob/main/tools/guild...

elSidCampeador•3w ago
I wonder if this can be hooked up with the now-dead Apollo app in some way, to get back a slice of time that is forever lost now?
19-84•3w ago
the API should allow for a lot of different integrations
Aurornis•3w ago
Cool way to self-host archives.

What I'd really like is a plugin that automatically pulls from archives somewhere and replaces deleted comments and those bot-overwritten comments with the original context.

Reddit is becoming maddening to use because half the old links I click have comments overwritten with garbage out of protest for something. Ironically the original content is available in these archives (which are used for AI training) but now missing for actual users like me just trying to figure out how someone fixed their printer driver 2 years ago.

anonymous908213•3w ago
That would only really be ironic if the reason for people overwriting their comments was out of protest for LLM training, but the main reason that resulted in by far the biggest wave of deletions was Reddit locking down their API. If the result of their protest is that the site is less useful for you, the user, then in fact it served its purpose, as the entire point was an attempt to boycott Reddit, ie. get people to stop using it by removing the user contributions that give the site its only value in the first place.
Aurornis•3w ago
> If the result of their protest is that the site is less useful for you, the user, then in fact it served its purpose, as the entire point was an attempt to boycott Reddit, ie. get people to stop using it by removing the user contributions that give the site its only value in the first place.

In practice I just give them more page views because I have to view more threads before I find the answer.

Reddit's DAU numbers have only gone up since the protest.

anonymous908213•3w ago
I did phrase it as "an attempt". In the end the protest probably wasn't as effective as protestors might have hoped, and it didn't get Reddit to change course on their enshittification decisions. I do think it was good that there was an attempt at pushback, at least, when most software users just accept enshittification as normal and continue tolerating whatever abuse their masters throw at them.
swed420•3w ago
> Reddit's DAU numbers have only gone up since the protest.

And so has the bot activity.

accrual•3w ago
Just offering another perspective because I see those missing comments too. The author decided they didn't want to participate in public discourse anymore and their comment is gone. So be it. I don't search archives or use tools to undermine their effort. I move onto the next thing.

I read "it's maddening because ... they decided to use their autonomy and..." and I stop there. So be it.

hrimfaxi•3w ago
People use their autonomy to maddening ends—how does the fact that it is of their own volition offer you any comfort? I ask genuinely. Is it something along the lines of recognizing the things you can't change?
dzelzs•3w ago
In this case - recognition of an attempt at doing something. Downplaying that is similar to Downplaying protests for not achieving anything. At the very least it might have brought attention to the topic of contention for more people which can be a spark for change. If you have apathy and disdain for attempts at change - it might be worth evaluating what the consequences might be of that at a societal level when that apathy is the norm for harder to change things (like politics, big corp practices etc.)
accrual•2w ago
Thanks for the question. It's a couple days later but:

> how does the fact that it is of their own volition offer you any comfort? I ask genuinely. Is it something along the lines of recognizing the things you can't change?

Yes, pretty much exactly this. It's like living a giant pot full of autonomous beings. Sometimes, others do unexpected or undesirable things. I can't control what they do, but I can control how I respond and what I do about it (if anything), and so I try to focus there.

Gander5739•3w ago
https://github.com/Fubs/reddit-uncensored
kylehotchkiss•3w ago
_Hacker News collectively grabs the dataset to train their models on how to become effective reddit trolls_
19-84•3w ago
the API and MCP server is very powerful ;)
layer8•3w ago
Don’t we have enough of those already? ;)
dvngnt_•3w ago
I want to do the same thing for tiktok. I have 5k videos starting from the pandemic downloaded. want to find a way to use AI to tag and categorize the videos to scroll locally.
syngrog66•3w ago
Did you pay all the people who created its content?
devilsdata•3w ago
I have no problem with this being downloaded for personal use, in fact that's a good thing. But of course we both know it'll be used to train AI.
nullandvoid•3w ago
Did anyone ever comment on reddit with an expectation of pay?

It's an open forum - similar to here, whatever I post I it's in the public forum and therefore I expect it to be used / remixed however anyone wants.

nozzlegear•3w ago
> Did anyone ever comment on reddit with an expectation of pay?

Maybe Gallowboob

Sohcahtoa82•3w ago
That's a name I haven't seen in a LONG time.
antisthenes•3w ago
Reddit didn't pay me for posting either. Not that I posted in the last decade.
alcroito•3w ago
I tried spinning up the local approach with docker compose, but it fails.

There's no `.env.example` file to copy from. And even if the env vars are set manually, there are issues with the mentioned volumes not existing locally.

Seems like this needs more polish.

19-84•3w ago
thank you for your comment, some example dot files were not copied in my original repo, they have now been added.

https://github.com/19-84/redd-archiver/commit/0bb103952195ae...

the docs have been updated with mkdir steps

https://github.com/19-84/redd-archiver/commit/c3754ea3a0238f...

alcroito•3w ago
Cheers. I checked the updated steps.

This is still missing creating the `output/.postgres-data` dir, without which docker compose refuses to start.

After creating that manually, going to http://localhost/ shows a 403 Forbidden page, which makes you believe that something might have gone wrong.

This is before running `reddarchiver-builder python reddarc.py` to generate the necessary DB from the input data.

19-84•3w ago
I've updated the workflow and added a placeholder page that will serve before archives are created. thanks again! https://github.com/19-84/redd-archiver/commit/0dfd505ca81cb2...
diggings•3w ago
This is a neat project, nice work.

You've probably come across this already but there are alternative archives to PushShift that may have differing sets of posts and comments (perhaps depending on removal request coverage?)

One is Arctic Shift: https://github.com/ArthurHeitmann/arctic_shift/releases

Another is PullPush: https://pullpush.io/

bkovacev•3w ago
Is there any way to check if a subreddit that was made private (2-3 years ago) is in the data dump?
19-84•3w ago
I included a metadata dump of every subreddit found in the torrent. it includes a status field which will show of a subreddit is private along with a much more details

data catalog readme: https://github.com/19-84/redd-archiver/blob/main/tools/READM...

reddit data: https://github.com/19-84/redd-archiver/blob/main/tools/subre...

m463•3w ago
I wonder if you could use this to "Seed" a new distributed social media thing and just take over from there.

sort of like forking a project.

19-84•3w ago
ive created tooling for an instance registry and team based leaderboard. the API has function to support this as well, so that we can collectively host archives in a decentralized and distributed manner

registry readme: https://github.com/19-84/redd-archiver/blob/main/docs/REGIST...

register instances: https://github.com/19-84/redd-archiver/blob/main/.github/ISS...

drob518•3w ago
This is a great way to participate in arguments you missed three years ago.
justsomehnguy•3w ago
Appreciated.

EDIT: Is there any cheap way to search? I have MS TechNet archive which is useless without search, so I realky want to know a way to have a cheap local search w/o grepping everyting.

19-84•3w ago
redd-archiver uses postgres full text search. for static search you could use lunr.js
twobitshifter•3w ago
If reddit was a squeaky clean place, or if I could pick certain subs, maybe I would be interested, but I really wouldn't want ALL of reddit on my machine even temporarily.
19-84•3w ago
the torrent has data for the top 40,000 subs on reddit. thanks to watchful1 splitting the data by subreddit, you can download only the subreddit you want from the torrent
Imustaskforhelp•3w ago
I am going to be honest and this looks really cool.

40,000 subs are good numbers and I hope that the number can be spread to even more subreddits

Perhaps we can finally migrate all or much of the data to lemmy instances as well to finally get the lemmy instance up and running as well.

Thank you for creating this. It opens up a lots of interesting opportunities.

feconroses•3w ago
Very cool project! Quick question: is the underlying Pushshift dataset updated with new Reddit data on any regular cadence (daily/weekly/monthly), or is this essentially a fixed historical snapshot up to a certain date? Just want to understand if self-hosters would need to periodically re-download for fresh content or if it's archival-only.
19-84•3w ago
the data from 2025-12 has been released already, it is usually released every month, it just needs to be split and reprocessed for 2025 by watchful1. i will probably eventually add support for importing data from the monthly arctic shift dumps so that archives can be updated monthly.

https://github.com/ArthurHeitmann/arctic_shift/releases

Arctic Shift https://academictorrents.com/browse.php?search=RaiderBDev

Watchful1 https://academictorrents.com/browse.php?search=Watchful1

riku_iki•3w ago
Is data web scrapped? Is reddit ok with that?..
blks•3w ago
Does it also contains countless NSFW content?
blks•3w ago
Opened the live demo, went into programming subreddit, felt like I was showered with liquid shit. I tend to forget what kind of edgelord hellhole Reddit was (and stil is sometimes).
vivzkestrel•3w ago
- slightly offtopic here but does anyone have a similar data set of all youtube channels out there?

- details probably include the 400 million youtube accounts, channel id, name, creator url, etc

19-84•3w ago
there is nearly 10TB of youtube metadata available on archive.org https://archive.org/details/youtube-metadata
chicagojoe•3w ago
These show as unavailable/with lock icons for me. Is there some process to download locked content from IA?
leshokunin•3w ago
Is there a docker compose?
nick007x•3w ago
Hey, I’m working on a similar project and have uploaded Pushshift Reddit data to Hugging Face Datasets. If anyone wants to download specific files when torrents aren’t seeding well, you can use:

https://huggingface.co/datasets/nick007x/pushshift-reddit

It’s handy for grabbing individual months or subreddit slices without needing to pull the full torrent. Might be useful for smaller-scale archiving or testing.

Reverse-Engineering Raiders of the Lost Ark for the Atari 2600

https://github.com/joshuanwalker/Raiders2600
2•todsacerdoti•47s ago•0 comments

Show HN: Deterministic NDJSON audit logs – v1.2 update (structural gaps)

https://github.com/yupme-bot/kernel-ndjson-proofs
1•Slaine•4m ago•0 comments

The Greater Copenhagen Region could be your friend's next career move

https://www.greatercphregion.com/friend-recruiter-program
1•mooreds•4m ago•0 comments

Do Not Confirm – Fiction by OpenClaw

https://thedailymolt.substack.com/p/do-not-confirm
1•jamesjyu•5m ago•0 comments

The Analytical Profile of Peas

https://www.fossanalytics.com/en/news-articles/more-industries/the-analytical-profile-of-peas
1•mooreds•5m ago•0 comments

Hallucinations in GPT5 – Can models say "I don't know" (June 2025)

https://jobswithgpt.com/blog/llm-eval-hallucinations-t20-cricket/
1•sp1982•5m ago•0 comments

What AI is good for, according to developers

https://github.blog/ai-and-ml/generative-ai/what-ai-is-actually-good-for-according-to-developers/
1•mooreds•5m ago•0 comments

OpenAI might pivot to the "most addictive digital friend" or face extinction

https://twitter.com/lebed2045/status/2020184853271167186
1•lebed2045•6m ago•2 comments

Show HN: Know how your SaaS is doing in 30 seconds

https://anypanel.io
1•dasfelix•7m ago•0 comments

ClawdBot Ordered Me Lunch

https://nickalexander.org/drafts/auto-sandwich.html
1•nick007•8m ago•0 comments

What the News media thinks about your Indian stock investments

https://stocktrends.numerical.works/
1•mindaslab•9m ago•0 comments

Running Lua on a tiny console from 2001

https://ivie.codes/page/pokemon-mini-lua
1•Charmunk•9m ago•0 comments

Google and Microsoft Paying Creators $500K+ to Promote AI Tools

https://www.cnbc.com/2026/02/06/google-microsoft-pay-creators-500000-and-more-to-promote-ai.html
2•belter•12m ago•0 comments

New filtration technology could be game-changer in removal of PFAS

https://www.theguardian.com/environment/2026/jan/23/pfas-forever-chemicals-filtration
1•PaulHoule•13m ago•0 comments

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

https://github.com/Momciloo/fun-with-clip-path
2•momciloo•13m ago•0 comments

Kinda Surprised by Seadance2's Moderation

https://seedanceai.me/
1•ri-vai•13m ago•2 comments

I Write Games in C (yes, C)

https://jonathanwhiting.com/writing/blog/games_in_c/
2•valyala•13m ago•0 comments

Django scales. Stop blaming the framework (part 1 of 3)

https://medium.com/@tk512/django-scales-stop-blaming-the-framework-part-1-of-3-a2b5b0ff811f
1•sgt•14m ago•0 comments

Malwarebytes Is Now in ChatGPT

https://www.malwarebytes.com/blog/product/2026/02/scam-checking-just-got-easier-malwarebytes-is-n...
1•m-hodges•14m ago•0 comments

Thoughts on the job market in the age of LLMs

https://www.interconnects.ai/p/thoughts-on-the-hiring-market-in
1•gmays•14m ago•0 comments

Show HN: Stacky – certain block game clone

https://www.susmel.com/stacky/
2•Keyframe•17m ago•0 comments

AIII: A public benchmark for AI narrative and political independence

https://github.com/GRMPZQUIDOS/AIII
1•GRMPZ23•17m ago•0 comments

SectorC: A C Compiler in 512 bytes

https://xorvoid.com/sectorc.html
2•valyala•19m ago•0 comments

The API Is a Dead End; Machines Need a Labor Economy

1•bot_uid_life•20m ago•0 comments

Digital Iris [video]

https://www.youtube.com/watch?v=Kg_2MAgS_pE
1•Jyaif•21m ago•0 comments

New wave of GLP-1 drugs is coming–and they're stronger than Wegovy and Zepbound

https://www.scientificamerican.com/article/new-glp-1-weight-loss-drugs-are-coming-and-theyre-stro...
5•randycupertino•23m ago•0 comments

Convert tempo (BPM) to millisecond durations for musical note subdivisions

https://brylie.music/apps/bpm-calculator/
1•brylie•25m ago•0 comments

Show HN: Tasty A.F. - Use AI to Create Printable Recipe Cards

https://tastyaf.recipes/about
2•adammfrank•25m ago•0 comments

The Contagious Taste of Cancer

https://www.historytoday.com/archive/history-matters/contagious-taste-cancer
2•Thevet•27m ago•0 comments

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

https://www.forbes.com/sites/mikestunson/2026/02/05/us-jobs-disappear-at-fastest-january-pace-sin...
2•alephnerd•27m ago•1 comments