Building a personal archive of the web, the slow way

https://alexwlchan.net/2025/personal-archive-of-the-web/

8•ingve•8mo ago

Comments

gwern•8mo ago

OP's workflow might be much more efficient with use of https://github.com/gildas-lormeau/SingleFile/

It can handle most of what they describe for things like private/paywalled pages or media enclosures or completely self-contained archives that live locally or easy to use or editing before saving or ensuring lazy-loaded images are there, you can view it immediately to check for breakage, it automatically works with adblock and NoScript and when you delete stuff in the DOM using the picker so they can clean each page very efficiently (create a bunch of rules in your adblock by picking elements like in ublock, so you never have to do those again, then quickly mouse any remainder), and it stores the final DOM so you can interact with stuff to make sure it is visible or archived.

So what I do ( https://gwern.net/archiving#preemptive-local-archiving ) is I have a script which calls SingleFile-CLI in a headless Chrome browser to automatically archive everything, and then opens up the original URL + snapshot in my normal Firefox, and look at the snapshot then original. If the snapshot looks good, I simply close the 2 tabs after a few seconds and I'm done; if the snapshot looks bad, then I look at the original and make edits: use Ublock Origin to define any necessary rules (assuming the page isn't cleaned up by all the rules I previously defined), make any minor tweaks to the DOM, and then SingleFile-browser-extension it manually.

If you use enough adblock rules, then you get a similar effect to the 'templates' described, since it looks like OP is mostly just trying to remove as much as possible. But since you're archiving the final DOM, you can do anything you like. Something I've done a few times is opening up multiple pages and copy-pasting the key DOM node from each of them into the first one, to create a single consolidated master page, in a way which is a lot easier & more reliable than messing around with the serialized HTML in Emacs.

You can also post-process them. (Because we use these local archives for 'previews' on Gwern.net, and a fully static self-contained HTML page can easily be 100MB+ with all its fonts and images and stuff, we take the SingleFile snapshots and for the large ones, we 'split' them back up, so loading the .html file doesn't necessarily load everything else: https://github.com/gwern/gwern.net/blob/master/build/deconst... And then you can save a lot of space by running standard optimization tools on the split-out files, eg OptiPNG on the revealed PNGs will save gigabytes of space because so many people fail to do the standard image optimizations.)

Compared to "it typically takes me a few minutes to save a page", I handle the majority of pages in a few seconds, and even the nastiest page where I have to delete a lot is usually like a minute. And since I do like 10 URLs a day, this is quite manageable at scale. (I'm up to >15k snapshots, although an unknown fraction are from an initial bulk archiving so may not be of high quality.)

ShowHN: Make OpenClaw Respond in Scarlett Johansson’s AI Voice from the Film Her

CReact Version 0.3.0 Released

Show HN: CReact – AI Powered AWS Website Generator

The rocky 1960s origins of online dating (2025)

Show HN: Agent-fetch – Sandboxed HTTP client with SSRF protection for AI agents

Why there is no official statement from Substack about the data leak

Effects of Zepbound on Stool Quality

Show HN: Seedance 2.0 – The Most Powerful AI Video Generator

Ask HN: Do we need "metadata in source code" syntax that LLMs will never delete?

Pentagon cutting ties w/ "woke" Harvard, ending military training & fellowships

Can Quantum-Mechanical Description of Physical Reality Be Considered Complete? [pdf]

Kessler Syndrome Has Started [video]

Complex Heterodynes Explained

EVs Are a Failed Experiment

MemAlign: Building Better LLM Judges from Human Feedback with Scalable Memory

CCC (Claude's C Compiler) on Compiler Explorer

Homeland Security Spying on Reddit Users

Actors with Tokio (2021)

Can graph neural networks for biology realistically run on edge devices?

Deeper into the shareing of one air conditioner for 2 rooms

Weatherman introduces fruit-based authentication system to combat deep fakes

Why Embedded Models Must Hallucinate: A Boundary Theory (RCC)

A Curated List of ML System Design Case Studies

Pony Alpha: New free 200K context model for coding, reasoning and roleplay

Show HN: Tunbot – Discord bot for temporary Cloudflare tunnels behind CGNAT

Open Problems in Mechanistic Interpretability

Bye Bye Humanity: The Potential AMOC Collapse

Dexter: Claude-Code-Style Agent for Financial Statements and Valuation

Digital Iris [video]

Essential CDN: The CDN that lets you do more than JavaScript