By converting it to Markdown, we save a lot of space, but it is for a different purpose and a different project: https://github.com/tamnd/ccrawl-cli
For my own custom data format, I have a lot of private code that I plan to release soon. It is optimized for compression, fast lookups, and more. I have been working on it for two years. This is part of a larger, ambitious umbrella project: I am building Google from scratch (all open source), something that anyone can host, including the crawler, indexer, storage, and serving layers. Stay tuned!
Cool!
It would be especially cool to have a version that didn't require the separate serving process - even though it's nifty you can package up a whole site as a single binary.
Maybe a single HTML entrypoint shim with a bit of javascript that could index into an archive (potentially embedded) of the site's content?
Also, in my mind, I already have a script/program to convert HTML to Markdown, so it could actually store everything on disk as a folder of Markdown files, and then commit them to a Git repo.
In any case, cool stuff :)
Sounds awesome. There is a lot of untapped potential with respect to efficiently archiving and indexing websites. I saw the impressive things Marginalia Search is doing in this area (the blog is great when it gets technical). There is also a lot of very complete archives of websites out there which are not being indexed at all, and I would love to make them available for researchers. In any case, I'm interested in your project!
maxloh•1h ago
It strips out all the JavaScript too, but also packs everything into a single HTML file that is easy to transfer. Binary assets (like web fonts and images) are packed as base64 strings.
They also offer a CLI powered by Puppeteer. [1]
[0]: https://github.com/gildas-lormeau/singlefile
[1]: https://github.com/gildas-lormeau/single-file-cli
tamnd•1h ago
What I'm implementing here is mirroring a whole website, with all its subpages, so you can browse it all offline. For example, all essays from paulgraham.com.
tamnd•1h ago
HelloUsername•59m ago
nmstoker•53m ago