Show HN: Kage – Shadow any website to a single binary for offline viewing

86•tamnd•1h ago

Comments

maxloh•1h ago

I find SingleFile [0] to be a much more robust version of this.

It strips out all the JavaScript too, but also packs everything into a single HTML file that is easy to transfer. Binary assets (like web fonts and images) are packed as base64 strings.

They also offer a CLI powered by Puppeteer. [1]

[0]: https://github.com/gildas-lormeau/singlefile

[1]: https://github.com/gildas-lormeau/single-file-cli

tamnd•1h ago

It seems this repo only saves one web page?

What I'm implementing here is mirroring a whole website, with all its subpages, so you can browse it all offline. For example, all essays from paulgraham.com.

tamnd•1h ago

And thanks for the link. Let me implement this single HTML feature, it looks nice to have!

HelloUsername•59m ago

What's the difference with, any webbrowser on a computer, File -> Save as ?

nmstoker•53m ago

That's for a single page, this handles the whole site. Also the browser Save As options often work poorly.

gregwebs•1h ago

This seems like it has potential to create a lot of load on a site- are there settings to set how fast it clones or avoid images/videos? Is there a way to only get a subset of a website?

tamnd•1h ago

Could you help create a new issue for that? I will do it later. It is already 1:00 AM my time, but I am happy that anyone is interested in it. : )

sanqui•1h ago

Cool concept. I would like to see this combined with mitmproxy for archive grade fidelity. You could be saving exactly the data served and at the same time a representation by a modern (contemporary) browser, with all JS having run. This combination would be my perfect replacement for the WARC format.

tamnd•1h ago

I'm working on WARC too, with format from Common Crawl!

By converting it to Markdown, we save a lot of space, but it is for a different purpose and a different project: https://github.com/tamnd/ccrawl-cli

sanqui•1h ago

That's neat! In my opinion, the WARC format is quite tricky and underspecified especially since HTTP2 introduced new semantics. It encodes too much in-band and requires rewriting of the server data. A mitmproxy capture is higher fidelity and supports capturing modern features such as WebSockets. I think if we could wrap Kage's crawler interactions by it and store its capture (the intercepted traffic), we could make a potentially nice new archival format.

tamnd•1h ago

I tried to follow well-known formats first, such as WARC and ZIM from Kiwix, so we could benefit from existing tooling support.

For my own custom data format, I have a lot of private code that I plan to release soon. It is optimized for compression, fast lookups, and more. I have been working on it for two years. This is part of a larger, ambitious umbrella project: I am building Google from scratch (all open source), something that anyone can host, including the crawler, indexer, storage, and serving layers. Stay tuned!

rahimnathwani•1h ago

So this is like using wget --mirror except that it works on pages that require javascript, right?

tamnd•1h ago

Yeah, it is. For example, openai.com is rendered with Next.js, so I will try to mirror it tomorrow.

wolttam•1h ago

One use I'd have for this is company wikis that you want to give folks easy offline access to (maybe the wiki has documentation that's useful at sites that don't have cellular coverage).

Cool!

It would be especially cool to have a version that didn't require the separate serving process - even though it's nifty you can package up a whole site as a single binary.

Maybe a single HTML entrypoint shim with a bit of javascript that could index into an archive (potentially embedded) of the site's content?

tamnd•1h ago

Submitting this to Hacker News is the right place! Thanks for your idea. I will consider implementing that :)

Also, in my mind, I already have a script/program to convert HTML to Markdown, so it could actually store everything on disk as a folder of Markdown files, and then commit them to a Git repo.

grahamstanes17•1h ago

nice

dimiprasakis•54m ago

Neat project, I like the idea. One thing from a quick read: you launch Chrome with --no-sandbox. Is there a good reason for that? Security wise it's probably not a good idea. If there is no reason, I'd suggest leaving the sandbox on!

In any case, cool stuff :)

lolpython•49m ago

This is cool. I could see myself downloading the articles behind the first couple pages of hacker news with this, for viewing on a flight or long distance train ride with spotty internet

daviding•46m ago

Nice idea! fwiw, false positives and all, but the Windows 11 default Windows Security doesn't like it: `leakless.exe: Operation did not complete successfully because the file contains a virus or potentially unwanted software.`

delduca•33m ago

curl can do this

Igor_Wiwi•27m ago

This is quite useful tool, especially for the cases where internet access is limited (the flights for example). I implemented it as a separate feature in mdview.io: for example you can export a document as a html file for offline usage, with all the presentation features like reach tables, mermaid and etc built in. Example https://mdview.io/s/why-markdown-became-default-format-for-a... then try to Export - Export HTML

telesilla•17m ago

I've been using httrack (https://www.httrack.com) to download wikis to read on flights, which isn't perfect but better than I'd found previously. I'll try this out, I'd be delighted to have good results. Thanks for the post.

Yserver: A modern X11 server written in Rust

Mathematical and Algorithmic Specification of an Advanced Cognitive Architecture

Ask HN: Which Free Software or Open Source Project Needs Help?

Why isn't the external link symbol in Unicode? (2018)

One Moving Part: The Forest Service Ax Manual

Android's head of security slams Google's door

Losing on Purpose: The Economics of NBA Tanking

The Telematico NMS3000 – Celso Martinho

Karpathy on Why 'Edutainment' Isn't Real Learning (Seek the Sweat) (2024)

Audit checklists for AI coding agents – 30 invariants, any language

Encrypted-File-Server: FTP/SFTP/WebServer Encryption Proxy

Anthropic staff to meet White House officials next week

Show HN: Data-review, diff the data/numbers a PR changes

Using Notes in the Glasgow Haskell Compiler

What the gym taught me about China's relentless competition

The Millions of Songs Mashed into AI-Generated Music

Smartphones arrived just before the US fertility rate plunged

Nobody Is Measuring What Your AI Agents Are Worth

Nudge – a collaborative memory layer for Claude Code and Codex CLI hooks

How to Share AI Riches

Building a Tower Defense Game on a Conference Badge

Flying Cars Expand the Area of Daily Life Quadratically with Speed

Scoop: Anthropic flies staff to D.C. to clean up White House fight

Benchmarking Claude Code for Oracle to PostgreSQL Migrations

S&P's decision not to include SpaceX is a mistake

Open-source React UI and D-pad focus engine for Meta Ray-Ban Display

Oliver Tree, 'Alien Boy' Musician, Dies at 32 in Helicopter Crash

Dallas mural cover-up led to a $25M lawsuit against FIFA

Show HN: I crunched the day's tech news into a 3-minute read

Show HN: BraveDebloater – strips Brave's bloat using official policies only

Yserver: A modern X11 server written in Rust

Mathematical and Algorithmic Specification of an Advanced Cognitive Architecture

Ask HN: Which Free Software or Open Source Project Needs Help?

Why isn't the external link symbol in Unicode? (2018)

One Moving Part: The Forest Service Ax Manual

Android's head of security slams Google's door

Losing on Purpose: The Economics of NBA Tanking

The Telematico NMS3000 – Celso Martinho

Karpathy on Why 'Edutainment' Isn't Real Learning (Seek the Sweat) (2024)

Audit checklists for AI coding agents – 30 invariants, any language

Encrypted-File-Server: FTP/SFTP/WebServer Encryption Proxy

Anthropic staff to meet White House officials next week

Show HN: Data-review, diff the data/numbers a PR changes

Using Notes in the Glasgow Haskell Compiler

What the gym taught me about China's relentless competition

The Millions of Songs Mashed into AI-Generated Music

Smartphones arrived just before the US fertility rate plunged

Nobody Is Measuring What Your AI Agents Are Worth

Nudge – a collaborative memory layer for Claude Code and Codex CLI hooks

How to Share AI Riches

Building a Tower Defense Game on a Conference Badge

Flying Cars Expand the Area of Daily Life Quadratically with Speed

Scoop: Anthropic flies staff to D.C. to clean up White House fight

Benchmarking Claude Code for Oracle to PostgreSQL Migrations

S&P's decision not to include SpaceX is a mistake

Open-source React UI and D-pad focus engine for Meta Ray-Ban Display

Oliver Tree, 'Alien Boy' Musician, Dies at 32 in Helicopter Crash

Dallas mural cover-up led to a $25M lawsuit against FIFA

Show HN: I crunched the day's tech news into a 3-minute read

Show HN: BraveDebloater – strips Brave's bloat using official policies only

Show HN: Kage – Shadow any website to a single binary for offline viewing

Comments