Peeking Inside Gigantic Zips with Only Kilobytes

https://ritiksahni.com/blog/peeking-inside-gigantic-zips-with-only-kilobytes/

33•rtk0•3mo ago

Comments

rtk0•3mo ago

In this blog, I wrote about the architecture of a ZIP file and how we can leverage HTTP range requests to download files without decompressing the archive, in-browser.

gildas•3mo ago

For implementation in a library, you can use HttpRangeReader [1][2] in zip.js [3] (disclaimer: I am the author). It's a solid feature that has been in the library for about 10 years.

[1] https://gildas-lormeau.github.io/zip.js/api/classes/HttpRang...

[2] https://github.com/gildas-lormeau/zip.js/blob/master/tests/a...

[3] https://github.com/gildas-lormeau/zip.js

toomuchtodo•3mo ago

Based on your experience, is zip the optimal archive format for long term digital archival in object storage if the use case calls for reading archives via http for scanning and cherry picking? Or is there a more optimal archive format?

gildas•3mo ago

Unfortunately, I will have difficulty answering your question because my knowledge is limited to the zip format. In the use case presented in the article, I find that the zip format meets the need well. Generally speaking, in the context of long-term archiving, its big advantage is also that there are thousands of implementations for reading/writing zip files.

duskwuff•3mo ago

ZIP isn't a terrible format, but it has a couple of flaws and limitations which make it a less than ideal format for long-term archiving. The biggest ones I'd call out are:

1) The format has limited and archaic support for file metadata - e.g. file modification times are stored as a MS-DOS timestamp with a 2-second (!) resolution, and there's no standard system for representing other metadata.

2) The single-level central directory can be awkward to work with for archives containing a very large number of members.

3) Support for 64-bit file sizes exists but is a messy hack.

4) Compression operates on each file as a separate stream, reducing its effectiveness for archives containing many small files. The format does support pluggable compression methods, but there's no straightforward way to support "solid" compression.

5) There is technically no way to reliably identify a ZIP file, as the end of central directory record can appear at any location near the end of the file, and the file can contain arbitrary data at its start. Most tools recognize ZIP files by the presence of a local file header at the start ("PK\x01\x02"), but that's not reliable.

Lammy•3mo ago

> there's no straightforward way to support "solid" compression.

I do it by ignoring ZIP's native compression entirely, using store-only ZIP files and then compressing the whole thing at the filesystem level instead.

Here's an example comparison of the same WWW site rip in a DEFLATE ZIP, in a store-only ZIP with zstd filesystem compression, in a tar with same zstd filesystem compression (identical size but less useful for seeking due to lack of trailing directory versus ZIP), and finally the raw size pre-zipping:

  982M preserve.mactech.com.deflate.zip
  408M preserve.mactech.com.store.zip
  410M preserve.mactech.com.tar
  3.8G preserve.mactech.com


  [Lammy@popola] zfs get compression spinthedisc/Backups/WWW
  NAME                     PROPERTY     VALUE           SOURCE
  spinthedisc/Backups/WWW  compression  zstd            local

This probably wouldn't help GP with their need for HTTP seeking since their HTTP server would incur a decompress+recompress at the filesystem boundary.

nicman23•3mo ago

lool why use zip then anyways? put them in a folder

Lammy•3mo ago

It's for when you have a very large number of mostly-identical files, like web pages with consistent header and footer. If 408MiB versus 3.8GiB is a meaningless difference to you then sure don't bother with compression, but why I want it should be very obvious to most people here.

nicman23•3mo ago

you completely missed what i asked you but ok

Lammy•3mo ago

I don't think I did, but please explain :)

The last example in my list of four file sizes is them in a folder. Filesystem compression works at the file level, so you have to turn many-almost-identical-files into one file in order to benefit from it. ZFS does have block-level deduplication, but that's it's own can of worms that shouldn't be turned on flippantly due to resource requirements and `recordsize` tuning needed to really benefit from it.

nicman23•3mo ago

you do not need dedup just use reflinks for everything. if that workflow does not work then eh i understand why you would use zips

although zfs dedup is probably better in 2025

gildas•3mo ago

FYI, zip.js has no issues with 1 (it can be fixed with standard extra fields), 3 (zip64 support), and 5 (you cannot have more than 64K of comment data at the end of the file).

duskwuff•3mo ago

With regard for the first two - that's good for zip.js, but the problem is that support for those features isn't universal. There's been a lot of fragmentation over the last 36 years (!).

As far as the last (file type detection) goes, the generally agreed upon standard is that file formats should be "sniffable" by looking for a signature in the file's header - ideally within the first few bytes of the file. Having to search through 64 KB of the file's end for a signature is a major departure from that pattern.

xg15•3mo ago

This is really cool! Could also make a useful standalone command line tool.

I think the general pattern - using the range header + prior knowledge of a file format to only download the parts of a file that are relevant - is still really underutilized.

One small problem I see is that a server that does not support range requests would just try to send you the entire file in the first request, I think.

So maybe doing a preflight HEAD request first to see if the server sends back Accept-Ranges could be useful.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Ran...

xp84•3mo ago

How common is it in practice today to not support ranges? I remember back in the early days of broadband (c. 2000) when having a Download Manager was something most nerds endorsed, that most servers then supported partial downloads. Aside from toy projects has anyone encountered a server which didn't allow ranges (unless specifically configured to forbid it)?

xg15•3mo ago

I'd guess everything where support would have to be manually implemented.

For static files served by CDNs or an "established" HTTP servers I think support is pretty much a given (though e.g. Python's FastAPI only got support in 2020 [1]), but for anything dynamic, I doubt many devs would go through the trouble and implement support if it wasn't strictly necessary for their usecase.

E.g. the URL may point to a service endpoint that loads the file contents from a database or blob storage instead of the file system. Then the service would have to implement range support itself and translate them to the necessary storage/database calls (if those exist), etc etc. That's some effort you have to put in.

Even for static files, there may be reverse proxies in front that (unintentionally) remove the support again. E.g. [2]

[1] https://github.com/Kludex/starlette/issues/950

[2] https://caddy.community/t/cannot-seek-further-in-videos-usin...

jeffrallen•3mo ago

Here's the results of my investigation into the same question:

https://blog.nella.org/2016/01/17/seeking-http/

(Originally written for Advent of Go.)

rtk0•3mo ago

Lovely. I had so much fun exploring and writing about this topic. Thanks for sharing.

HPsquared•3mo ago

7-zip does this. You can see it if you open (to view) a large ZIP file on slow network drive. There's no way it is downloading the whole thing. You can extract single files from the ZIP also with only a little traffic.

dividuum•3mo ago

Would be surprised if that’s not how basically all tools behave, as I expect them all to seek to the central directory and to the referenced offset of individual files when extracting. Doesn’t really make a difference if that’s across a network file system or a local disc.

aeblyve•3mo ago

This is also quite easy to do with .tar files, not to be confused with .tar.gz files.

dekhn•3mo ago

tar does not have an index.

Lammy•3mo ago

> That question took me into the guts of the ZIP format, where I learned there’s a tiny index at the end that points to everything else.

Tangential, but any Free Software that uses `shared-mime-info` to identify files (any of your GNOMEs, KDEs, etc) are unable to correctly identify Zip files by their EOCD due to lack of accepted syntax for defining search patterns based on negative file offsets. Please show your support on this Issue if you would also like to see this resolved: https://gitlab.freedesktop.org/xdg/shared-mime-info/-/issues... (linking to my own comment, so no this is not brigading)

Anything using `file(1)` does not have this problem: https://github.com/file/file/blob/280e121/magic/Magdir/zip#L...

silasb•3mo ago

I've been looking at this for gunzip files as well. There is a rust solution that looks interesting called https://docs.rs/indexed_deflate/latest/indexed_deflate/. My goals are to be able to index mysql dump files by tables boundaries.

dabinat•3mo ago

I wrote a Rust command-line tool to do this for internal use in my SaaS. The motivation was to be able to index the contents of zip files stored on S3 without incurring significant egress charges. Is this something that people would generally find useful if it was open-sourced?

rtk0•3mo ago

Yes, the motivation to explore was something similar. I was curious if downloading ZIP files could be made more efficient over the web.

saulpw•3mo ago

Here's my Python library that does the same[0]. And it's incorporated into VisiData so you can view a .csv from within a .zip file over HTTP without downloading the whole .zip file.

[0] https://github.com/saulpw/unzip-http/

rtk0•3mo ago

Lovely! Thanks for sharing. I had so much fun learning about ZIP and writing the blog post.

jacknews•3mo ago

My 16yo son did exactly this over the last week as part of his Rust minecraft mod manager, using http range requests to get the file length, then the directory, then individual file data.

I'll dig up a link.

The purpose of Continuous Integration is to fail

Apfelstrudel: Live coding music environment with AI agent chat

What Is Stoicism?

What happens when a neighborhood is built around a farm

Every major galaxy is speeding away from the Milky Way, except one

Extreme Inequality Presages the Revolt Against It

There's no such thing as "tech" (Ten years later)

What Really Killed Flash Player: A Six-Year Campaign of Deliberate Platform Work

Ask HN: Anyone orchestrating multiple AI coding agents in parallel?

Show HN: Knowledge-Bank

Show HN: The Codeverse Hub Linux

Take a trip to Japan's Dododo Land, the most irritating place on Earth

British drivers over 70 to face eye tests every three years

BookTalk: A Reading Companion That Captures Your Voice

Is AI "good" yet? – tracking HN's sentiment on AI coding

Show HN: Amdb – Tree-sitter based memory for AI agents (Rust)

OpenClaw Partners with VirusTotal for Skill Security

Show HN: Seedance 2.0 Release

Leisure Suit Larry's Al Lowe on model trains, funny deaths and Disney

Towards Self-Driving Codebases

VCF West: Whirlwind Software Restoration – Guy Fedorkow [video]

Show HN: COGext – A minimalist, open-source system monitor for Chrome (<550KB)

FOSDEM 26 – My Hallway Track Takeaways

Show HN: Env-shelf – Open-source desktop app to manage .env files

Show HN: Almostnode – Run Node.js, Next.js, and Express in the Browser

Dell support (and hardware) is so bad, I almost sued them

Project Pterodactyl: Incremental Architecture

Styling: Search-Text and Other Highlight-Y Pseudo-Elements

Crypto firm accidentally sends $40B in Bitcoin to users

Magnetic fields can change carbon diffusion in steel

The purpose of Continuous Integration is to fail

Apfelstrudel: Live coding music environment with AI agent chat

What Is Stoicism?

What happens when a neighborhood is built around a farm

Every major galaxy is speeding away from the Milky Way, except one

Extreme Inequality Presages the Revolt Against It

There's no such thing as "tech" (Ten years later)

What Really Killed Flash Player: A Six-Year Campaign of Deliberate Platform Work

Ask HN: Anyone orchestrating multiple AI coding agents in parallel?

Show HN: Knowledge-Bank

Show HN: The Codeverse Hub Linux

Take a trip to Japan's Dododo Land, the most irritating place on Earth

British drivers over 70 to face eye tests every three years

BookTalk: A Reading Companion That Captures Your Voice

Is AI "good" yet? – tracking HN's sentiment on AI coding

Show HN: Amdb – Tree-sitter based memory for AI agents (Rust)

OpenClaw Partners with VirusTotal for Skill Security

Show HN: Seedance 2.0 Release

Leisure Suit Larry's Al Lowe on model trains, funny deaths and Disney

Towards Self-Driving Codebases

VCF West: Whirlwind Software Restoration – Guy Fedorkow [video]

Show HN: COGext – A minimalist, open-source system monitor for Chrome (<550KB)

FOSDEM 26 – My Hallway Track Takeaways

Show HN: Env-shelf – Open-source desktop app to manage .env files

Show HN: Almostnode – Run Node.js, Next.js, and Express in the Browser

Dell support (and hardware) is so bad, I almost sued them

Project Pterodactyl: Incremental Architecture

Styling: Search-Text and Other Highlight-Y Pseudo-Elements

Crypto firm accidentally sends $40B in Bitcoin to users

Magnetic fields can change carbon diffusion in steel

Peeking Inside Gigantic Zips with Only Kilobytes

Comments