I've built a small tool to visualize how inefficient `docker pull` is, in preparation for standing up a new Docker registry + transport. It's bugged me for a while that updating one dependency with Docker drags along many other changes. It's a huge problem with Docker+robotics. With dozens or hundreds of dependencies, there's no "right" way to organize the layers that doesn't end up invalidating a bunch of layers on a single dependency update - and this is ignoring things like compiled code, embedded ML weights, etc. Even worse, many robotics deployments are on terrible internet, either due to being out in the boonies or due to customer shenanagins. I've been up at 4AM before supporting a field tech who needs to pull 100MB of mostly unchanged Docker layers to 8 robots on a 1Mbps connnection. (and I don't think that robotics is the only industry that runs into this, either - see the ollama example, that's a painful pull)
What if Docker were smarter and knew about the files were already on disk? How many copies of `python3.10` do I have floating around `/var/lib/docker`. For that matter, how many copies of it does DockerHub have? A registry that could address and deduplicate at the file level rather than just the layer level is surely cheaper to run.
This tool:
- Given two docker images, one you have and one you are pulling, finds how much data docker pull would use, as well as how much data is _actually_ required to pull
- Shows an estiimate for how much time you will save on various levels of cruddy internet
- There's a bunch of examples given of situations where more intelligent pulls would help, but the two image names are free text, feel free to write your own values there and try it out (one at a time though, there's a work queue to analyze new image pairs)
The one thing I wish it had but haven't gotten around to fitting in the UI somehow is a visualization of the files that _didn't_ change but are getting pulled anyhow.It was written entirely in Claude Code, which is a new experience for me. I don't know nextjs at all, I don't generally write frontends. I could have written the backend maybe a little slower than Claude, but the frontend would have taken me 4x as long and wouldn't have been as pretty. It helped that I knew what I wanted on the backend, I think.
The registry/transport/snapshotter(?) I'm building will allow both sharing files across docker layers on your local machine well as in the registry. There's a bit of prior art with this, but only on the client side. The eStargz format allows splitting apart the metadata for a filesystem and the contents, while still remaining OCI compliant - but it does lazy pulls of the contents, and has no deduplication. I think it could easily compete with other image providers both on cost (due to using less storage and bandwidth...everywhere) as well as speed.
If you'd be interested, please reach out.
PaulHoule•1h ago