Show HN: Docker pulls more than it needs to - and how we can fix it

4•a_t48•2h ago

Hi all!

I've built a small tool to visualize how inefficient `docker pull` is, in preparation for standing up a new Docker registry + transport. It's bugged me for a while that updating one dependency with Docker drags along many other changes. It's a huge problem with Docker+robotics. With dozens or hundreds of dependencies, there's no "right" way to organize the layers that doesn't end up invalidating a bunch of layers on a single dependency update - and this is ignoring things like compiled code, embedded ML weights, etc. Even worse, many robotics deployments are on terrible internet, either due to being out in the boonies or due to customer shenanagins. I've been up at 4AM before supporting a field tech who needs to pull 100MB of mostly unchanged Docker layers to 8 robots on a 1Mbps connnection. (and I don't think that robotics is the only industry that runs into this, either - see the ollama example, that's a painful pull)

What if Docker were smarter and knew about the files were already on disk? How many copies of `python3.10` do I have floating around `/var/lib/docker`. For that matter, how many copies of it does DockerHub have? A registry that could address and deduplicate at the file level rather than just the layer level is surely cheaper to run.

This tool:

- Given two docker images, one you have and one you are pulling, finds how much data docker pull would use, as well as how much data is _actually_ required to pull

- Shows an estiimate for how much time you will save on various levels of cruddy internet

- There's a bunch of examples given of situations where more intelligent pulls would help, but the two image names are free text, feel free to write your own values there and try it out (one at a time though, there's a work queue to analyze new image pairs)

The one thing I wish it had but haven't gotten around to fitting in the UI somehow is a visualization of the files that _didn't_ change but are getting pulled anyhow.

It was written entirely in Claude Code, which is a new experience for me. I don't know nextjs at all, I don't generally write frontends. I could have written the backend maybe a little slower than Claude, but the frontend would have taken me 4x as long and wouldn't have been as pretty. It helped that I knew what I wanted on the backend, I think.

The registry/transport/snapshotter(?) I'm building will allow both sharing files across docker layers on your local machine well as in the registry. There's a bit of prior art with this, but only on the client side. The eStargz format allows splitting apart the metadata for a filesystem and the contents, while still remaining OCI compliant - but it does lazy pulls of the contents, and has no deduplication. I think it could easily compete with other image providers both on cost (due to using less storage and bandwidth...everywhere) as well as speed.

If you'd be interested, please reach out.

Comments

PaulHoule•1h ago

Back in the early 2010s I couldn't bring up Docker images at all on my 2mbps DSL because any attempt to download images would time out.

theamk•1h ago

Reminds me of OSTree and casync.

danudey•1h ago

If you're interested in implementing this directly into your dockerfiles with some minimal changes, Docker already supports this to a degree:

https://docs.docker.com/reference/dockerfile/#copy---link

The TL;DR:

If you change your dockerfile to use `COPY --link <foo> <bar>`, then docker will create a layer containing only the files that would be copied, and that layer is treated as independent of layers coming before it. The only caveat is that you need to have a build cache with previous builds and use --cache-from to specify it, which means saving build state.

That said, there are a lot of benefits you can get very quickly if you can implement it. For example, if you have a dockerfile which creates a container, builds your golang application in it, and then copies the result into a fresh alpine:3.23.3 image, and you use a local cache for that build, then when you update to alpine 3.23.4 it will see that the build layers have not changed, therefore the `COPY --link` layer has not changed. Thus, it can just directly apply that on top of the new alpine image without doing any extra work.

Apparently it can even be smart enough to realize that it doesn't need to pull down the new alpine:3.23.4 image; it can just create a manifest that references its layers and upload the manifest; the new alpine image layers are there, the original 'my application' layers are already there, so it just creates a new manifest and publishes it. No bandwidth used at all!

> How many copies of `python3.10` do I have floating around `/var/lib/docker`.

Well, if you use 'FROM python:3.10' for your images then only one.

If you're careful, you can sort of pull together contents of multiple images by using `COPY --link`, and then even if you have 10 layers then changing from python:3.10 to python:3.14 only changes one of them.

Again, this does require that you maintain a cache, but that cache can live in a lot of places that doesn't have to be the local filesystem: https://docs.docker.com/reference/cli/docker/buildx/build/#c...

a_t48•1h ago

I'm well aware of `COPY --link`, it doesn't solve the problem. I'm a heavy heavy user of it, combined with throwaway build stages. `COPY --link` won't help my `apt install` commands.

The use case here isn't `FROM python:3.10`, it's `FROM ubuntu; RUN apt install -y vim wget curl software-properties-common python3.10`/`RUN rosdep install`/`RUN --mount=type=cache,target=/root/.cache/uv --mount=type=bind,source=uv.lock,target=uv.lock --mount=type=bind,source=pyproject.toml,target=pyproject.toml uv sync --locked --no-install-project`. All of those dependencies get merged onto a single layer that isn't shared with anything else. You'd better hope something like tensorflow isn't one of those dependencies.

Show HN: Jido 2.0, Elixir Agent Framework

Show HN: Docker pulls more than it needs to - and how we can fix it

Show HN: PageAgent, A GUI agent that lives inside your web app

Show HN: A Claude Code skill that renders decisions as interactive HTML pages

Show HN: Check out my new project – SitDeck

Show HN: Kybernis – Prevent AI agents from executing the same action twice

Show HN: GitHub-powered instant developer portfolios

Show HN: Anki(-Ish) for Music Theory

Show HN: Poppy – A simple app to stay intentional with relationships

Show HN: Reformat Word document citations (APA/Vancouver) in <1 second

Show HN: Mumpix – persistent memory for AI agents (works in browser and Node)

Show HN: A2A protocol for Elixir with GenServer-like ergonomics

Show HN: Git Diff for Agentic Coding

Show HN: Hormuz Crisis Dashboard Real-time shipping disruption tracker

Show HN: Vet – Prevent coding agents from making mistakes

Show HN: I'm an AI growth-hacking agent. My premise was a lie.

Show HN: Cognitive architecture for Claude Code – triggers, memory, docs

Show HN: Stacked Game of Life

Show HN: SeaRoutes, find the shortest navigable sea routes on the globe

Show HN: Tracemap – run and visualize traceroutes from probes around the world

Show HN: OmoiOS–190K lines of Python to stop babysitting AI agents (Apache 2.0)

Show HN: AgnosticUI – A source-first UI library built with Lit

Show HN: Vertex.js – A 1kloc SPA Framework

Show HN: Rust compiler in PHP emitting x86-64 executables

Show HN: echo.html, between Feather Wiki and Roam with commands like Emacs

Show HN: A shell-native cd-compatible directory jumper using power-law frecency

Show HN: I made a zero-copy coroutine tracer to find my scheduler's lost wakeups

Show HN: Keep large tool output out of LLM context: 3x accuracy 95% fewer tokens

Show HN: Voice skill for AI agents – sub-200ms latency via native SIP

Show HN: SpiderSuite – Multi-engine web crawler and proxy for security research

Show HN: Docker pulls more than it needs to - and how we can fix it

Comments

Show HN: Jido 2.0, Elixir Agent Framework

Show HN: Docker pulls more than it needs to - and how we can fix it

Show HN: PageAgent, A GUI agent that lives inside your web app

Show HN: A Claude Code skill that renders decisions as interactive HTML pages

Show HN: Check out my new project – SitDeck

Show HN: Kybernis – Prevent AI agents from executing the same action twice

Show HN: GitHub-powered instant developer portfolios

Show HN: Anki(-Ish) for Music Theory

Show HN: Poppy – A simple app to stay intentional with relationships

Show HN: Reformat Word document citations (APA/Vancouver) in <1 second

Show HN: Mumpix – persistent memory for AI agents (works in browser and Node)

Show HN: A2A protocol for Elixir with GenServer-like ergonomics

Show HN: Git Diff for Agentic Coding

Show HN: Hormuz Crisis Dashboard Real-time shipping disruption tracker

Show HN: Vet – Prevent coding agents from making mistakes

Show HN: I'm an AI growth-hacking agent. My premise was a lie.

Show HN: Cognitive architecture for Claude Code – triggers, memory, docs

Show HN: Stacked Game of Life

Show HN: SeaRoutes, find the shortest navigable sea routes on the globe

Show HN: Tracemap – run and visualize traceroutes from probes around the world

Show HN: OmoiOS–190K lines of Python to stop babysitting AI agents (Apache 2.0)

Show HN: AgnosticUI – A source-first UI library built with Lit

Show HN: Vertex.js – A 1kloc SPA Framework

Show HN: Rust compiler in PHP emitting x86-64 executables

Show HN: echo.html, between Feather Wiki and Roam with commands like Emacs

Show HN: A shell-native cd-compatible directory jumper using power-law frecency

Show HN: I made a zero-copy coroutine tracer to find my scheduler's lost wakeups

Show HN: Keep large tool output out of LLM context: 3x accuracy 95% fewer tokens

Show HN: Voice skill for AI agents – sub-200ms latency via native SIP

Show HN: SpiderSuite – Multi-engine web crawler and proxy for security research