The 1979 Design Choice Breaking AI Workloads

https://www.cerebrium.ai/blog/rethinking-container-image-distribution-to-eliminate-cold-starts

22•za_mike157•3h ago

Comments

PaulHoule•2h ago

I remember dealing with this BS back in 2017. It was clear to me that containers were, more than anything else, a system for turning 15MB of I/O into 15GB of I/O.

So wow and new shiny though so if you told people that they would just plug their ears with their fingers.

pocksuppet•2h ago

This doesn't follow from anything in the article.

PaulHoule•46m ago

I was working with prototypical foundation models and having the exact same problem. My diagnosis wasn't quite the same, I think more radical gains could be had with a "stamp out unnecessary copies everywhere" policy but it looks like he did get through a bottleneck. The thing is he is happy with 3x speedup whereas I was looking for more of 300x except that, of course, if it takes you 20 min to sling containers and 5 min to do real work you'll probably be happy to 3x the container slinging.

formerly_proven•2h ago

The gzip compression of layers is actually optional in OCI images, but iirc not in legacy docker images. The two formats are not the same. On SSDs, the overhead for building an index for a tar is not that high, if we're primarily talking about large files (so the data/weights/cuda layers instead of system layers). The approach from the article is of course still faster, especially for running many minor variations of containers, though I am wondering how common it is for only some parts of weights changing? I would've assumed that most things you'll do with weights would change about 100% of them when viewed through 1M chunks. The lazy pulling probably has some rather dubious/interesting service latency implications.

The main annoyance imho with gzip here is that it was already slow when the format was new (unless you have Intel QAT and bothered to patch and recompile that into all the go binaries which handle these, which you do not).

jono_irwin•20m ago

Yeah that’s fair. For weights specifically there often isn’t a huge dedupe win across versions since retraining tends to change most of them. That said, we generally don’t advocate including model weights in container images anyway. The main benefit for us is avoiding the need to pull the full image up front and only fetching the data actually touched during startup. On the latency side, reads happen over a local network with caching and prefetching, so the impact on request latency is typically minimal.

MontyCarloHall•2h ago

I ran into a similar issue years ago, where the base infrastructure occupied the lion's share of the container size, very similar to the sizes shown in the article:

   Ubuntu base      ~29 MB compressed
   PyTorch + CUDA   7 – 13 GB
   NVIDIA NGC       4.5+ GB compressed

The easy solution that worked for us was to bake all of these into a single base container, and force all production containers built within the company to use that base. We then preloaded this base container onto our cloud VM disk images, so that pulling the model container only needed to download comparatively tiny layers for model code/weights/etc. As a benefit, this forced all production containers to be up-to-date, since we regularly updated the base container which caused automatic rebuilding of all derived containers.

jono_irwin•41m ago

That approach works really well when you have a stable shared base image.

Where it starts to get harder is when you have multiple base stacks (different CUDA versions, frameworks, etc.) or when you need to update them frequently. You end up with lots of slightly different multi-GB bases.

Chunked images keep the benefit you mentioned (we still cache heavily on the nodes) but the caching happens at a finer granularity. That makes it much more tolerant to small differences between images and to frequent updates, since unchanged chunks can still be reused.

andrewvc•2h ago

They say an ideal container system would download portions of layers on demand, however is seems far from ideal for many production workloads. What if your service starts, works fine for an hour, then needs to read one file that is only available over the network, but that endpoint is unreachable? What if it is reachable but it is very very slow?

The current system has issues with network stuff, but in a deploy process you can delineate that all to a new container deployment. Perhaps you try to deploy a new container and it fails because the network is slow or broken. Rollback is simpler there. Spreading network issues over time makes debugging much harder.

The current system is simple and resilient but clearly not fast. Trading speed for more complex failure modes for such a widely distributed technology is hardly a clear win.

The de-duplication seems like a neat win however.

jono_irwin•48m ago

Good point, network dependency is a valid concern.

In practice these systems typically fetch data over a local, highly available network and aggressively cache anything that gets read. If that network path becomes unavailable, it usually indicates a much larger infrastructure issue since many other parts of the system rely on the same storage or registry endpoints.

So while it does introduce a different failure mode, in most production environments it ends up being a low practical risk compared to the startup latency improvements.

For us and our customers, the trade off is worth it.

pocksuppet•2h ago

Clickbait title. Summary: Their AI docker containers are slow to start up because they are 10GB layers that have to be gunzipped, and gzip doesn't support random access.

alanfranz•2h ago

Looks like they'd like something git repositories (maybe with transparent compression on top) rather than .tar.gz files. Just pull the latest head and you're done.

cosmotic•1h ago

Why does the model data need to be stored in the image? Download the model data on container startup using whatever method works best.

jono_irwin•58m ago

hey cosmotic, we're not really advocating for storing model weights in the container image.

even the smaller nvidia images (like nvidia/cuda:13.1.1-cudnn-runtime-ubuntu24.04) are about 2Gb before adding any python deps and that is a problem.

if you split the image into chunks and pull on-demand, your container will start much faster.

za_mike157•57m ago

You are correct! From our tests, storing model weights in the image actually isn't a preferred approach for model weights larger than ~1GB. We run a distributed, multi-layer cache system to combat this and we can load roughly 6-7GB of files in p99 of <2.5s

dsr_•1h ago

The problem: "containers that take far too long to start".

Somehow, they don't hit upon the solution other organizations use: having software running all the time.

I suppose if you have a lousy economic model where the cost of running your software is a large percentage of your overall costs, that's a problem. I can only advise them to move to a model where they provide more value for their clients.

za_mike157•59m ago

A lot of AI workloads require GPUs which are expensive so customers would waste money running idle machines 24/7 with low utilisation which kills gross margins. By loading containers quickly means, means we can scale up quickly as requests come in and you only need to pay for usage.

This is successful for CPU workloads (AWS Lambda) but AI models and images are 50x the size

dsr_•56m ago

As I said, if only you were providing more value rather than being a commodity, you could avoid all this.

notyourbiz•1h ago

Super helpful.

za_mike157•57m ago

Glad you liked it!

Building a Procedural Hex Map with Wave Function Collapse

JSLinux Now Supports x86_64

Show HN: The Mog Programming Language

DARPA's new X-76

Bluesky CEO Jay Graber is stepping down

Launch HN: Terminal Use (YC W26) – Vercel for filesystem-based agents

Fixfest is a global gathering of repairers, tinkerers, and activists

Restoring a Sun SPARCstation IPX part 1: PSU and NVRAM (2020)

Fontcrafter: Turn Your Handwriting into a Real Font

Flash media longevity testing – 6 years later

Show HN: DenchClaw – Local CRM on Top of OpenClaw

Florida judge rules red light camera tickets are unconstitutional

Durdraw – ANSI art editor for Unix-like systems

Rethinking Syntax: Binding by Adjacency

Ireland shuts last coal plant, becomes 15th coal-free country in Europe (2025)

Jolla on track to ship new phone with Sailfish OS, user-replaceable battery

Reverse-engineering the UniFi inform protocol

An opinionated take on how to do important research that matters

FreeBSD Capsicum vs. Linux Seccomp Process Sandboxing

What I Always Wanted to Know about Second Class Values

US Court of Appeals: TOS may be updated by email, use can imply consent [pdf]

Algebraic topology: knots links and braids

The Most Beautiful Freezer in the World: Notes on Baking at the South Pole

Velxio, Arduino Emulator

Workers report watching Ray-Ban Meta-shot footage of people using the bathroom

Uber reported to the state that I was fired for "annoying a coworker."

Is legal the same as legitimate: AI reimplementation and the erosion of copyleft

Unlocking Python's Cores:Energy Implications of Removing the GIL

Show HN: VS Code Agent Kanban: Task Management for the AI-Assisted Developer

Grammarly is offering ‘expert’ AI reviews from famous dead and living writers