The 1979 Design Choice Breaking AI Workloads

https://www.cerebrium.ai/blog/rethinking-container-image-distribution-to-eliminate-cold-starts

22•za_mike157•2h ago

Comments

PaulHoule•2h ago

I remember dealing with this BS back in 2017. It was clear to me that containers were, more than anything else, a system for turning 15MB of I/O into 15GB of I/O.

So wow and new shiny though so if you told people that they would just plug their ears with their fingers.

pocksuppet•1h ago

This doesn't follow from anything in the article.

PaulHoule•11m ago

I was working with prototypical foundation models and having the exact same problem. My diagnosis wasn't quite the same, I think more radical gains could be had with a "stamp out unnecessary copies everywhere" policy but it looks like he did get through a bottleneck.

formerly_proven•2h ago

The gzip compression of layers is actually optional in OCI images, but iirc not in legacy docker images. The two formats are not the same. On SSDs, the overhead for building an index for a tar is not that high, if we're primarily talking about large files (so the data/weights/cuda layers instead of system layers). The approach from the article is of course still faster, especially for running many minor variations of containers, though I am wondering how common it is for only some parts of weights changing? I would've assumed that most things you'll do with weights would change about 100% of them when viewed through 1M chunks. The lazy pulling probably has some rather dubious/interesting service latency implications.

The main annoyance imho with gzip here is that it was already slow when the format was new (unless you have Intel QAT and bothered to patch and recompile that into all the go binaries which handle these, which you do not).

MontyCarloHall•1h ago

I ran into a similar issue years ago, where the base infrastructure occupied the lion's share of the container size, very similar to the sizes shown in the article:

   Ubuntu base      ~29 MB compressed
   PyTorch + CUDA   7 – 13 GB
   NVIDIA NGC       4.5+ GB compressed

The easy solution that worked for us was to bake all of these into a single base container, and force all production containers built within the company to use that base. We then preloaded this base container onto our cloud VM disk images, so that pulling the model container only needed to download comparatively tiny layers for model code/weights/etc. As a benefit, this forced all production containers to be up-to-date, since we regularly updated the base container which caused automatic rebuilding of all derived containers.

jono_irwin•6m ago

That approach works really well when you have a stable shared base image.

Where it starts to get harder is when you have multiple base stacks (different CUDA versions, frameworks, etc.) or when you need to update them frequently. You end up with lots of slightly different multi-GB bases.

Chunked images keep the benefit you mentioned (we still cache heavily on the nodes) but the caching happens at a finer granularity. That makes it much more tolerant to small differences between images and to frequent updates, since unchanged chunks can still be reused.

andrewvc•1h ago

They say an ideal container system would download portions of layers on demand, however is seems far from ideal for many production workloads. What if your service starts, works fine for an hour, then needs to read one file that is only available over the network, but that endpoint is unreachable? What if it is reachable but it is very very slow?

The current system has issues with network stuff, but in a deploy process you can delineate that all to a new container deployment. Perhaps you try to deploy a new container and it fails because the network is slow or broken. Rollback is simpler there. Spreading network issues over time makes debugging much harder.

The current system is simple and resilient but clearly not fast. Trading speed for more complex failure modes for such a widely distributed technology is hardly a clear win.

The de-duplication seems like a neat win however.

jono_irwin•13m ago

Good point, network dependency is a valid concern.

In practice these systems typically fetch data over a local, highly available network and aggressively cache anything that gets read. If that network path becomes unavailable, it usually indicates a much larger infrastructure issue since many other parts of the system rely on the same storage or registry endpoints.

So while it does introduce a different failure mode, in most production environments it ends up being a low practical risk compared to the startup latency improvements.

For us and our customers, the trade off is worth it.

pocksuppet•1h ago

Clickbait title. Summary: Their AI docker containers are slow to start up because they are 10GB layers that have to be gunzipped, and gzip doesn't support random access.

alanfranz•1h ago

Looks like they'd like something git repositories (maybe with transparent compression on top) rather than .tar.gz files. Just pull the latest head and you're done.

cosmotic•56m ago

Why does the model data need to be stored in the image? Download the model data on container startup using whatever method works best.

jono_irwin•23m ago

hey cosmotic, we're not really advocating for storing model weights in the container image.

even the smaller nvidia images (like nvidia/cuda:13.1.1-cudnn-runtime-ubuntu24.04) are about 2Gb before adding any python deps and that is a problem.

if you split the image into chunks and pull on-demand, your container will start much faster.

za_mike157•22m ago

You are correct! From our tests, storing model weights in the image actually isn't a preferred approach for model weights larger than ~1GB. We run a distributed, multi-layer cache system to combat this and we can load roughly 6-7GB of files in p99 of <2.5s

dsr_•32m ago

The problem: "containers that take far too long to start".

Somehow, they don't hit upon the solution other organizations use: having software running all the time.

I suppose if you have a lousy economic model where the cost of running your software is a large percentage of your overall costs, that's a problem. I can only advise them to move to a model where they provide more value for their clients.

za_mike157•25m ago

A lot of AI workloads require GPUs which are expensive so customers would waste money running idle machines 24/7 with low utilisation which kills gross margins. By loading containers quickly means, means we can scale up quickly as requests come in and you only need to pay for usage.

This is successful for CPU workloads (AWS Lambda) but AI models and images are 50x the size

dsr_•21m ago

As I said, if only you were providing more value rather than being a commodity, you could avoid all this.

notyourbiz•31m ago

Super helpful.

za_mike157•22m ago

Glad you liked it!

Oscar Pool Ballot, 98th Academy Awards

Advanced Pet Screen Drawing Techniques

The Reviewer Isn't the Bottleneck

Apple in 2025: The Six Colors report card

Show HN: ContextForge now supports Cursor IDE – persistent AI memory

Show HN: A2UI for Elixir/Phoenix/LiveView

Reasoning boosts search relevance 15-30%

Specimen Gallery – CC0 transparent specimen PNGs organized by taxonomy

Show HN: An AI system that pushes political reform

Price-Checking Zerocopy's Zero Cost Abstractions

Uber reported to the state that I was fired for "annoying a coworker."

Things I've Done with AI

Ask HN: What apps have you created for your own use?

Sam Kriss on AI's false starts, doomsday scenarios, and eccentric proponents

Ask HN: How does one review code when most of the code is written by AI?

Code-review-graph: persistent code graph that cuts Claude Code token usage

Latent Context Compilation: Distilling Long Context into Compact Portable Memory

AluminatiAI – per-job GPU cost tracking (Nvidia-smi shows watts, not dollars)

Hono js

Code-review-graph: persistent code graph that cuts Claude Code token usage

Andrew Ng Just Dropped Context Hub – GitHub for AI Agent Knowledg

Toni Schneider (New Bluesky CEO) - Coming Off the Bench for Bluesky

Software Architecture in the Era of Agentic AI

An Economic Analysis of a Drug-Selling Gang's Finances [pdf]

Agent Session Kit (ASK) – Git guardrails for AI-assisted coding workflows

Teenagers report for duty as Croatia reinstates conscription

Show HN: LocalConvert –browser-based file converter using ffmpeg.wasm no uploads

Advice for (Berkeley) Ph. D. students in math (2013)

Billionaires are a danger to themselves and (especially) us

Show HN: SimpleStats – Server-side Laravel analytics, immune to ad blockers

The 1979 Design Choice Breaking AI Workloads

Comments

Oscar Pool Ballot, 98th Academy Awards

Advanced Pet Screen Drawing Techniques

The Reviewer Isn't the Bottleneck

Apple in 2025: The Six Colors report card

Show HN: ContextForge now supports Cursor IDE – persistent AI memory

Show HN: A2UI for Elixir/Phoenix/LiveView

Reasoning boosts search relevance 15-30%

Specimen Gallery – CC0 transparent specimen PNGs organized by taxonomy

Show HN: An AI system that pushes political reform

Price-Checking Zerocopy's Zero Cost Abstractions

Uber reported to the state that I was fired for "annoying a coworker."

Things I've Done with AI

Ask HN: What apps have you created for your own use?

Sam Kriss on AI's false starts, doomsday scenarios, and eccentric proponents

Ask HN: How does one review code when most of the code is written by AI?

Code-review-graph: persistent code graph that cuts Claude Code token usage

Latent Context Compilation: Distilling Long Context into Compact Portable Memory

AluminatiAI – per-job GPU cost tracking (Nvidia-smi shows watts, not dollars)

Hono js

Code-review-graph: persistent code graph that cuts Claude Code token usage

Andrew Ng Just Dropped Context Hub – GitHub for AI Agent Knowledg

Toni Schneider (New Bluesky CEO) - Coming Off the Bench for Bluesky

Software Architecture in the Era of Agentic AI

An Economic Analysis of a Drug-Selling Gang's Finances [pdf]

Agent Session Kit (ASK) – Git guardrails for AI-assisted coding workflows

Teenagers report for duty as Croatia reinstates conscription

Show HN: LocalConvert –browser-based file converter using ffmpeg.wasm no uploads

Advice for (Berkeley) Ph. D. students in math (2013)

Billionaires are a danger to themselves and (especially) us

Show HN: SimpleStats – Server-side Laravel analytics, immune to ad blockers