I'm struggling to understand where the gains are coming from. What is the intuition for why DiT training was so inefficient?
joshred•49m ago
This is the high-level explanation of the simplest diffusion architecture. The model trains by taking an image and iteratively adding noise to the image until there is only noise. Then they take that sequence of noisier and noisier images and they reverse it. The result is that they start with only noise, and they predict the removal of noise at step until they get to the final step (which should be the original image (or training input)).
That process means they may require a hundred or more training iterations on a single image. I haven't digested the paper, but it sounds like they are proposing something conceptually similar to skip layers (but significantly more involved).
earthnail•52m ago
Wow, Ommer’s students never fail to impress. 37x faster for a generic architecture, ie no domain specific tricks. Insane.
platers•1h ago
joshred•49m ago
That process means they may require a hundred or more training iterations on a single image. I haven't digested the paper, but it sounds like they are proposing something conceptually similar to skip layers (but significantly more involved).