But recently (2024 NeuIPS paper of the year) there was a new paper on autoregressive image modelling that apparently outperforms diffusion models: https://arxiv.org/abs/2404.02905
The innovation is that it doesn't predict image patches (like older autoregressive image models) but somehow does some sort of "next scale" or "next resolution" prediction.
In the past, autoregressive image models did not perform as well as diffusion models, which meant that most image models used diffusion. Now it seems autoregressive techniques have a strict advantage over diffusion models. Another advantage is that they can be integrated with autoregressive LLMs (multimodality), which is not possible with diffusion image models. In fact, the recent GPT-4o image generation is autoregressive according to OpenAI. I wonder whether diffusion models still have a future now.
I'm not 100% convinced that diffusion models are dead. That paper fixes autoregression for 2D spaces by basically turning the generation problem from pixel-by-pixel to iterative upsampling, but if 2D was the problem (and 1D was not), why don't we have more autoregressive models in 1D spaces like audio?
- Michelangelo
You can build an image generator that basically renders each word on one line in an image, and then uses a transformer architecture to morph the image of the words into what the words are describing.
They only big difference is really efficiency, but we are just taking stabs at the dark at this point - there is work that Google is doing that eventually is going to result in the most optimal model for a certain type of task.
So you will always know where to go to restore the original image: shortest distance to the natural image manifold.
How all these random images end up perpendicular to the manifold? High dimensional statistics and the fact that the natural image manifold has much lower dimension than the overall space.
It is short, with good lecture notes and has hands on examples that are very approachable (with solutions available if you get stuck).
I found it to be the best resource to understand the material. That's certainly a good reference to delve deeper into the intuitions given by OP (it's about 5 hours of lectures, plus exercises).
user14159265•6h ago
Philpax•5h ago