But recently (2024 NeuIPS paper of the year) there was a new paper on autoregressive image modelling that apparently outperforms diffusion models: https://arxiv.org/abs/2404.02905
The innovation is that it doesn't predict image patches (like older autoregressive image models) but somehow does some sort of "next scale" or "next resolution" prediction.
In the past, autoregressive image models did not perform as well as diffusion models, which meant that most image models used diffusion. Now it seems autoregressive techniques have a strict advantage over diffusion models. Another advantage is that they can be integrated with autoregressive LLMs (multimodality), which is not possible with diffusion image models. In fact, the recent GPT-4o image generation is autoregressive according to OpenAI. I wonder whether diffusion models still have a future now.
I'm not 100% convinced that diffusion models are dead. That paper fixes autoregression for 2D spaces by basically turning the generation problem from pixel-by-pixel to iterative upsampling, but if 2D was the problem (and 1D was not), why don't we have more autoregressive models in 1D spaces like audio?
You could, because it's still autoregressive. It still generates patches left to right, top to bottom. It's just that we're not starting with patches at the target resolution.
Which means autoregressive image models are even ahead of diffusion on multiple fronts, i.e. both in whatever GPT-4o is doing and in the method described in the VAR paper.
Going off my bad memory, but I think I remember a comment saying the line-by-line generation was just a visual effect.
It still predicts image patches, left to right and top to bottom. The main difference is that you start with patches at a low resolution.
- Michelangelo
You can build an image generator that basically renders each word on one line in an image, and then uses a transformer architecture to morph the image of the words into what the words are describing.
They only big difference is really efficiency, but we are just taking stabs at the dark at this point - there is work that Google is doing that eventually is going to result in the most optimal model for a certain type of task.
This is worse than exponential and means we have nothing but tricks to try and solve any problem that we see in reality.
As an example solving mnist and its variants of 28x28 pixels will be impossible until the 2100s because we don't have enough memory to store the general tensor which stores the interactions between group of pixels with every other group pixels.
However, for example, a Transformer can be represented with just deeply connected layers, albeit with a lot of zeros for weights.
So you will always know where to go to restore the original image: shortest distance to the natural image manifold.
How all these random images end up perpendicular to the manifold? High dimensional statistics and the fact that the natural image manifold has much lower dimension than the overall space.
Generative Visual Manipulation on the Natural Image Manifold
https://arxiv.org/abs/1609.03552
For me, the most intriguing aspect of LLMs (and friends) are the embedding space and the geometry of the embedded manifolds. Curious if anyone has looked into comparative analysis of the geometry of the manifolds corresponding to distinct languages. Intuitively I see translations as a mapping from one language manifold to another, with expressions being paths on that manifold, which makes me wonder if there is a universal narrative language manifold that captures 'human expression semantics' in the same way as a "natural image manifold".
It is short, with good lecture notes and has hands on examples that are very approachable (with solutions available if you get stuck).
I found it to be the best resource to understand the material. That's certainly a good reference to delve deeper into the intuitions given by OP (it's about 5 hours of lectures, plus exercises).
My takeaway is that diffusion "samples all the tokens at once", incrementally, rather than getting locked in to a particular path, as in auto-regression, which can only look backward. The upside is global context, the downside is fixed-size output.
How AI Image Generators Work (Stable Diffusion / Dall-E) - Computerphile
Why is this sentence true ? “That makes sure the model is paying a lot of attention to the caption.”
Not in a “kind of like this” kind of way: PyTorch vector pipelines can’t take arbitrary sized inputs at runtime right?
If you input has shape [x, y, z] you cannot pass [2x, 2y, 2z] into it.
Not… “it works but not very well”; like, it cannot execute the pipeline if the input dimensions aren’t exactly what they were when training.
Right? Isn’t that how it works?
So, is the image chunked into fixed patches and fed through in parts? Or something else?
For example, (1) this toy implementation resizes the input image to match the expected input, and always emits an output of a specific fixed size.
Which is what you would expect; but also, points to tools like stable diffusion working in a way that is distinctly different to what the trivial explanation tend to say does?
[1] - https://github.com/uygarkurt/UNet-PyTorch/blob/main/inferenc...
not quite right... anyone who has run models for >100 steps knows that you can go too far. whts the explanation of that?
user14159265•1mo ago
Philpax•1mo ago
Y_Y•1mo ago
"Hell, if I could explain it to the average person, it wouldn't have been worth the Nobel prize." - Richard Feynman
CamperBob2•1mo ago
Y_Y•1mo ago
He did go on to write a very readable little book (from a lecture series) on the subject which has photons wearing little watches and waiting for the hands to line up. I'd say a keen eight-year-old could get something from that.
https://ia600101.us.archive.org/17/items/richard-feynman-pdf...