Diffusion language models are super data learners

https://jinjieni.notion.site/Diffusion-Language-Models-are-Super-Data-Learners-239d8f03a866800ab196e49928c019ac

218•babelfish•6mo ago

Comments

woadwarrior01•6mo ago

> During inference, generating sequences ranging from 16 to 4096 tokens incurs a 16× to 4700× increase in FLOPs compared to AR baselines.

I wonder why the increase in FLOPs has such a wide spectrum? Naively, I'd have expected the FLOPs to increase linearly with the number of tokens. OTOH, it sort of makes sense because because diffusion models are not autoregressive, as their name suggests.

ckjellqv•6mo ago

My guess is that autoregressive models can use Key Value (KV) caching to eliminate most of the FLOPs inside the self-attention block. Can't use KV caching inside diffusion (because it's not a causal model) but they sell this as a win anyway because they believe it leads to better reasoning.

godelski•6mo ago

This is interesting but I'm not sure some of the claims can be made without some more information. Terms like "downstream task", "in/out of distribution" are frequently used in the literature to mean many different things[0] and it is hard to know which one you mean from context. As a reader I *cannot know* what is in-distribution or not if I have no notion of what the training data[1] is. Consequently, I also can't know what downstream tasks are.

Though I'm very confused by this

  > This phenomenon persists for both in-domain and out-of-domain training data.

What does it mean for training data to be "out-of-domain"? The domain is any valid input into your function. Was this intended to be distribution? I'd still be a bit confused by that because it makes it sound like you're talking about training and validation data, both of which are in distribution.

  > Is validation loss a good metric for AR and DLM

In academic settings, does anyone seriously believe that the answer would be yes? I would be extremely concerned if people honestly believed that you could use loss as a strong indicator for comparing two different architectures[2]. These losses are not measuring the things we want to measure, they are proxies of them. The architectures themselves are a big part of forming that loss landscape. This would be a fine comparison if the metric were not a proxy but since it it then it isn't reliable unless we know what the divergence is[3]. This is all fine, but to advance as a field we need to remember what we don't know.

Overall, I'm still not sure what is meant by "Super Data Learners".

It seems like this is counted by information per parameter? I do think there is good discussion in the "causal" attention vs the free-form attention of diffusion, but I think there are also some potential oversteps in the conclusions here. A lower triangular matrix is still full-rank, so there is high representation power here, though it is correct that the free form has more (even when including the permutation and the untangling via the FFN layer in the transformer). I think if this part can be highlighted more and more time is spent on explaining then a much stronger case can be made. But I think some additional analysis is needed to determine if this is a diffusion vs transformer thing or triangular attention vs full rank attention thing. From a mathematical perspective the second question can be answered much more easily, but then there is a larger question about training these things because the problem of training free-form matrices is that they are... well... free form. There's actually some good discussions about this in the Normalizing Flow literature as they work through a similar problem of representation power and training/computational efficiencies. I think this work has the potential to open up a larger discussion for talking about the representation power of different architectures. Which, IMO, that is a really important topic that we need to discuss these days. Though I'm biased since I work on neural architectures.

Just for fun ;)

  Reviewer 2:
  Rating: 4: Borderline accept
  Confidence: 4: You are confident in your assessment, but not absolutely certain.
  Limitations: I think this is a sufficient work but with better clarity and some additional analysis (actually do  ̶t̶h̶e̶o̶r̶e̶t̶i̶c̶a̶l̶ mathematical analysis ;) I think it could be an excellent work and have much more impact than it has in its current form. There is much more to be said, but hey, we're on HN and this last part is being done half jokingly.

[0] Let's say you train on wikipedia and reddit and just train as entropy of next token. Is coding out-of-distribution? Arguably it isn't because there are code samples in both of those datasets. It is not even clear if this is OOD by task. It is even unclear if we strip out things we can identify as code as we aren't necessarily stripping out the discussion of code in natural language. We are, after all, talking about learning in extremely high dimensional spaces and so these 'little nuances' are rather critical in determining what is actually being done. This is deeply related to the 'black box' nature of all of this. As a clear counter, I don't think there is ambiguity when training on Shakespeare that there is ambiguity that coding tasks are OOD. I also think if you strip literal code from reddit and wiki we could say this task is at least not within the main distribution.

[1] Am I understanding correctly that these are the same as the referenced [2,3]? Put that experimental setting section up. I want to look backwards for this type of information, not forward. Because looking backwards I'll have a good idea of where I need to go and probably got some of that information before I start asking lots of questions.

[2] I suspect many people do and I do have this extreme concern. So I actually appreciate this being included.

[3] Which we can't calculate. After all, we're not using these proxy metrics for (just) computational efficiency, we are using them because we have no formal (mathematical) definition of our true objective metrics. We have no formal definition of "human like language" or "correct code given human language inputs".

BarakWidawsky•6mo ago

I wonder how much of this is due to Diffusion models having less capacity for memorization than auto regressive models

The auto regressive models consistently show better loss for the same number of training tokens

I find a lot of the conclusions compelling but I would’ve loved to see more epochs of training on the 1B model with a 10B dataset, as that model was showing epoch over epoch improvements

thesz•6mo ago

> I wonder how much of this is due to Diffusion models having less capacity for memorization than auto regressive models

Diffusion requires more computation resources than autoregressive models, compute excess is proportional to the length of sequence. Time dilated RNNs and adaptive computation in image recognition hint us that we can compute more with same weights and achieve better results.

Which, I believe, also hint at the at least one flaw of the TS study - I did not see that they matched DLM and AR by compute, they matched them only by weights.

heyitsguay•6mo ago

Do you have references on adaptive methods for image recognition?

godelski•6mo ago

I don't have an exact reference but there are a lot more hints that evidence the claim (compute more with same weights). In fact, I wouldn't even call them hints since they aren't subtle at all. For one, animal brains are perfect examples of this. But in the ML space, we could think of this purely from the mathematical perspective.

I think it might be confusing because neurons are neurons right? And they can only hold so much memory, so what's the difference? Well, that difference is architecture and training.

Let's think about signals for a moment and to help understand this, let's move to small dimensions[0]. Like 2D or 3D. (I'll use 3D, but you'll see why this can still ruin visualization) We're talking about universal approximates, so we can think of these as finite length strings, but have fixed end points. Our goal is then to untangle these strings. Oh no, this bundle has a knot! We can't actually untangle this string just by stretching. We also have a rule that we can't cut and glue things. We'd be stuck if we didn't have a trick up our sleeves. We can move into a higher dimension and untangle these strings there[1]. We'll need at least 2N-D. To the flatlander this will look like a cut, but it isn't.

The reason this needs to be understood is because we need to know where we get those dimensions. It is through architecture and training. But let's just think about that architecture. When we're learning these relationships we need to have the capacity to perform these higher dimensional movements, but once we already uncover the relationships we don't necessarily need to. The relationship it depends on the dimensionality of the relationship itself, not the data.

This is true for all models and is fundamentally why things like distillation even work. It is also why that FFN layer post attention in the transformer needs to project into a higher dimension before returning (typical is 4x and I think you can reason why that gives more flexibility than 2x). Also related to the latent manifold hypothesis.

If you ever wondered if math is useful to machine learning, I hope this gives some motivation to learn more. You don't need math to build good models, but even a little math goes a long way to help make better models.

[0] Note, we're doing a significant amount of simplification here. There's a lot of depth and complexity to all of this but I think this will be sufficient to point anyone in (mostly) the right direction.

[1] Think about a Klein bottle. In 4D it has a single surface. But the 3D projection of this shape makes it look like it is intersecting itself. Unfortunately we can't really visualize the 4D version :(

godelski•6mo ago

  > as that model was showing epoch over epoch improvements

Both of them were showing improvements. I agree with you that I'd like to see more, but I'm not sure more would significantly change the argument (which is a lot about how metrics aren't straight forward). Especially since the 96B token experiment shows.

IN FACT, those results are so similar I had to open them up in GIMP to align and spot the differences. Now I'm actually not convinced there wasn't a mistake. There are differences, just very minor. Harder to tell with the AR model because scale, but in the diffusion you can see a little bump in the second one right before the concavity change at the end. There some more bumps in the AR model earlier on that help show differences too, but the fact that the envelopes are nearly identical is... suspicious. I'm not claiming maliciousness because even if a mistake these things are so easy to make that they are common. I'm not even convinced there is a mistake, but it warrants extra thinking.

That said, money is finite and these are quite computationally heavy. Author looks to be a research fellow and so I'm assuming not backed by big tech.

cma•6mo ago

> The auto regressive models consistently show better loss for the same number of training tokens

I thought bi-directional transformers (non auto-regressive) show less loss than autoregressive for the same amount of training tokens.

pama•6mo ago

It is the other way around. If the data is causal and presented in the causal order, it is impossible to beat the loss of a pure auto-regressive model because it has the correct probability distribution for the dataset. Language data is mostly causal (as words follow in the context of previous words when they are spoken/written). Most of the remaining additional info in the extreme oversampling of the same data via diffusion models should be there by using fill-in-the-middle or order-reversal strategies with AR models as well and with significant compute savings during training.

cma•5mo ago

I mean models like BERT and not diffusion.

> Language data is mostly causal (as words follow in the context of previous words when they are spoken/written).

But where it isn't, the old KV is frozen in place and has to be ammended after what follows, where BERT like models take it all into account all over.

I have definitely heard they have less loss for the same amount of training tokens but are less efficient to compute and running next token prediction from them would be much more expensive.

semiinfinitely•6mo ago

Results probably just indicate that the ar baseline is fucked

bicsi•6mo ago

What if I told you that one can model bidirectional attention just by recurring over causal attention, and it’s still fast enough? Hint: It’s called chain of thought.

I strongly believe it’s time to discontinue diffusion models, solely on the fact that iterated auto-regression is faster, more parallelizable, and just as potent with proper prompting techniques (of course, unless you consider CoT as a form of diffusion, which it essentially is).

smus•6mo ago

Can you explain how CoT is a form of diffusion or models bidirectional attn?

fancyfredbot•6mo ago

I'd respectfully suggest that it's perhaps not time to "discontinue diffusion models". Minsky and Papert set AI back by decades by suggesting neural networks were a dead end which couldn't learn XOR. There's not a chance of an HN comment having the same effect of course but my point is that it's easy to dismiss things prematurely.

SalmoShalazar•6mo ago

Chain of thought is not a form of diffusion. Diffusion models clearly have characteristics that are useful and worthy of further research and should not be “discontinued“

Amazon no longer defend cloud customers against video patent infringement claims

Show HN: Medinilla – an OCPP compliant .NET back end (partially done)

How Does AI Distribute the Pie? Large Language Models and the Ultimatum Game

Resistance Infrastructure

Fire-juggling unicyclist caught performing on crossing

Restoring a lost 1981 Unix roguelike (protoHack) and preserving Hack 1.0.3

GPS and Time Dilation – Special and General Relativity

Show HN: Witnessd – Prove human authorship via hardware-bound jitter seals

Show HN: I built a clawdbot that texts like your crush

Scientists reverse Alzheimer's in mice and restore memory (2025)

Compiling Prolog to Forth [pdf]

Show HN: Cymatica – an experimental, meditative audiovisual app

GitBlack: Tracing America's Foundation

Horizon-LM: A RAM-Centric Architecture for LLM Training

We just ordered shawarma and fries from Cursor [video]

Correctio

Trying to make an Automated Ecologist: A first pass through the Biotime dataset

Watch Ukraine's Minigun-Firing, Drone-Hunting Turboprop in Action

Free Trial: AI Interviewer

FDA intends to take action against non-FDA-approved GLP-1 drugs

Supernote e-ink devices for writing like paper

We are QA Engineers now

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

Adversarial Reasoning: Multiagent World Models for Closing the Simulation Gap

Show HN: Poddley.com – Follow people, not podcasts

Layoffs Surge 118% in January – The Highest Since 2009

Papyrus 114: Homer's Iliad

DicePit – Real-time multiplayer Knucklebones in the browser

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Show HN: AI Agent Tool That Keeps You in the Loop

Amazon no longer defend cloud customers against video patent infringement claims

Show HN: Medinilla – an OCPP compliant .NET back end (partially done)

How Does AI Distribute the Pie? Large Language Models and the Ultimatum Game

Resistance Infrastructure

Fire-juggling unicyclist caught performing on crossing

Restoring a lost 1981 Unix roguelike (protoHack) and preserving Hack 1.0.3

GPS and Time Dilation – Special and General Relativity

Show HN: Witnessd – Prove human authorship via hardware-bound jitter seals

Show HN: I built a clawdbot that texts like your crush

Scientists reverse Alzheimer's in mice and restore memory (2025)

Compiling Prolog to Forth [pdf]

Show HN: Cymatica – an experimental, meditative audiovisual app

GitBlack: Tracing America's Foundation

Horizon-LM: A RAM-Centric Architecture for LLM Training

We just ordered shawarma and fries from Cursor [video]

Correctio

Trying to make an Automated Ecologist: A first pass through the Biotime dataset

Watch Ukraine's Minigun-Firing, Drone-Hunting Turboprop in Action

Free Trial: AI Interviewer

FDA intends to take action against non-FDA-approved GLP-1 drugs

Supernote e-ink devices for writing like paper

We are QA Engineers now

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

Adversarial Reasoning: Multiagent World Models for Closing the Simulation Gap

Show HN: Poddley.com – Follow people, not podcasts

Layoffs Surge 118% in January – The Highest Since 2009

Papyrus 114: Homer's Iliad

DicePit – Real-time multiplayer Knucklebones in the browser

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Show HN: AI Agent Tool That Keeps You in the Loop

Diffusion language models are super data learners

Comments