Hopefully it's not more Google abandonware, because it was wicked fast and a delight to use
Now, is it possible that a model can combine advantages of both? Combine fast generation and multidirectional causality of diffusion with precision, capabilities and generalization of autoregression?
Maybe. This paper is research in that direction. So far, it's not a clear upgrade over autoregressive LLMs.
So it’s more about the mask modeling objective than Diffusion.
is it possible to quantify that and just have a linked slider for quality and speed? If I can get an answer that's 80% right in 1/10th the time, and then iterate on that who comes out ahead?
None of the big LLMs do an acceptable job. This is a task a trained human can do, but it's a lot of work. You have to learn, not just the script style of the period (which can vary far more than people think), but even the idiosyncracies of a given writer. All the time, you run into an unreadable word, and you need to look around for context which might give a clue, or other places the same word (or a similar looking word) is used in cleaner contexts. It's very much not a beginning-to-end task, trying to read a document from start to end would be like solving a crossword puzzle in strict left to right, top to bottom order.
Maybe autoregressive models can eventually become powerful enough that they can just do that! But so far, they haven't. And I have a lot more faith in that the diffusion approach is closer to how you have to do it.
What you need is: good image understanding, at least GPT-5 tier, general purpose reasoning over images training, and then some domain-specific training, or at least some few-shot guidance to get it to adopt the correct reasoning patterns.
If I had to guess which model would be able to do it best out of the box, few-shot, I'd say Gemini 3 Pro.
There is nothing preventing an autoregressive LLM from revisiting images and rewriting the texts as new clues come in. This is how they can solve puzzles like sudoku.
https://urn.digitalarkivet.no/URN:NBN:no-a1450-rg60085808000...
Over time we seem to have a tendency to build models that are well matched to our machines
But op is referring to the fact that diffusion is friendlier on both bandwidth and not needing large n^2 compute blocks in the critical path.
Diffusion just allows you to spend more compute at the same time so you don't redundantly access the same memory. It can only improve speed beyond the memory bandwidth limit by committing multiple tokens each pass.
Other linear models like Mamba get away from O(n^2) effects, but type of neural architecture is orthogonal to the method of generation.
If you add a "cheat" rule that lets you deduce anything from something else, then replacing these cheat rule applications with real subgoal proofs is denoising for Natural Deduction.
Alifatisk•2mo ago