frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Diffusion Language Models Are Super Data Learners

https://jinjieni.notion.site/Diffusion-Language-Models-are-Super-Data-Learners-239d8f03a866800ab196e49928c019ac
66•babelfish•2h ago

Comments

woadwarrior01•44m ago
> During inference, generating sequences ranging from 16 to 4096 tokens incurs a 16× to 4700× increase in FLOPs compared to AR baselines.

I wonder why the increase in FLOPs has such a wide spectrum? Naively, I'd have expected the FLOPs to increase linearly with the number of tokens. OTOH, it sort of makes sense because because diffusion models are not autoregressive, as their name suggests.

ckjellqv•5m ago
My guess is that autoregressive models can use Key Value (KV) caching to eliminate most of the FLOPs inside the self-attention block. Can't use KV caching inside diffusion (because it's not a causal model) but they sell this as a win anyway because they believe it leads to better reasoning.
godelski•3m ago
This is interesting but I'm not sure some of the claims can be made without some more information. Terms like "downstream task", "in/out of distribution" are frequently used in the literature to mean many different things[0] and it is hard to know which one you mean from context. As a reader I *cannot know* what is in-distribution or not if I have no notion of what the training data[1] is. Consequently, I also can't know what downstream tasks are.

Though I'm very confused by this

  > This phenomenon persists for both in-domain and out-of-domain training data.
What does it mean for training data to be "out-of-domain"? The domain is any valid input into your function. Was this intended to be distribution? I'd still be a bit confused by that because it makes it sound like you're talking about training and validation data, both of which are in distribution.

  > Is validation loss a good metric for AR and DLM
In academic settings, does anyone seriously believe that the answer would be yes? I would be extremely concerned if people honestly believed that you could use loss as a strong indicator for comparing two different architectures[2]. These losses are not measuring the things we want to measure, they are proxies of them. The architectures themselves are a big part of forming that loss landscape. This would be a fine comparison if the metric were not a proxy but since it it then it isn't reliable unless we know what the divergence is[3]. This is all fine, but to advance as a field we need to remember what we don't know.

Overall, I'm still not sure what is meant by "Super Data Learners".

It seems like this is counted by information per parameter? I do think there is good discussion in the "causal" attention vs the free-form attention of diffusion, but I think there are also some potential oversteps in the conclusions here. A lower triangular matrix is still full-rank, so there is high representation power here, though it is correct that the free form has more (even when including the permutation and the untangling via the FFN layer in the transformer). I think if this part can be highlighted more and more time is spent on explaining then a much stronger case can be made. But I think some additional analysis is needed to determine if this is a diffusion vs transformer thing or triangular attention vs full rank attention thing. From a mathematical perspective the second question can be answered much more easily, but then there is a larger question about training these things because the problem of training free-form matrices is that they are... well... free form. There's actually some good discussions about this in the Normalizing Flow literature as they work through a similar problem of representation power and training/computational efficiencies. I think this work has the potential to open up a larger discussion for talking about the representation power of different architectures. Which, IMO, that is a really important topic that we need to discuss these days. Though I'm biased since I work on neural architectures.

Just for fun ;)

  Reviewer 2:
  Rating: 4: Borderline accept
  Confidence: 4: You are confident in your assessment, but not absolutely certain.
  Limitations: I think this is a sufficient work but with better clarity and some additional analysis (actually do  ̶t̶h̶e̶o̶r̶e̶t̶i̶c̶a̶l̶ mathematical analysis ;) I think it could be an excellent work and have much more impact than it has in its current form. There is much more to be said, but hey, we're on HN and this last part is being done half jokingly. 
[0] Let's say you train on wikipedia and reddit and just train as entropy of next token. Is coding out-of-distribution? Arguably it isn't because there are code samples in both of those datasets. It is not even clear if this is OOD by task. It is even unclear if we strip out things we can identify as code as we aren't necessarily stripping out the discussion of code in natural language. We are, after all, talking about learning in extremely high dimensional spaces and so these 'little nuances' are rather critical in determining what is actually being done. This is deeply related to the 'black box' nature of all of this. As a clear counter, I don't think there is ambiguity when training on Shakespeare that there is ambiguity that coding tasks are OOD. I also think if you strip literal code from reddit and wiki we could say this task is at least not within the main distribution.

[1] Am I understanding correctly that these are the same as the referenced [2,3]? Put that experimental setting section up. I want to look backwards for this type of information, not forward. Because looking backwards I'll have a good idea of where I need to go and probably got some of that information before I start asking lots of questions.

[2] I suspect many people do and I do have this extreme concern. So I actually appreciate this being included.

[3] Which we can't calculate. After all, we're not using these proxy metrics for (just) computational efficiency, we are using them because we have no formal (mathematical) definition of our true objective metrics. We have no formal definition of "human like language" or "correct code given human language inputs".

Yes, a Moon Base

https://www.theatlantic.com/science/archive/2025/08/moon-base-nuclear-reactor/683802/
1•JumpCrisscross•1m ago•0 comments

South Korea's military has shrunk by 20% in six years as male population drops

https://www.channelnewsasia.com/east-asia/south-koreas-military-has-shrunk-20-in-six-years-male-population-drops-5287301
2•eagleislandsong•1m ago•0 comments

The zone zero secret: how ultra-low-stress exercise can change your life

https://www.theguardian.com/lifeandstyle/2025/aug/10/the-zone-zero-secret-how-ultra-low-stress-exercise-can-change-your-life
1•nickcotter•2m ago•0 comments

Climate Action Plan for Developers

https://github.com/social-impact/focus-areas/environmental-sustainability/climate-action-plan-for-developers
1•protontypes•2m ago•0 comments

Reinforcement Learning Conference 2025: Outstanding Paper Awards

https://rl-conference.cc/RLC2025Awards.html
1•smokel•2m ago•0 comments

Fixed Points with Event-Indexed Lipschitz Contractions

https://lightcapai.medium.com/fixed-points-with-event-indexed-lipschitz-contractions-298c5c9037a2
1•WASDAai•4m ago•1 comments

2012 (Rosy Retrospection)

https://brian.bearblog.dev/2012-rosy-retrospection/
1•brianalonso•9m ago•0 comments

Swimming Naked: Why Safety Nets Kill Startups

https://www.wizenheimer.dev/blog/opinionated-articulation
1•wizenheimer•9m ago•0 comments

Diagnosing Your Company's Strategy Problem

https://cutlefish.substack.com/p/tbm-271-diagnosing-your-companys
1•gpi•11m ago•0 comments

ICE Took Half Their Work Force. What Do They Do Now?

https://www.nytimes.com/2025/07/27/us/ice-glenn-valley-foods.html
2•JumpCrisscross•12m ago•0 comments

Self-attention mechanism explained

https://jtlicardo.com/blog/self-attention-mechanism/
1•jtlicardo•15m ago•0 comments

Conversations remotely detected from cell phone vibrations, researchers report

https://www.psu.edu/news/engineering/story/conversations-remotely-detected-cell-phone-vibrations-researchers-report
2•giuliomagnifico•18m ago•0 comments

Claude is competitive with humans in (some) cyber competitions

https://red.anthropic.com/2025/cyber-competitions/
1•Techbrunch•30m ago•0 comments

Hugging Face TTS Arena V2 Results (Papla and Async.ai Ahead of ElevenLabs)

https://tts-agi-tts-arena-v2.hf.space/leaderboard
1•zinagorc•31m ago•0 comments

Revenue Automation Series: Testing an Integration with Third-Party System

https://engineeringblog.yelp.com/2025/05/revenue-automation-series-testing-an-integration-with-third-party-system.html
1•initialg•36m ago•0 comments

The Enshittification of Generative AI

https://www.wheresyoured.at/the-enshittification-of-generative-ai/
6•rcy•37m ago•1 comments

AI Is Spam Technology

https://twitter.com/staysaasy/status/1954526861427437656
4•thisismytest•38m ago•0 comments

High-tech monitoring during heart surgery doesn't lower risk of complications

https://medicalxpress.com/news/2025-07-high-tech-heart-surgery-doesnt.html
2•PaulHoule•39m ago•0 comments

Show HN: Bolt – A super-fast, statically-typed scripting language written in C

https://github.com/Beariish/bolt
7•beariish•39m ago•2 comments

Buttercup is now open-source

https://blog.trailofbits.com/2025/08/08/buttercup-is-now-open-source/
1•wslh•40m ago•0 comments

Unveiling complexity in blue spaces and life expectancy

https://www.sciencedirect.com/science/article/pii/S0013935125012320
2•gnabgib•41m ago•0 comments

Copy Link to Highlight in Nightly – These Weeks in Firefox: Issue 185

https://blog.nightly.mozilla.org/2025/07/28/copy-link-to-highlight-in-nightly-these-weeks-in-firefox-issue-185/
1•Bogdanp•44m ago•0 comments

Assemblers in W64devkit

https://nullprogram.com/blog/2025/08/10/
1•ingve•48m ago•0 comments

We've been building Swarm agents incorrectly (starting from OpenAI's Swarm)

https://github.com/minki-j/agentic_classification
2•minkijung•51m ago•1 comments

Why Load Balancing at Scale Is Hard

https://startwithawhy.com/reverseproxy/2025/08/08/ReverseProxy-Deep-Dive-Part4.html
1•agentictime•52m ago•0 comments

Cryptoasset Realization: How Cryptocurrencies Are Frozen, Seized, and Forfeited

https://www.chainalysis.com/blog/cryptoasset-realization-explained/
1•paulpauper•56m ago•0 comments

AOL Underground

https://aolunderground.com/
1•chrisco255•56m ago•1 comments

Stephen Miran became Trump's top ideologue on tariffs

https://fortune.com/article/who-is-stephen-miran-paper-trump-tariffs/
3•TMWNN•1h ago•1 comments

Firecracker: Start a VM in less than a second (2021)

https://jvns.ca/blog/2021/01/23/firecracker--start-a-vm-in-less-than-a-second/
1•thunderbong•1h ago•0 comments

Sleeping in Airports

https://www.sleepinginairports.net/
2•bookofjoe•1h ago•0 comments