This paper presents a structural theory of dataset and model degradation under recursive training on synthetic data. Unlike prior work that attributes model collapse to entropy loss, noise accumulation, or data provenance, the paper identifies the loss distribution as the central object governing degradation. The core claim is that recursive self-training acts as a sharpening operator on the data distribution: low-loss (high-probability) samples become increasingly dominant, while rare and difficult cases—the tail of the distribution—systematically vanish. This process is formalized as an iterative distributional transformation that leads to progressive collapse and loss of structural diversity. The paper introduces a tail invariance principle, stating that stable long-term learning requires preservation of tail probability mass across model generations. The theoretical framework is supported by controlled experiments on discrete distributions, continuous models, and language models, using metrics such as KL divergence, entropy, and tail mass. The results demonstrate that common mitigation strategies (noise injection, anti-repetition heuristics, AI-content detection) do not address the root cause of collapse. Effective prevention requires explicit mechanisms to preserve the loss distribution, including real-data anchoring, dataset accumulation, distributed generators, and tail-mass correction. Overall, the work reframes model collapse as a structural consequence of loss distribution dynamics and provides a principled stability criterion for generative training pipelines.
GOE_OVSYANKA•1h ago