I'm Terry, the first author.
I spent the last 2.5+ years formalizing this theory to explain a strange anomaly I kept encountering in industry: models trained on vast, incredibly dirty, uncurated datasets were sometimes achieving state-of-the-art predictive performance, completely defying the "Garbage In, Garbage Out" mantra.
The TL;DR of the paper [https://arxiv.org/abs/2603.12288] is a formal mathematical proof showing why adding more error-prone variables can actually beat cleaning fewer variables to perfection.
The key is recognizing that complex systems often generate data through underlying latent structures. This allows for the partitioning of predictor-space noise into "Predictor Error" and "Structural Uncertainty," and the results follow logically. The paper also formally connects latent architecture to the prerequisites for Benign Overfitting — showing that the structural conditions that enable modern overparameterized models to generalize well arise naturally from latent generative processes.
The theory applies broadly across domains, but work began as an attempt to explain a specific peer-reviewed result at Cleveland Clinic Abu Dhabi — published in PLOS Digital Health [https://journals.plos.org/digitalhealth/article?id=10.1371/j...] — where we achieved .909 AUC predicting stroke/MI in 558k patients using thousands of uncurated EHR variables with no manual cleaning.
Important caveat: As detailed in the paper, this isn't a magic silver bullet. The framework strictly requires data with a latent hierarchical structure (e.g., medical diagnoses driven by unmeasured physiology, stock prices driven by hidden sentiment, sensor readings driven by underlying physical states). It also means your pre-processing effort shifts from data hygiene to data architecture.
I included a fully annotated R simulation in the repo so you can see the exact mechanisms of how "Dirty Breadth" beats "Clean Parsimony."
My team and I are currently operationalizing this into warehouse-native infrastructure (Snowflake, Databricks, etc.) because 80% of enterprise data is tabular, and companies are burning massive amounts of their ML budgets on data cleaning pipelines that they might not actually need.
I would love to hear your thoughts or criticisms on the theory, or how you handle high-dimensional noise in your own tabular pipelines.
I'll be hanging out in the comments to answer any questions!
elkomysara7•1h ago
I’d love to discuss the practical implications of this—especially how this theory might reshape feature selection strategies, data architecture design, and the balance between cleaning vs. redundancy in real-world enterprise environments. How do others see this influencing ML workflows, particularly in high‑dimensional tabular settings?
tjleestjohn•1h ago