Show HN: A mathematical proof that more dirty features can beat fewer clean ones

https://github.com/tjleestjohn/from-garbage-to-gold

3•tjleestjohn•1h ago

Comments

elkomysara7•1h ago

I’ve been reviewing the From Garbage to Gold framework, and I’m intrigued by its central claim: that expanding the predictor space with error‑prone variables can outperform perfect cleaning of a smaller set. The distinction between Predictor Error and Structural Uncertainty feels like a powerful reframing of the ‘dirty data’ problem.

I’d love to discuss the practical implications of this—especially how this theory might reshape feature selection strategies, data architecture design, and the balance between cleaning vs. redundancy in real-world enterprise environments. How do others see this influencing ML workflows, particularly in high‑dimensional tabular settings?

tjleestjohn•1h ago

Practically, we see this shifting enterprise ML workflows towards what we term Proactive Data-Centric AI (P-DCAI) in the paper. Instead of the traditional, reactive approach of aggressively cleaning and pruning variables — which often strips away the redundancy needed to capture the full latent signal — P-DCAI treats data architecture as an upfront strategic design choice. Feature selection becomes less about finding pristine, uncorrelated inputs and more about deliberately engineering a portfolio optimized for "novelty" (to comprehensively cover all underlying latent drivers) and "informative redundancy" (to ensure statistical reliability even when individual predictors are highly error-prone).

tjleestjohn•57m ago

Hello HN,

I'm Terry, the first author.

I spent the last 2.5+ years formalizing this theory to explain a strange anomaly I kept encountering in industry: models trained on vast, incredibly dirty, uncurated datasets were sometimes achieving state-of-the-art predictive performance, completely defying the "Garbage In, Garbage Out" mantra.

The TL;DR of the paper [https://arxiv.org/abs/2603.12288] is a formal mathematical proof showing why adding more error-prone variables can actually beat cleaning fewer variables to perfection.

The key is recognizing that complex systems often generate data through underlying latent structures. This allows for the partitioning of predictor-space noise into "Predictor Error" and "Structural Uncertainty," and the results follow logically. The paper also formally connects latent architecture to the prerequisites for Benign Overfitting — showing that the structural conditions that enable modern overparameterized models to generalize well arise naturally from latent generative processes.

The theory applies broadly across domains, but work began as an attempt to explain a specific peer-reviewed result at Cleveland Clinic Abu Dhabi — published in PLOS Digital Health [https://journals.plos.org/digitalhealth/article?id=10.1371/j...] — where we achieved .909 AUC predicting stroke/MI in 558k patients using thousands of uncurated EHR variables with no manual cleaning.

Important caveat: As detailed in the paper, this isn't a magic silver bullet. The framework strictly requires data with a latent hierarchical structure (e.g., medical diagnoses driven by unmeasured physiology, stock prices driven by hidden sentiment, sensor readings driven by underlying physical states). It also means your pre-processing effort shifts from data hygiene to data architecture.

I included a fully annotated R simulation in the repo so you can see the exact mechanisms of how "Dirty Breadth" beats "Clean Parsimony."

My team and I are currently operationalizing this into warehouse-native infrastructure (Snowflake, Databricks, etc.) because 80% of enterprise data is tabular, and companies are burning massive amounts of their ML budgets on data cleaning pipelines that they might not actually need.

I would love to hear your thoughts or criticisms on the theory, or how you handle high-dimensional noise in your own tabular pipelines.

I'll be hanging out in the comments to answer any questions!

The Reckless Women Who Changed Journalism

AI, Human Cognition and Knowledge Collapse

Factory waste heat could be used to cool data centers via novel thermal battery

Len Deighton, originator of modern spy fiction, dies

Stop Losing Referrals: A Quick Note for Coaches and Consultants

What can we remove? (2024)

Genetically modified bacteria convert plastic waste into Parkinson's drug

If you thought the code writing speed was your problem; you have bigger problems

GhostNet: A Community-Driven Global OSINT Network over Shortwave Radio

Certifications with the best ROI per hour in 2026

Show HN: Specimen – Font Manager for macOS

Nornr: Give your agent a spending mandate before it touches money

Arizona Attorney General sues Kalshi on illegal gambling charges

Anti-Private Equity Is Good Business

Show HN: I built a 7-agent AI marketing crew – 235 replies, /bin/zsh revenue

Real-World Industrial-Scale Verification: LLM-Driven Theorem Proving on SeL4

Show HN: Complete Guide to AI Agent Observability in Production

The Byzantine MCP Router – AI Safety and Security via Semantic Consensus

More Big Tech Layoffs Loom as Meta Mulls 20% Cut to Its Workforce

Oils for Unix – A Pause in the Project

Free alternative to Harvey/Legora's tabular document review

Pgtui, a Postgres TUI Client

Migrating from DigitalOcean to Hetzner

What Does the Future of Programming Look Like?

Simulation we live in was created to develop AGI, and will soon be turned off

RocketRide – Build and run AI/data pipelines within VS Code, Cursor etc.

I canceled my Antigravity subscription today. Here is why

Tab Organizer for Developer

Email for agents – agent doesn't need another Gmail

Webtool: Let AI agents control your live Chrome session with CDP