frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: A mathematical proof that more dirty features can beat fewer clean ones

https://github.com/tjleestjohn/from-garbage-to-gold
3•tjleestjohn•1h ago

Comments

elkomysara7•1h ago
I’ve been reviewing the From Garbage to Gold framework, and I’m intrigued by its central claim: that expanding the predictor space with error‑prone variables can outperform perfect cleaning of a smaller set. The distinction between Predictor Error and Structural Uncertainty feels like a powerful reframing of the ‘dirty data’ problem.

I’d love to discuss the practical implications of this—especially how this theory might reshape feature selection strategies, data architecture design, and the balance between cleaning vs. redundancy in real-world enterprise environments. How do others see this influencing ML workflows, particularly in high‑dimensional tabular settings?

tjleestjohn•1h ago
Practically, we see this shifting enterprise ML workflows towards what we term Proactive Data-Centric AI (P-DCAI) in the paper. Instead of the traditional, reactive approach of aggressively cleaning and pruning variables — which often strips away the redundancy needed to capture the full latent signal — P-DCAI treats data architecture as an upfront strategic design choice. Feature selection becomes less about finding pristine, uncorrelated inputs and more about deliberately engineering a portfolio optimized for "novelty" (to comprehensively cover all underlying latent drivers) and "informative redundancy" (to ensure statistical reliability even when individual predictors are highly error-prone).
tjleestjohn•57m ago
Hello HN,

I'm Terry, the first author.

I spent the last 2.5+ years formalizing this theory to explain a strange anomaly I kept encountering in industry: models trained on vast, incredibly dirty, uncurated datasets were sometimes achieving state-of-the-art predictive performance, completely defying the "Garbage In, Garbage Out" mantra.

The TL;DR of the paper [https://arxiv.org/abs/2603.12288] is a formal mathematical proof showing why adding more error-prone variables can actually beat cleaning fewer variables to perfection.

The key is recognizing that complex systems often generate data through underlying latent structures. This allows for the partitioning of predictor-space noise into "Predictor Error" and "Structural Uncertainty," and the results follow logically. The paper also formally connects latent architecture to the prerequisites for Benign Overfitting — showing that the structural conditions that enable modern overparameterized models to generalize well arise naturally from latent generative processes.

The theory applies broadly across domains, but work began as an attempt to explain a specific peer-reviewed result at Cleveland Clinic Abu Dhabi — published in PLOS Digital Health [https://journals.plos.org/digitalhealth/article?id=10.1371/j...] — where we achieved .909 AUC predicting stroke/MI in 558k patients using thousands of uncurated EHR variables with no manual cleaning.

Important caveat: As detailed in the paper, this isn't a magic silver bullet. The framework strictly requires data with a latent hierarchical structure (e.g., medical diagnoses driven by unmeasured physiology, stock prices driven by hidden sentiment, sensor readings driven by underlying physical states). It also means your pre-processing effort shifts from data hygiene to data architecture.

I included a fully annotated R simulation in the repo so you can see the exact mechanisms of how "Dirty Breadth" beats "Clean Parsimony."

My team and I are currently operationalizing this into warehouse-native infrastructure (Snowflake, Databricks, etc.) because 80% of enterprise data is tabular, and companies are burning massive amounts of their ML budgets on data cleaning pipelines that they might not actually need.

I would love to hear your thoughts or criticisms on the theory, or how you handle high-dimensional noise in your own tabular pipelines.

I'll be hanging out in the comments to answer any questions!

The Reckless Women Who Changed Journalism

https://www.theatlantic.com/books/2026/02/women-who-reinvented-journalism/686146/
1•samclemens•53s ago•0 comments

AI, Human Cognition and Knowledge Collapse

https://www.nber.org/papers/w34910
1•netfortius•2m ago•0 comments

Factory waste heat could be used to cool data centers via novel thermal battery

https://www.datacenterdynamics.com/en/news/factory-waste-heat-could-help-cool-data-centers-thanks...
1•ZinedineF•3m ago•1 comments

Len Deighton, originator of modern spy fiction, dies

https://www.theguardian.com/books/2026/mar/17/len-deighton-obituary
1•zabzonk•3m ago•0 comments

Stop Losing Referrals: A Quick Note for Coaches and Consultants

https://substack.com/profile/461289274-willem/note/c-229234997
1•anqer•5m ago•0 comments

What can we remove? (2024)

https://stephango.com/remove
1•Sir_Twist•7m ago•0 comments

Genetically modified bacteria convert plastic waste into Parkinson's drug

https://www.heise.de/en/news/Genetically-modified-bacteria-convert-plastic-waste-into-Parkinson-s...
2•ohjeez•7m ago•0 comments

If you thought the code writing speed was your problem; you have bigger problems

https://andrewmurphy.io/blog/if-you-thought-the-speed-of-writing-code-was-your-problem-you-have-b...
3•mooreds•9m ago•0 comments

GhostNet: A Community-Driven Global OSINT Network over Shortwave Radio

https://github.com/s2underground/GhostNet
2•CGMthrowaway•9m ago•0 comments

Certifications with the best ROI per hour in 2026

1•OpenClawAura•10m ago•0 comments

Show HN: Specimen – Font Manager for macOS

https://getspecimen.app
1•yaniszaf•12m ago•0 comments

Nornr: Give your agent a spending mandate before it touches money

https://nornr.com/quickstart
2•dreadpirates•13m ago•1 comments

Arizona Attorney General sues Kalshi on illegal gambling charges

https://www.engadget.com/big-tech/arizona-attorney-general-sues-kalshi-on-illegal-gambling-charge...
5•spenvo•13m ago•0 comments

Anti-Private Equity Is Good Business

https://www.bloomberg.com/opinion/newsletters/2026-03-17/anti-private-equity-is-good-business
2•impish9208•14m ago•1 comments

Show HN: I built a 7-agent AI marketing crew – 235 replies, /bin/zsh revenue

https://vaos.sh/blog/i-built-a-7-agent-ai-marketing-crew
2•jmanhype•16m ago•0 comments

Real-World Industrial-Scale Verification: LLM-Driven Theorem Proving on SeL4

https://arxiv.org/abs/2602.08384
1•PaulHoule•17m ago•0 comments

Show HN: Complete Guide to AI Agent Observability in Production

https://vaos.sh/blog/complete-guide-to-ai-agent-observability
1•jmanhype•17m ago•0 comments

The Byzantine MCP Router – AI Safety and Security via Semantic Consensus

https://github.com/wdulz/byzantine-mcp-router
1•wdulz•18m ago•1 comments

More Big Tech Layoffs Loom as Meta Mulls 20% Cut to Its Workforce

https://www.investopedia.com/more-big-tech-layoffs-loom-as-meta-mulls-20-percent-cut-to-its-workf...
3•nigelgutzmann•18m ago•0 comments

Oils for Unix – A Pause in the Project

https://oils.pub/blog/2026/03/status-update.html
3•righthand•18m ago•0 comments

Free alternative to Harvey/Legora's tabular document review

https://www.usefolio.ai/blog/a-tabular-document-review-companion-for-your-claude-legal-skill
1•nibab•18m ago•0 comments

Pgtui, a Postgres TUI Client

https://kdwarn.net/programming/blog/227
3•birdculture•19m ago•0 comments

Migrating from DigitalOcean to Hetzner

https://isayeter.com/posts/digitalocean-to-hetzner-migration/
2•luispa•20m ago•0 comments

What Does the Future of Programming Look Like?

https://jackwsmth.com/what-does-the-future-of-programming-look-like/
1•fixedprog•20m ago•0 comments

Simulation we live in was created to develop AGI, and will soon be turned off

https://twitter.com/pmddomingos/status/2032223840403931541
2•reconnecting•21m ago•2 comments

RocketRide – Build and run AI/data pipelines within VS Code, Cursor etc.

https://github.com/rocketride-org/rocketride-server
3•shashidhar-babu•21m ago•1 comments

I canceled my Antigravity subscription today. Here is why

1•davidvartanian•22m ago•1 comments

Tab Organizer for Developer

https://github.com/gancio-xyz/dev-tab-organizer
1•alexfg93•24m ago•0 comments

Email for agents – agent doesn't need another Gmail

https://mails.dev/
2•guoyu•27m ago•0 comments

Webtool: Let AI agents control your live Chrome session with CDP

https://github.com/usewebtool/webtool
4•machinecontrol•28m ago•1 comments