frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Model Collapse and the Need for Human-Generated Training Data

https://glthr.com/model-collapse-and-the-need-for-human-generated-training-data
16•glth•6mo ago

Comments

sans_souse•6mo ago
I think a point often missed is that it's not just what the substance and quality of those sources and their associated decline but also the overall decline of sources, period. The first phases involved training the models with a massive backlog of raw knowlege, communicated over thousands of years, and for the majority of that span, this was a world much different from our today; in short, all of our knowledge was "boots on the ground" type, and all of it served to aid our growth, and our record of this tells such a story.

But our knowledge and growth today is so narrow in scope (in a sense) and there's an ever looming scenario ready to present itself where our perceived growth is actually a recursion and the answer to "what is the purpose" becomes "there is none"

ipython•6mo ago
So I’ve heard of this model collapse theory. But I’ve also heard of model providers who are intentionally training with synthetically generated data (as a result of insufficient “real” data).

So I’m curious where the line is? Are there phases in the training/continued pre training/alignment/rlhf pipeline where synthetic data isn’t just harmless but actually beneficial? Is it a question of quantity or a question of how much novelty is in the training data?

dinfinity•6mo ago
Let's remember that the basic defense against model collapse is just not training on AI-generated and other crap data.

Sure, there are places where determining whether it is AI-generated or 'real' is hard, but there are plenty of places where the trust in the provider provides enough basis to include the data during curation. For example, it's not as if the NYT will suddenly start pumping out unchecked AI slop.

And then there is the enormous potential of data synthesized aided by, but not completely generated by AI and validated for accuracy through systematic means.

EVs Are a Failed Experiment

https://spectator.org/evs-are-a-failed-experiment/
1•ArtemZ•10m ago•2 comments

MemAlign: Building Better LLM Judges from Human Feedback with Scalable Memory

https://www.databricks.com/blog/memalign-building-better-llm-judges-human-feedback-scalable-memory
1•superchink•11m ago•0 comments

CCC (Claude's C Compiler) on Compiler Explorer

https://godbolt.org/z/asjc13sa6
1•LiamPowell•13m ago•0 comments

Homeland Security Spying on Reddit Users

https://www.kenklippenstein.com/p/homeland-security-spies-on-reddit
2•duxup•15m ago•0 comments

Actors with Tokio (2021)

https://ryhl.io/blog/actors-with-tokio/
1•vinhnx•17m ago•0 comments

Can graph neural networks for biology realistically run on edge devices?

https://doi.org/10.21203/rs.3.rs-8645211/v1
1•swapinvidya•29m ago•1 comments

Deeper into the shareing of one air conditioner for 2 rooms

1•ozzysnaps•31m ago•0 comments

Weatherman introduces fruit-based authentication system to combat deep fakes

https://www.youtube.com/watch?v=5HVbZwJ9gPE
2•savrajsingh•32m ago•0 comments

Why Embedded Models Must Hallucinate: A Boundary Theory (RCC)

http://www.effacermonexistence.com/rcc-hn-1-1
1•formerOpenAI•33m ago•2 comments

A Curated List of ML System Design Case Studies

https://github.com/Engineer1999/A-Curated-List-of-ML-System-Design-Case-Studies
3•tejonutella•37m ago•0 comments

Pony Alpha: New free 200K context model for coding, reasoning and roleplay

https://ponyalpha.pro
1•qzcanoe•42m ago•1 comments

Show HN: Tunbot – Discord bot for temporary Cloudflare tunnels behind CGNAT

https://github.com/Goofygiraffe06/tunbot
1•g1raffe•44m ago•0 comments

Open Problems in Mechanistic Interpretability

https://arxiv.org/abs/2501.16496
2•vinhnx•50m ago•0 comments

Bye Bye Humanity: The Potential AMOC Collapse

https://thatjoescott.com/2026/02/03/bye-bye-humanity-the-potential-amoc-collapse/
2•rolph•54m ago•0 comments

Dexter: Claude-Code-Style Agent for Financial Statements and Valuation

https://github.com/virattt/dexter
1•Lwrless•56m ago•0 comments

Digital Iris [video]

https://www.youtube.com/watch?v=Kg_2MAgS_pE
1•vermilingua•1h ago•0 comments

Essential CDN: The CDN that lets you do more than JavaScript

https://essentialcdn.fluidity.workers.dev/
1•telui•1h ago•1 comments

They Hijacked Our Tech [video]

https://www.youtube.com/watch?v=-nJM5HvnT5k
1•cedel2k1•1h ago•0 comments

Vouch

https://twitter.com/mitchellh/status/2020252149117313349
35•chwtutha•1h ago•6 comments

HRL Labs in Malibu laying off 1/3 of their workforce

https://www.dailynews.com/2026/02/06/hrl-labs-cuts-376-jobs-in-malibu-after-losing-government-work/
4•osnium123•1h ago•1 comments

Show HN: High-performance bidirectional list for React, React Native, and Vue

https://suhaotian.github.io/broad-infinite-list/
2•jeremy_su•1h ago•0 comments

Show HN: I built a Mac screen recorder Recap.Studio

https://recap.studio/
1•fx31xo•1h ago•1 comments

Ask HN: Codex 5.3 broke toolcalls? Opus 4.6 ignores instructions?

1•kachapopopow•1h ago•0 comments

Vectors and HNSW for Dummies

https://anvitra.ai/blog/vectors-and-hnsw/
1•melvinodsa•1h ago•0 comments

Sanskrit AI beats CleanRL SOTA by 125%

https://huggingface.co/ParamTatva/sanskrit-ppo-hopper-v5/blob/main/docs/blog.md
1•prabhatkr•1h ago•1 comments

'Washington Post' CEO resigns after going AWOL during job cuts

https://www.npr.org/2026/02/07/nx-s1-5705413/washington-post-ceo-resigns-will-lewis
4•thread_id•1h ago•1 comments

Claude Opus 4.6 Fast Mode: 2.5× faster, ~6× more expensive

https://twitter.com/claudeai/status/2020207322124132504
1•geeknews•1h ago•0 comments

TSMC to produce 3-nanometer chips in Japan

https://www3.nhk.or.jp/nhkworld/en/news/20260205_B4/
3•cwwc•1h ago•0 comments

Quantization-Aware Distillation

http://ternarysearch.blogspot.com/2026/02/quantization-aware-distillation.html
2•paladin314159•1h ago•0 comments

List of Musical Genres

https://en.wikipedia.org/wiki/List_of_music_genres_and_styles
1•omosubi•1h ago•0 comments