Model collapse is already happening

https://cacm.acm.org/blogcacm/model-collapse-is-already-happening-we-just-pretend-it-isnt/

15•zdw•1h ago

Comments

FeepingCreature•1h ago

Source: a bad study from 2023.

slowmovintarget•1h ago

Why is the study bad?

https://www.nature.com/articles/s41586-024-07566-y

levocardia•1h ago

Evidence: trust me bro. Really, where is the actual evidence that models are "collapsing" from too much AI-generated training material? Evals are up, subjective perception of model usefulness is up (for me, certainly), and if anything the slop levels are down, or at least stable. I find it hard to believe that seven-figure software engineers at top labs aren't being careful about how much post-ChatGPT-era internet content is going into their training data.

jrmg•1h ago

I find it hard to believe that seven-figure software engineers at top labs aren't being careful about how much post-ChatGPT-era internet content is going into their training data.

I agree - but as the Internet descends into all-slop-all-the-time (seriously, just do a search for reviews or travel advice or technical questions -or most anything - to see it), where do you expect the high quality training material on future things to come from? I have a hard time imagining it.

ctoth•1h ago

Your Claude Code sessions. Every interaction. Every time the model is asked to do something and then gets feedback on that something (this didn't work I got this traceback)

Textbooks, company wikis, news corpora, structured reports of all kinds from far more sources than what is available on the web.

chromacity•1h ago

There's some comedy in this article having all the hallmarks of LLM writing.

justonceokay•1h ago

Yeah a typo in the subtitle does not especially inspire confidence

niccl•1h ago

you've got me. What's the typo?

justonceokay•57m ago

It seems to me there is a word or two missing between “rich” and “slowly”. If I read the whole thing aloud I cannot parse it into a sentence. Or the word “rich” could be removed. That would be clunky but at least grammatically sensible.

“Make data get smoothed out” is a very strange way of saying “smooths out data”

quantified•35m ago

It might be weird if you haven't read a lot of English. It's actually quite normal to say that process X is a way to make effect Y happen. "Makes your mout water" is more effective than "waters your mouth". "Makes your breath fresh and tolerable" is better than "freshens and tolerablerizes your breath". Etc.

Actually, what you are describing is what happens when LLM-generated prose cycles and then trains humans to use equally dull thinking.

atmavatar•4m ago

I read the subtitle as

> The weird, rare, surprising patterns [that make data rich] slowly get smoothed out when an AI model trains on outputs from a previous model.

i.e., the patterns are responsible for making data rich, and they are slowly lost as each new generation model trains on the prior generation's output.

Or, if you'd prefer an analogy, we're using a copy machine to output new documents by taking the last copy spit out by the machine, adding some marks to it, and running it through the copier again. Over time, details present in much older copies blur and fade away in Nth generation copies.

kimi•1h ago

I have a pet-peeve with this. As a non-native English speaker, I find it very useful to dictate multiple notes, in different languages, and have the LLM produce clear English prose out of it. The prose may be LLM-generated, but I edit it when needed to make sure that the contents is 100% mine.

It's like dictating to a typist like they did in the 60's - he will make sure that your letter looks professional and will fix your grammar, but you will sign the letter. This is totally different from LLM spam, the kind that inflates a sentence into a three-page article full of nothing.

So - is it a problem if the language reverts to a mean? that is the point of a shared language, right?

SunshineTheCat•1h ago

I always find articles like this very odd and nebulous because they act as though AI models are just Google.

Type request, get info.

But that's such a narrow/one dimensional view of how LLMs are used. They can gather data or write an article, but that's probably a minority of use cases.

People have casual conversations with them, code written, brainstorming sessions, dictating a voice-recorded note, and the list goes on.

While data its getting trained on is important, the supposition is that this data consists only of what sits out there on the interwebs.

That as oppose to user input/interaction which, I'm guessing, has a pretty large role in training models. Maybe even more so in some cases than AI-written blog spam.

Meta Lays Off 700 Employees, While Rewarding Top Executives

Sculpting Code

Daniel Stenberg – Emails

Joycraft: Upgrade your Claude Code harness

Supreme Court Wipes Out Record Labels' $1B Piracy Judgment Against Cox

How to make good lecture slides with AI assistance

Marco Arment did something awesome

Strong Customer Authentication

Cella dev journey, a 3D space game in Rust

Tired of AI When will this era end?

On Claude Code

All non-government Claude services below two nines of uptime in March 2026

Bernie Sanders and AOC introduce bill to pause building of new datacenters

Interactive web pages. Is this a real defense against AI mode predation?

Chat Control: How Governments and Tech Lobby Try to Overturn EU Parliament

I installed Fedora and accidentally created a haunted house

We are developing software with a slot-machine

Show HN: Upload your pitch deck, get investor feedback

Show HN: GhostDesk – MCP server giving AI agents a full virtual Linux desktop

The EU still wants to scan your private messages and photos

DeleteMe acquires Tracey Chou's Block Party browser extension

Why I Got Out Of The Gambling Business

GTC the Game: Web gaming to prep for big events

Toyota cuts EV prices in China, some now under $15,000

EU Commission stands with Big Tech in an utterly wild letter

Left-Leaning Red-Black Trees Considered Harmful

Sodium-ion EV battery breakthrough delivers 11-min charging and 450 km range

Show HN: LiveDemo – open-source tool for creating interactive product demos

Our Approach to the Model Spec

Know your tokens. Own your costs