So I’m curious where the line is? Are there phases in the training/continued pre training/alignment/rlhf pipeline where synthetic data isn’t just harmless but actually beneficial? Is it a question of quantity or a question of how much novelty is in the training data?
Sure, there are places where determining whether it is AI-generated or 'real' is hard, but there are plenty of places where the trust in the provider provides enough basis to include the data during curation. For example, it's not as if the NYT will suddenly start pumping out unchecked AI slop.
And then there is the enormous potential of data synthesized aided by, but not completely generated by AI and validated for accuracy through systematic means.
sans_souse•6mo ago
But our knowledge and growth today is so narrow in scope (in a sense) and there's an ever looming scenario ready to present itself where our perceived growth is actually a recursion and the answer to "what is the purpose" becomes "there is none"