1. To what extent scaling laws hold.
2. To what extent adding more amateur video data increases quality.
3. To what extent Google can stop other AI firms scraping the best quality stuff.
The OpenAI Whisper model is already nearly perfect, despite its habit of occasionally transcribing silence or noise as "Thanks for watching!". Adding another petabyte of data isn't going to make it better. Training on a gazillion terabytes of private videos probably isn't going to be allowed by the lawyers either in case the model memorizes something sensitive. And the long tail of public videos nobody ever watches probably doesn't add anything.
LLM training already became years ago about securing access to unique value adding data, not just throwing more web crawl data into the mix. That's why the big AI firms all pay PhDs to create transcripts of their reasoning as they solve hard problems and similar. YouTube is probably already tapped out as a source of really useful data, although of course there's a thin sliver of new high quality content being uploaded all the time that's useful to keep the knowledge base fresh.
The problem for Google is that it's only possible to stop scraping at scale. If an AI lab uses enough proxies, VMs and similar they can still grab a steady stream of new videos and it's hard to stop them (short of going full DRM for everything and maybe not even then). They can block bulk scrapes of YouTube and are doing so, but that slams the stable door after the horse has bolted.
ArtemZ•2mo ago
I feel like it lacks traction due to how convenient and popular youtube is...but at the same time with more and more ad and the war on ad blockers things can change.