> we manually curated a set of over 2,000 YouTube channels that release original openly licensed content containing speech. From these channels, we retrieved and transcribed (using Whisper) over 1.1 million openly licensed videos comprising more than 470,000 hours of content.
secret-noun•1h ago
This is why Gemini has such an advantage.
Also, link to explore data: https://huggingface.co/collections/common-pile/common-pile-v...
otherme123•32m ago
ACCount37•19m ago