Get the latest tech news
OpenAI transcribed over a million hours of YouTube videos to train GPT-4
Google says it prohibits unauthorized YouTube scraping.
OpenAI spokesperson Lindsay Held told The Verge in an email that the company curates “unique” datasets for each of its models to “help their understanding of the world” and maintain its global research competitiveness. The Times article says that the company exhausted supplies of useful data in 2021, and discussed transcribing YouTube videos, podcasts, and audiobooks after blowing through other resources. By then, it had trained its models on data that included computer code from Github, chess move databases, and schoolwork content from Quizlet.
Or read this on The Verge