Get the latest tech news

OpenAI transcribed over a million hours of YouTube videos to train GPT-4

Google says it prohibits unauthorized YouTube scraping.

OpenAI spokesperson Lindsay Held told The Verge in an email that the company curates “unique” datasets for each of its models to “help their understanding of the world” and maintain its global research competitiveness. The Times article says that the company exhausted supplies of useful data in 2021, and discussed transcribing YouTube videos, podcasts, and audiobooks after blowing through other resources. By then, it had trained its models on data that included computer code from Github, chess move databases, and schoolwork content from Quizlet.

Get the Android app

Or read this on The Verge