Get the latest tech news

OpenAI transcribed over a million hours of YouTube videos to train GPT-4


Google says it prohibits unauthorized YouTube scraping.

OpenAI spokesperson Lindsay Held told The Verge in an email that the company curates “unique” datasets for each of its models to “help their understanding of the world” and maintain its global research competitiveness. The Times article says that the company exhausted supplies of useful data in 2021, and discussed transcribing YouTube videos, podcasts, and audiobooks after blowing through other resources. By then, it had trained its models on data that included computer code from Github, chess move databases, and schoolwork content from Quizlet.

Get the Android app

Or read this on The Verge

Read more on:

Photo of YouTube

YouTube

Photo of OpenAI

OpenAI

Photo of GPT-4

GPT-4

Related news:

News photo

OpenAI and Google reportedly used transcriptions of YouTube videos to train their AI models

News photo

Instagram makes more money from ads than YouTube does, and it has for years

News photo

Jony Ive and OpenAI's Sam Altman Seeking Funding for Personal AI Device