Get the latest tech news

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft


Harvard University announced Thursday it's releasing a high-quality dataset of nearly one million public-domain books that could be used by anyone to train large language models and other AI tools. From a report: The dataset was created by Harvard's newly formed Institutional Data Initiative with fu...

Harvard University announced Thursday it's releasing a high-quality dataset of nearly one million public-domain books that could be used by anyone to train large language models and other AI tools. Around five times the size of the notorious Books3 dataset that was used to train AI models like Meta's Llama, the Institutional Data Initiative's database spans genres, decades, and languages, with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries. Greg Leppert, executive director of the Institutional Data Initiative, says the project is an attempt to "level the playing field" by giving the general public, including small players in the AI industry and individual researchers, access to the sort of highly-refined and curated content repositories that normally only established tech giants have the resources to assemble.

Get the Android app

Or read this on Slashdot

Read more on:

Photo of Microsoft

Microsoft

Photo of OpenAI

OpenAI

Photo of Harvard

Harvard

Related news:

News photo

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft

News photo

Microsoft now allows Windows 11 installations on unsupported hardware, devices.

News photo

Microsoft will take an $800M hit over Cruise robotaxi shutdown