Get the latest tech news

EleutherAI releases massive AI training dataset of licensed and open domain text

EleutherAI, an AI research organization, has released what it's claiming is one of the largest collections of licensed and open-domain text for training AI models.

While some AI companies have licensing arrangements in place with certain content providers, most maintain that the U.S. legal doctrine of fair use shields them from liability in cases where they trained on copyrighted work without permission. “[Copyright] lawsuits have not meaningfully changed data sourcing practices in [model] training, but they have drastically decreased the transparency companies engage in,” Stella Biderman, EleutherAI’s executive director, wrote in a blog post on Hugging Face early Friday. The Common Pile v0.1, which can be downloaded from Hugging Face’s AI dev platform and GitHub, was created in consultation with legal experts, and it draws on sources including 300,000 public domain books digitized by the Library of Congress and the Internet Archive.

Get the Android app

Or read this on TechCrunch