Get the latest tech news

EleutherAI releases massive AI training dataset of licensed and open domain text


EleutherAI, an AI research organization, has released what it's claiming is one of the largest collections of licensed and open-domain text for training AI models.

While some AI companies have licensing arrangements in place with certain content providers, most maintain that the U.S. legal doctrine of fair use shields them from liability in cases where they trained on copyrighted work without permission. “[Copyright] lawsuits have not meaningfully changed data sourcing practices in [model] training, but they have drastically decreased the transparency companies engage in,” Stella Biderman, EleutherAI’s executive director, wrote in a blog post on Hugging Face early Friday. The Common Pile v0.1, which can be downloaded from Hugging Face’s AI dev platform and GitHub, was created in consultation with legal experts, and it draws on sources including 300,000 public domain books digitized by the Library of Congress and the Internet Archive.

Get the Android app

Or read this on TechCrunch

Read more on:

Photo of Text

Text

Photo of domain

domain

Photo of EleutherAI

EleutherAI

Related news:

News photo

AI Upstart Manus Starts Text-to-Video Service to Take On OpenAI

News photo

Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective

News photo

Don't click on that Facebook ad for a text-to-AI-video tool