Get the latest tech news

Here’s Proof You Can Train an AI Model Without Slurping Copyrighted Content

OpenAI claimed it's "impossible" to build good AI models without using copyrighted data. An “ethically created” large language model and a giant AI dataset of public domain text suggest otherwise.

A group of researchers backed by the French government have released what is thought to be the largest AI training dataset composed entirely of text that is in the public domain. He founded the nonprofit in January 2024 after quitting his executive role at image generation startup Stability AI because he disagreed with its policy of scraping content without permission. On Wednesday, researchers released what they claim is the largest available AI dataset for language models composed purely of public domain content.

Get the Android app

Or read this on Wired