Get the latest tech news

A major AI training data set contains millions of examples of personal data


Personally identifiable information has been found in DataComp CommonPool, one of the largest open-source data sets used to train image generation models.

Millions of images of passports, credit cards, birth certificates, and other documents containing personally identifiable information are likely included in one of the biggest open-source AI training sets, new research has found. And since DataComp CommonPool has been downloaded more than 2 million times over the past two years, it is likely that “there [are]many downstream models that are all trained on this exact data set,” says Rachel Hong, a PhD student in computer science at the University of Washington and the paper’s lead author. When asked for comment, Florent Daudens of Hugging Face said that “maximizing the privacy of data subjects across the AI ecosystem takes a multilayered approach, which includes but is not limited to the widget mentioned,” and that the platform is “working with our community of users to move the needle in a more privacy-grounded direction.”

Get the Android app

Or read this on r/technology

Read more on:

Photo of Millions

Millions

Photo of personal data

personal data

Photo of examples

examples

Related news:

News photo

Study: Hundreds of registered data brokers ignore user requests around personal data

News photo

Allianz Life says ‘majority’ of customers’ personal data stolen in cyberattack

News photo

India bans streaming apps you’ve never heard of — but millions watch