Get the latest tech news

Transparency is often lacking in datasets used to train large language models

The Data Provenance Explorer can help machine-learning practitioners make more informed choices about the data they train their models on, which could improve the accuracy of models deployed in the real world.

Building off these insights, they developed a user-friendly tool called the Data Provenance Explorer that automatically generates easy-to-read summaries of a dataset’s creators, sources, licenses, and allowable uses. When you have misattribution and confusion about where data came from, you have a serious transparency issue,” says Robert Mahari, a graduate student in the MIT Human Dynamics Group, a JD candidate at Harvard Law School, and co-lead author on the paper. For instance, if the licensing terms of a dataset are wrong or missing, someone could spend a great deal of money and time developing a model they might be forced to take down later because some training data contained private information.

Get the Android app

Or read this on Hacker News