Get the latest tech news

The promise and perils of synthetic data

Big tech companies — and startups — are increasingly using synthetic data to train their AI models. But there's risks to this strategy.

Companies large and small rely on workers employed by data annotation firms to create labels for AI training sets. That, combined with fears of copyright lawsuits and objectionable material making their way into open data sets, has forced a reckoning for AI vendors. To this point, a 2023 study by researchers at Rice University and Stanford found that over-reliance on synthetic data during training can create models whose “quality or diversity progressively decrease.” Sampling bias — poor representation of the real world — causes a model’s diversity to worsen after a few generations of training, according to the researchers (although they also found that mixing in a bit of real-world data helps to mitigate this).

Get the Android app

Or read this on TechCrunch