Get the latest tech news

The promise and perils of synthetic data


Big tech companies — and startups — are increasingly using synthetic data to train their AI models. But there's risks to this strategy.

Companies large and small rely on workers employed by data annotation firms to create labels for AI training sets. That, combined with fears of copyright lawsuits and objectionable material making their way into open data sets, has forced a reckoning for AI vendors. To this point, a 2023 study by researchers at Rice University and Stanford found that over-reliance on synthetic data during training can create models whose “quality or diversity progressively decrease.” Sampling bias — poor representation of the real world — causes a model’s diversity to worsen after a few generations of training, according to the researchers (although they also found that mixing in a bit of real-world data helps to mitigate this).

Get the Android app

Or read this on TechCrunch

Read more on:

Photo of promise

promise

Photo of perils

perils

Photo of synthetic data

synthetic data

Related news:

News photo

This Week in AI: Tech giants embrace synthetic data

News photo

Big Tech’s Promise Never To Block Access To Politically Embarrassing Content Apparently Only Applies To Democrats

News photo

The perils of transition to 64-bit time_t