Get the latest tech news

The promise and perils of synthetic data


Big tech companies — and startups — are increasingly using synthetic data to train their AI models. But there's risks to this strategy.

Companies large and small rely on workers employed by data annotation firms to create labels for AI training sets. To this point, a 2023 study by researchers at Rice University and Stanford found that over-reliance on synthetic data during training can create models whose “quality or diversity progressively decrease.” Sampling bias — poor representation of the real world — causes a model’s diversity to worsen after a few generations of training, according to the researchers (although they also found that mixing in a bit of real-world data helps to mitigate this). Models lose their grasp of more esoteric knowledge over generations, the researchers found — becoming more generic and often producing answers irrelevant to the questions they’re asked.

Get the Android app

Or read this on TechCrunch

Read more on:

Photo of promise

promise

Photo of perils

perils

Photo of synthetic data

synthetic data

Related news:

News photo

The promise and warning of Truth Terminal, the AI bot that secured $50,000 in bitcoin from Marc Andreessen

News photo

Synthetic data has its limits — why human-sourced data can help prevent AI model collapse

News photo

How Databricks is using synthetic data to simplify evaluation of AI agents