Get the latest tech news
The promise and perils of synthetic data
Big tech companies — and startups — are increasingly using synthetic data to train their AI models. But there's risks to this strategy.
Companies large and small rely on workers employed by data annotation firms to create labels for AI training sets. To this point, a 2023 study by researchers at Rice University and Stanford found that over-reliance on synthetic data during training can create models whose “quality or diversity progressively decrease.” Sampling bias — poor representation of the real world — causes a model’s diversity to worsen after a few generations of training, according to the researchers (although they also found that mixing in a bit of real-world data helps to mitigate this). Models lose their grasp of more esoteric knowledge over generations, the researchers found — becoming more generic and often producing answers irrelevant to the questions they’re asked.
Or read this on TechCrunch