Get the latest tech news

Beware of AI 'model collapse': How training on synthetic data pollutes the next generation


Oxford scholars found that large language models fed a diet of 'cannibal' data, created by other LLMs, sink into complete gibberish.

To arrive at that conclusion, the authors conducted an experiment using Meta's open-source AI model, OPT, for "open pre-trained transformer," introduced in 2022. Shumailov's team used the Wikitext2 dataset of Wikipedia articles to "fine-tune" OPT, meaning, to re-train it with additional data, a very common practice in gen AI. The authors provided examples of what happens after five rounds of using each fine-tuned model as the source for teaching the next: by generation five, it's complete gibberish.

Get the Android app

Or read this on ZDNet

Read more on:

Photo of training

training

Photo of generation

generation

Photo of synthetic data

synthetic data

Related news:

News photo

Stop X’s Grok AI From Training on Your Tweets

News photo

The problem of 'model collapse': how a lack of human data limits AI progress

News photo

‘Model collapse’: Scientists warn against letting AI eat its own tail