Get the latest tech news
Beware of AI 'model collapse': How training on synthetic data pollutes the next generation
Oxford scholars found that large language models fed a diet of 'cannibal' data, created by other LLMs, sink into complete gibberish.
To arrive at that conclusion, the authors conducted an experiment using Meta's open-source AI model, OPT, for "open pre-trained transformer," introduced in 2022. Shumailov's team used the Wikitext2 dataset of Wikipedia articles to "fine-tune" OPT, meaning, to re-train it with additional data, a very common practice in gen AI. The authors provided examples of what happens after five rounds of using each fine-tuned model as the source for teaching the next: by generation five, it's complete gibberish.
Or read this on ZDNet