Get the latest tech news

Will AI systems perform poorly due to AI-generated material in training data?


Ever since ChatGPT was released to the public in November 2022, people have been using it to generate text, from emails to blog posts to bad poetry, much of which they post online. Since that release, the companies that build the large language models (LLMs) on which such chatbots are based—such as OpenAI’s GPT 3.5, the technology underlying ChatGPT—have also continued to put out newer versions of their models, training them with new text data, some of which they scraped off the Web.

LLMs work by learning the statistical distribution of so-called tokens—words or parts of words—within a language by examining billions of sentences garnered from sources including book databases, Wikipedia, and the Common Crawl dataset, a collection of material gathered from the Internet. Some curation happens naturally, Gal said; people do not post everything their chatbot creates to the Internet, weeding out the material that contains false information or simply does not make sense, so that improves the training set. Get Involved By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of collapse

collapse

Photo of GPT

GPT

Related news:

News photo

Hidden costs in AI deployment: Why Claude models may be 20-30% more expensive than GPT in enterprise settings

News photo

Saying ‘Thank You’ to Chat GPT Is Costly. But Maybe It’s Worth the Price.

News photo

Cache loop and memory loss in GPT – a user-side fix (tested with GPT itself)