Get the latest tech news

How much information do LLMs really memorize? Now we know, thanks to Meta, Google, Nvidia and Cornell

Using a clever solution, researchers find GPT-style models have a fixed memorization capacity of approximately 3.6 bits per parameter.

Most people interested in generative AI likely already know that Large Language Models (LLMs) — like those behind ChatGPT, Anthropic’s Claude, and Google’s Gemini — are trained on massive datasets: trillions of words pulled from websites, books, codebases, and, increasingly, other media such as images, audio, and video. This is important not only for better understanding how LLMs operate — and when they go wrong — but also as model providers defend themselves in copyright infringement lawsuits brought by data creators and owners, such as artists and record labels. If not — if the models are found to generate outputs based on generalized patterns rather than exact replication — developers may be able to continue scraping and training on copyrighted data under existing legal defenses such as fair use.

Get the Android app

Or read this on Venture Beat