Get the latest tech news

Extracting memorized pieces of books from open-weight language models


Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) have memorized plaintiffs' protected expression. Drawing on adversarial ML and copyright law, we show that these polarized positions dramatically oversimplify the relationship between memorization and copyright. To do so, we leverage a recent probabilistic extraction technique to extract pieces of the Books3 dataset from 13 open-weight LLMs. Through numerous experiments, we show that it's possible to extract substantial parts of at least some books from different LLMs. This is evidence that the LLMs have memorized the extracted text; this memorized content is copied inside the model parameters. But the results are complicated: the extent of memorization varies both by model and by book. With our specific experiments, we find that the largest LLMs don't memorize most books -- either in whole or in part. However, we also find that Llama 3.1 70B memorizes some books, like Harry Potter and 1984, almost entirely. We discuss why our results have significant implications for copyright cases, though not ones that unambiguously favor either side.

View a PDF of the paper titled Extracting memorized pieces of (copyrighted) books from open-weight language models, by A. Feder Cooper and Aaron Gokaslan and Amy B. Cyphert and Christopher De Sa and Mark A. Lemley and Daniel E. Ho and Percy Liang With our specific experiments, we find that the largest LLMs don't memorize most books -- either in whole or in part. We discuss why our results have significant implications for copyright cases, though not ones that unambiguously favor either side.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of books

books

Photo of memorized pieces

memorized pieces

Related news:

News photo

The best apps for reading, tracking and listening to books

News photo

Canadian authors warn readers that AI dupes of their books are popping up on Amazon

News photo

Chicago Sun-Times Prints AI-Generated Summer Reading List With Books That Don’t Exist