Get the latest tech news

Large language model data pipelines and Common Crawl


This article provides a short introduction to the pipeline used to create the data to train large language models (LLMs) such as LLaMA using Common Crawl (CC).

Another important aspect of deduplication is described in CCNet paper: this step removes a lot of boilerplate (e.g. navigation menus, cookie warnings, and contact information). It seems to me that for LLaMA’s dataset, the LM filtering was more conservative to avoid removing relevant data and then they added this extra step to deal with remaining quality issues, but this is me hypothesizing. It is a long-term investment that requires substantial experimentation, engineering effort, attention to detail, and good intuition to make bets under uncertainty.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Common Crawl

Common Crawl