Overview of data used to train language models

Click play to LISTEN to the article below

What types of datasets are used to train LLMs?

This post provides a brief summary of several corpora used for training Large Language Models (LLMs), categorized into six groups: Books, CommonCrawl, Reddit links, Wikipedia, Code, and others. The original paper, “A Survey of Large Language Models,” can be found here.

  1. Books: This category includes BookCorpus, consisting of over 11,000 books of various topics and genres, and Project Gutenberg, which contains over 70,000 literary books in the public domain. Books1 and Books2, used in GPT-3, are larger than BookCorpus but haven’t been publicly released.
  2. CommonCrawl: This is one of the largest open-source web crawling databases, containing petabyte-scale data volume. Subsets of web pages from CommonCrawl are used for training LLMs. Four filtered datasets based on CommonCrawl are C4, CC-Stories, CC-News, and RealNews. https://huggingface.co/datasets/allenai/c4
  3. Reddit Links: Reddit is a social media platform where users submit links and text posts. Highly upvoted posts are used to create high-quality datasets like WebText and OpenWebText. Another corpus extracted from Reddit is PushShift.io, a real-time updated dataset consisting of historical data from Reddit.
  4. Wikipedia: Wikipedia is an online encyclopedia containing high-quality articles on diverse topics. The English-only filtered versions of Wikipedia are widely used in most LLMs.
  5. Code: Code data is collected from open-source licensed codes from the Internet, primarily from public code repositories like GitHub and code-related question-answering platforms like StackOverflow. Google’s BigQuery dataset includes a substantial number of open-source licensed code snippets in various programming languages.
  6. Others: The Pile is a large-scale, diverse, and open-source text dataset consisting of over 800GB of data from multiple sources. ROOTS is composed of various smaller datasets and covers 59 different languages.

In practice, a mixture of different data sources is required for pre-training LLMs, instead of a single corpus. For example, GPT-3 was trained on a mixed dataset of 300B tokens, including CommonCrawl, WebText2, Books1, Books2, and Wikipedia. PaLM uses a pre-training dataset of 780B tokens, sourced from social media conversations, filtered webpages, books, Github, multilingual Wikipedia, and news. LLaMA extracts training data from various sources, including CommonCrawl, C4, Github, Wikipedia, books, ArXiv, and StackExchange.

Share this Post: Facebook Twitter Pinterest Google Plus StumbleUpon Reddit RSS Email

Comments are closed.