Resources Article Building an LLM Stack Part 2: Pre-training Tips & Data Processing Tricks

Building an LLM Stack Part 2: Pre-training Tips & Data Processing Tricks

Zian (Andy) Wang

Published on 02/15/24

Table of Contents

Stages of LLM Training The Pre-Training Casual Language Modeling Pre-Training Datasets The Common Crawl WebText Dataset Preprocessing The Proprietary Nature of LLMs

Share this guide

In the previous article on Building the LLM Stack, we covered various popular LLM architectures in depth, detailing their additions and changes compared to the original Transformer introduced in 2017. But LLMs can’t perform any tasks without training! An empty architecture is as useless as an empty text editor without code.

In this article, we are going to cover the training process of LLMs. Specifically, the pre-training phase.

Stages of LLM Training

In most modern Large Language Models, training is executed in two stages: the pre-training stage and the fine-tuning stage.

In the pre-training stage, models are typically trained on an enormous corpus of text data, typically spanning across many diverse areas with its text gathered from every corner of the internet. This training stage is intended to “teach” the model how our human language actually “works”. There are no particular goals that the training is aimed at, other than gaining a comprehensive understanding of the human language as a whole.

In this stage, the most important component is not how the model is trained, as we will explore in this article, most models are trained using the same mechanism, but rather the data that is leveraged in the pre-training phase.

The second stage of training, fine-tuning, the model is trained tailored to a specific task, whether it is conversational chatbot, question-answering assistant, or any language related use case.

Just as the training of Large Language Models is executed in two main stages, the journey of education mirrors this process closely. Imagine the pre-training phase of LLMs as the foundational years in school—elementary, middle, and high school. In these early years, students are exposed to a broad curriculum, covering a wide array of subjects from mathematics and science to literature and the arts. This broad spectrum education is crucial, as it lays the groundwork for understanding the world at large. Similarly, during the pre-training stage, LLMs are fed a vast and diverse range of text, learning the intricacies of human language without a specific end-goal in sight, much like a student learning about the world in its broadest strokes.

Transitioning to the fine-tuning stage of LLM training parallels entering college or university, where the choice of a major allows students to specialize in a field of their interest. This is like the fine-tuning of LLMs, where models, now equipped with a general understanding of language, are further trained on specific tasks—be it generating human-like chat responses, answering complex questions, or any other specialized language task. This stage allows the models to refine their capabilities, focusing their vast knowledge to excel in particular areas, just as a college education hones a student’s skills and knowledge in their chosen field.

In this article, our focus is on the pre-training stage of LLMs, how the training works and most importantly, how the data is collected and processed. Furthermore, we will cover the evolution of pre-training datasets, from the ancient GPT model to the modern LLMs, shining light on the importance of data in Large Language Models.

The Pre-Training

There generally are two flavors of Large Language Model pre-training: Casual Language Modeling and Masked Language Modeling.

Casual Language Modeling, mostly popularized by the famous original GPT model paper, where its main contribution was introducing Generative Pre-Training that became the nuts and bolts of modern LLMs. Casual Language Modeling involves training the model to predict the next token in a series of text. In other words, it can be described as “next-token prediction”, where the model attempts to generate the next token based on previous context.