Since the debut of the Transformer architecture in the groundbreaking paper “Attention is All You Need” 7 years ago, the landscape of machine learning has been fundamentally transformed. This architecture, along with its self-attention mechanisms, has creeped into every corner of machine learning, from computer vision to reinforcement learning. Predominantly, modern Large Language Models are built upon the foundations of the Transformer architecture and its core principles. However, as the development of LLMs flood the nuance of each model, the evolution and divergence of these models’ architecture from the original Transformer design is often less documented. It typically requires digging through papers to trace the origins and reasons for each modification.

This article aims to shed light on some of the most pivotal and influential LLM architectures, delving into the rationale behind their specific design choices. A clear understanding of the original Transformer architecture is assumed. To learn more, checkout this article.

GPT-3

Undoubtedly, the revolution of LLMs all began with the release of ChatGPT on November 30th, 2022. ChatGPT was based on the GPT-3 architecture and fine-tuned for conversation using Reinforcement Learning with Human Feedback (RLHF).

The GPT line of models is more than game-changing, from GPT to GPT-4, at every iteration, the GPT series almost always reigning on top of all other NLP models at the time.

In the original GPT paper, the model sets itself apart from all others at the time by employing a decoder-only architecture. Some of the best performing models at its time, such as the Bidirectional Transformer (BERT), utilizes both the encoder and the decoder module as outlined in the original “Attention is All You Need” paper. The decoder-only architecture improved the efficiency in computation as well as decreased the complexity in models. Almost every LLM after the success of GPT-3 adopted the decoder-only architecture.

Inputs and Outputs

Outputs from the GPT-3 model are generated autoregressively, similar to the original Transformer implementation. However, there is one minor difference. The input length of GPT-3 is fixed to 2048 tokens, and any input that’s shorter is padded with empty tokens until it reaches 2048 tokens.

At each prediction step, the model produces the token most likely to follow the end of the input sequence. This output token is then appended to the input sequence and re-entered into the model for the next token prediction. This process repeats until the desired length of the output sequence is reached or the model determines it has reached the natural end of the response.

The above schema applies for all modern Large Language Models.

Byte Pair Encoding Tokenization

Notice how GPT-3 deals with tokens, not words or characters. Language models prior to the GPT family had a wide variety of tokenization methods, one of them, space and punctuation tokenization, was widely used. It can loosely be explained as splitting input sequences into individual words and punctuations.

Tokenization techniques like space and punctuation will typically generate an enormous vocabulary, encompassing as many unique words and symbols seen in the model’s training data. Utilizing such tokenization methods will not only lead to increase in computational complexity but also create the problem of dealing with words outside of the model’s vocabulary during inference.

GPT-3’s adoption of Byte Pair Encoding (BPE) addresses these challenges by employing a more efficient and adaptable tokenization strategy. BPE strikes a balance between the extremes of character-level granularity and word-level generalization. Most modern LLMs rely on their own variation of BPE.

Here’s a simplified process of how BPE works:

  1. Initialization of Vocabulary: BPE starts with a vocabulary of individual characters. This vocabulary includes all characters appearing in the training corpus. Each unique character is treated as an initial token.

  2. Building the Vocabulary: The algorithm counts the frequency of each pair of adjacent tokens (initially characters) in the training data.

  3. Merging: It identifies the most frequent pair of consecutive tokens and merges them into a single new token. For example, if ‘h’ and ‘e’ are the most frequent pair, they are merged to form the token ‘he’.

Determining the Number of Merge Operations: The number of merge operations is a hyperparameter set based on the desired vocabulary size. In GPT-3, this meant a large number of merges, leading to a vocabulary that efficiently encodes common sequences in the training data.