Since the introduction of the Transformer architecture by Ashish Vaswani et al. in 2017, it has become the de facto standard for any large-scale natural language processing task. From the pioneering GPT model in 2018 to the now impressive ChatGPT, even text-to-image synthesis models such as Stable Diffusion are based on, or inspired by, the Transformer. 

Given its significance as a breakthrough in the field of NLP and machine learning as a whole, it is surprising that the original paper which introduced the Transformer offers only a limited explanation of its architecture and, most critically, how the metaphorical gears of a Transformer model turn and why they work so well. This article aims to fill that gap by offering a comprehensive and illuminating illustration of the Transformer through an intuitive, visual-first approach and “decode” all the fuss around its incredible comprehension of language.

Why Transformers?

Along with the Transformer architecture that the authors of the paper proposed, they brought a revolutionary approach to sequence based modeling: the self-attention mechanism. Previously, language modeling tasks were dominated by recurrence-based techniques such as RNNs, GRUs, and LSTMs, which struggled to retain information from earlier time steps in longer sequences, resulting in poor performance on tasks involving long-range dependencies. The Transformer architecture sidesteps these limitations by allowing the entire input sequence to be processed simultaneously, enabling it to have perfect memory and significantly improved computational speed. On the other hand, the concept of self-attention is an extension from attention mechanisms introduced before the Transformer, which allows the model to focus on certain parts of the input, giving more “attention” to one part than another. 

In the following section, we will delve into the fundamental methodology underlying the Transformer model and most sequence-to-sequence modeling approaches: the encoder and the decoder. This will serve as a springboard for dissecting the Transformer model architecture and gaining an in-depth understanding of its inner workings.

The Encoder-Decoder Concept

The Transformer model relies on the interactions between two separate, smaller models: the encoder and the decoder. The encoder receives the input, while the decoder outputs the prediction. Implementing an encoder and a decoder to process sequence-to-sequence data has been relatively standard practice since 2014, first applied to recurrence-based models, then later used by the Transformer. 

Before the existence of the encoder-decoder architecture, predictions for sequence-based problems were solely based on accumulated knowledge over the entire input sequence “squeezed” into one sample's worth of representation. Although architectures such as LSTM and GRU attempted to improve the issue of long-range dependence, they did not completely resolve the underlying problem of RNNs, which was the inability to carry the information of long sequences through prediction fully. 

In an encoder-decoder schema, the encoder takes in the entirety of the input sequence. It transforms it into a vectorized representation that contains accumulated knowledge of the input sequence at every time step. Then the entire vectorized representation of the input sequence is fed into the decoder, which “decodes” the information collected by the encoder and attempts to make a valid prediction.

In the context of the Transformer model, one can make an analogy between the encoder and the decoder as a researcher developing initial ideas and discoveries. At the same time, a programmer implements a solution based on the researcher’s findings. Think of the encoder as a researcher picking out the crucial aspects of the input, such as sentence structure, syntax, and semantics. The encoder then passes on the insights learned from the inputs to the decoder. Like a programmer, the decoder will “program” a practical solution based on the insights gained by the encoder, or, in our analogy, the researcher.

The decoder receives a separate sequence of inputs before decoding the information provided by the encoder. The decoder bases its predictions not only on the original input sequence but also on its previous outputs. It is auto-regressive in nature. Every time the decoder makes a prediction, it produces a single word and then concatenates it with words predicted by the model in previous time steps. This prediction loop continues until the maximum number of outputs is reached—usually a hyperparameter specified by the user—or the model deduces that it has reached the natural end of a sentence or phrase.