Artificial Intelligence (AI) has evolved significantly since its inception, thanks to numerous breakthroughs and advancements that have led to complex models and algorithms. The transformer, proposed in 2017 by Vaswani et al. in “Attention is All You Need,” has quickly become a foundational concept in contemporary AI. Transformers, a subset of deep learning models, excel in processing sequential input, such as text, audio, or time series data, with the ability to handle entire sequences in parallel. Their self-attention mechanism, which assesses the value of multiple elements in the input sequence simultaneously, is a key technological advancement. This is particularly valuable in complex natural language processing (NLP) tasks. Transformers have revolutionized NLP, becoming the preferred architecture for language modeling, text categorization, translation, and more. In this blog article, we delve into transformers, exploring their structure, components, applications, integration with large language models (LLMs), and providing code snippets and mathematical formulas to support our insights.
Why Build Transformers? Transformers are commonly utilized in large language models (LLMs) due to their capacity to analyze sequential input and capture long-range dependencies. The transformer architecture has now been modified for various NLP tasks including text categorization, sentiment analysis, and question answering. It was originally developed for sequence-to-sequence applications like machine translation.Transformers are frequently employed in LLMs as the encoder component, which takes the input sequence of tokens (words or characters) and processes it to create a continuous representation of the input that the decoder component may utilize to generate the output sequence. The model can capture intricate contextual links between tokens through self-attention mechanism in transformers, which enables the model to attend to various sections of the input sequence simultaneously and weigh their importance.Before transformers, NLP tasks were often handled by conventional recurrent neural networks (RNNs). RNNs, however, had certain drawbacks. As information from previous time steps was frequently lost by the time the network processed later steps, they struggled with long-range dependencies. Because of this, it was difficult to represent intricate contextual links in sequences.
What is LLM? A Large Language Model (LLM) is a type of artificial intelligence (AI) model that is trained on a large corpus of text data to generate language outputs that are coherent and natural-sounding. This made it possible to analyze sequential data more effectively and efficiently, which helped to pave the way for LLM progress.LLMs have gained popularity in recent years due to their ability to generate text that frequently cannot be distinguished from human-generated language.
Transformer Architecture: The Building Block: The transformer architecture comprises of an encoder and a decoder. An input series of tokens, such as words or letters, are fed into the encoder, which then generates a continuous representation of the input sequence. On the basis of the encoder’s output, the decoder subsequently creates the output sequence, one token at a time.
1. Inputs and Input Embeddings: Tokens entered by the user are considered inputs for machine learning models. However, since models can only comprehend numerical inputs and not language, these inputs must be transformed into a numerical format known as “input embeddings.” Words are represented as numbers in input embeddings, which machine learning models can then process. Similar to a dictionary, these embeddings place words in a mathematical space where related words are clustered together to assist the model to grasp their meaning. The model learns how to build these embeddings during training so that related vectors represent words with related meanings.
2. Positional Encoding: Positional Encoding: The arrangement of words in a phrase is essential for interpreting the meaning of the sentence in natural language processing. Traditional machine learning models, such neural networks, do not, however, automatically recognize the sequence of inputs. The location of each word in the input sequence can be encoded as a collection of integers using positional encoding to solve this problem. Together with the input embeddings, these numbers may be supplied into the Transformer model. By using positional encoding in the Transformer architecture, LLMs can more effectively comprehend the arrangement of words in a phrase and provide output that is both grammatically accurate and semantically relevant.
3. Encoder: As a component of the neural network, the encoder analyzes the input text and creates a number of hidden states that represent the text’s context and meaning. The encoder in LLM first tokenizes the input text, breaking it up into a series of tokens like single words or sub words. The input text is then represented at various degrees of abstraction by a number of hidden states created by a sequence of self-attention layers. The transformer uses a number of encoder layers.The encoder is made up of a stack of identical layers called Multi-Head Self-Attention (MHSA) and Position-wise Feed-Forward Networks (FFN), each of which has two sublayers. All token pairs in the input sequence are divided into attention weights by the MHSA layer, whereas the encoder is made up of a stack of identical layers called Multi-Head Self-Attention (MHSA) and Position-wise Feed-Forward Networks (FFN), each of which has two sublayers. The MHSA layer calculates the attention weights between each pair of tokens in the input sequence, and the FFN layer uses a fully connected feed-forward network to process the output of the MHSA layer. To create the final output probability distribution, the output of the last layer is processed via a linear layer and softmax activation function.
Multi-Head Self-Attention (MHSA): The relationship between various elements of the input sequence must be recorded by MHSA. In order to do this, attention weights are computed, which show how significant each element in the sequence is when computing the representation of a specific element.
MHSA may be modelled mathematically as follows:
Let Q, K, and V be matrices that, in turn, represent the key, the value, and the query. This is how the attention weights are calculated:
Attention (Q, K, and V) is equal to softmax((Q * KT /sqrt{d}) * V). The vectors’ dimensions are denoted by d and their square root, sqrt d, respectively. The attention weights are normalized by the softmax function to guarantee that they add to 1.
Feed-Forward Networks (FFN) with Position-wise Forwarding: The output of MHSA is transformed nonlinearly by a fully connected feedforward network (FFN), also known as a feedforward network. This enables the model to understand intricate data patterns.
FFN can be represented mathematically as: FFN(x) is equal to ReLU(W1 * x + b1) * W2 + b2)
Here, b1 and b2 are learnt bias vectors, ReLU is the rectified linear unit activation function, and W1 and W2 are learned weight matrices.
Code Snippet:This code defines an encoder consisting of six layers, each of which contains a self-attention mechanism followed by a feedforward network (FFN). The output of the sixth layer is passed through a linear layer and softmax activation function to produce the final output probability distribution.
4. Outputs (shifted right): Outputs (shifted right): During training, the decoder learns how to predict the subsequent word by examining the preceding words. We do this by shifting the output sequence one position to the right. In this manner, the decoder is limited to using the preceding words. We train LLM on a huge amount of text material, which helps it understand what it writes. The Common Crawl web corpus, the BooksCorpus dataset, and the English Wikipedia are a few text corpora that is utilized to train LLM. These corpora contain billions of words and phrases, giving LLM a wealth of linguistic information to draw upon.
5. Output Embeddings: In contrast to input embeddings, models solely comprehend numeric data. It is therefore necessary to convert the output to a numerical representation known as “output embeddings.” Similar to input embeddings, output embeddings also undergo positional encoding, which aids the model in comprehending the sequential sequence of words in a phrase. In machine learning, the difference between a model’s predictions and the actual target values is measured using a loss function. For complicated models, the loss function is extremely crucial. The loss function reduces the discrepancy between predictions and targets, which modifies some aspects of the model to increase accuracy. The model’s overall performance is eventually enhanced by the tweak, which is fantastic! Inference and training both employ output embeddings. They calculate the loss function and update the model parameters during training. They produce the output text during inference by mapping the expected probabilities of each token in the model to the matching token in the lexicon. Additionally, the decoder is made up of a stack of identical layers that each include three sublayers: MHSA, FFN, and Attention-based Final Linear Layer (AFLL). The final output probability distribution is computed by the AFLL layer using the attention weights between each pair of tokens in the output sequence.
6. Decoder: The decoder processes both the positionally encoded output embeddings and the positionally encoded input representation. The model’s decoder creates the output sequence from the input sequence that has been encoded. The decoder gains the ability to predict the following word by studying the ones that came before it during training. Based on the input sequence and the context that the encoder has learnt, the decoder produces natural language text. The transformer employs numerous layers of decoders, similar to an encoder.
Code Snippet:
Similar to the encoder, it has multiple layers, each consisting of multi-head self-attention and multi-head encoder-decoder attention. The forward method takes target_ids (decoder inputs), memory (encoder outputs), and appropriate attention masks as inputs.
7. Linear Layer and Softmax: The linear layer translates the output embeddings to a higher-dimensional space after the decoder generates them. The resulting embeddings must be converted into the original input space in this stage. Then, we create probability distributions for each output token in the vocabulary using the softmax function, allowing us to create output tokens with probabilities.
Conclusion:In this blog post, we explored the transformer architecture and its components, including MHSA, FFN, and MHCA. We also discussed the role of transformers in large language models (LLMs) and how they have revolutionized the field of natural language processing. Understanding transformers is essential for building effective NLP models. By mastering these techniques, it helps in creating state-of-the-art models that can process and generate natural language text with ease.