An out-and-out view of Transformer Architecture

Why was the transformer introduced?

For a sequential task, the most widely used network is RNN. But RNN can’t handle vanishing gradient. So they introduced LSTM, GRU networks to overcome vanishing gradients with the help of memory cells and gates. But in terms of Long term dependency even GRU and LSTM lack because we‘re relying on these new gate/memory mechanisms to pass information from old steps to the current ones. If you don’t know about LSTM and GRU nothing to worry about just mentioned it because of the evaluation of the transformer this article is nothing to do with LSTM or GRU.

Now transformer overcame long-term dependency bypassing the whole sentence rather than word by word(sequential). Which had direct access to all the other words and introduced a self-attention mechanism that does not allow any information loss. Several new NLP models which are making big changes in the AI industry especially in NLP, such as BERT, GPT-3, and T5, are based on the transformer architecture. The transformer was successful because they used a special type of attention mechanism called self-attention. We will be seeing the self-attention mechanism in depth.

Transformer Architecture

The Transformer in NLP is a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease. The Transformer was proposed in the paper Attention Is All You Need. Given figure below is the Transformer architecture. We are going to break down the Transformer Architecture into subparts to understand it better.

Breaking down Transformer architecture in this article is as follows :

1 . Encoder-Decoder Architecture

2. Encoder Architecture

2.1 Input Embedding and Positional encoding

2.2 Self-attention mechanism

2.3 Multi-head attention mechanism

2.4 Feed forward network

2.5 add and norm component

3 . Decoder Architecture

3.1 Masked Multi-head attention

3.2 Multi-head attention

3.3 Feed forward network

3.4 add and norm component

4. Linear and softmax layers

5. Putting Encoder-Decoder together

1. Encoder-Decoder Architecture :

We feed the input sentence to the encoder. The encoder learns the representation of the input sentence using some attention and network. The decoder receives the representation learned by the encoder as an input and generates the output.

For example, if we are building a machine translation model from English to German. Let us assume that the given input sentence to the encoder is “How you doing ?” and the output from the decoder should be “Wei geht’s ?”. Refer to fig 2 below.

So, the raw input data “How you doing” is given to the encoder that captures the semantic meaning of the sentence in vectors says 100-dimensional vector for each word. So the representation might be in the form of a (3,100) matrix where 3 is the number of words and 100 is the dimension of each word vector. That vector representation from the encoder is given to the decoder which builds a machine translation model that converts the vector representation into the output in human-readable form.

We feed the input sentence to the encoder. The encoder learns the representation of the input sentence using some attention and network. The decoder receives the representation learned by the encoder as an input and generates the output.

For example, if we are building a machine translation model from English to German. Let us assume that the given input sentence to the encoder is “How you doing ?” and the output from the decoder should be “Wei geht’s ?”. Refer to fig 2 below.

So, the raw input data “How you doing” is given to the encoder that captures the semantic meaning of the sentence in vectors say 100-dimensional vector for each word. So the representation might be in the form of a (3,100) matrix where 3 is the number of words and 100 is the dimension of each word vector. That vector representation from the encoder is given to the decoder which builds a machine translation model that converts the vector representation into the output in human-readable form.

We feed the input sentence to the encoder. The encoder learns the representation of the input sentence using some attention and network. The decoder receives the representation learned by the encoder as an input and generates the output.

For example, if we are building a machine translation model from English to German. Let us assume that the given input sentence to the encoder is “How you doing ?” and the output from the decoder should be “Wei geht’s ?”. Refer to fig 2 below.

So, the raw input data “How you doing” is given to the encoder that captures the semantic meaning of the sentence in vectors say 100-dimensional vector for each word. So the representation might be in the form of a (3,100) matrix where 3 is the number of words and 100 is the dimension of each word vector. That vector representation from the encoder is given to the decoder which builds a machine translation model that converts the vector representation into the output in human-readable form.

The process behind this machine translation is always a black box to us. But we will now see how the encoder and decoder in the transformer convert the English sentence to the german sentence in detail

2. Encoder Architecture :

The Transformer consist of not only one encoder like in fig 2. It has several encoders stacked up one another. The output of Encoder 1 is sent as input to Encoder 2 and Encoder 2 is sent as input to Encoder 3 and so on till the Encoder n and the Encoder n return the representation of the sentence “How you doing ?” to the decoder as Input. As shown in figure 3 below.

Each block consists of 2 sublayers Multi-head Attention and Feed Forward Network as shown in figure 4 above. This is the same in every encoder block all encoder blocks will have these 2 sublayers. Before diving into Multi-head Attention the 1st sublayer we will see what is self-attention mechanism is first.

2.1. Input Embedding and Positional Encoding

Input Embedding:

The input embeddings are just embedding layers. The embedding layer takes a sequence of words and learns a vector representation for each word. That vector representation

Positional Encoding:

The position and order of words define the grammar and actual semantics of a sentence. In the case of RNN, it inherently takes the order of words into account by parsing a sentence word by word. The positional encoding block applies a function to the embedding matrix that allows a neural network to understand the relative position of each word vector.

Pretty basic, created a new vector where every entry is its index number. This is the absolute positional encoding. But there is a wrong method because the scale of the number differs. If we have a sequence of 500 tokens, we’ll end up with a 500 in our vector. In general, neural nets like their weights to hover around zero, and usually be equally balanced positive and negative. If not, you open yourself up to all sorts of problems, like exploding gradients and unstable training.

we create a position encoder with the help of sin and cos function:

Taking the sin part of the formula. Let us assume that there are 5 words in the sentence. Where d=5 , (p0,p1,p2,p3,p4) will be the position of each words. Keeping I,d static and varying positions. We have,

If we plot a sin curve and vary “pos” (on the x-axis), you will land up with different position values on the y-axis. Therefore, words with different positions will have different values position embeddings values.

There is a problem though. Since the “sin” curve repeats in intervals, you can see in the figure above that P0 and P6 have the same position embedding values, despite being at two very different positions. This is where the ‘i’ part in the equation comes into play.

If you vary “i” in the equation above, you will get a bunch of curves with varying frequencies. Reading of the position embedding values against different frequencies lands up giving different values at different embedding dimensions for P0 and P6.

2.2. Self-Attention Mechanism

A self-attention mechanism ensures that every word in a sentence has some knowledge about the context words. For example, we use these famous sentences “The animal didn’t cross the street because it was too long” and “The animal didn’t cross the street because it was too tired” in those sentences “it” is referring to “street” not “animal” in sentence 1 and “it” is referring to “animal” not “street” in a sentence 2.

So “it” depends entirely on the word “long” and “tired”. The word “long” depends on “street” and “tired” depends on “animal”. How do we make the model understand it !? There is where we use the self-attention mechanism. The self-attention mechanism makes sure each word is related to all the words.

Let us take the sentence “How you doing” again and the word embedding of each word be of 100 dimensions. Then the input matrix dimension will be X[3,100] where 3 is the number of words and 100 is the dimension of each word.

The self-attention mechanism learns by using Query (Q), Key (K), and Value (V) matrices. These Query, Key, and Value matrices are created by multiplying the input matrix X, by weight matrices WQ, WK, WV. The Weight matrices WQ, WK, WV are randomly initialized and their optimal values will be learned during training.

This is how we compute Query, Key, and Value matrices. We will see how Q, K, and V are used in the self-attention mechanism. The self-attention mechanism includes four steps.

Step 1 :

Compute Dot product between Query and Key matrics.

Thus, we can say that computing the dot product between the Query matrix (Q) and the Key matrix (KT), essentially gives us the similarity score, Which helps us to understand how similar each word in the sentence is to all other words.

Step 2 :

The 2nd step of the self-attention mechanism is to divide the Q.KT matrix by the square root of the dimension of the Key vector. We do this to obtain a stable gradient.

Q.KT / √dk

Step 3 :

We convert Q.KT / √dk unnormalized form to normalized form by applying the softmax function, which helps bring the score to the range 0 to 1 and the sum of the score equal to 1.

Score matrix = softmax(Q.KT / √dk)

Step 4 :

We compute attention matrix z by multiplying the score matrix by the value matrix.

Thus, the value of ZHow will contain 98% of the value from the value vector (How), 1% of the value from the value vector(you), 1% of the value from the value vector(doing). Refer the fig 9 above.

Likewise, in the example “The animal didn’t cross the street because it was too long” the value of Zit can be computed by the 4 steps mentioned above. Then Zit will be:

The self-attention value of the word “it” contains 81% of the value from the value vector V6(street). This helps the model that the word “it” actually refers to “street” and not “animal” from the above sentence. Thus, we can understand how a word is related to all other words in the sentence by using a self-attention mechanism.

2.3. Multi-head Attention Mechanism

Instead of computing a single attention matrix, we will compute multiple single-attention matrices and concatenate their results. So, by using this multi-head attention our attention model will be more accurate.

So for the phrase “How you doing”, we will compute the first single attention matrix by creating Query(Q1), Key(K1), and Value(V1) matrices. It is computed by multiplying the input matrix (X) by the weighted matrix WQ, WK, and WV. Then our first attention matrix will be,

Z1 = Softmax(Q1.K1T / √dk1)

Then, we will compute the second attention matrix by creating Query(Q2), Key(K2), and Value(V2) matrices by multiplying the input matrix (X) by the weighted matrix WQ, WK, and WV. Then our second attention matrix will be,

Z2 = Softmax(Q2.K12T / √dk2)

Likewise, we will compute n attention matrices (z1,z2,z3,….zn) and then concatenate all the attention matrices. So our multi-head attention matrices are:

Multi-head attention= Concatenate(Z1,Z2,….Zn)*W0

Where W0 is the weight matrix.

2.4. Feedforward Network

In feedforward neural network layer it consists of two dense layers with ReLu activations. This is applied to every attention vector. So that it is of the form that is acceptable by the next encoders and decoders attention layers.

2.5. Add and Norm component

The add and Norm component is basically a residual connection followed by layer normalization. It connects the input and output of sublayers.

Which connects the input of the Multi-head attention sublayer to its output feedforward neural network layer. Then connects the input of the feedforward sublayer to its output.

3. Decoder Architecture

It’s a stack of decoder units, each unit takes the representation of the encoders as the input with the previous decoder. Thus each decoder receives two inputs. With that, each predicts an output at time step t.

From the GIF above from jalammar’s Blog.

The decoder takes the input <sos> as the first token. At time step t=2, Decoder receives two inputs: one is from the previous output from the previous decoder prediction and the other is the encoder representation with that it predicts “am”. At time step t=3, the Decoder receives output from the previous output and from the encoder representation with that it predicts “a”. Likewise, It predicts till it reaches the end token <eos>.

3.1. Masked Multi-head Attention

For the Bottom decoder or first decoder, the input should be given. Instead of feeding input directly to the decoder, we convert it into an output embedding and add positional encoding and feed it to the decoder.

X = Output embedding + positional encoding

X will be given as input to the first decoder. Now we create a Query(Q), Key(K), and Value(V) matrices by multiplying the weight matrices WQ, WK, and WVwith the X as we did in encoders.

Then we compute Qi.KiT / √dki which is equal to the matrix given below

Before normalizing the matrix that we got above. We need to mask the words to the right of the target words by ∞. So that the previous word in the sentence is used and the other words are masked. This allows the transformer to learn to predict the next word.

The output of this masked attention block is added and normalized by applying softmax to the masked Qi.KiT / √dki matrix before being passed to another attention block.

3.2. Multi-head Attention

Each decoder receives two inputs: One is from the previous sublayer masked multi-head attention and the other is the encoder representation.

Let’s represent the encoder representation by R and the attention matrix obtained as a result of the masked-multi attention sublayer by M. Since we have the interaction between the encoder and decoder this layer is called an encoder-decoder attention layer.

The Query matrix essentially holds the target sentence. Since it is obtained from M and the Key and Value matrices hold the representation of the source sentence. Since it is obtained from R.

Then we calculate the score matrix as we did in encoders. But this time we got Q, K, and V matrices from 2 different matrices.

Z1 = Softmax(Q1.K1T / √dk1) , Z2 = Softmax(Q2.K12T / √dk2)…

Then we calculate Multi-head attention using z1,z2,z3,….zn from above.

Multi-head attention = Concatenate(Z1,Z2,….Zn)*W0

3.3 Feed-Forward neural network and Add and Norm

This works the same way as Encoders.

4. Linear and Softmax layer

The decoder learns the representation of the target sentence/target class/depends on the problem. We feed that representation of the topmost decoder to the Linear and Softmax layers.

The linear layer generates the logits whose size is equal to the vocabulary size. Suppose our vocabulary has only 3 words “How you doing”. Then the logits returned by the linear layer will be of size 3. Then we convert the logits into probability using the softmax function, the decoder outputs the word whose index has a higher probability value.

5. Putting Encoder-Decoder together

Reference:

1. Inspired By the book “Getting started with Google Bert” by Mr. Sudharsan Ravichandiran.

2. Visualizing A Neural Machine Translation Model Blog by Jay Alammar.

3. Illustrated Guide to Transformer Blog.

4. Positional encoding Blog.

5. StackExchange Thread — positional encoding in transformer. Positional Encoding best explanation

6. Colah’s Blog.

Aspiring Data Scientist | ML Engineer