Transformer
Positional Encoding
Absolute Positional Encoding (APE)
Consider an input feature vector \(\rvx \in \sR^{n\times d}\), where \(d\) is the dimension of the feature vector and \(n\) is the number of tokens in the input sequence. For any position \(t\) in the input sequence, the corresponding positional encoding \(\rvp_t \in \sR^d\) is computed as follows:
Here \(\rvp_{t, 2i}\) represents the \(2i\)-th element of the positional encoding vector \(\rvp_t\). The frequency \(\omega_i\) is computed as follows:
We can aslo write the positional encoding as follows:
Nice Properties
-
Dot product of two positional encodings only depends on \(\Delta t\):
\[ \begin{aligned} \rvp_t^T \rvp_{t+\Delta t} &= \sum_{i=0}^{d/2-1} \sin\left(\omega_i t\right) \sin\left(\omega_i (t+\Delta t)\right) + \cos\left(\omega_i t\right) \cos\left(\omega_i (t+\Delta t)\right) \\ &= \sum_{i=0}^{d/2-1} \cos\left(\omega_i \Delta t\right) \\ \end{aligned} \] -
\(\rvp_t^T \rvp_{t+\Delta t} = \rvp_t^T \rvp_{t-\Delta t}\)
Relative Positional Encoding (RPE)
Relative positional encoding is applied to the self-attention, i.e., \(\rvq,\rvk,\rvv\):
Let us assume that the inner product of \(\rvq_m\) and \(\rvk_n\) can be represented as a function \(g\) which takes the \(\rvx_m, \rvx_n\) and the relative position \(m-n\) as input:
We know that rotation matrix \(R\) naturally satisfies the above property:
Thus, we can use the rotation matrix to compute the relative positional encoding. And for any high-dimensional vector \(\rvx\), we can use the following formula to compute the rotation matrix:
Here we have \(\rmR_{n-m} = \rmR_{m}^T \rmR_{n}\). Note that due to \(\rmR\) is a sparse matrix, we can compute the relative positional encoding in \(O(d)\) time complexity:
Self-attention
Scaled Dot-Product Attention
Given a query vector \(\rvq \in \sR^{ d}\), a key vector \(\rvk \in \sR^{d}\), and a value vector \(\rvv \in \sR^{d}\), the scaled dot-product attention is computed as follows:
Why divide by \(\sqrt{d}\)? The dot product between two independent random vectors \(\rvq, \rvk\in\mathbb{R}^d\) with entries having zero mean and unit variance is:
The variance of the dot product is \(d\). Thus, we can divide by \(\sqrt{d}\) to make the variance of the dot product to be 1. Conceptually, this makes the softmax more stable.
Multi-Head Attention
Multi-head attention is simply the concatenation of multiple scaled dot-product attentions using different projection heads.
Encoder-Decoder Attention
Encoder would provides $\(\rvk\) and $\(\rvv\) to the decoder, and decoder provides \(\rvq\) to perform self attention.
Add Normalization
Batch Normalization (BN)
Given a mini-batch of data \(\rmX\in\sR^{n\times d}\), the batch normalization is computed as follows:
Here \(\boldsymbol{\gamma}\) is re-scaling parameter and \(\boldsymbol{\beta}\) is the re-shift parameter. These parameters can mapped the data to a desired distribution. The \(\epsilon\) is a small constant to avoid division by zero. During test time, we can use the moving average of \(\boldsymbol{\mu}\) and \(\boldsymbol{\sigma}^2\) to normalize the data.
Layer Normalization (LN)
Layer normalization is similar to batch normalization, but it normalizes the data across the feature dimension:
Add & Norm
The add & norm layer is computed as follows:
-
Post-Layer Normalization (original implementation):
\[ \rmY = \mathrm{LayerNorm}(\rmX + \mathrm{FFN}(\rmX)) \] -
Pre-Layer Normalization:
\[ \rmY = \rmX + \mathrm{FFN}(\mathrm{LayerNorm}(\rmX)) \]
Here \(\mathrm{FFN}\) is a feed-forward network. For pre-layer normalization, the warm-up learning is less important than post-layer normalization.