Transformer

Positional Encoding

Absolute Positional Encoding (APE)

Consider an input feature vector $\rvx \in \sR^{n\times d}$, where $d$ is the dimension of the feature vector and $n$ is the number of tokens in the input sequence. For any position $t$ in the input sequence, the corresponding positional encoding $\rvp_t \in \sR^d$ is computed as follows:

\[ \begin{cases} \rvp_{t, 2i} = \sin\left(\omega_i t\right) \\ \rvp_{t, 2i+1} = \cos\left(\omega_i t\right) \\ \end{cases} \]

Here $\rvp_{t, 2i}$ represents the $2i$-th element of the positional encoding vector $\rvp_t$. The frequency $\omega_i$ is computed as follows:

\[ \omega_i = \frac{1}{10000^{2i/d}} \]

We can aslo write the positional encoding as follows:

\[ \rvp_t = \begin{bmatrix} \sin\left(\omega_0 t\right) \\ \cos\left(\omega_0 t\right) \\ \vdotswithin{\sin\left(\omega_0 t\right)} \\ \sin\left(\omega_{d/2-1} t\right) \\ \cos\left(\omega_{d/2-1} t\right) \\ \end{bmatrix} \]

Nice Properties

Dot product of two positional encodings only depends on $\Delta t$:

\[ \begin{aligned} \rvp_t^T \rvp_{t+\Delta t} &= \sum_{i=0}^{d/2-1} \sin\left(\omega_i t\right) \sin\left(\omega_i (t+\Delta t)\right) + \cos\left(\omega_i t\right) \cos\left(\omega_i (t+\Delta t)\right) \\ &= \sum_{i=0}^{d/2-1} \cos\left(\omega_i \Delta t\right) \\ \end{aligned} \]
$\rvp_t^T \rvp_{t+\Delta t} = \rvp_t^T \rvp_{t-\Delta t}$

Relative Positional Encoding (RPE)

Relative positional encoding is applied to the self-attention, i.e., $\rvq,\rvk,\rvv$:

\[ \begin{aligned} \rvq_m &= f_q(\rvx_m, m) \\ \rvk_n &= f_k(\rvx_n, n) \\ \rvv_n &= f_v(\rvx_n, n) \\ \end{aligned} \]

Let us assume that the inner product of $\rvq_m$ and $\rvk_n$ can be represented as a function $g$ which takes the $\rvx_m, \rvx_n$ and the relative position $m-n$ as input:

\[ \langle f_q(\rvx_m, m), f_k(\rvx_n, n)\rangle = g(\rvx_m,\rvx_n,m-n) \]

We know that rotation matrix $R$ naturally satisfies the above property:

\[ \begin{aligned} \langle \rmR_m\rvx_m, \rmR_n\rvx_n\rangle &= \rvx_m^T \rmR_m^T \rmR_n \rvx_n \\ &= \rvx_m^T \rmR_{n-m} \rvx_n \\ &=\langle \rvx_m, \rmR_{n-m}\rvx_n\rangle \end{aligned} \]

Thus, we can use the rotation matrix to compute the relative positional encoding. And for any high-dimensional vector $\rvx$, we can use the following formula to compute the rotation matrix:

\[ \rmR_{t} = \begin{bmatrix} \cos\left(\omega_0 t\right) & -\sin\left(\omega_0 t\right) & 0 & \cdots & 0 \\ \sin\left(\omega_0 t\right) & \cos\left(\omega_0 t\right) & 0 & \cdots & 0 \\ 0 & 0 & \cos\left(\omega_1 t\right) & -\sin\left(\omega_1 t\right) & 0 \\ 0 & 0 & \sin\left(\omega_1 t\right) & \cos\left(\omega_1 t\right) & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cos\left(\omega_{d/2-1} t\right) & -\sin\left(\omega_{d/2-1} t\right) \\ 0 & 0 & 0 & 0 & \sin\left(\omega_{d/2-1} t\right) & \cos\left(\omega_{d/2-1} t\right) \\ \end{bmatrix} \]

Here we have $\rmR_{n-m} = \rmR_{m}^T \rmR_{n}$. Note that due to $\rmR$ is a sparse matrix, we can compute the relative positional encoding in $O(d)$ time complexity:

\[ \rmR_t = \begin{bmatrix} x_0\\ x_1\\ \vdots\\ x_{d-2} \\ x_{d-1} \end{bmatrix} \cdot \begin{bmatrix} \cos\left(\omega_0 t\right)\\ \cos\left(\omega_0 t\right)\\ \vdots\\ \cos\left(\omega_{d/2-1} t\right)\\ \cos\left(\omega_{d/2-1} t\right) \end{bmatrix} + \begin{bmatrix} -x_1\\ x_0\\ \vdots\\ -x_{d-1} \\ x_{d-2} \end{bmatrix} \cdot \begin{bmatrix} \sin\left(\omega_0 t\right)\\ \sin\left(\omega_0 t\right)\\ \vdots\\ \sin\left(\omega_{d/2-1} t\right)\\ \sin\left(\omega_{d/2-1} t\right) \end{bmatrix} \]

Self-attention

Scaled Dot-Product Attention

Given a query vector $\rvq \in \sR^{ d}$, a key vector $\rvk \in \sR^{d}$, and a value vector $\rvv \in \sR^{d}$, the scaled dot-product attention is computed as follows:

\[ \rvz = \mathrm{softmax}\left(\frac{\rvq^T\rvk}{\sqrt{d}}\right)\rvv \]

Why divide by $\sqrt{d}$? The dot product between two independent random vectors $\rvq, \rvk\in\mathbb{R}^d$ with entries having zero mean and unit variance is:

\[ \rvq^T\rvk = \sum_{i=1}^d q_i k_i \]

The variance of the dot product is $d$. Thus, we can divide by $\sqrt{d}$ to make the variance of the dot product to be 1. Conceptually, this makes the softmax more stable.

Multi-Head Attention

Multi-head attention is simply the concatenation of multiple scaled dot-product attentions using different projection heads.

Encoder-Decoder Attention

Encoder would provides $$\rvk$ and $$\rvv$ to the decoder, and decoder provides $\rvq$ to perform self attention.

Add Normalization

Batch Normalization (BN)

Given a mini-batch of data $\rmX\in\sR^{n\times d}$, the batch normalization is computed as follows:

\[ \begin{aligned} \boldsymbol{\mu} &= \frac{1}{n}\sum_{i=1}^n \rmX_{i:} \\ \boldsymbol{\sigma}^2 &= \frac{1}{n}\sum_{i=1}^n (\rmX_{i:} - \boldsymbol{\mu})^2 \\ \rmY &= \frac{\rmX - \boldsymbol{\mu}}{\sqrt{\boldsymbol{\sigma}^2 + \epsilon}} \boldsymbol{\gamma} + \boldsymbol{\beta} \end{aligned} \]

Here $\boldsymbol{\gamma}$ is re-scaling parameter and $\boldsymbol{\beta}$ is the re-shift parameter. These parameters can mapped the data to a desired distribution. The $\epsilon$ is a small constant to avoid division by zero. During test time, we can use the moving average of $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}^2$ to normalize the data.

Layer Normalization (LN)

Layer normalization is similar to batch normalization, but it normalizes the data across the feature dimension:

\[ \begin{aligned} \boldsymbol{\mu} &= \frac{1}{d}\sum_{i=1}^d \rmX_{:i} \\ \boldsymbol{\sigma}^2 &= \frac{1}{d}\sum_{i=1}^d (\rmX_{:i} - \boldsymbol{\mu})^2 \\ \rmY &= \frac{\rmX - \boldsymbol{\mu}}{\sqrt{\boldsymbol{\sigma}^2 + \epsilon}} \boldsymbol{\gamma} + \boldsymbol{\beta} \end{aligned} \]

Add & Norm

The add & norm layer is computed as follows:

Post-Layer Normalization (original implementation):

\[ \rmY = \mathrm{LayerNorm}(\rmX + \mathrm{FFN}(\rmX)) \]
Pre-Layer Normalization:

\[ \rmY = \rmX + \mathrm{FFN}(\mathrm{LayerNorm}(\rmX)) \]

Here $\mathrm{FFN}$ is a feed-forward network. For pre-layer normalization, the warm-up learning is less important than post-layer normalization.