Skip to content

Transformer

Positional Encoding

Absolute Positional Encoding (APE)

Consider an input feature vector \(\rvx \in \sR^{n\times d}\), where \(d\) is the dimension of the feature vector and \(n\) is the number of tokens in the input sequence. For any position \(t\) in the input sequence, the corresponding positional encoding \(\rvp_t \in \sR^d\) is computed as follows:

\[ \begin{cases} \rvp_{t, 2i} = \sin\left(\omega_i t\right) \\ \rvp_{t, 2i+1} = \cos\left(\omega_i t\right) \\ \end{cases} \]

Here \(\rvp_{t, 2i}\) represents the \(2i\)-th element of the positional encoding vector \(\rvp_t\). The frequency \(\omega_i\) is computed as follows:

\[ \omega_i = \frac{1}{10000^{2i/d}} \]

We can aslo write the positional encoding as follows:

\[ \rvp_t = \begin{bmatrix} \sin\left(\omega_0 t\right) \\ \cos\left(\omega_0 t\right) \\ \vdotswithin{\sin\left(\omega_0 t\right)} \\ \sin\left(\omega_{d/2-1} t\right) \\ \cos\left(\omega_{d/2-1} t\right) \\ \end{bmatrix} \]

Nice Properties

  1. Dot product of two positional encodings only depends on \(\Delta t\):

    \[ \begin{aligned} \rvp_t^T \rvp_{t+\Delta t} &= \sum_{i=0}^{d/2-1} \sin\left(\omega_i t\right) \sin\left(\omega_i (t+\Delta t)\right) + \cos\left(\omega_i t\right) \cos\left(\omega_i (t+\Delta t)\right) \\ &= \sum_{i=0}^{d/2-1} \cos\left(\omega_i \Delta t\right) \\ \end{aligned} \]
  2. \(\rvp_t^T \rvp_{t+\Delta t} = \rvp_t^T \rvp_{t-\Delta t}\)

Relative Positional Encoding (RPE)

Relative positional encoding is applied to the self-attention, i.e., \(\rvq,\rvk,\rvv\):

\[ \begin{aligned} \rvq_m &= f_q(\rvx_m, m) \\ \rvk_n &= f_k(\rvx_n, n) \\ \rvv_n &= f_v(\rvx_n, n) \\ \end{aligned} \]

Let us assume that the inner product of \(\rvq_m\) and \(\rvk_n\) can be represented as a function \(g\) which takes the \(\rvx_m, \rvx_n\) and the relative position \(m-n\) as input:

\[ \langle f_q(\rvx_m, m), f_k(\rvx_n, n)\rangle = g(\rvx_m,\rvx_n,m-n) \]

We know that rotation matrix \(R\) naturally satisfies the above property:

\[ \begin{aligned} \langle \rmR_m\rvx_m, \rmR_n\rvx_n\rangle &= \rvx_m^T \rmR_m^T \rmR_n \rvx_n \\ &= \rvx_m^T \rmR_{n-m} \rvx_n \\ &=\langle \rvx_m, \rmR_{n-m}\rvx_n\rangle \end{aligned} \]

Thus, we can use the rotation matrix to compute the relative positional encoding. And for any high-dimensional vector \(\rvx\), we can use the following formula to compute the rotation matrix:

\[ \rmR_{t} = \begin{bmatrix} \cos\left(\omega_0 t\right) & -\sin\left(\omega_0 t\right) & 0 & \cdots & 0 \\ \sin\left(\omega_0 t\right) & \cos\left(\omega_0 t\right) & 0 & \cdots & 0 \\ 0 & 0 & \cos\left(\omega_1 t\right) & -\sin\left(\omega_1 t\right) & 0 \\ 0 & 0 & \sin\left(\omega_1 t\right) & \cos\left(\omega_1 t\right) & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cos\left(\omega_{d/2-1} t\right) & -\sin\left(\omega_{d/2-1} t\right) \\ 0 & 0 & 0 & 0 & \sin\left(\omega_{d/2-1} t\right) & \cos\left(\omega_{d/2-1} t\right) \\ \end{bmatrix} \]

Here we have \(\rmR_{n-m} = \rmR_{m}^T \rmR_{n}\). Note that due to \(\rmR\) is a sparse matrix, we can compute the relative positional encoding in \(O(d)\) time complexity:

\[ \rmR_t = \begin{bmatrix} x_0\\ x_1\\ \vdots\\ x_{d-2} \\ x_{d-1} \end{bmatrix} \cdot \begin{bmatrix} \cos\left(\omega_0 t\right)\\ \cos\left(\omega_0 t\right)\\ \vdots\\ \cos\left(\omega_{d/2-1} t\right)\\ \cos\left(\omega_{d/2-1} t\right) \end{bmatrix} + \begin{bmatrix} -x_1\\ x_0\\ \vdots\\ -x_{d-1} \\ x_{d-2} \end{bmatrix} \cdot \begin{bmatrix} \sin\left(\omega_0 t\right)\\ \sin\left(\omega_0 t\right)\\ \vdots\\ \sin\left(\omega_{d/2-1} t\right)\\ \sin\left(\omega_{d/2-1} t\right) \end{bmatrix} \]

Self-attention

Scaled Dot-Product Attention

Given a query vector \(\rvq \in \sR^{ d}\), a key vector \(\rvk \in \sR^{d}\), and a value vector \(\rvv \in \sR^{d}\), the scaled dot-product attention is computed as follows:

\[ \rvz = \mathrm{softmax}\left(\frac{\rvq^T\rvk}{\sqrt{d}}\right)\rvv \]

Why divide by \(\sqrt{d}\)? The dot product between two independent random vectors \(\rvq, \rvk\in\mathbb{R}^d\) with entries having zero mean and unit variance is:

\[ \rvq^T\rvk = \sum_{i=1}^d q_i k_i \]

The variance of the dot product is \(d\). Thus, we can divide by \(\sqrt{d}\) to make the variance of the dot product to be 1. Conceptually, this makes the softmax more stable.

Multi-Head Attention

Multi-head attention is simply the concatenation of multiple scaled dot-product attentions using different projection heads.

Encoder-Decoder Attention

Encoder would provides $\(\rvk\) and $\(\rvv\) to the decoder, and decoder provides \(\rvq\) to perform self attention.

Add Normalization

Batch Normalization (BN)

Given a mini-batch of data \(\rmX\in\sR^{n\times d}\), the batch normalization is computed as follows:

\[ \begin{aligned} \boldsymbol{\mu} &= \frac{1}{n}\sum_{i=1}^n \rmX_{i:} \\ \boldsymbol{\sigma}^2 &= \frac{1}{n}\sum_{i=1}^n (\rmX_{i:} - \boldsymbol{\mu})^2 \\ \rmY &= \frac{\rmX - \boldsymbol{\mu}}{\sqrt{\boldsymbol{\sigma}^2 + \epsilon}} \boldsymbol{\gamma} + \boldsymbol{\beta} \end{aligned} \]

Here \(\boldsymbol{\gamma}\) is re-scaling parameter and \(\boldsymbol{\beta}\) is the re-shift parameter. These parameters can mapped the data to a desired distribution. The \(\epsilon\) is a small constant to avoid division by zero. During test time, we can use the moving average of \(\boldsymbol{\mu}\) and \(\boldsymbol{\sigma}^2\) to normalize the data.

Layer Normalization (LN)

Layer normalization is similar to batch normalization, but it normalizes the data across the feature dimension:

\[ \begin{aligned} \boldsymbol{\mu} &= \frac{1}{d}\sum_{i=1}^d \rmX_{:i} \\ \boldsymbol{\sigma}^2 &= \frac{1}{d}\sum_{i=1}^d (\rmX_{:i} - \boldsymbol{\mu})^2 \\ \rmY &= \frac{\rmX - \boldsymbol{\mu}}{\sqrt{\boldsymbol{\sigma}^2 + \epsilon}} \boldsymbol{\gamma} + \boldsymbol{\beta} \end{aligned} \]

Add & Norm

The add & norm layer is computed as follows:

  1. Post-Layer Normalization (original implementation):

    \[ \rmY = \mathrm{LayerNorm}(\rmX + \mathrm{FFN}(\rmX)) \]
  2. Pre-Layer Normalization:

    \[ \rmY = \rmX + \mathrm{FFN}(\mathrm{LayerNorm}(\rmX)) \]

Here \(\mathrm{FFN}\) is a feed-forward network. For pre-layer normalization, the warm-up learning is less important than post-layer normalization.