Back propagation

Vectorized Gradient

Let us suppose we have a function \(\rvf: \sR^n \to \sR^m\) that maps a vector of length \(n\) to a vector of length \(m\): \(\rvf(\rvx) = [f_1(x_1,\dots, x_n), \dots, f_m(x_1,\dots, x_n)]\). Then its Jacobian matrix is defined as:

\[ \frac{\partial \rvf}{\partial \rvx}= \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}\in \sR^{m \times n} \]

That is, \((\frac{\partial \rvf}{\partial \rvx})_{i. j} = \frac{\partial f_i}{\partial x_j}\).

Useful identities

Matrix times column vector with respect to the column vector (\(\rvz=\rmW\rvx\), what is \(\frac{\partial \rvz}{\partial \rvx}\)?)

Let us suppose \(\rmW\in\sR^{n\times m}\), then we can think of \(\rvz\) as a function of \(\rvx\) taking an m-dimensional vector to an n-dimensional vector. Then, the Jacobian matrix will be \(n\times m\). Note that

\[ z_i = \sum_{k=1}^m W_{ik} x_k \]

So an entry \((\frac{\partial z}{\partial x}_{i,j})\) of the Jacobian is

\[ (\frac{\partial z}{\partial x})_{ij}=\frac{\partial z_i}{\partial x_j} = \frac{\partial}{\partial x_j} \sum_{k=1}^m W_{ik} x_k = \sum_{k=1}^m W_{ik}\frac{\partial x_k}{\partial x_j} = W_{ij} \Longleftrightarrow \frac{\partial \rvz}{\partial \rvx} = \rmW \]

Row vector times matrix with respect to the row vector (\(\rvz=\rvx\rmW\), what is \(\frac{\partial \rvz}{\partial \rvx}\)?)

\[ (\frac{\partial z}{\partial x})_{ij} = \frac{\partial z_i}{\partial x_j} = \frac{\partial}{\partial x_j} \sum_{k=1}^n x_k W_{ki} = W_{ji} \Longleftrightarrow \frac{\partial \rvz}{\partial \rvx} = \rmW^T \]

Here, we can see that the Jacobian matrix is \(m\times n\) as input \(\rvx\) is \(n\)-dimensional and output \(\rvz\) is \(m\)-dimensional.

A vector with itself (\(\rvz=\rvx\), what is \(\frac{\partial \rvz}{\partial \rvx}\)?)

\[ (\frac{\partial z}{\partial x})_{ij} = \frac{\partial z_i}{\partial x_j} = \frac{\partial x_i}{\partial x_j} =\begin{cases} 1 & \text{if } i=j \\ 0 & \text{otherwise} \end{cases} \Longleftrightarrow \frac{\partial \rvz}{\partial \rvx} = \rmI \]

An element-wise function applied a vector \(\rvz = f(\rvx)\), what is \(\frac{\partial \rvz}{\partial \rvx}\)?

Since we have \(z_i = f(x_i)\), so:

\[ (\frac{\partial z}{\partial x})_{ij} = \frac{\partial z_i}{\partial x_j} = \frac{\partial f(x_i)}{\partial x_j} = \begin{cases} f'(x_i) & \text{if } i=j \\ 0 & \text{otherwise} \end{cases} \Longleftrightarrow \frac{\partial \rvz}{\partial \rvx} = \text{diag}(f'(\rvx)) \]

Matrix times column vector with respect to the matrix (\(\rvz=\rmW\rvx\), what is \(\frac{\partial \rvz}{\partial \rmW}\)?)

Here \(\rvz\in \sR^n\) and \(\rmW\in \sR^{n\times m}\), so the Jacobian matrix will be \(n\times n\times m\). This shape is too large. However, if we inlcude a loss function \(J\) (scalar), and compute its gradient with respect to \(\rmW\), it will become \(1\times nm\). We can reshape it into \(n\times m\) matrix:

\[ \frac{\partial J}{\partial \rmW} = \begin{bmatrix} \frac{\partial J}{\partial W_{11}} & \cdots & \frac{\partial J}{\partial W_{1m}} \\ \vdots & \ddots & \vdots \\ \frac{\partial J}{\partial W_{n1}} & \cdots & \frac{\partial J}{\partial W_{nm}} \end{bmatrix} \]

Since this matrix has the same shape as \(\rmW\), we can update \(\rmW\) by \(\rmW = \rmW - \mu \frac{\partial J}{\partial \rmW}\). To solve \(\frac{\partial J}{\partial \rmW}\), we can use the chain rule:

\[ \frac{\partial J}{\partial \rmW} = \frac{\partial J}{\partial \rvz} \frac{\partial \rvz}{\partial \rmW} \]

Let us assume that \(\frac{\partial J}{\partial \rvz} = \boldsymbol{\delta}\in \sR^n\). Then, we only need to compute \(\frac{\partial \rvz}{\partial \rmW}\). Though Jacobian matrix is \(n\times n\times m\), we can avoid it by taking gradient with respect to a single weight \(W_{ij}\), i.e., \(\frac{\partial \rvz}{\partial W_{ij}}\in \sR^n\). We can compute it as follows:

\[ \begin{aligned} z_k &= \sum_{l=1}^m W_{kl} x_l\\ \frac{\partial z_k}{\partial W_{ij}} &= \sum_{l=1}^m \frac{\partial}{\partial W_{ij}} W_{kl} x_l = \begin{cases} x_j & \text{if } k=i \text{ and } l=j \\ 0 & \text{otherwise} \end{cases} \end{aligned} \]

That is to say:

\[ \frac{\partial z_k}{\partial W_{ij}} = \begin{bmatrix} 0\\ \vdots\\ x_j\\ \vdots\\ 0 \end{bmatrix} \leftarrow \text{ith row} \]

Now let us compute \(\frac{\partial J}{\partial W_{ij}}\):

\[ \frac{\partial J}{\partial W_{ij}} = \sum_{k=1}^n \frac{\partial J}{\partial z_k} \frac{\partial z_k}{\partial W_{ij}} = \sum_{k=1}^n \delta_k \frac{\partial z_k}{\partial W_{ij}} = \delta_i x_j \]

That means, to get \(\frac{\partial J}{\partial \rmW}\), we need a matrix where entry \((i,j)\) is \(\delta_i x_j\). This is the outer product of \(\boldsymbol{\delta}\) and \(\rvx\):

\[ \frac{\partial J}{\partial \rmW} = \boldsymbol{\delta} \rvx^T \]

Row vector times matrix with respect to the matrix (\(\rvz=\rvx\rmW\), what is \(\frac{\partial \rvz}{\partial \rmW}\)?)

A similar computation shows that: \(\frac{\partial J}{\partial \rmW} = \rvx^T \boldsymbol{\delta}\).

Gradient Layout

Simply follow the convention: The shape of the gradient equals to the shape of parameters when doing SGD.

References

gradient-notes.pdf