Back propagation
Vectorized Gradient
Let us suppose we have a function \(\rvf: \sR^n \to \sR^m\) that maps a vector of length \(n\) to a vector of length \(m\): \(\rvf(\rvx) = [f_1(x_1,\dots, x_n), \dots, f_m(x_1,\dots, x_n)]\). Then its Jacobian matrix is defined as:
That is, \((\frac{\partial \rvf}{\partial \rvx})_{i. j} = \frac{\partial f_i}{\partial x_j}\).
Useful identities
- Matrix times column vector with respect to the column vector (\(\rvz=\rmW\rvx\), what is \(\frac{\partial \rvz}{\partial \rvx}\)?)
Let us suppose \(\rmW\in\sR^{n\times m}\), then we can think of \(\rvz\) as a function of \(\rvx\) taking an m-dimensional vector to an n-dimensional vector. Then, the Jacobian matrix will be \(n\times m\). Note that
So an entry \((\frac{\partial z}{\partial x}_{i,j})\) of the Jacobian is
- Row vector times matrix with respect to the row vector (\(\rvz=\rvx\rmW\), what is \(\frac{\partial \rvz}{\partial \rvx}\)?)
Here, we can see that the Jacobian matrix is \(m\times n\) as input \(\rvx\) is \(n\)-dimensional and output \(\rvz\) is \(m\)-dimensional.
- A vector with itself (\(\rvz=\rvx\), what is \(\frac{\partial \rvz}{\partial \rvx}\)?)
- An element-wise function applied a vector \(\rvz = f(\rvx)\), what is \(\frac{\partial \rvz}{\partial \rvx}\)?
Since we have \(z_i = f(x_i)\), so:
- Matrix times column vector with respect to the matrix (\(\rvz=\rmW\rvx\), what is \(\frac{\partial \rvz}{\partial \rmW}\)?)
Here \(\rvz\in \sR^n\) and \(\rmW\in \sR^{n\times m}\), so the Jacobian matrix will be \(n\times n\times m\). This shape is too large. However, if we inlcude a loss function \(J\) (scalar), and compute its gradient with respect to \(\rmW\), it will become \(1\times nm\). We can reshape it into \(n\times m\) matrix:
Since this matrix has the same shape as \(\rmW\), we can update \(\rmW\) by \(\rmW = \rmW - \mu \frac{\partial J}{\partial \rmW}\). To solve \(\frac{\partial J}{\partial \rmW}\), we can use the chain rule:
Let us assume that \(\frac{\partial J}{\partial \rvz} = \boldsymbol{\delta}\in \sR^n\). Then, we only need to compute \(\frac{\partial \rvz}{\partial \rmW}\). Though Jacobian matrix is \(n\times n\times m\), we can avoid it by taking gradient with respect to a single weight \(W_{ij}\), i.e., \(\frac{\partial \rvz}{\partial W_{ij}}\in \sR^n\). We can compute it as follows:
That is to say:
Now let us compute \(\frac{\partial J}{\partial W_{ij}}\):
That means, to get \(\frac{\partial J}{\partial \rmW}\), we need a matrix where entry \((i,j)\) is \(\delta_i x_j\). This is the outer product of \(\boldsymbol{\delta}\) and \(\rvx\):
- Row vector times matrix with respect to the matrix (\(\rvz=\rvx\rmW\), what is \(\frac{\partial \rvz}{\partial \rmW}\)?)
A similar computation shows that: \(\frac{\partial J}{\partial \rmW} = \rvx^T \boldsymbol{\delta}\).
Gradient Layout
Simply follow the convention: The shape of the gradient equals to the shape of parameters when doing SGD.