Skip to content

Back propagation

Vectorized Gradient

Let us suppose we have a function \(\rvf: \sR^n \to \sR^m\) that maps a vector of length \(n\) to a vector of length \(m\): \(\rvf(\rvx) = [f_1(x_1,\dots, x_n), \dots, f_m(x_1,\dots, x_n)]\). Then its Jacobian matrix is defined as:

\[ \frac{\partial \rvf}{\partial \rvx}= \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}\in \sR^{m \times n} \]

That is, \((\frac{\partial \rvf}{\partial \rvx})_{i. j} = \frac{\partial f_i}{\partial x_j}\).

Useful identities

  1. Matrix times column vector with respect to the column vector (\(\rvz=\rmW\rvx\), what is \(\frac{\partial \rvz}{\partial \rvx}\)?)

Let us suppose \(\rmW\in\sR^{n\times m}\), then we can think of \(\rvz\) as a function of \(\rvx\) taking an m-dimensional vector to an n-dimensional vector. Then, the Jacobian matrix will be \(n\times m\). Note that

\[ z_i = \sum_{k=1}^m W_{ik} x_k \]

So an entry \((\frac{\partial z}{\partial x}_{i,j})\) of the Jacobian is

\[ (\frac{\partial z}{\partial x})_{ij}=\frac{\partial z_i}{\partial x_j} = \frac{\partial}{\partial x_j} \sum_{k=1}^m W_{ik} x_k = \sum_{k=1}^m W_{ik}\frac{\partial x_k}{\partial x_j} = W_{ij} \Longleftrightarrow \frac{\partial \rvz}{\partial \rvx} = \rmW \]
  1. Row vector times matrix with respect to the row vector (\(\rvz=\rvx\rmW\), what is \(\frac{\partial \rvz}{\partial \rvx}\)?)
\[ (\frac{\partial z}{\partial x})_{ij} = \frac{\partial z_i}{\partial x_j} = \frac{\partial}{\partial x_j} \sum_{k=1}^n x_k W_{ki} = W_{ji} \Longleftrightarrow \frac{\partial \rvz}{\partial \rvx} = \rmW^T \]

Here, we can see that the Jacobian matrix is \(m\times n\) as input \(\rvx\) is \(n\)-dimensional and output \(\rvz\) is \(m\)-dimensional.

  1. A vector with itself (\(\rvz=\rvx\), what is \(\frac{\partial \rvz}{\partial \rvx}\)?)
\[ (\frac{\partial z}{\partial x})_{ij} = \frac{\partial z_i}{\partial x_j} = \frac{\partial x_i}{\partial x_j} =\begin{cases} 1 & \text{if } i=j \\ 0 & \text{otherwise} \end{cases} \Longleftrightarrow \frac{\partial \rvz}{\partial \rvx} = \rmI \]
  1. An element-wise function applied a vector \(\rvz = f(\rvx)\), what is \(\frac{\partial \rvz}{\partial \rvx}\)?

Since we have \(z_i = f(x_i)\), so:

\[ (\frac{\partial z}{\partial x})_{ij} = \frac{\partial z_i}{\partial x_j} = \frac{\partial f(x_i)}{\partial x_j} = \begin{cases} f'(x_i) & \text{if } i=j \\ 0 & \text{otherwise} \end{cases} \Longleftrightarrow \frac{\partial \rvz}{\partial \rvx} = \text{diag}(f'(\rvx)) \]
  1. Matrix times column vector with respect to the matrix (\(\rvz=\rmW\rvx\), what is \(\frac{\partial \rvz}{\partial \rmW}\)?)

Here \(\rvz\in \sR^n\) and \(\rmW\in \sR^{n\times m}\), so the Jacobian matrix will be \(n\times n\times m\). This shape is too large. However, if we inlcude a loss function \(J\) (scalar), and compute its gradient with respect to \(\rmW\), it will become \(1\times nm\). We can reshape it into \(n\times m\) matrix:

\[ \frac{\partial J}{\partial \rmW} = \begin{bmatrix} \frac{\partial J}{\partial W_{11}} & \cdots & \frac{\partial J}{\partial W_{1m}} \\ \vdots & \ddots & \vdots \\ \frac{\partial J}{\partial W_{n1}} & \cdots & \frac{\partial J}{\partial W_{nm}} \end{bmatrix} \]

Since this matrix has the same shape as \(\rmW\), we can update \(\rmW\) by \(\rmW = \rmW - \mu \frac{\partial J}{\partial \rmW}\). To solve \(\frac{\partial J}{\partial \rmW}\), we can use the chain rule:

\[ \frac{\partial J}{\partial \rmW} = \frac{\partial J}{\partial \rvz} \frac{\partial \rvz}{\partial \rmW} \]

Let us assume that \(\frac{\partial J}{\partial \rvz} = \boldsymbol{\delta}\in \sR^n\). Then, we only need to compute \(\frac{\partial \rvz}{\partial \rmW}\). Though Jacobian matrix is \(n\times n\times m\), we can avoid it by taking gradient with respect to a single weight \(W_{ij}\), i.e., \(\frac{\partial \rvz}{\partial W_{ij}}\in \sR^n\). We can compute it as follows:

\[ \begin{aligned} z_k &= \sum_{l=1}^m W_{kl} x_l\\ \frac{\partial z_k}{\partial W_{ij}} &= \sum_{l=1}^m \frac{\partial}{\partial W_{ij}} W_{kl} x_l = \begin{cases} x_j & \text{if } k=i \text{ and } l=j \\ 0 & \text{otherwise} \end{cases} \end{aligned} \]

That is to say:

\[ \frac{\partial z_k}{\partial W_{ij}} = \begin{bmatrix} 0\\ \vdots\\ x_j\\ \vdots\\ 0 \end{bmatrix} \leftarrow \text{ith row} \]

Now let us compute \(\frac{\partial J}{\partial W_{ij}}\):

\[ \frac{\partial J}{\partial W_{ij}} = \sum_{k=1}^n \frac{\partial J}{\partial z_k} \frac{\partial z_k}{\partial W_{ij}} = \sum_{k=1}^n \delta_k \frac{\partial z_k}{\partial W_{ij}} = \delta_i x_j \]

That means, to get \(\frac{\partial J}{\partial \rmW}\), we need a matrix where entry \((i,j)\) is \(\delta_i x_j\). This is the outer product of \(\boldsymbol{\delta}\) and \(\rvx\):

\[ \frac{\partial J}{\partial \rmW} = \boldsymbol{\delta} \rvx^T \]
  1. Row vector times matrix with respect to the matrix (\(\rvz=\rvx\rmW\), what is \(\frac{\partial \rvz}{\partial \rmW}\)?)

A similar computation shows that: \(\frac{\partial J}{\partial \rmW} = \rvx^T \boldsymbol{\delta}\).

Gradient Layout

Simply follow the convention: The shape of the gradient equals to the shape of parameters when doing SGD.

MLP example

Let us consider a Multilayer Perceptron (MLP) with one hidden layer, using ReLU activation, and a Softmax output layer with Cross-Entropy loss.

Notation:

  • Input: \(\rvx \in \sR^n\)
  • Hidden weights: \(\rmW_1 \in \sR^{h \times n}\)
  • Hidden bias: \(\rvb_1 \in \sR^h\)
  • Output weights: \(\rmW_2 \in \sR^{m \times h}\)
  • Output bias: \(\rvb_2 \in \sR^m\)
  • Target labels (one-hot): \(\rvy \in \sR^m\)

1. Forward Pass

\[ \begin{aligned} \rvz_1 &= \rmW_1 \rvx + \rvb_1 \\ \rvh &= \text{ReLU}(\rvz_1) = \max(0, \rvz_1) \\ \rvz_2 &= \rmW_2 \rvh + \rvb_2 \\ \hat{\rvy} &= \text{Softmax}(\rvz_2) \quad \text{where } \hat{y}_i = \frac{e^{z_{2,i}}}{\sum_j e^{z_{2,j}}} \\ J &= \text{CrossEntropy}(\rvy, \hat{\rvy}) = -\sum_{k=1}^m y_k \log(\hat{y}_k) \end{aligned} \]

2. Backward Pass

We compute gradients starting from the loss \(J\) and propagating backward.

Step A: Gradient w.r.t. Logits (\(\rvz_2\)) The derivative of the Softmax function combined with Cross-Entropy loss simplifies to the difference between prediction and target.

\[ \boldsymbol{\delta}_2 = \frac{\partial J}{\partial \rvz_2} = \hat{\rvy} - \rvy \in \sR^m \]

Step B: Gradient w.r.t. Output Parameters (\(\rmW_2, \rvb_2\)) Using Identity 5 (outer product of error and input):

\[ \frac{\partial J}{\partial \rmW_2} = \boldsymbol{\delta}_2 \rvh^T \]

For the bias:

\[ \frac{\partial J}{\partial \rvb_2} = \boldsymbol{\delta}_2 \]

Step C: Gradient w.r.t. Hidden Activation (\(\rvh\)) Using Identity 1 (backpropagating through matrix multiplication):

\[ \frac{\partial J}{\partial \rvh} = \frac{\partial J}{\partial \rvz_2} \frac{\partial \rvz_2}{\partial \rvh} = \rmW_2^T \boldsymbol{\delta}_2 \]

Step D: Gradient w.r.t. Hidden Pre-activation (\(\rvz_1\)) Using Identity 4 (element-wise activation derivative). The derivative of ReLU is the indicator function \(\mathbb{I}(\rvz_1 > 0)\).

\[ \boldsymbol{\delta}_1 = \frac{\partial J}{\partial \rvz_1} = \frac{\partial J}{\partial \rvh} \odot \mathbb{I}(\rvz_1 > 0) = (\rmW_2^T \boldsymbol{\delta}_2) \odot \mathbb{I}(\rvz_1 > 0) \]

(Note: \(\odot\) denotes element-wise multiplication)

Step E: Gradient w.r.t. Hidden Parameters (\(\rmW_1, \rvb_1\)) Applying Identity 5 again:

\[ \frac{\partial J}{\partial \rmW_1} = \boldsymbol{\delta}_1 \rvx^T \]
\[ \frac{\partial J}{\partial \rvb_1} = \boldsymbol{\delta}_1 \]

CNN example

Notation:

  • Input feature map: \(\rmX \in \sR^{H \times W}\)
  • Filter (kernel): \(\rmK \in \sR^{k \times k}\)
  • Output feature map: \(\rmZ = \rmX * \rmK\)
  • Loss gradient w.r.t output map: \(\boldsymbol{\delta} = \frac{\partial J}{\partial \rmZ} \in \sR^{(H-k+1) \times (W-k+1)}\)

1. Gradient w.r.t Filter (\(\rmK\)) We are looking for \(\frac{\partial J}{\partial \rmK}\). Each weight in the filter contributes to every position in the output map. This results in a convolution between the Input \(\rmX\) and the Gradient \(\boldsymbol{\delta}\).

\[ \frac{\partial J}{\partial \rmK} = \rmX * \boldsymbol{\delta}\in \sR^{k\times k} \]

(Note: Depending on convention, this is technically a cross-correlation).

2. Gradient w.r.t Input (\(\rmX\)) We are looking for \(\frac{\partial J}{\partial \rmX}\) to propagate error to previous layers. This operation turns out to be a convolution between the Gradient \(\boldsymbol{\delta}\) (usually padded) and the 180-degree rotated Filter \(\rmK\).

\[ \frac{\partial J}{\partial \rmX} = \boldsymbol{\delta} * \text{rot180}(\rmK) \]

This maintains the pattern:

  1. Weights gradient: Outer product analogs (Correlation of Input and Delta).
  2. Input gradient: Transpose analogs (Convolution of Delta and rotated Weights).

3. Pooling Layers

Pooling layers have no learnable weights, so we only compute the gradient with respect to the input \(\rmX\). Let \(\rmZ = \text{Pool}(\rmX)\) and \(\boldsymbol{\delta} = \frac{\partial J}{\partial \rmZ}\).

  • Max Pooling: The forward pass selects the maximum value in each window. The backward pass acts as a "router": it passes the gradient only to the pixel that was the max, and 0 to others.

    \[ \frac{\partial J}{\partial x_{ij}} = \begin{cases} \delta_{uv} & \text{if } x_{ij} = \max(\text{window}_{uv}) \\ 0 & \text{otherwise} \end{cases} \]

    (This requires caching the indices of the max values during the forward pass).

  • Average Pooling: The forward pass computes the mean: \(z_{uv} = \frac{1}{k^2} \sum x_{ij}\). The backward pass distributes the gradient equally to all inputs in the window.

    \[ \frac{\partial J}{\partial x_{ij}} = \frac{1}{k^2} \delta_{uv} \]

    (This acts like a "reverse" average filter, upsampling the gradient map).

References

gradient-notes.pdf