Monocular Depth Estimation

MsDepth - NeurIPS'14

Scale invariant error

For a predicted depth map \(y\) and groundtruth depth map \(y^\star\), each with \(n\) pixels indexed by \(i\), we can define scale invariant error in log space as:

\[ D(y, y^\star) = \frac{1}{n}\sum_{i=1}^n (\log y_i - \log y_i^\star + \alpha(y, y^\star))^2, \]

where \(\alpha(y, y^\star) = \frac{1}{n}\sum_{j=1}^n(\log y_j^\star - \log y_j)\). So actually \(\alpha(y, y^\star)\) defines the scale in log-space, i.e. \(e^\alpha\) is the scale that best aligns predictions with gts.

Now, we could rewrite it into

\[ \begin{align} D(y, y^\star) &= \frac{1}{n}\sum_{i=1}^n (\log y_i - \log y_i^\star + \frac{1}{n}\sum_{j=1}^n(\log y_j^\star - \log y_j))^2\\ &= \frac{1}{n}\sum_{i=1}^n (\frac{1}{n}\sum_{j=1}^n\log y_i - \frac{1}{n}\sum_{j=1}^n\log y_i^\star + \frac{1}{n}\sum_{j=1}^n(\log y_j^\star - \log y_j))^2\\ &= \frac{1}{n^2} \sum_{i, j} (\log y_i - \log y_i^\star + \log y_j^\star - \log y_j)^2\\ &= \frac{1}{n^2} \sum_{i, j} ((\log y_i - \log y_j) - (\log y_i^\star -\log y_j^\star))^2 \end{align} \]

According to this equation, intuitively, we are forcing the scale between different pairs of pixels to match the gt. By defining \(d_i = \log y_i - \log y_i^\star\), we have

\[ \begin{align} D(y, y^\star) &= \frac{1}{n^2} \sum_{i, j} (d_i - d_j)^2\\ &= \frac{1}{n^2} \sum_{i, j} (d_i^2 + d_j^2 - 2 d_i d_j)\\ &= \frac{2}{n}\sum_{i} d_i^2 - \frac{2}{n^2} \sum_{i, j} d_i d_j\\ &=\frac{2}{n}\sum_{i} d_i^2 - \frac{2}{n^2} (\sum_{i}d_i)^2 \end{align} \]

This is to say, if the difference is in the same direction, we will have a smaller loss because of this term : \(-\frac{2}{n^2} \sum_{i, j} d_i d_j\). In practice, MsDepth uses:

\[ L(y, y^\star) = \frac{1}{n}\sum_{i} d_i^2 - \frac{\lambda}{n^2} (\sum_{i}d_i)^2 \]

with \(\lambda=0.5\). Now here \(\lambda=0.0\) would be original absolute loss while \(\lambda=0.5\) is scale-invariant loss. Thus, setting \(\lambda=0.5\) is a trade-off to ensure the final prediction is still the same scale while uses scale-invariant term to encourage the better details.

MsDepth-v2 - ICCV'15

Depth loss

\[ L(y, y^\star) = \frac{1}{n}\sum_{i} d_i^2 - \frac{\lambda}{n^2} (\sum_{i}d_i)^2 +\frac{1}{n}\sum_{i} [(\nabla_x d_i)^2 + (\nabla_y d_i)^2] \]

An extra term is involved to encourage the predictions to have similar local structures.