Monocular Depth Estimation
MsDepth - NeurIPS'14
Scale invariant error
For a predicted depth map \(y\) and groundtruth depth map \(y^\star\), each with \(n\) pixels indexed by \(i\), we can define scale invariant error in log space as:
where \(\alpha(y, y^\star) = \frac{1}{n}\sum_{j=1}^n(\log y_j^\star - \log y_j)\). So actually \(\alpha(y, y^\star)\) defines the scale in log-space, i.e. \(e^\alpha\) is the scale that best aligns predictions with gts.
Now, we could rewrite it into
According to this equation, intuitively, we are forcing the scale between different pairs of pixels to match the gt. By defining \(d_i = \log y_i - \log y_i^\star\), we have
This is to say, if the difference is in the same direction, we will have a smaller loss because of this term : \(-\frac{2}{n^2} \sum_{i, j} d_i d_j\). In practice, MsDepth uses:
with \(\lambda=0.5\). Now here \(\lambda=0.0\) would be original absolute loss while \(\lambda=0.5\) is scale-invariant loss. Thus, setting \(\lambda=0.5\) is a trade-off to ensure the final prediction is still the same scale while uses scale-invariant term to encourage the better details.
MsDepth-v2 - ICCV'15
Depth loss
An extra term is involved to encourage the predictions to have similar local structures.