Scene Coordinate Regression
Scene coordinate regression is a problem of estimating the 3D world coordinates corresponding to 2D image pixels. This is a common solution for camera localization as one can estimate the camera pose from 2D-3D correspondences using the PnP-RANSAC algorithm.
Formally, let us define a set of mapped images \(I^m_i\) and their corresponding 3D coordinates \(Y^m_i\). Our goal is to train a model \(f\) to memorize the mapping between \(I^m_i\) and \(Y^m_i\):
Once learned, we can estimate the 3D coordinates from a new query image \(I^q_j\):
And estimate the camera pose \(H\) (rotation \(R\) and translation \(t\)) from the predicted 2D-3D correspondences using the PnP-RANSAC algorithm:
where \(K\) is the camera intrinsic matrix.
DSAC: Differentiable RANSAC
Previous methods usually rely on coordinate regression, which does not necessarily correlate with the pose error. Also, RANSAC is a non-differentiable algorithm, which makes it hard to train the model end-to-end.
In standard RANSAC, hypotheses are scored using a function \(s(h_J, Y^w; v)\) (parameterized by a Score CNN with weights \(v\)). The algorithm selects the hypothesis \(h_{\mathrm{AM}}\) that maximizes this score:
To enable end-to-end training, DSAC replaces the deterministic argmax selection with 1) soft argmax selection (SoftAM) and 2) probabilistic selection (DSAC).
SoftAM
SoftAM replaces the argmax selection with a soft argmax based on the weighted average of the hypotheses:
where \(P(J|v,w)\) is the probability of selecting hypothesis \(J\):
DSAC
Instead of hard selection, a hypothesis is chosen probabilistically based on a softmax distribution of the score. The selected hypothesis is sampled as:
Since the output is now stochastic, we minimize the expected loss over the training set \(\mathcal{I}\):
where \(R\) is the refinement function and \(h^*\) is the ground truth pose.
The derivative of the expected loss with respect to the parameters (e.g., \(w\)) can be derived using the log-likelihood ratio trick (similar to Policy Gradient):
This formulation allows gradients to flow through both the selection probabilities (affecting the Score CNN \(v\)) and the hypothesis generation quality (affecting the Coordinate CNN \(w\))[cite: 221].