Scene Coordinate Regression

Scene coordinate regression is a problem of estimating the 3D world coordinates corresponding to 2D image pixels. This is a common solution for camera localization as one can estimate the camera pose from 2D-3D correspondences using the PnP-RANSAC algorithm.

Formally, let us define a set of mapped images \(I^m_i\) and their corresponding 3D coordinates \(Y^m_i\). Our goal is to train a model \(f\) to memorize the mapping between \(I^m_i\) and \(Y^m_i\):

\[ f_\theta(I^m_i) \rightarrow \hat{Y}^m_i \]

Once learned, we can estimate the 3D coordinates from a new query image \(I^q_j\):

\[ f_\theta(I^q_j) \rightarrow \hat{Y}^q_j \]

And estimate the camera pose \(H\) (rotation \(R\) and translation \(t\)) from the predicted 2D-3D correspondences using the PnP-RANSAC algorithm:

\[ H = \text{PnP-RANSAC}(I^q_j, \hat{Y}^q_j, K) \]

where \(K\) is the camera intrinsic matrix.

DSAC: Differentiable RANSAC

Previous methods usually rely on coordinate regression, which does not necessarily correlate with the pose error. Also, RANSAC is a non-differentiable algorithm, which makes it hard to train the model end-to-end.

In standard RANSAC, hypotheses are scored using a function \(s(h_J, Y^w; v)\) (parameterized by a Score CNN with weights \(v\)). The algorithm selects the hypothesis \(h_{\mathrm{AM}}\) that maximizes this score:

\[ h_{\mathrm{AM}}^{w,v} = \operatorname*{argmax}_{h_J} s(h_J, Y^w; v) \]

To enable end-to-end training, DSAC replaces the deterministic argmax selection with 1) soft argmax selection (SoftAM) and 2) probabilistic selection (DSAC).

SoftAM

SoftAM replaces the argmax selection with a soft argmax based on the weighted average of the hypotheses:

\[ h_{\mathrm{SoftAM}}^{w,v} = \sum_{J} P(J|v,w) h_J^w \]

where \(P(J|v,w)\) is the probability of selecting hypothesis \(J\):

\[ P(J|v,w) = \frac{\exp(s(h_J^w, Y^w; v))}{\sum_{J'} \exp(s(h_{J'}^w, Y^w; v))} \]

DSAC

Instead of hard selection, a hypothesis is chosen probabilistically based on a softmax distribution of the score. The selected hypothesis is sampled as:

\[ h_{\mathrm{DSAC}}^{w,v} = h_J, \quad \text{with} \quad J \sim P(J|v,w) \]

Since the output is now stochastic, we minimize the expected loss over the training set \(\mathcal{I}\):

\[ \tilde{w}, \tilde{v} = \operatorname*{argmin}_{w,v} \sum_{I \in \mathcal{I}} \mathbb{E}_{J \sim P(J|v,w)} \left[ \ell(R(h_J^w, Y^w), h^*) \right] \]

where \(R\) is the refinement function and \(h^*\) is the ground truth pose.

The derivative of the expected loss with respect to the parameters (e.g., \(w\)) can be derived using the log-likelihood ratio trick (similar to Policy Gradient):

\[ \frac{\partial}{\partial w}\mathbb{E}_{J\sim P(J|v,w)}[\ell(\cdot)] = \mathbb{E}_{J\sim P(J|v,w)} \left[ \ell(\cdot) \frac{\partial}{\partial w}\log P(J|v,w) + \frac{\partial}{\partial w}\ell(\cdot) \right] \]

This formulation allows gradients to flow through both the selection probabilities (affecting the Score CNN \(v\)) and the hypothesis generation quality (affecting the Coordinate CNN \(w\))[cite: 221].