Skip to content

SLAM

Lie Group and Lie Algebra

Check here for more details.

Non-linear Optimization

State Estimation

Consider the motion & observation model:

\[ \left\{\begin{array}{l} {\mathbf{x}_k} = \mathbf{f}\left( {{\mathbf{x}_{k - 1}},{\mathbf{u}_k}} \right) + \mathbf{w}_k\\ {\mathbf{z}_{k,j}} = \mathbf{h}\left( {{ \mathbf{y}_j},{ \mathbf{x}_k}} \right)+ \mathbf{v}_{k,j} \end{array} \right. \]

where \(\rvx_k\) is camera pose, \(\rvu_k\) is senor reading, \(\rvz_{k,j}\) is the observation of the \(j\)-th feature point \(\rvy_j\) at time \(k\) using observation model \(\rvh\). The noise \(\rvw_k\) and \(\rvv_{k,j}\) are assumed to be Gaussian:

\[ \begin{array}{l} {\mathbf{w}_k} \sim \mathcal{N}\left( {0,{\mathbf{R}_k}} \right)\\ {\mathbf{v}_{k,j}} \sim \mathcal{N}\left( {0,{\mathbf{Q}_{k,j}}} \right) \end{array} \]

We can use \(\rmT_k\) to represent \(\rvx_k\), and the observation equation can be written as:

\[ s \mathbf{z}_{k,j}= \mathbf{K}\exp(\boldsymbol{\xi}_k^\wedge)\rvy_j \]

For visual SLAM (i.e., no sensor reading), we can formulate the problem using Bayes rule:

\[ P(\rvx | \rvz) = \frac{P(\rvz | \rvx)P(\rvx)}{P(\rvz)} \propto P(\rvz | \rvx)P(\rvx) \]

Least Square

Let us consider the likelihood function:

\[ \begin{aligned} P( \mathbf{z}_{j,k} | \mathbf{x}_k, \mathbf{y}_j) &= N\left( h(\mathbf{y}_j, \mathbf{x}_k), \mathbf{Q}_{k,j} \right)\\ &=\frac{1}{{\sqrt {{{(2\pi )}^N}\det (\boldsymbol{\Sigma} )} }}\exp \left( {-\frac{1}{2}{{\left( {\rvz_{j, k} - \mu_{j, k}} \right)}^T}{\boldsymbol{\Sigma} ^{ - 1}}\left( {\rvz_{j, k} - \mu_{j, k}} \right)} \right) \end{aligned} \]

The negative log-likelihood function can be written as:

\[ -\ln P( \mathbf{z}_{j,k} | \mathbf{x}_k, \mathbf{y}_j) = \frac{1}{2}(\mathbf{z}_{j,k} - h(\mathbf{y}_j, \mathbf{x}_k))^\top \mathbf{Q}_{k,j}^{-1}(\mathbf{z}_{j,k} - h(\mathbf{y}_j, \mathbf{x}_k)) + \underbrace{\frac{1}{2}\ln \left( {{{\left( {2\pi} \right )}^N}\det \left( \boldsymbol{\Sigma} \right)} \right)}_{\text{can be discarded}} \]

Thus, MLE can be formulated as:

\[ \begin{aligned} (\mathbf{x}_k,\mathbf{y}_j)^* &= \arg \max \mathcal{N}(h(\mathbf{y}_j, \mathbf{x}_k), \mathbf{Q }_{k,j}) \\ &= \arg \min \left( {{{\left( {{ \mathbf{z}_{k,j}}-h\left( {{\mathbf{x }_k},{\mathbf{y}_j}} \right)} \right)}^T} \mathbf{Q}_{k,j}^{-1}\left( {{\mathbf{z}_{k,j}}-h\left( {{\mathbf{x}_k},{\mathbf{y}_j}} \right)} \right)} \right) \end{aligned} \]

Let us put observation together and define the error term:

\[ \begin{array}{l} {\mathbf{e}_{u,k}} = {\mathbf{x}_k}-f\left( {{\mathbf{x}_{k-1}},{\mathbf{u}_k} } \right)\\ {\mathbf{e}_{y,j,k}} = {\mathbf{z}_{k,j}}-h\left( {{\mathbf{x}_k},{\mathbf{y} _j}} \right) \end{array} \]

Thus, the objective function can be written as:

\[ \min J (\mathbf{x},\mathbf{y}) = \sum\limits_k {\mathbf{e}_{v,k}^T \mathbf{R}_k^{-1} {\mathbf{e}_{v,k}}} + \sum\limits_k {\sum\limits_j {\mathbf{e}_{y,k,j}^T \mathbf{Q}_ {k,j}^{-1}{\mathbf{e}_{y,k,j}}}} \]

Non-linear Least Square

Consider the non-linear optimization problem:

\[ \mathop {\min }\limits_{\mathbf{x}} \frac{1}{2}{\left\| {f\left( \mathbf{x} \right) } \right\|^2_2} \]

In practice, \(f\) is usually a non-linear function, we need to use iterative method to solve this problem instead of directly solving the equation by setting the gradient to zero. We can write the Taylor expansion of \(f\) at \(\rvx\):

\[ \| f(\rvx + \Delta \rvx) \|^2 \approx \| f(\rvx) \|^2 + \rmJ(\rvx)\Delta\rvx + \frac{1}{2}\Delta\rvx^\top \rmH(\rvx)\Delta\rvx \]

If we use the first-order approximation, we can get the update direction:

\[ \Delta\rvx^\star = -\rmJ(\rvx) \]

If we use the second-order approximation, we can represent the objective function as the function of \(\Delta\rvx\) and set the gradient to zero. We will have

\[ \rmH\Delta\rvx = -\rmJ^\top \]

Gauss-Newton

Gauss-Newton performs the first-order approximation on \(f(\rvx)\), not objective function:

\[ f(\rvx + \Delta\rvx) \approx f(\rvx) + \rmJ(\rvx)\Delta\rvx \]

Note this \(\rmJ(\rvx)\) is the Jacobian of \(f\) at \(\rvx\). Thus, the objective function can be written as:

\[ \begin{aligned} \Delta \mathbf{x}^* &= \arg \mathop {\min }\limits_{\Delta \mathbf{x}} \frac{1}{2}{\left\| {f\left( \mathbf{x} \right) + \mathbf{J} \left( \mathbf{x} \right)\Delta \mathbf{x} } \right\|^2}\\ &= \arg \mathop {\min }\limits_{\Delta \mathbf{x}} \frac{1}{2}{\left\| {f\left( \mathbf{x} \right) } \right\|^2} + \left( f\left( \mathbf{x} \right) \right)^\top \mathbf{J} \left( \mathbf{x} \right)\Delta \mathbf{x} + \frac{1}{2}\Delta \mathbf{x}^\top \mathbf{J}^\top \left( \mathbf{x} \right)\mathbf{J} \left( \mathbf{x} \right)\Delta \mathbf{x} \end{aligned} \]

We can think of it as a function of \(\Delta \mathbf{x}\) and set the gradient to zero:

\[ \rmJ(\rvx)^\top \rmJ(\rvx)\Delta\rvx = -\rmJ(\rvx)^\top f(\rvx) \]

Consider \(\rmH\Delta\rvx = -\rmJ^\top\) of Newton method, Gauss-Newton can be seen as approximating \(\rmH\) as \(\rmJ^\top\rmJ\). To use Gauss-Newton, \(\rmJ^\top\rmJ\) need to be positive definite. However, in practice, \(\rmJ^\top\rmJ\) is positive semi-definite, which means that Gauss-Newton may not converge. To solve this problem, we can use Levenberg-Marquardt method, which adds a damping term to the diagonal of \(\rmJ^\top\rmJ\):

Levenberg-Marquardt

Lazy:) Just add a damping term to the diagonal of \(\rmJ^\top\rmJ\)x

Visual Odometry I (Indirect methods)

Features

Feature points

keypoint (position) + Descriptor (feature vector)

OBR feature (Oriented FAST and Rotated BRIEF)

Fast corner detection + pyramid (scale invariant) + Intesity centroid (rotation invariant)

Intesity centroid

Given a image patch B, we can define the moment of this image:

\[ m_{pq} = \sum_{x, y \in B} x^p y^q I(x, y), \quad p, q= \{0, 1\} \]

The centroid is then determined as:

\[ C = (\frac{m_{10}}{m_{00}}, \frac{m_{01}}{m_{00}}) \]

The orientation of the feature point can be thus defined as:

\[ \theta = \arctan (m_{01}/m_{10}) \]

Feature matching

todo

2D-2D Epipolar Geometry

epipolar geometry

Given point \(p_1\) and its matching point \(p_2\), we have:

\[ p_1 = KP, \quad p_2 = K(RP+t) \]

(Why \((RP+t)\)? Recall that the rotation and translation defines the mapping between two coordinates!).

Let us assume

\[ x_1 = K^{-1}p_1, \quad x_2 = K^{-1}p_2 \]

Then, we get

\[ x_2 = Rx_1 + t \]

Multiply \(t^{\wedge}\) - equivalent to do outer product with t:

\[ t^{\wedge}x_2 = t^{\wedge}Rx_1 + t^{\wedge}t = t^{\wedge}Rx_1 \]

Multiply \(x^{\top}_2\):

\[ x^{\top}_2 t^{\wedge}x_2 = x^{\top}_2 t^{\wedge}Rx_1 \]

We know that \(t^{\wedge}x_2\) is orthogonal to \(t\) and \(x_2\), thus:

\[ 0 = x^{\top}_2 t^{\wedge}Rx_1 \]

Substitute \(p_1\) and \(p_2\) back:

\[ p_2^\top K^{-\top} t^{\wedge}R K^{-1}p_1 = 0 \]

The above two equations are called epipolar constraint. Essential matrix and Fundamental matrix are as following:

\[ E = t^{\wedge}R, \quad F = K^{-\top} t^{\wedge}R K^{-1} \]

Essential Matrix

  1. Essential matrix and hence the translation are up to an arbitrary scale (Recall \(x^{\top}_2 Ex_1=0\), multiply \(E\) by a constant will not affect the results).
  2. The singular value of E must be \([\sigma, \sigma, 0]^\top\)
  3. Essential matrix has 5 (3+3-1) degree of freedom (up to scale)

Solving essential matrix

Ideally, you only need 5 points to solve essential matrix. However, due to the properties of essential matrix, we usually use Eight-point-algorithm. (More than 8 points will be a least square problem, where we can use RANSAC to make it robust)

Consider one pair of the correpsonding point, we have:

\[ \begin{bmatrix} u_1 & v_1 & 1\\ \end{bmatrix} \begin{bmatrix} e_1 & e_2 & e_3\\ e_4 & e_5 & e_6\\ e_7 & e_8 & e_9 \end{bmatrix}\begin{bmatrix} u_2 \\ v_2 \\ 1 \end{bmatrix} \]

we can rewrite it into:

\[ \begin{bmatrix} u_1 u_2, u_1 v_2, u_1, v_1 u_2, v_1 v_2, v_1, u_2, v_2, 1 \end{bmatrix} \cdot \mathbf{e} = 0 \]

And this linear equation can be solved, and we can obtain \(E\). However, the solution may not satisfy the constraint of \(E\), thus, we usually use SVD to decompose \(E\) and then obtain the new \(E\) via:

\[ E = U \mathrm{diag}(\frac{\sigma_1+\sigma_2}{2}, \frac{\sigma_1+\sigma_2}{2}, 0)V^\top \]

Recovering \(R\) and \(t\) from essential matrix

For any skew-symmetric matrix \(S\), we can decompose it into:

\[ S = kUZU^\top \]

where \(U^\top\) is an orthogonal matrix, \(Z=\mathrm{diag}(1, 1, 0)WU^\top\):

\[ Z = \begin{bmatrix} 0 & 1 & 0\\ -1 & 0 & 0\\ 0 & 0 & 0 \end{bmatrix}, \quad W = \begin{bmatrix} 0 & -1 & 0\\ 1 & 0 & 0\\ 0 & 0 & 1 \end{bmatrix} \]

Here \(W\) is also an orthogonal matrix. Since the essential matrix is up-to-scale, we could ignore the scale factor \(k\):

\[ E = SR = U \mathrm{diag}(1, 1, 0) (WU^\top R) \]

Now, we could find that the above equation is just the SVD decomposition of E since \((WU^\top R)\) is also an orthogonal matrix. Then, we know

\[ E = U\Sigma V^\top = U \mathrm{diag}(k, k, 0) (WU^\top R) \]

Since we can also have

\[ Z = \begin{bmatrix} 0 & -1 & 0\\ 1 & 0 & 0\\ 0 & 0 & 0 \end{bmatrix}, \quad W = \begin{bmatrix} 0 & 1 & 0\\ -1 & 0 & 0\\ 0 & 0 & 1 \end{bmatrix} \]

For any \(E\), there are two possible \(R,t\):

\[ t_1^{\wedge} = U R_z(\frac{\pi}{2})\Sigma U^\top, \quad R_1 = U R_z^\top(\frac{\pi}{2})V^\top) \]
\[ t_2^{\wedge} = U R_z(-\frac{\pi}{2})\Sigma U^\top, \quad R_2 = U R_z^\top(-\frac{\pi}{2})V^\top) \]

epipolar geometry

Luckily, only one of them is valid, we just need to test the depth and the problem is solved.

Homography

todo

Scale ambiguity

Due to the scale ambiguity, the initialization of monocular SLAM requires that the translation of the camera cannot be zero, i.e. the camera motion cannot be pure rotation.

Triangulation

epipolar geometry

After estimating camera pose, we need to find the position of the 3D points. Due to the presence of the noise, we cannot find the exact solution. Consider the equation

\[ s_1x_1 = s_2 R x_2 + t \]
\[ s_1x_1^{\wedge}x_1 = 0 = s_2 x_1^{\wedge}R x_2 + x_1^{\wedge}t \]

Solving \(s_2\), we will get the position of the target 3D point

epipolar geometry

From the above figure, we can see when the relative translation is small, the uncertainy of the measurement will be larger. This means that a larger translation would be improve the accuracy of triangulation. However, large translation will also significantly change the viewpoint, result in some of new observation being involved.

3D-2D: PnP (Perspective-n-point)

When we know the position of 3D points and its projection in 2D, we could use PnP to estimate the relative camera motion. Thus, for stereo or RGB-D VO, we could use PnP directly while for monocular case, we need initialization before using PnP.

Consider a point \(P=(X, Y, Z, 1)^\top\) in 3D and the corresponding feature point \(x_1=(u_1, v_1, 1)^\top\) in 2D, we want to solve \([R|t]\), a \(3\times 4\) matrix:

\[ s\begin{bmatrix} u_1 \\ v_1 \\ 1 \end{bmatrix}= \begin{bmatrix} t_1 & t_2 & t_3 & t_4 \\ t_5 & t_6 & t_7 & t_8 \\ t_9 & t_{10} & t_{11} & t_{12} \\ \end{bmatrix} \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix} \]

Each pair of point can provide us two equations:

\[ u_1 = \frac{t_1X+t_2Y+t_3Z+t_4}{t_9X + t_{10}Y+t_{11}Z+t_{12}}, \quad v_1 = \frac{t_5X+t_6Y+t_7Z+t_8}{t_9X + t_{10}Y+t_{11}Z+t_{12}} \]

Thus, we get

\[ \mathbf{t_1}^\top P - \mathbf{t_3}^\top P u_1 = 0 \]
\[ \mathbf{t_2}^\top P - \mathbf{t_3}^\top P v_1 = 0 \]

We can formulate the linear equation as follows:

\[ \left(\begin{array}{ccc} \boldsymbol{P}_1^T & 0 & -u_1 \boldsymbol{P}_1^T \\ 0 & \boldsymbol{P}_1^T & -v_1 \boldsymbol{P}_1^T \\ \vdots & \vdots & \vdots \\ \boldsymbol{P}_N^T & 0 & -u_N \boldsymbol{P}_N^T \\ 0 & \boldsymbol{P}_N^T & -v_N \boldsymbol{P}_N^T \end{array}\right)\left(\begin{array}{l} \boldsymbol{t}_1 \\ \boldsymbol{t}_2 \\ \boldsymbol{t}_3 \end{array}\right)=0 \]

Since \(\mathbf{t}\) has 12 unknowns, we need 6 points to solve this equation. Note that here we did not consider the constraint of \(R\in SO(3)\). Thus, after obtaining \(R\), we need to use QR decomposition to get an approximated \(R\).

P3P

todo epipolar geometry

Bundle adjustment

Consider a 3D point \(P\) and its projection \(p\), we want to estimate the camera pose \(R, t\), and the corresponding Lie algebra is \(\xi\), we could get:

\[ s\begin{bmatrix} u_i \\ v_i \\ 1 \end{bmatrix}= K\exp(\xi^\wedge) \begin{bmatrix} X_i \\ Y_i \\ Z_i \\ 1 \end{bmatrix} \]
\[ s_i\mathbf{u}_i=K\exp(\xi^\wedge)\mathbf{P}_i \]

We can formulate our objective function, which aims to minimize the reprojection error.

\[ \xi^\star = \arg\min_{\xi, P} \frac{1}{2}\sum_{i=1}^n \|\mathbf{u}_i - \frac{1}{s_i}K\exp(\xi^\wedge)\mathbf{P}_i\|^2_2 \]

We can solve this equation using Gauss-Newton or LM. However, we need to know the gradient of residual with respect to our optimizable params:

\[ e(x+\Delta x) \approx e(x) + J\Delta x \]

The shape of \(J\) would be \(2\times 6\). Now we need to derive the Jacobian. Let us firstly extract first 3 dimension of \(P\):

\[ P^\prime = (\exp(\xi^\wedge) P)_{1:3} = \begin{bmatrix}X^\prime & Y^\prime & Z^\prime \end{bmatrix}^\top \]

Then, we get:

\[ s\mathbf{u} = K P^\prime \]
\[ s\begin{bmatrix} su \\ sv \\ s \end{bmatrix}= \begin{bmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} X^\prime \\ Y^\prime \\ Z^\prime \end{bmatrix} \]

Thus,

\[ u = f_x \frac{X^\prime}{Z^\prime} + c_x, \quad v = f_y \frac{Y^\prime}{Z^\prime} + c_y \]

Now, we could apply a left perturbation \(\delta\xi\) to \(\xi^\wedge\) (\(\delta\xi \oplus \xi\)):

\[ \frac{\partial e}{\partial \delta\xi}=\lim_{\delta\xi\rightarrow 0} \frac{e(\delta\xi \oplus \xi)}{\delta\xi} = \frac{\partial e}{\partial P^\prime} \frac{\partial P^\prime}{\partial \delta\xi} \]

Then,

\[ \frac{\partial e}{\partial P^\prime}=- \begin{bmatrix} \frac{\partial u}{\partial X^\prime} & \frac{\partial u}{\partial Y^\prime} & \frac{\partial u}{\partial Z^\prime} \\ \frac{\partial v}{\partial X^\prime} & \frac{\partial v}{\partial Y^\prime} & \frac{\partial v}{\partial Z^\prime} \end{bmatrix} = \begin{bmatrix} \frac{f_x}{Z^\prime} & 0 & -\frac{f_x X^\prime}{Z^{\prime 2}} \\ 0 & \frac{f_y}{Z^\prime} & -\frac{f_y Y^\prime}{Z^{\prime 2}} \end{bmatrix} \]

The second term is

\[ \frac{\partial P^\prime}{\partial \delta\xi}= \frac{\partial TP}{\partial \delta\xi}= \begin{bmatrix} I & -P^{\prime\wedge} \end{bmatrix} \]

The overall gradient is then:

\[ \frac{\partial \boldsymbol{e}}{\partial \delta \boldsymbol{\xi}}=-\left[\begin{array}{cccccc} \frac{f_x}{Z^{\prime}} & 0 & -\frac{f_x X^{\prime}}{Z^{\prime 2}} & -\frac{f_x X^{\prime} Y^{\prime}}{Z^{\prime 2}} & f_x+\frac{f_x X^2}{Z^{\prime 2}} & -\frac{f_x Y^{\prime}}{Z^{\prime}} \\ 0 & \frac{f_y}{Z^{\prime}} & -\frac{f_y Y^{\prime}}{Z^{\prime 2}} & -f_y-\frac{f_y Y^{\prime 2}}{Z^{\prime 2}} & \frac{f_y X^{\prime} Y^{\prime}}{Z^{\prime 2}} & \frac{f_y X^{\prime}}{Z^{\prime}} \end{array}\right] \]

And the gradient with respect to position should be:

\[ \frac{\partial e}{\partial P}= \frac{\partial e}{\partial P^\prime} \frac{\partial P^\prime}{\partial P}= \begin{bmatrix} \frac{f_x}{Z^\prime} & 0 & -\frac{f_x X^\prime}{Z^{\prime 2}} \\ 0 & \frac{f_y}{Z^\prime} & -\frac{f_y Y^\prime}{Z^{\prime 2}} \end{bmatrix} R \]

3D-3D: ICP (Iterative Closest Point)

Consider a pair of point cloud:

\[ P = \{p_1, \dots, p_n\}, \quad P^\prime = \{p_1^\prime, \dots, p_n^\prime\} \]

We aim to find an Euclidean transformation \(R, t\) so as to

\[ \forall_i p_i = R p_i^\prime + t \]

Solving ICP (SVD)

The error residual can be written as:

\[ e_i = p_i - (Rp_i^\prime + t) \]

Then, we could formulate a least-square problem:

\[ \min_{R, t} \frac{1}{2}\sum_{i=1}^n\|p_i - (Rp_i^\prime + t)|^2_2 \]

We can define the centroid of two point cloud:

\[ p = \frac{1}{n}\sum_{i=1}^np_i, \quad p^\prime = \frac{1}{n}\sum_{i=1}^np_i^\prime \]

Our objective function can then be formulated as:

\[ \begin{aligned} \frac{1}{2} \sum_{i=1}^n\left\|\boldsymbol{p}_i-\left(\boldsymbol{R} \boldsymbol{p}_i{ }^{\prime}+\boldsymbol{t}\right)\right\|^2= & \frac{1}{2} \sum_{i=1}^n\left\|\boldsymbol{p}_i-\boldsymbol{R} \boldsymbol{p}_i{ }^{\prime}-\boldsymbol{t}-\boldsymbol{p}+\boldsymbol{R} \boldsymbol{p}^{\prime}+\boldsymbol{p}-\boldsymbol{R} \boldsymbol{p}^{\prime}\right\|^2 \\ = & \frac{1}{2} \sum_{i=1}^n\left\|\left(\boldsymbol{p}_i-\boldsymbol{p}-\boldsymbol{R}\left(\boldsymbol{p}_i{ }^{\prime}-\boldsymbol{p}^{\prime}\right)\right)+\left(\boldsymbol{p}-\boldsymbol{R} \boldsymbol{p}^{\prime}-\boldsymbol{t}\right)\right\|^2 \\ = & \frac{1}{2} \sum_{i=1}^n\left(\left\|\boldsymbol{p}_i-\boldsymbol{p}-\boldsymbol{R}\left(\boldsymbol{p}_i{ }^{\prime}-\boldsymbol{p}^{\prime}\right)\right\|^2+\left\|\boldsymbol{p}-\boldsymbol{R} \boldsymbol{p}^{\prime}-\boldsymbol{t}\right\|^2+\right. \\ & \left.2\left(\boldsymbol{p}_i-\boldsymbol{p}-\boldsymbol{R}\left(\boldsymbol{p}_i{ }^{\prime}-\boldsymbol{p}^{\prime}\right)\right)^T\left(\boldsymbol{p}-\boldsymbol{R} \boldsymbol{p}^{\prime}-\boldsymbol{t}\right)\right) . \end{aligned} \]

We know that \(2\left(\boldsymbol{p}_i-\boldsymbol{p}-\boldsymbol{R}\left(\boldsymbol{p}_i{ }^{\prime}-\boldsymbol{p}^{\prime}\right)\right)\) would be zero. Thus, the objective funtion can be:

\[ \min_{R, t} \frac{1}{2}\sum_{i=1}^n\|p_i - p - R(p_i^\prime - p^\prime)|^2_2 + \|p-Rp^\prime-t\|^2 \]

We can find that the left part only contains \(R\) while right part contains \(R, t\). Thus, solving ICP can be divided into 3 steps:

  1. Obtain the centroid and obtain relative coordinate

    \[ q_i = p_i - p, \quad q_i^\prime = p_i^\prime - p^\prime \]
  2. Estimate rotation matrix \(R\)

    \[ R^\star = \arg\min_R \frac{1}{2}\sum_{i=1}^n\|p_i - p - R(p_i^\prime - p^\prime)|^2_2 \]
  3. Estimate \(t\) from estimated rotation matrix \(R\)

    \[ t^\star = p - Rp^\prime \]

Visual Odometry II (Direct methods)

Key ideas: Keypoint + Feature descriptor extraction takes quite a long time (Mostly because of the descriptor)

Solution:

  1. Keypoint + optical flow
  2. Keypoint + direct optimization
  3. Direct optimization

Optical flow (Lucas-Kanade)

Assumption: Brightness constancy, i.e., for pixel at \((x, y)\) in time \(t\), if it goes to \((x+dx, y+dy)\) in time \(t+dt\), we have

\[ I(x+dx, y+dy, t+dt) = I(x, y, t) \]

Using Taylor expansion and keep the first-order term, we have

\[ I(x+dx, y+dy, t+dt) \approx I(x, y, t) + \frac{\partial I}{\partial x}dx + \frac{\partial I}{\partial y}dy + \frac{\partial I}{\partial t}dt \]

Using brightness constancy assumption, we have

\[ \frac{\partial I}{\partial x}dx + \frac{\partial I}{\partial y}dy + \frac{\partial I}{\partial t}dt = 0 \]

Then

\[ \frac{\partial I}{\partial x}\frac{dx}{dt} + \frac{\partial I}{\partial y}\frac{dy}{dt} = -\frac{\partial I}{\partial t} \]

where $\frac{dx}{dt} $ and \(\frac{dy}{dt}\) are velocity along x dimension and y dimension, denote as \(u\) and \(v\). \(\frac{\partial I}{\partial x}\) and \(\frac{\partial I}{\partial y}\) are gradients, denote as \(I_x\) and \(I_y\). Then, we have

\[ \begin{bmatrix} I_x & I_y \end{bmatrix} \begin{bmatrix} u \\ v \end{bmatrix}= -I_t \]

However, we cannot solve this equation due to the fact that there are two unknowns. What to do next? ASSUMPTION! i.e., optical flow field changes smoothly in a small neighborhood. Consider a window of \(k\times k\), we can have \(k^2\) equations:

\[ \begin{bmatrix} I_x & I_y \end{bmatrix}_k \begin{bmatrix} u \\ v \end{bmatrix}= -I_{tk}, \quad k=1,\dots, w^2 \]

Let us say

\[ \boldsymbol{A}=\left[\begin{array}{c} {\left[\boldsymbol{I}_x, \boldsymbol{I}_y\right]_1} \\ \vdots \\ {\left[\boldsymbol{I}_x, \boldsymbol{I}_y\right]_k} \end{array}\right], \boldsymbol{b}=\left[\begin{array}{c} \boldsymbol{I}_{t 1} \\ \vdots \\ \boldsymbol{I}_{t k} \end{array}\right] \]

Thus, the equation can be written as

\[ A\begin{bmatrix} u \\ v \end{bmatrix} = -b \]

Since now it is an over-determined problem, the solution would be

\[ A\begin{bmatrix} u \\ v \end{bmatrix}^\star = - (A^\top A)^{-1} A^\top b \]

Direct Methods

Direct method

The projection of \(P\) can be written as:

\[ \begin{align} p_1 &= \begin{bmatrix} u \\ v \\ 1 \end{bmatrix}_1= \frac{1}{Z_1}KP\\ \\ p_2 &= \begin{bmatrix} u \\ v \\ 1 \end{bmatrix}_2= \frac{1}{Z_2}K(RP+t)= \frac{1}{Z_2}K(\exp(\xi^\wedge)P)_{1:3} \end{align} \]

Now we do not know the correspondence between \(p_1\) and \(p_2\). To solve this equation, we need minimize the photometric error:

\[ e = I_1(p_1) - I_2(p_2) \]

The minimization can be formulated as:

\[ \min_{\xi} J(\xi) = \sum_{i=1}^N \|e_i\|^2 = \sum_{i=1}^N e_i^\top e_i, \quad e_i = I_1(p_{1, i}) - I_2(p_{2, i}) \]

Apply a small left perturbation \(\exp(\delta\xi)\) to \(\exp(\xi)\):

\[ \begin{align} e(\xi\oplus\delta\xi) &= I_1(\frac{1}{Z_1}KP) - I_2(\frac{1}{Z_2}K\exp(\delta\xi^\wedge)\exp(\xi^\wedge)P) \\ &\approx I_1(\frac{1}{Z_1}KP) - I_2(\frac{1}{Z_2}K(1+\delta\xi^\wedge)\exp(\xi^\wedge)P)\\ &= I_1(\frac{1}{Z_1}KP) - I_2(\frac{1}{Z_2}K\exp(\xi^\wedge)P + \frac{1}{Z_2}K\delta\xi^\wedge\exp(\xi^\wedge)P) \end{align} \]

Let

\[ q = \delta\xi^\wedge\exp(\xi^\wedge)P, \]
\[ u=\frac{1}{Z_2}Kq \]

Here \(P\) is the coordinate in the second camera coordinate system after applying the perturbation to \(P\). And \(u\) is its pixel coordinate.

We can rewrite the above equation using Taylor expansion:

\[ \begin{align} e(\xi\oplus\delta\xi) &= I_1(\frac{1}{Z_1}KP) - I_2(\frac{1}{Z_2}K\exp(\xi^\wedge)P + u)\\ &\approx I_1(\frac{1}{Z_1}KP) - I_2(\frac{1}{Z_2}K\exp(\xi^\wedge)P) - \frac{\partial I_2}{\partial u}\frac{\partial u}{\partial q} \frac{\partial q}{\partial \delta \xi} \delta\xi\\ &= e(\xi) - \frac{\partial I_2}{\partial u}\frac{\partial u}{\partial q} \frac{\partial q}{\partial \delta \xi} \delta\xi \end{align} \]

We know that:

  1. \(\frac{\partial I_2}{\partial u}\) is the gradient of the image at pixel \(u\)
  2. \(\frac{\partial u}{\partial q}\) is the gradient of projection with respect to the 3D coordinate in camera coordinate:

    \[ {\partial u}{\partial q}= \begin{bmatrix} \frac{\partial u}{\partial X} & \frac{\partial u}{\partial Y} & \frac{\partial u}{\partial Z}\\ \frac{\partial v}{\partial X} & \frac{\partial v}{\partial Y} & \frac{\partial v}{\partial Z} \end{bmatrix}= \begin{bmatrix} \frac{\partial f_x}{\partial Z} & 0 & -\frac{\partial f_x X}{\partial Z^2} \\ 0 & \frac{\partial f_y}{\partial Z} & -\frac{\partial f_y Y}{\partial Z^2} \end{bmatrix} \]
  3. \(\frac{\partial q}{\partial \delta \xi} = [I, \quad -q^\wedge]\)

Discussion

The derivation above assumes that \(P\) is known. For Monocular sequence, \(P\) comes from:

  1. Sparse keypoint matching
  2. Semi-dense direct methods: Consider pixel with non-zero gradient only
  3. Dense direct methods: All pixels

The cons of direct methods are:

  1. Non-convex
  2. Brightness consistency is a strong assumption

Reference

  1. 14 lectures on visual SLAM