Lecture 2: Image Formation

Primitives and Transformations

Primitives

Homogeneous coordinate: 2D points $\mathbf{x}$ can be represented in homogeneous coordinate $\tilde{x} \in \mathbb{P}^2$, where $\mathbb{P}^2 = \mathbb{R}^3 \setminus \{(0, 0, 0)\}$ is called projective space (excluding $(0, 0, 0)$ ).

Augmented vector $\bar{\mathbf{x}}$:

\[ \bar{\mathbf{x}} = \left(\begin{array}{c} \mathbf{x} \\ 1 \end{array}\right) = \left(\begin{array}{c} x \\ y \\ 1 \end{array}\right) = \frac{1}{\tilde{w}}\tilde{\mathbf{x}} = \left(\begin{array}{c} \tilde{x}/\tilde{w} \\ \tilde{y}/\tilde{w}\\ 1 \end{array}\right) \]

Homogeneous points with last element $\tilde{w}=0$ are called ideal points or points at infinity.

2D lines: can be expressed using homogeneous coordinate $\tilde{\mathbf{l}}=(a, b, c)^\top$ . We can normalize $\tilde{\mathbf{l}}$ so that $\tilde{l} = (n_x, n_y, -d)^\top = (\mathbf{n}, -d)^\top$ with $\|\mathbf{n}\|_2 = 1$. $\mathbf{n}$ is the normal vector perpendicular to the line and $d$ is its distance to the origin.

Lines at infinity: $\tilde{\mathbf{l}}_{\infty} = (0, 0, 1)^\top$ passes all ideal points ($\tilde{w}=0$).

Intersection of two lines: $\tilde{\mathbf{x}} = \tilde{\mathbf{l}}_1 \times \tilde{\mathbf{l}}_2$. Proof: $\tilde{\mathbf{l}}_1\tilde{\mathbf{x}} = \tilde{\mathbf{l}}_1 (\tilde{\mathbf{l}}_1 \times \tilde{\mathbf{l}}_2) = 0$

Line joining two points: $\tilde{\mathbf{l}} = \tilde{\mathbf{x}}_1 \times \tilde{\mathbf{x}}_2$

Transformations

Overview of 3D tranformations:

transformation

Direct Linear Transformation:

Let $\mathcal{X}=\{\tilde{x}_i, \tilde{x}^{\prime}_i\}_{i=1}^N$ denote a set of N 2D-to-2D correspondences related by $\tilde{x}_i^\prime = \tilde{H} \tilde{x}_i$:

\[ \underbrace{\left[\begin{array}{ccc} \mathbf{0}^{\top} & -\tilde{w}_i^{\prime} \tilde{\mathbf{x}}_i^{\top} & \tilde{y}_i^{\prime} \tilde{\mathbf{x}}_i^{\top} \\ \tilde{w}_i^{\prime} \tilde{\mathbf{x}}_i^{\top} & \mathbf{0}^{\top} & -\tilde{x}_i^{\prime} \tilde{\mathbf{x}}_i^{\top} \\ -\tilde{y}_i^{\prime} \tilde{\mathbf{x}}_i^{\top} & \tilde{x}_i^{\prime} \tilde{\mathbf{x}}_i^{\top} & \mathbf{0}^{\top} \end{array}\right]}_{\mathbf{A}_i} \underbrace{\left[\begin{array}{c} \tilde{\mathbf{h}}_1 \\ \tilde{\mathbf{h}}_2 \\ \tilde{\mathbf{h}}_3 \end{array}\right]}_{\tilde{\mathbf{h}}}=\mathbf{0} \]

Each point correpsondence yields two equations (Last row is linearly dependent on the first two), stacking all equations into a $2N\times 9$ dimentional matrix $A$ leads to the following constrained least squares problem

\[ \begin{aligned} \tilde{\mathbf{h}}^\star &= \arg\min_{\tilde{\mathbf{h}}} \|\mathbf{A}\tilde{\mathbf{h}}\|_2^2 + \lambda (\|\tilde{\mathbf{h}}\|_2^2-1)\\ &= \arg\min_{\tilde{\mathbf{h}}} \tilde{\mathbf{h}}^\top \mathbf{A}^\top \mathbf{A} \tilde{\mathbf{h}} + \lambda (\tilde{\mathbf{h}}^\top \tilde{\mathbf{h}} -1 ) \end{aligned} \]

where we have fixed $\|\mathbf{\tilde{h}}\|_2^2 = 1$ as $\tilde{\mathbf{H}}$ is homogeneous (i.e. defined only up to scale). We can solve the above equation using SVD ($\mathbf{A}=\mathbf{U}\mathbf{S}\mathbf{V}^\top$, where $\mathbf{U}^\top \mathbf{U} = I$, $\mathbf{V}^\top \mathbf{V} = I$). The derivation is as following:

\[ \begin{aligned} \Omega &= \tilde{\mathbf{h}}^\top \mathbf{A}^\top \mathbf{A} \tilde{\mathbf{h}}\\ &=\tilde{\mathbf{h}}^\top \mathbf{V}\mathbf{S}\mathbf{U}^\top \mathbf{U}\mathbf{S}\mathbf{V}^\top \tilde{\mathbf{h}}\\ &=\tilde{\mathbf{h}}^\top \mathbf{V}\mathbf{S}^2\mathbf{V}^\top \tilde{\mathbf{h}}\\ &=\tilde{\mathbf{h}}^\top \sum_i (s^2_i \mathbf{v}_i \mathbf{v}_i^\top) \tilde{\mathbf{h}}\\ \end{aligned} \]

As $\mathbf{V}$ is orthogonal matrix, we can have:

\[ \mathbf{v}_i^{\top} \mathbf{v}_j=\left\{\begin{array}{lll} 1 & \text { if } & i=j \\ 0 & \text { if } & i \neq j \end{array}\right. \]

If we set $\tilde{\mathbf{h}}$ to be any column vector $\tilde{\mathbf{v}}$ of $\mathbf{V}$. Then, we can have

\[ \begin{aligned} \Omega &= \mathbf{v}_j^\top \sum_i (s_i^2 \mathbf{v}_i\mathbf{v}_i^\top)\mathbf{v}_j\\ &=s_j^2 \mathbf{v}_j^\top \mathbf{v}_j \mathbf{v}_j^\top \mathbf{v}_j = s^2_j \end{aligned} \]

Thus, to minimize the error, we need to choose the singular vector correponding to the smallest single value of $\mathbf{A}$ to get the homography matrix $\tilde{H}$ .

Geometric Image Formulation

Projection Models

projection

Orthographic projection: During projection, z-coordinate is dropped, x and y remain the same. Scaled orthographic projection introduces one extra scale factor for x and y.

Perspective projection: 3D points in camera coordinates are mapped to the image plane by dividing them by their z component and multipluing with the focal length:

\[ \left(\begin{array}{c}x_s\\y_s\end{array}\right) = \left(\begin{array}{c}fx_c/z_c\\fy_c/z_c\end{array}\right) \Leftrightarrow \tilde{x}_s = \left(\begin{array}{cccc}f&0&0&0\\0&f&0&0\\0&0&1&0 \end{array}\right)\bar{x}_c \]

After this projection, it is not possible to recover the distance of 3D points from the image. In practice, we have a principal point offset $c$ and skew factor $s$.

$$ \left(\begin{array}{c}x_s\y_s\end{array}\right) = \left(\begin{array}{c}fx_c/z_c + sy_c/z_c+c_x\fy_c/z_c+c_y\end{array}\right) \Leftrightarrow \tilde{x}_s = \left(\begin{array}{cccc}f_x&s&c_x&0\0&f_y&c_y&0\0&0&1&0 \end{array}\right)\bar{x}_c $$ The left $3\times3$ projection matrix is called calibration matrix $K$.

World space -> Image space:

transformation

Given calibration matrix $\mathbf{K}$ and $[\mathbf{R}|\mathbf{t}]$ the camera pose, we can project a point from world coordinate to the image coordinate

\[ \tilde{\mathbf{x}}_s = [\begin{array}{cc}\mathbf{K} & \mathbf{0}\end{array}] \bar{\mathbf{x}}_c = [\begin{array}{cc}\mathbf{K} & \mathbf{0}\end{array}] \left[\begin{array}{cc}\mathbf{R} & \mathbf{r}\\\mathbf{0}^\top &1\end{array}\right] \bar{\mathbf{x}}_w = \mathbf{K} \left[\begin{array}{cc}\mathbf{R} & \mathbf{t}\end{array}\right] \bar{\mathbf{x}}_w = \mathbf{P}\bar{\mathbf{x}}_w \]

Full rank representation: Sometimes it is preferable to use a full rank $4\times4$ projection matrix

\[ \tilde{\mathbf{x}}_s = [\begin{array}{cc}\mathbf{K} & \mathbf{0}\end{array}] \left[\begin{array}{cc}\mathbf{R} & \mathbf{r}\\\mathbf{0}^\top &1\end{array}\right] \bar{\mathbf{x}}_w = \mathbf{P}\bar{\mathbf{x}}_w \]

Now the homogeneous vector $\tilde{\mathbf{x}}_s$ is a 4D vector and must be normalized wrt. its 3rd entry to obtain inhomogeneous image pixels:

\[ \bar{\mathbf{x}}_s = \tilde{\mathbf{x}}_s/z_s = (\begin{array}{cccc}x_s/z_s,& ys/zs,&1,& 1/z_s\end{array})^\top \]

4th component is called inverse depth. We can recover 3D points if inverse depth is known.

Len distortion

Photometric Image Formation

Rendering Equation:

\[ L_{out}(\mathbf{p}, \mathbf{v}, \lambda) = L_{emit}(\mathbf{p}, \mathbf{v}, \lambda) + \int_\Omega BRDF(\mathbf{p}, \mathbf{s}, \mathbf{v}, \lambda) \cdot L_{in}(\mathbf{p}, \mathbf{s}, \lambda) \cdot (-\mathbf{n}^\top \mathbf{s})ds \]

Where $\mathbf{p}\in \mathbb{R}^3$ denotes a 3D surface point, $\mathbf{v}\in \mathbb{R}^3$ the view direction, and $\mathbf{s}\in \mathbb{R}^3$ the incoming light direction. $\Omega$ is the the unit hemisphere at the normal $\mathbf{n}$. $L_{emit}>0$ only for light emitting surfaces.

transformation

Camera lenses: Camera with lenses can accumulate light on sensor plane, and if a 3D point is in focus, all light rays arrive at the same 2D pixel. For many applications, modelling pinhole camera is enough, but to address focus, vignetting and aberration, we need to model lenses.

optic

Thin lens model: $1/z_s + 1/z_c = 1/f$ means the image is in focus

thin_len

Depth of Field (DOF): If image plane is out of focus, a 3D point projects to the circle of confusion $c$.

thin_len

To control the size of circle of confusion, we can change the lens aperture.

Vignetting: Tendency for brightness to fall off towards the image edge.