Lecture 2: Image Formation
Primitives and Transformations
Primitives
Homogeneous coordinate: 2D points \(\mathbf{x}\) can be represented in homogeneous coordinate \(\tilde{x} \in \mathbb{P}^2\), where \(\mathbb{P}^2 = \mathbb{R}^3 \setminus \{(0, 0, 0)\}\) is called projective space (excluding \((0, 0, 0)\) ).
Augmented vector \(\bar{\mathbf{x}}\):
Homogeneous points with last element \(\tilde{w}=0\) are called ideal points or points at infinity.
2D lines: can be expressed using homogeneous coordinate \(\tilde{\mathbf{l}}=(a, b, c)^\top\) . We can normalize \(\tilde{\mathbf{l}}\) so that \(\tilde{l} = (n_x, n_y, -d)^\top = (\mathbf{n}, -d)^\top\) with \(\|\mathbf{n}\|_2 = 1\). \(\mathbf{n}\) is the normal vector perpendicular to the line and \(d\) is its distance to the origin.
Lines at infinity: \(\tilde{\mathbf{l}}_{\infty} = (0, 0, 1)^\top\) passes all ideal points (\(\tilde{w}=0\)).
Intersection of two lines: \(\tilde{\mathbf{x}} = \tilde{\mathbf{l}}_1 \times \tilde{\mathbf{l}}_2\). Proof: \(\tilde{\mathbf{l}}_1\tilde{\mathbf{x}} = \tilde{\mathbf{l}}_1 (\tilde{\mathbf{l}}_1 \times \tilde{\mathbf{l}}_2) = 0\)
Line joining two points: \(\tilde{\mathbf{l}} = \tilde{\mathbf{x}}_1 \times \tilde{\mathbf{x}}_2\)
Transformations
Overview of 3D tranformations:
Direct Linear Transformation:
Let \(\mathcal{X}=\{\tilde{x}_i, \tilde{x}^{\prime}_i\}_{i=1}^N\) denote a set of N 2D-to-2D correspondences related by \(\tilde{x}_i^\prime = \tilde{H} \tilde{x}_i\):
Each point correpsondence yields two equations (Last row is linearly dependent on the first two), stacking all equations into a \(2N\times 9\) dimentional matrix \(A\) leads to the following constrained least squares problem
where we have fixed \(\|\mathbf{\tilde{h}}\|_2^2 = 1\) as \(\tilde{\mathbf{H}}\) is homogeneous (i.e. defined only up to scale). We can solve the above equation using SVD (\(\mathbf{A}=\mathbf{U}\mathbf{S}\mathbf{V}^\top\), where \(\mathbf{U}^\top \mathbf{U} = I\), \(\mathbf{V}^\top \mathbf{V} = I\)). The derivation is as following:
As \(\mathbf{V}\) is orthogonal matrix, we can have:
If we set \(\tilde{\mathbf{h}}\) to be any column vector \(\tilde{\mathbf{v}}\) of \(\mathbf{V}\). Then, we can have
Thus, to minimize the error, we need to choose the singular vector correponding to the smallest single value of \(\mathbf{A}\) to get the homography matrix \(\tilde{H}\) .
Geometric Image Formulation
Projection Models
Orthographic projection: During projection, z-coordinate is dropped, x and y remain the same. Scaled orthographic projection introduces one extra scale factor for x and y.
Perspective projection: 3D points in camera coordinates are mapped to the image plane by dividing them by their z component and multipluing with the focal length:
After this projection, it is not possible to recover the distance of 3D points from the image. In practice, we have a principal point offset \(c\) and skew factor \(s\).
$$ \left(\begin{array}{c}x_s\y_s\end{array}\right) = \left(\begin{array}{c}fx_c/z_c + sy_c/z_c+c_x\fy_c/z_c+c_y\end{array}\right) \Leftrightarrow \tilde{x}_s = \left(\begin{array}{cccc}f_x&s&c_x&0\0&f_y&c_y&0\0&0&1&0 \end{array}\right)\bar{x}_c $$ The left \(3\times3\) projection matrix is called calibration matrix \(K\).
World space -> Image space:
Given calibration matrix \(\mathbf{K}\) and \([\mathbf{R}|\mathbf{t}]\) the camera pose, we can project a point from world coordinate to the image coordinate
Full rank representation: Sometimes it is preferable to use a full rank \(4\times4\) projection matrix
Now the homogeneous vector \(\tilde{\mathbf{x}}_s\) is a 4D vector and must be normalized wrt. its 3rd entry to obtain inhomogeneous image pixels:
4th component is called inverse depth. We can recover 3D points if inverse depth is known.
Len distortion
Photometric Image Formation
Rendering Equation:
Where \(\mathbf{p}\in \mathbb{R}^3\) denotes a 3D surface point, \(\mathbf{v}\in \mathbb{R}^3\) the view direction, and \(\mathbf{s}\in \mathbb{R}^3\) the incoming light direction. \(\Omega\) is the the unit hemisphere at the normal \(\mathbf{n}\). \(L_{emit}>0\) only for light emitting surfaces.
Camera lenses: Camera with lenses can accumulate light on sensor plane, and if a 3D point is in focus, all light rays arrive at the same 2D pixel. For many applications, modelling pinhole camera is enough, but to address focus, vignetting and aberration, we need to model lenses.
Thin lens model: \(1/z_s + 1/z_c = 1/f\) means the image is in focus
Depth of Field (DOF): If image plane is out of focus, a 3D point projects to the circle of confusion \(c\).
To control the size of circle of confusion, we can change the lens aperture.
Vignetting: Tendency for brightness to fall off towards the image edge.