Camera Calibration
Overview
The Pinhole Camera Model
The projection from 3D space to a 2D image is modeled by:
Where:
- Known:
- World points \(\mathbf{X}\): by measuring the calibration object
- Image points \(\mathbf{x}\): by corner detection
- Unknown:
- Intrinsics \(\mathbf{K} = \begin{bmatrix} f_x & \gamma & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}\)
- Extrinsics \(\mathbf{R}, \mathbf{t}\)
- Distortion Coefficients: \(k_1, k_2, p_1, p_2\)
Skew Coefficient \(\gamma\)
Usually, the analog camera contains synchronization error caused by frame grabber:
[NOTE] This is different to rolling shutter effect, which does not have a problem with static camera and static scene.

Here:
and we can thus have
We can write \(f_y^\prime = \frac{f_y}{\sin\theta}\) and skey factor \(\gamma = -f_x\cot\theta\), and obtain the final intrinsic matrix \(\mathbf{K}\).
[NOTE] For modern digital cameras, the skew coefficient is usually 0. Note that the skew coefficient is different to distortion coefficients, which are non-linear lens imperfections.
Distortion Coefficients \(k_1, k_2, p_1, p_2\)
Unlike the skew coefficient \(\gamma\), which models a linear geometric shear within the intrinsic matrix \(\mathbf{K}\), distortion coefficients model non-linear aberrations caused by the physical lens optics. This step occurs on the normalized image plane (Z=1), before the projection from camera space to pixel space.
-
Radial Distortion:
\[ \begin{aligned} x_{\text{rad}} &= x (1 + k_1 r^2 + k_2 r^4 + k_3 r^6) \\ y_{\text{rad}} &= y (1 + k_1 r^2 + k_2 r^4 + k_3 r^6) \end{aligned} \] -
Tangential Distortion:
\[ \begin{aligned} x_{\text{tan}} &= 2p_1 xy + p_2(r^2 + 2x^2) \\ y_{\text{tan}} &= p_1(r^2 + 2y^2) + 2p_2 xy \end{aligned} \]
1. Direct Linear Transformation (DLT)
Core Concept: DLT is a "brute force" approach that solves for the combined projection matrix \(\mathbf{P}\) (\(3\times4\)) directly, without explicitly separating rotation, translation, or intrinsics initially.
Pipeline
- We ignore the structure of \(\mathbf{K}, \mathbf{R}, \mathbf{t}\) and treat \(\mathbf{P}\) as a matrix with \(12-1\) unknown elements.
- Each correspondence \(\mathbf{x}_i \leftrightarrow \mathbf{X}_i\) provides 2 independent linear equations.
- With \(N \ge 6\) points, we can solve the system \(\mathbf{A}\mathbf{p} = 0\) using SVD (Singular Value Decomposition).
- Once \(\mathbf{P}\) is found, we use QR Decomposition to factorize it back into \(\mathbf{K}, \mathbf{R}, \mathbf{t}\).
Pros & Cons
- Pros: Linear, simple to solve.
- Cons: Requires a 3D calibration rig (points cannot be coplanar). If points are on a flat plane (like a single chessboard photo), the system becomes degenerate and cannot be solved.
2. Zhang's Method
Core Concept: Use a planar pattern (chessboard) and exploit geometric constraints on the rotation matrix to solve for intrinsics, even though the depth (\(Z\)) is unknown.
Step 1: The Homography (\(\mathbf{H}\))
Since the calibration object is planar, we assume \(Z=0\) for all world points. The projection equation simplifies:
We define the Homography Matrix \(\mathbf{H} = [\mathbf{h}_1 \quad \mathbf{h}_2 \quad \mathbf{h}_3]\) such that:
We can calculate \(\mathbf{H}\) for each image using the known point correspondences (similar to DLT, but 2D-to-2D).
Step 2: Extracting Geometric Constraints
From the equation above, we can isolate the rotation vectors:
Since \(\mathbf{R}\) is a rotation matrix, its columns \(\mathbf{r}_1\) and \(\mathbf{r}_2\) must be orthonormal:
- Orthogonal: \(\mathbf{r}_1^T \mathbf{r}_2 = 0\)
- Unit Length: \(||\mathbf{r}_1|| = ||\mathbf{r}_2|| \implies \mathbf{r}_1^T \mathbf{r}_1 = \mathbf{r}_2^T \mathbf{r}_2\)
Substituting \(\mathbf{r}\) with terms of \(\mathbf{K}\) and \(\mathbf{h}\):
Step 3: Solving for Intrinsics (\(\mathbf{K}\))
Let \(\mathbf{B} = \mathbf{K}^{-T} \mathbf{K}^{-1}\). \(\mathbf{B}\) is a symmetric matrix with \(5+1\) unknowns. (scale factor and 5 unknowns in \(\mathbf{K}\)) We rewrite the constraints as a linear system:
Where \(\mathbf{b}\) is the vector of the 6 unknowns in \(\mathbf{B}\).
- Each image provides 2 equations.
- With \(\ge 3\) images, we can solve for \(\mathbf{b}\).
- Once \(\mathbf{B}\) is known, use Cholesky decomposition to recover \(\mathbf{K}\).
Step 4: Recovering Extrinsics
Once \(\mathbf{K}\) is known, extrinsic parameters for each image are recovered:
3. Non-Linear Refinement (Optimization)
Why needed? DLT and Zhang's method minimize algebraic error and assume a perfect pinhole model. However, for real lenses, they usually have distortion, and we care about reprojection error (geometric distance).
The Objective Function
We define the Reprojection Error \(E\) as the Euclidean distance between the detected pixel points (\(\mathbf{m}\)) and the projected points \(\pi(\Theta)\):
where \(\Theta\) contains all optimizing variables:
- Intrinsics: \(\mathbf{K} = \{f_x, f_y, c_x, c_y\}\)
- Distortion: \(\mathbf{D} = \{k_1, k_2, p_1, p_2\}\)
- Extrinsics: \(\mathbf{T}_i = \{\mathbf{r}_i, \mathbf{t}_i\}\) for every image \(i=1...N\).
To minimize this, we define the residual vector \(\mathbf{e}(\Theta) = \mathbf{m} - \pi(\Theta)\) and minimize the cost function:
Optimization
1. Linearization (Taylor Expansion) Suppose we are at iteration \(k\) with parameters \(\Theta_k\). We seek a step \(\Delta\) such that \(\Theta_{k+1} = \Theta_k + \Delta\) reduces the error. We approximate the non-linear residual function \(\mathbf{e}\) using a first-order Taylor expansion:
Where \(\mathbf{J} = \frac{\partial \mathbf{e}}{\partial \Theta}\) is the Jacobian Matrix.
2. Approximating the Cost Function Substitute this into the cost function \(F(\Theta)\):
3. Finding the Minimum (Gauss-Newton) To find the optimal \(\Delta\), we take the derivative with respect to \(\Delta\) and set it to 0:
Rearranging gives the Normal Equations:
Some intuitions here
- Jacobian \(\mathbf{J}\in \mathbb{R}^{N_{obs}\times N_{params}}\) maps Parameter space \(N_{params}\) to Residual space \(N_{obs}\). Here \(N_{obs}> N_{params}\)
- \(\mathbf{J}^T \mathbf{e}\) projects the high-dimensional residual space back to the parameter space. for instance, 100 observations projects back to \(N_{params}<<100\) parameters and results in a \(N_{params}\)-dim vector that represents the steepest direction to reduce the error.
- \(\mathbf{J}^T \mathbf{J}\) captures the correlation between parameters (non-diagonal elements) and curvature (diagonal elements). It tells the optimizer: how big should I take this step along this direction?
4. The Levenberg-Marquardt Modification If \(\mathbf{J}^T\mathbf{J}\) is ill-conditioned (nearly singular), Gauss-Newton fails. LM adds a damping factor \(\lambda\):
- Large \(\lambda\): Behaves like Gradient Descent (Robust, slow).
- Small \(\lambda\): Behaves like Gauss-Newton (Fast, precise).
[NOTE] Marquardt's Insight: In practice, we replace identity matrix \(\mathbf{I}\) with the diagonal of the Hessian approximation:
\[(\mathbf{J}^T\mathbf{J} + \lambda \text{diag}(\mathbf{J}^T\mathbf{J}))\Delta = -\mathbf{J}^T\mathbf{e}\]This scales the step size based on the curvature of each parameter, handling the massive scale difference between focal length (\(f \approx 1000\)) and distortion coefficients (\(k \approx 0.001\)).
4. Bundle Adjustment - Exploiting Sparsity
Solving the normal equations directly requires inverting \(\mathbf{H} \approx \mathbf{J}^T\mathbf{J}\). If we have \(N\) images, the naive inversion is \(O(N^3)\). We thus need to exploit the sparsity pattern of the problem.
A. Structure of the Jacobian
The residuals for Image \(i\) depend ONLY on:
- Global Intrinsics: \(\mathbf{K}\)
- Distortion Coefficients: \(\mathbf{D}\)
- Extrinsics of Image \(i\): \(\mathbf{R}_i, \mathbf{t}_i\)
They do not depend on the extrinsics of any other Image \(j\). This creates a sparse block structure:
NOTE: Here the computation of derivatives w.r.t. the rotation matrix requires Lie algebra.
B. Structure of the Hessian ("Arrowhead Matrix")
When we compute \(\mathbf{H} = \mathbf{J}^T\mathbf{J}\), the "zero blocks" result in a specific shape:
- \(\mathbf{U}\): Intrinsics self-correlation (Small, Dense).
- \(\mathbf{V}\): Extrinsics self-correlation. Crucially, this is Block Diagonal because Extrinsics \(i\) and \(j\) are uncorrelated.
- \(\mathbf{W}\): Coupling between Intrinsics and Extrinsics.
An insight about \(\mathbf{W}\): if you change focal length \(f\), you can compensate this by changing \(\mathbf{R}\). This is reflected by the structure of \(\mathbf{W}\), which represents the coupling between intrinsics and extrinsics.
C. Schur Complement (Block Elimination)
Since \(\mathbf{V}\) is block diagonal, it is computationally cheap to invert. We can eliminate \(\Delta_{\text{ext}}\) to create a reduced system for \(\Delta_{\text{int}}\).
We start with the two linear equations from the block matrix:
- \(\mathbf{U}\Delta_{\text{int}} + \mathbf{W}\Delta_{\text{ext}} = \mathbf{g}_{\text{int}}\)
- \(\mathbf{W}^T\Delta_{\text{int}} + \mathbf{V}\Delta_{\text{ext}} = \mathbf{g}_{\text{ext}}\)
Step 1: Solve equation (2) for \(\Delta_{\text{ext}}\):
Step 2: Substitute this expression into equation (1):
Step 3: Expand and rearrange to group \(\Delta_{\text{int}}\) terms:
Algorithm:
- Solve Reduced System: Compute the Schur Complement matrix and solve for \(\Delta_{\text{int}}\).
- Back-Substitution: Plug \(\Delta_{\text{int}}\) back into Step 1 formula to compute \(\Delta_{\text{ext}}\).
This reduces complexity from \(O(N^3)\) to \(O(N)\) (linear in the number of images).