1.3. Camera Calibration

1.3.1. Estimating the Camera Matrix using the DLT

Our camera model projects a 3D point \(\hv X\) on a 2D retina resulting in point \(\hv x\):

\[\hv x \sim P \hv X\]

In this section we introduce the Direct Linear Transform (DLT) to estimate \(P\) in case a number of examples \(\hv x_i \sim P \hv X_i\) are knownn (the calibration points). We will see that the crux of this method is to rewrite the above relation into a form \(Ai \v p = \v 0\) where \(p\) is the vector formed from all elements in the matrix \(P\) and the matrix \(A\) represents all our calibration points. The homogeneous system \(A\v p=\v 0\) can than be solved using the SVD ‘trick’, Well let’s see how to construct the matrix \(A\) for this problem and what the SVD trick entails.

Note that mathematically \(\hv x = s P\hv X\) for some non zero scalar \(s\), i.e. the vectors \(\hv x\) and \(P\hv X\) are parallel and therefore (because mathematically these vectors are 3 component vectors we can use the cross product) \(\hv x \times P \hv X=\v 0\).

Using \(\hv p_1\T\), \(\hv p_2\T\) and \(\hv p_3\T\) to denote the 3 row vectors of \(P\) and using \(\hv x = (x\;y\;1)\T\), calculating the cross product leads to:

\[\begin{split}\hv x \times P \hv X = \matvec{c}{ - \hv X\T\hv p_2 + y \hv X\T \hv p_3 \\ \hv X\T \hv p_1 - x \hv X\T \hv p_3\\ x \hv X\T \hv p_2 - y \hv X\T \hv p_1} = \matvec{c}{0\\0\\0}\end{split}\]

or equivalently:

\[\begin{split}\matvec{ccc}{ 0 & -\hv X\T & y \hv X\T \\ \hv X\T & 0 & -x\hv X\T \\ -y \hv X\T & x\hv X\T & 0 } \matvec{c}{\hv p_1 \\ \hv p_2 \\ \hv p_3} = \matvec{c}{0\\0\\0}\end{split}\]

Note that the third row in the matrix is a linear combination of the first two rows and thus we can consider the first two elements only. Interchanging the first and second row and inserting a minus sign (allowed because we are looking for a null vector), ee can rewrite this as:

\[\begin{split}\matvec{ccc}{ \hv X\T & \v 0\T & -x \hv X\T \\ \v 0\T & \hv X\T & -y \hv X\T } \matvec{c}{\hv p_1\\ \hv p_2 \\ \hv p_3} = \matvec{ccc}{ \hv X\T & \v 0\T & -x \hv X\T \\ \v 0\T & \hv X\T & -y \hv X\T }\v p = \matvec{c}{0\\0}\end{split}\]

Note that \(\v p\) stacks the 3 rows of \(P\) on top of eachother forming a 12 component vector.

The above equation is an homogeneous set of equations. It is based on just one pair of point correspondences \((X,Y,Z)\rightarrow(x,y)\). For \(n\) point correspondences we may stack \(2n\) equations on top of eachother:

\[\begin{split}\matvec{ccc}{ \hv X_1\T & \v 0\T & -x_1 \hv X_1\T \\ \v 0\T & \hv X_1\T & -y_1 \hv X_1\T \\ \vdots & \vdots & \vdots \\ \hv X_n\T & \v 0\T & -x_n \hv X_n\T \\ \v 0\T & \hv X_n\T & -y_n \hv X_n\T \\ } \v p = \v 0 \\ A \v p = \v 0\end{split}\]

For 6 point correspondences we can find a non-trivial nulvector \(\v p^\ast\). This only works in case there are no measurement errors (and because the 2D points are measured in images there will be noise), in practical situations using more then the minimal number of point correspondences is to be prefered and we search for the vector \(\v p\) that minimizes \(\|A\v p\|\) subject to the constraint that \(\|\v p\|=1\). The ‘SVD-trick’ can be used here.

Given the optimal vector \(\v p\) we can reshape this vector into a \(3\times4\) camera matrix \(P\). Given the calibrated camera matrix we may take each model 3D point \((X_i,Y_i,Z_i)\) and project it on the camera plane and obtain \((a_i, b_i)\) when we compare this ‘reprojected’ point \((a_i,b_i)\) with the ‘true’ projected point \((x_i,y_i)\) we expect that the reprojection distance \(d_i=\sqrt{(x_i-a_i)^2 + (y_i-b_i)^2}\) to be small.

This however we can only hope for: finding the camera matrix \(P\) by minimizing \(\|A\v p\|\) subject to \(\|\v p\|\) does not guarantee that the sum of square reprojections distances is minimal. However minimizing the sum of square reprojection errors is a difficult (and computationally complex) non linear optimization problem that can only be approximately solved. For this we refer to the literature.