Least Squares Estimators¶

This section provides a practical introduction to least squares estimation procedures in a linear algebra framework that is hopefully generic enough to be useful in practical applications.

Fitting a straight line¶

The classical example of a least squares estimator is fitting a straight line \(f(x)=p_1 + p_2 x\) to a set of \(N\) measurements \(\{x_i, f_i\}\) for \(i=1,\ldots,N\). In case all points lie exactly on the straight line we have:

\[f_i = p_1 + p_2 x_i\]

for all \(i=1,\ldots,N\). For data obtained through measurements, there will be some unavoidable deviation from this situation, instead we will have:

\[f_i - p_1 - p_2 x_i = e_i\]

where \(e_i\) is some (hopefully) small deviation (error) from the model. The goal of the LSQ estimation procedure now is to find the values of the parameters \(p_1\) and \(p_2\) such that the sum of the squared errors is minimal. The total squared error is:

\[\epsilon(p_1,p_2) = \sum_{i=1}^N e_i^2 =\sum_{i=1}^N ( f_i - p_1 - p_2 x_i )^2\]

A function \(f\) in one variable \(x\) has a local maximum or minimum (both are called extrema) in a point \(x_0\) in case \(f'(x_0)=0\). Whether it is maximum or a minimum depends on the sign of the second order derivative in that same point. For \(f''(x_0)<0\) we have a local maximum and for \(f''(x_0)>0\) we have a local minimum in \(x_0\).

For a function \(f\) in several variables \(x_1,\ldots,x_n\) a necessary condition for an extrema is that all the partial derivatives of \(f\) with respect to \(x_i\) (for \(i=1,\ldots,n\)) are equal to zero:

\[\frac{\partial \epsilon}{\partial x_1} = \frac{\partial \epsilon}{\partial x_2} = \cdots = \frac{\partial \epsilon}{\partial x_n} = 0\]

Any point where all derivatives are zero is called a stationary or critical point. A stationary point need not be an extremum. To check wether a stationary point is a local maximum, a local minimum or a saddle point (in some directions the function is a maximum whereas in other directions it is a minimum) we need to look at the eigenvalues of the Hessian matrix:

\[\begin{split}H(f) = \begin{bmatrix} \dfrac{\partial^2 f}{\partial x_1^2} & \dfrac{\partial^2 f}{\partial x_1\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_1\,\partial x_n} \\[2.2ex] \dfrac{\partial^2 f}{\partial x_2\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_2^2} & \cdots & \dfrac{\partial^2 f}{\partial x_2\,\partial x_n} \\[2.2ex] \vdots & \vdots & \ddots & \vdots \\[2.2ex] \dfrac{\partial^2 f}{\partial x_n\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_n\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_n^2} \end{bmatrix}\end{split}\]

In case all eigenvalues are negative we have local maximum and in case they are all positive we have a local minimum. In case there are both positive and negative eigenvalues the stationary point is a saddle point.

We are now looking for those values of the parameters \(p_1\) and \(p_2\) that make the total squared error minimal. A necessary condition for a function \(\epsilon\) in two parameters (\(p_1,p_2)\) to have an extreme value (either local maxumum or local minimum) at parameter values \((p_1,p_2)\) is that the derivatives of \(\epsilon\) with respect to both parameters are zero:

\[\frac{\partial}{\partial p_1}\, \epsilon = \frac{\partial}{\partial p_2}\, \epsilon = 0\]

Or equivalently:

\[\begin{split}- 2\sum_{i=1}^N ( f_i - p_1 - p_2 x_i ) &=& 0\\ - 2\sum_{i=1}^N x_i ( f_i - p_1 - p_2 x_i ) &=& 0\end{split}\]

This can be rewritten as:

\[\begin{split}p_1 \sum_{i=1}^N 1 + p_2 \sum_{i=1}^N x_i &=& \sum_{i=1}^N f_i \\ p_1 \sum_{i=1}^N x_i + p_2 \sum_{i=1}^N x_i^2 &=& \sum_{i=1}^N x_i f_i\end{split}\]

In a matrix vector notation we can write:

\[\begin{split}\matvec{cc}{ \sum_{i=1}^N 1 & \sum_{i=1}^N x_i \\ \sum_{i=1}^N x_i & \sum_{i=1}^N x_i^2 } \matvec{c}{p_1 \\ p_2} = \matvec{c}{ \sum_{i=1}^N f_i\\ \sum_{i=1}^N x_i f_i }\end{split}\]

The optimal parameters are thus found as:

\[\begin{split}\matvec{c}{p_1\\ p_2} &=& \matvec{cc}{ \sum_{i=1}^N 1 & \sum_{i=1}^N x_i \\ \sum_{i=1}^N x_i & \sum_{i=1}^N x_i^2 }\inv \matvec{c}{ \sum_{i=1}^N f_i\\ \sum_{i=1}^N x_i f_i }\\ &=& \frac{1}{ -\left(\sum_{i=1}^N x_i \right)^2 + N \sum_{i=1}^N x_i^2 } \matvec{ccc}{ \sum_{i=1}^N x_i^2 & - \sum_{i=1}^N x_i\\ - \sum_{i=1}^N x_i & N } \matvec{c}{ \sum_{i=1}^N f_i\\ \sum_{i=1}^N x_i f_i }\end{split}\]

Now consider the problem of estimating the second order polynomial \(f(x)=p_1 + p_2 x + p_3 x^2\) given a set of observations \(\{x_i, f_i\}\). It is not too difficult to go through the entire process again of calculating the total square error, differentiating with respect to the parameters,setting all derivatives equal to zero and solving the resulting set of equations. It is however a tedious and error prone exercise that we would like to generalize to all sorts of least squares estimation problems. That is what we will do in the next section.

Least Squares Estimators¶

Let us look again at the problem of fitting a straight line. For each of the observations \(\{x_i, f_i\}\) the deviation from the model equals

\[e_i = f_i - p_1 - p_2 x_i\]

or equivalently:

\[p_1 + p_2 x_i = f_i - e_i .\]

In a matrix vector equation we write:

\[\begin{split}\matvec{cc}{1 & x_i}\matvec{c}{p_1\\ p_2} = f_i-e_i\end{split}\]

Note that we have such an equation for each \(i=1,\ldots,N\) and that we may combine all these observations in one matrix vector equation:

\[\begin{split}\matvec{cc}{1 & x_1\\ 1 & x_2\\ \vdots & \vdots\\ 1 & x_N} \matvec{c}{p_1\\ p_2} = \matvec{c}{f_1\\ f_2 \\ \vdots \\ f_N}- \matvec{c}{e_1\\ e_2 \\ \vdots \\ e_N}\end{split}\]

Let us denote the matrices and vectors in the above equation with \(\v X, \v p, \v f\) and \(\v v\) respectively:

\[\v X \v p = \v f - \v v\]

or equivalently:

\[\v v = \v f - \v X\v p\]

The sum of the squared errors is equal to:

\[\epsilon = \sum_{i=1}^N e_i^2 = \| \v v \|^2 = \v v\T\v v \] i.e. \[ \epsilon = \v v\T\v v = (\v f-\v X\v p)\T(\v f-\v X\v p)= \v f\T\v f - 2\v p\T\v X\T\v f+\v p\T\v X\T\v X\v p\]

Again we can calculate the parameter values, collected in the vector \(\v p = (p_1\;p_2\cdots p_n)\T\), by differentiating \(\epsilon(\v p)\) with respect to all \(p_i\) and solving for the parameter values to make all derivatives equal to zero. It is of course possible to expand the vectorial error equation in its elements and perform the derivatives explicitly (see vectorial differentiation).

It is easier to make use of the following vectorial derivatives. First let \(f\) be a scalar function in \(N\) parameters \(\v v[1],\ldots,\v v[N]\) then we define \(\D_{\v v} f\) as the vector of all partial derivatives:

\[\begin{split}\frac{\D f}{\D \v v} \equiv \D_{\v v} f = \matvec{c}{ \frac{\D f}{\D \v v[1]}\\ \frac{\D f}{\D \v v[2]}\\ \vdots\\ \frac{\D f}{\D \v v[N]} }\end{split}\]

The following `vector derivative’ expressions are easy to prove

\[\begin{split}&\D_{\v v}(\v v\T \v w) = \D_{\v v}(\v w\T\v v) = \v w&\\ &\D_{\v v}(\v v\T \v A \v v) = (\v A+\v A\T) \v v&\end{split}\]

by calculating the derivatives seperately for all components. With these derivatives it is straightforward to show that:

\[\D_{\v p}\epsilon = -2 \v X\T \v f + 2 \v X\T\v X\v p\]

Solving \(\D_{\v p} \epsilon = \v 0\) we arrive at:

\[\v X\T\v X \v p = \v X\T \v f\]

In case \(\v X\T \v X\) is non-singular we get:

\[\v p = \left(\v X\T\v X\right)\inv \v X\T \v f\]

In case you read your notes of linear algebra class again, you will undoubtedly find a remark somewhere that in general it is a bad idea to do a matrix inversion in case the goal is to solve a system of linear equations. But, that is exactly what we are doing... Your linear algebra teacher was right nevertheless. In practice you should use a method to solve a system of linear equations, not doing the matrix inverse explicitly.

Examples of least squares estimators¶

In this section we use the expressions from the previous section to tackle some least squares estimation problems.

2nd-order polynomial¶

Consider the observations \(\{x_i,f_i\}\) where the assumed functional relation between \(x\) and \(f(x)\) is given by a second order polynomial:

\[f(x) = p_1 + p_2 x + p_3 x^2\]

The error vector in this case is:

\[\begin{split}\v e = \v f - \v X \v p = \matvec{c}{f_1 \\ f_2 \\ \vdots \\ f_N }- \matvec{ccc}{1 & x_1 & x_1^2\\ 1 & x_2 & x_2^2\\ \vdots & \vdots\\ 1 & x_N & x_N^2} \matvec{c}{p_1 \\ p_2 \\ p_3 }\end{split}\]

The above scheme for a second order polynomial can be extended to work for a polynomial function of any order.

Multivariate polynomials¶

The facet model used in image processing and computer vision is an example of a least squares estimator of a multivariate polynomial function.

Consider the discrete image \(F\) only known through its samples \(F(i,j)\) for \(i=1,\ldots,M\) and \(j=1,\ldots,N\). In many situations we need to estimate the image function values inbetween the sample points, say the point \((x_0,y_0)\). What is then done is: first come up with a function \(f(x,y)\) defined on the continuous plane that is consistent with \(F\) and then take \(f(x_0,y_0)\) as the value you were looking for. The function \(f\) can either be defined as a function that interpolates \(F\), i.e.\(f(i,j)=F(i,j)\) (for all \(i\) and \(j\)) or it can be defined as a function that approximates the samples. Such an approximation is what we are after here.

Consider the point \((i_0,j_0)\) and define \(F_0(i,j)=F(i-i_0,j-j_0)\), i.e. we shift the origin to the point of interest. The task we set ourselves to now is to approximate \(F_0\) with a polynomial function

\[f_0(x,y) = p_1 + p_2 x + p_3 y + p_4 x^2 + p_5 x y + p_6 y^2.\]

Such a simple function evidently cannot hope to approximate the image data over the entire domain. We restrict the region in which to approximate \(F_0\) to a small neighborhood of the origin (say a \(3\times3\) neighborhood). For one point \((i,j)\) with image value \(F_0(i,j)\) we have:

\[\begin{split}F_0(i,j) - e(i,j) = \matvec{cccccc}{1 & i & j & i^2 & i j & j^2} \matvec{c}{p_1 \\ p_2 \\ p_3 \\ p_4 \\ p_5 \\ p_6}\end{split}\]

In case we have \(K\) sample points in the considered neighborhood we can stack up the rows and obtain an \(\v X\) matrix. Equivalently we obtain the image value vector \(\v F_0\), error vector \(\v e\) and parameter vector \(\v p\) leading to:

\[\v F_0 - \v e = \v X \v p\]

and the least squares estimator:

\[\v p^\star = (\v X\T\v X)\inv \v X\T \v F_0\]

For the 9 points \((i,j)\) in the \(3\times3\) neighborhood centered around the origin we obtain:

\[\begin{split}\v X = \matvec{cccccc}{ % 1 i j ii ij jj 1 & -1 & -1 & 1 & 1 & 1 \\ 1 & 0 & -1 & 0 & 0 & 1 \\ 1 & 1 & -1 & 1 & -1 & 1 \\ 1 & -1 & 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 0 & 1 & 0 & 0 \\ 1 & -1 & 1 & 1 & -1 & 1 \\ 1 & 0 & 1 & 0 & 0 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ } \qquad \v F_0 = \matvec{c}{F_0(-1,-1)\\ F_0(0,-1)\\ F_0(1,-1)\\ F_0(-1,0)\\ F_0(0,0)\\ F_0(1,0)\\ F_0(-1,1)\\ F_0(0,1)\\ F_0(1,1)\\ }\end{split}\]

leading to

\[\begin{split}(\v X\T\v X)\inv \v X\T = \frac{1}{36}\matvec{ccccccccc}{ -4 & 8 & -4 & 8 & 20 & 8 & -4 & 8 & -4 \\ -6 & 0 & 6 & -6 & 0 & 6 & -6 & 0 & 6 \\ -6 & -6 & -6 & 0 & 0 & 0 & 6 & 6 & 6 \\ 6 & -12 & 6 & 6 & -12 & 6 & 6 & -12 & 6 \\ 9 & 0 & -9 & 0 & 0 & 0 & -9 & 0 & 9 \\ 6 & 6 & 6 & -12 & -12 & -12 & 6 & 6 & 6 }\end{split}\]

The first row of the above matrix gives the weights to calculate the zero order coefficient of the 2nd order polynomial that approximates the image function:

\[\begin{split}p_1 = \frac{1}{36}(-4 F_0(-1,-1) + 8 F_0(0,-1) -4 F_0(1,-1) \\ + 8 F_0(-1,0) + 20 F_0(0,0) + 8 F_0(1,0) \\ -4 F_0(-1,1) + 8 F_0(0,1) -4 F_0(1,1) )\end{split}\]

The weights can be arranged in a spatial layout as:

\[\begin{split}\frac{1}{36} \left\{ \begin{array}{ccc} -4 & 8 & -4\\ 8 & 20 & 8\\ -4 & 8 & -4 \end{array} \right\}\end{split}\]

The above template or kernel can be interpreted as: multiply the image values at the corresponding positions with the kernel values and sum all resulting values to give the value of \(p_1\).

The same analysis can be made for the other coefficients \(p_i\) in the polynomial approximation of the image function.