Least Squares Estimators 
========================

This section provides a practical introduction to least squares
estimation procedures in a linear algebra framework that is hopefully
generic enough to be useful in practical applications.


.. RVDB: hier moeten (weer) vele Python/Numpy voorbeelden in gezet
   worden.


Fitting a straight line
-----------------------

The classical example of a least squares estimator is fitting a
straight line $f(x)=p_1 + p_2 x$ to a set of $N$ measurements $\{x_i,
f_i\}$ for $i=1,\ldots,N$. In case all points lie exactly on the
straight line we have:

.. math::
   f_i = p_1 + p_2 x_i

for all $i=1,\ldots,N$. For data obtained through measurements, there
will be some unavoidable deviation from this situation, instead we
will have:

.. math::
   f_i - p_1 - p_2 x_i = e_i

where $e_i$ is some (hopefully) small deviation (error) from the
model. The goal of the LSQ estimation procedure now is to find the
values of the parameters $p_1$ and $p_2$ such that the sum of the
squared errors is minimal. The total squared error is:

.. math::
   \epsilon(p_1,p_2) = \sum_{i=1}^N e_i^2 
   =\sum_{i=1}^N ( f_i - p_1 - p_2 x_i )^2

.. sidebar:: **Extrema**

   A function $f$ in one variable $x$ has a local maximum or minimum
   (both are called *extrema*) in a point $x_0$ in case
   $f'(x_0)=0$. Whether it is maximum or a minimum depends on the sign
   of the second order derivative in that same point. For $f''(x_0)<0$
   we have a local maximum and for $f''(x_0)>0$ we have a local
   minimum in $x_0$.

   .. RVDB: HIER PLAATJE VAN 1D VOORBEELD

   For a function $f$ in several variables $x_1,\ldots,x_n$ a
   *necessary condition* for an extrema is that all the partial
   derivatives of $f$ with respect to $x_i$ (for $i=1,\ldots,n$) are
   equal to zero:

   .. math::
      \frac{\partial \epsilon}{\partial x_1} = 
      \frac{\partial \epsilon}{\partial x_2} = \cdots =
      \frac{\partial \epsilon}{\partial x_n} = 0

   Any point where all derivatives are zero is called a *stationary*
   or *critical* point. A stationary point need not be an extremum. To
   check wether a stationary point is a local maximum, a local minimum
   or a *saddle point* (in some directions the function is a maximum
   whereas in other directions it is a minimum) we need to look at the
   eigenvalues of the Hessian matrix:

   .. math:: H(f) = \begin{bmatrix} \dfrac{\partial^2 f}{\partial
      x_1^2} & \dfrac{\partial^2 f}{\partial x_1\,\partial x_2} &
      \cdots & \dfrac{\partial^2 f}{\partial x_1\,\partial x_n}
      \\[2.2ex] \dfrac{\partial^2 f}{\partial x_2\,\partial x_1} &
      \dfrac{\partial^2 f}{\partial x_2^2} & \cdots &
      \dfrac{\partial^2 f}{\partial x_2\,\partial x_n} \\[2.2ex]
      \vdots & \vdots & \ddots & \vdots \\[2.2ex] \dfrac{\partial^2
      f}{\partial x_n\,\partial x_1} & \dfrac{\partial^2 f}{\partial
      x_n\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial
      x_n^2} \end{bmatrix}

   In case all eigenvalues are negative we have local maximum and in
   case they are all positive we have a local minimum. In case there
   are both positive and negative eigenvalues the stationary point is
   a saddle point.

   .. RVDB: HIER PLAATJE VAN 2D VOORBEELD MET MAX MIN EN SADDLE (peak
      functie van matlab?)

We are now looking for those values of the parameters $p_1$ and $p_2$
that make the total squared error minimal. A necessary condition for a
function $\epsilon$ in two parameters ($p_1,p_2)$ to have an extreme
value (either local maxumum or local minimum) at parameter values
$(p_1,p_2)$ is that the derivatives of $\epsilon$ with respect to both
parameters are zero:

.. math::
   \frac{\partial}{\partial p_1}\, \epsilon =  
   \frac{\partial}{\partial p_2}\, \epsilon = 0

Or equivalently:

.. math::
   - 2\sum_{i=1}^N ( f_i - p_1 - p_2 x_i ) &=& 0\\
     - 2\sum_{i=1}^N x_i ( f_i - p_1 - p_2 x_i ) &=& 0

This can be rewritten as:

.. math::
   p_1 \sum_{i=1}^N 1 + p_2 \sum_{i=1}^N x_i &=& \sum_{i=1}^N  f_i \\
   p_1 \sum_{i=1}^N x_i + p_2 \sum_{i=1}^N x_i^2 &=& \sum_{i=1}^N  x_i f_i 

In a matrix vector notation we can write:

.. math::
   \matvec{cc}{
   \sum_{i=1}^N 1 & \sum_{i=1}^N x_i  \\
   \sum_{i=1}^N x_i & \sum_{i=1}^N x_i^2
   }
   \matvec{c}{p_1 \\ p_2} =
   \matvec{c}{
   \sum_{i=1}^N f_i\\
   \sum_{i=1}^N x_i f_i 
   }

The optimal parameters are thus found as:

.. math::
   \matvec{c}{p_1\\ p_2} &=& 
   \matvec{cc}{
   \sum_{i=1}^N 1 & \sum_{i=1}^N x_i  \\
   \sum_{i=1}^N x_i & \sum_{i=1}^N x_i^2
   }\inv 
   \matvec{c}{
   \sum_{i=1}^N f_i\\
   \sum_{i=1}^N x_i f_i 
   }\\
   &=&
   \frac{1}{ 
   -\left(\sum_{i=1}^N x_i \right)^2 + N \sum_{i=1}^N x_i^2 
   }
   \matvec{ccc}{
   \sum_{i=1}^N x_i^2 & - \sum_{i=1}^N x_i\\
   - \sum_{i=1}^N x_i & N
     }
     \matvec{c}{
     \sum_{i=1}^N f_i\\
     \sum_{i=1}^N x_i f_i 
     }


.. hier moet een Python voorbeeld programma + figure


Now consider the problem of estimating the second order polynomial
$f(x)=p_1 + p_2 x + p_3 x^2$ given a set of observations $\{x_i,
f_i\}$. It is not too difficult to go through the entire process again
of calculating the total square error, differentiating with respect to
the parameters,setting all derivatives equal to zero and solving the
resulting set of equations. It is however a tedious and error prone
exercise that we would like to generalize to all sorts of least
squares estimation problems. That is what we will do in the next
section.


Least Squares Estimators
------------------------

Let us look again at the problem of fitting a straight line. For each
of the observations $\{x_i, f_i\}$ the deviation from the model equals

.. math::
   e_i = f_i - p_1 - p_2 x_i

or equivalently:

.. math::
   p_1 + p_2 x_i = f_i - e_i .

In a matrix vector equation we write:

.. math::
   \matvec{cc}{1 & x_i}\matvec{c}{p_1\\ p_2} = f_i-e_i

Note that we have such an equation for each $i=1,\ldots,N$ and that we
may combine all these observations in one matrix vector equation:

.. math::
   \matvec{cc}{1 & x_1\\
   1 & x_2\\
   \vdots & \vdots\\
   1 & x_N}
   \matvec{c}{p_1\\ p_2} = 
   \matvec{c}{f_1\\ f_2 \\ \vdots \\ f_N}-
   \matvec{c}{e_1\\ e_2 \\ \vdots \\ e_N}
   

Let us denote the matrices and vectors in the above equation with
$\v X, \v p, \v f$ and $\v v$ respectively: 

.. math::
   \v X \v p = \v f - \v v

or equivalently: 

.. math::
   \v v = \v f - \v X\v p 

The sum of the squared errors is equal to:

.. math::
   \epsilon = \sum_{i=1}^N e_i^2 = \| \v v \|^2 =
   \v v\T\v v \] i.e.  \[ \epsilon = \v v\T\v v =
   (\v f-\v X\v p)\T(\v f-\v X\v p)= \v f\T\v f -
   2\v p\T\v X\T\v f+\v p\T\v X\T\v X\v p 

Again we can calculate the parameter values, collected in the vector
$\v p = (p_1\;p_2\cdots p_n)\T$, by differentiating $\epsilon(\v p)$
with respect to all $p_i$ and solving for the parameter values to make
all derivatives equal to zero.  It is of course possible to expand the
vectorial error equation in its elements and perform the derivatives
explicitly (see :doc:`vectorial differentiation<vecdif>`).

It is easier to make use of the following *vectorial
derivatives*. First let $f$ be a scalar function in $N$ parameters
$\v v[1],\ldots,\v v[N]$ then we define $\D_{\v v} f$ as the vector of
all partial derivatives: 

.. math::
   \frac{\D f}{\D \v v} \equiv \D_{\v v} f =
   \matvec{c}{ \frac{\D f}{\D \v v[1]}\\ \frac{\D f}{\D \v v[2]}\\ \vdots\\
   \frac{\D f}{\D \v v[N]} } 

The following `vector derivative' expressions are easy to prove

.. math::
   &\D_{\v v}(\v v\T \v w) = \D_{\v v}(\v w\T\v v) = \v w&\\ 
   &\D_{\v v}(\v v\T \v A \v v) = (\v A+\v A\T) \v v&

by calculating the derivatives seperately for all components.  With
these derivatives it is straightforward to show that: 

.. math::
   \D_{\v p}\epsilon = -2 \v X\T \v f + 2 \v X\T\v X\v p 

Solving $\D_{\v p} \epsilon = \v 0$ we arrive at: 

.. math::
   \v X\T\v X \v p = \v X\T \v f

In case $\v X\T \v X$ is non-singular we get: 

.. math:: 
   \v p = \left(\v X\T\v X\right)\inv \v X\T \v f 


.. hier een python voorbeeld programma 


In case you read your notes of linear algebra class again, you will
undoubtedly find a remark somewhere that in general it is a bad idea
to do a matrix inversion in case the goal is to solve a system of
linear equations. But, that is exactly what we are doing... Your
linear algebra teacher was right nevertheless. In practice you should
use a method to solve a system of linear equations, not doing the
matrix inverse explicitly.


Examples of least squares estimators
------------------------------------

In this section we use the expressions from the previous section to
tackle some least squares estimation problems.


2nd-order polynomial
....................

Consider the observations $\{x_i,f_i\}$ where the assumed functional
relation between $x$ and $f(x)$ is given by a second order polynomial:

.. math::
   f(x) = p_1 + p_2 x + p_3 x^2

The error vector in this case is:

.. math::
   \v e = \v f - \v X \v p = 
   \matvec{c}{f_1 \\ f_2 \\ \vdots \\ f_N }- 
   \matvec{ccc}{1 & x_1 & x_1^2\\
   1 & x_2 & x_2^2\\
   \vdots & \vdots\\
   1 & x_N & x_N^2}
   \matvec{c}{p_1 \\ p_2 \\ p_3 }

The above scheme for a second order polynomial can be extended to work
for a polynomial function of any order.


Multivariate polynomials
........................


The facet model used in image processing and computer vision is an
example of a least squares estimator of a multivariate polynomial
function.

Consider the discrete image $F$ only known through its samples
$F(i,j)$ for $i=1,\ldots,M$ and $j=1,\ldots,N$. In many situations we
need to estimate the image function values *inbetween* the sample
points, say the point $(x_0,y_0)$. What is then done is: first come up
with a function $f(x,y)$ defined on the continuous plane that is
consistent with $F$ and then take $f(x_0,y_0)$ as the value you were
looking for. The function $f$ can either be defined as a function that
*interpolates* $F$, i.e.\ $f(i,j)=F(i,j)$ (for all $i$ and $j$)
or it can be defined as a function that approximates the samples. Such
an approximation is what we are after here.

Consider the point $(i_0,j_0)$ and define $F_0(i,j)=F(i-i_0,j-j_0)$,
i.e. we shift the origin to the point of interest. The task we set
ourselves to now is to approximate $F_0$ with a polynomial function

.. math::
   f_0(x,y) = p_1 + p_2 x + p_3 y + p_4 x^2 + p_5 x y + p_6 y^2.

Such a simple function evidently cannot hope to approximate the image
data over the entire domain. We restrict the region in which to
approximate $F_0$ to a small neighborhood of the origin (say a
$3\times3$ neighborhood). For one point $(i,j)$ with image value
$F_0(i,j)$ we have:

.. math::
   F_0(i,j) - e(i,j) = 
   \matvec{cccccc}{1 & i & j & i^2 & i j & j^2}
   \matvec{c}{p_1 \\ p_2 \\ p_3 \\ p_4 \\ p_5 \\ p_6}

In case we have $K$ sample points in the considered neighborhood we
can stack up the rows and obtain an $\v X$ matrix. Equivalently we
obtain the image value vector $\v F_0$, error vector $\v e$ and
parameter vector $\v p$ leading to:

.. math::
   \v F_0 - \v e = \v X \v p

and the least squares estimator:

.. math::
   \v p^\star = (\v X\T\v X)\inv \v X\T \v F_0

For the 9 points $(i,j)$ in the $3\times3$ neighborhood centered around
the origin we obtain:

.. math::
   \v X = \matvec{cccccc}{
   % 1 i j ii ij jj
   1 &   -1 &   -1 &    1 &    1 &    1 \\
   1 &    0 &   -1 &    0 &    0 &    1 \\
   1 &    1 &   -1 &    1 &   -1 &    1 \\
   1 &   -1 &    0 &    1 &    0 &    0 \\
   1 &    0 &    0 &    0 &    0 &    0 \\
   1 &    1 &    0 &    1 &    0 &    0 \\
   1 &   -1 &    1 &    1 &   -1 &    1 \\
   1 &    0 &    1 &    0 &    0 &    1 \\
   1 &    1 &    1 &    1 &    1 &    1 \\
   }
   \qquad
   \v F_0 = \matvec{c}{F_0(-1,-1)\\
   F_0(0,-1)\\
   F_0(1,-1)\\
   F_0(-1,0)\\
   F_0(0,0)\\
   F_0(1,0)\\
   F_0(-1,1)\\
   F_0(0,1)\\
   F_0(1,1)\\
   }
   
leading to

.. math::
   (\v X\T\v X)\inv \v X\T = 
   \frac{1}{36}\matvec{ccccccccc}{
   -4 &    8 &   -4 &    8 &   20 &    8 &   -4 &    8 &   -4 \\
   -6 &    0 &    6 &   -6 &    0 &    6 &   -6 &    0 &    6 \\
   -6 &   -6 &   -6 &    0 &    0 &    0 &    6 &    6 &    6 \\
   6 &  -12 &    6 &    6 &  -12 &    6 &    6 &  -12 &    6 \\
   9 &    0 &   -9 &    0 &    0 &    0 &   -9 &    0 &    9 \\
   6 &    6 &    6 &  -12 &  -12 &  -12 &    6 &    6 &    6
   }

The first row of the above matrix gives the weights to calculate the
zero order coefficient of the 2nd order polynomial that approximates
the image function:

.. math::
   p_1 =     \frac{1}{36}(-4 F_0(-1,-1)  +  8 F_0(0,-1)    -4 F_0(1,-1) \\
   +    8 F_0(-1,0) +   20 F_0(0,0) +    8 F_0(1,0)    \\
	-4 F_0(-1,1) +    8 F_0(0,1)    -4 F_0(1,1) )

The weights can be arranged in a spatial layout as:

.. math::
   \frac{1}{36}
   \left\{
   \begin{array}{ccc}
   -4 &  8 & -4\\
   8 & 20 &  8\\
   -4 &  8 & -4
   \end{array}
   \right\}

The above *template* or *kernel* can be interpreted as: multiply the
image values at the corresponding positions with the kernel values and
sum all resulting values to give the value of $p_1$.

The same analysis can be made for the other coefficients $p_i$ in the
polynomial approximation of the image function.