Correlation Tracking
====================

$\DeclareMathOperator{\ssd}{ssd}$
$\DeclareMathOperator{\nssd}{nssd}$
$\DeclareMathOperator{\ncc}{ncc}$
$\newcommand{\one}{\mathbb{I}}$
$\DeclareMathOperator{\nom}{nom}$
$\DeclareMathOperator{\den}{den}$

.. sidebar:: New version

   A :doc:`new version of this page <correlation_tracking>` is written
   (hopefully without the error). Please use that new page for
   learning.


Assume that we are watching a video and at time $t_0$ we are given a
rectangular patch positioned at $\v x_0$. This rectangular patch is
called the template or mask. Let the set $\set M$ be the set of all
vectors $\v y$ such that $\v x_0 + \v y$ is a point in the template
centered at point $\v x_0$. The image values in the template (mask)
are denoted as $m(\v y)$ for all $\v y\in M$. We assume that there are
$M$ elements in $\set M$.

We can compare the mask $m$ at every position $\v x$ in the image at
time $t>t_0$ to see where it fits best. An obvious measure to choose
is the average sum of squared distances:

.. math::
   \ssd(\v x) = \frac{1}{M} \sum_{\v y \in \set M} \left( f(\v
   x + \v y) - m(\v y) \right)^2

The template fits best at position $\v x^\star$ where $\ssd$ is minimal:

.. math::
   \v x^\star = \arg\min_{\v x} \ssd(\v x)

To make the notation somewhat simpler we will only consider the
comparison of the mask with the image at position $\v x=\v 0$ and we
write:

.. math::
   \ssd = \ssd(\v 0) = \frac{1}{M} \sum_{\v y \in \set M} \left( f(\v y) - m(\v y) \right)^2

The problem with this simple difference measure is that changes in
illumination may change the measure enormously. Therefore a normalized
measure is used. Instead of comparing $f$ with $m$ we will compare $\hat f$ and $\hat m$:

.. math::
   \hat f(\v y) = \frac{f(\v y) - \bar{f}}{s_f}, \quad
   \hat m(\v y) = \frac{m(\v y) - \bar{m}}{m_f}

where 

.. math::
   \bar{f} = \frac{1}{M} \sum_{\v y \in \set M} f(\v y), \quad
   \bar{m} = \frac{1}{M} \sum_{\v y \in \set M} m(\v y)

and

.. math::
   s_f = \sqrt{ \frac{1}{M} \sum_{\v y \in \set M} \left( f(\v y)-\bar f\right)^2 }, \quad
   m_f = \sqrt{ \frac{1}{M} \sum_{\v y \in \set M} \left( m(\v y)-\bar m\right)^2 }

i.e. we normalize by subtracting the mean (in the neighborhood
defined by the mask) and dividing by the standard deviation (again in
the neighborhood defined by the mask).

Note that the normalization of the mask has to be done only once but
normalization of $f$ will depend on the origin $\v x$ we have chosen
for the mask. Then both $\bar f$ and $s_f$ will become functions of
$\v x$. But for now we continue with $\v x = \v 0$.

Using the sum of squared distances for the normalized functions we
arrive at:

.. math::
   \nssd &= \frac{1}{M} \sum_{\v y \in \set M} \left( \hat f(\v y) - \hat m(\v y) \right)^2 \\
         &= \frac{1}{M}\sum_{\v y \in \set M} {\hat f}^2(\v y)
           - \frac{2}{M} \sum_{\v y \in \set M} \hat f(\v y) \hat m(\v y)
	   + \frac{1}{M} \sum_{\v y \in \set M} {\hat m}^2(\v y)


We leave it as a :doc:`proof <proof>` to the reader to show that the first and last
term in the above expression for $\nssd$ are equal to 1, leading to:

.. math::
   \nssd = 2 \left(1 - \frac{1}{M} \sum_{\v y \in \set M} \hat f(\v y) \hat m(\v y)\right)


Minimizing the $\nssd$ thus amounts to maximizing the sum of products
in the above expression. This term is known as the **normalized cross correlation**:

.. math::
   \ncc &= \frac{1}{M} \sum_{\v y \in \set M} \hat f(\v y) \hat m(\v y) \\
        &= \frac{1}{M}\frac{ \sum_{\v y \in \set M} (f(\v y)-\bar f)(m(\v y) - \bar m) }{
	          \sqrt{ \frac{1}{M} \sum_{\v y \in \set M} (f(\v y)-\bar f)^2 }
                  \sqrt{ \frac{1}{M}\sum_{\v y \in \set M} (m(\v y)-\bar m)^2 }}


Using the Cauchy-Schwarz inequality it is easy to prove that
$|ncc|<1$. Normalization did not only solves the problem of
illumination changes but it also results in a measure range (from -1
to +1) that is independent of the image or template content.

Returning to the situation where we have to calculate the $\ncc$ value
for each translation of the mask (not just $\v x=\v 0$) we get:


.. math::
   \ncc(\v x) = \frac{ \sum_{\v y \in \set M} (f(\v x + \v y)-\bar f(\v x))(m(\v y) - \bar m) }{
   \sqrt{ \sum_{\v y \in \set M} (f(\v x + \v y)-\bar f(\v x))^2 } \sqrt{ \sum_{\v y \in \set M} (m(\v y)-\bar m)^2 }}


This seems like a daunting expression to calculate for each $\v x$
position in the image. Fortunately we can rewrite the expression as a
rational function based on convolutions. First observe that the part
containing the mask function $m$ does not contain $\v x$, i.e. $\hat
m$ is not dependent on $\v x$. So we reintroduce $\hat m$ in the above
expression and obtain:

.. math::
   \ncc(\v x) &= \frac{ \sum_{\v y \in \set M} (f(\v x + \v y)-\bar f(\v x))\hat m(\v y) }{
   \sqrt{ \sum_{\v y \in \set M} (f(\v x + \v y)-\bar f(\v x))^2 } } \\
   &= \frac{ \sum_{\v y \in \set M} f(\v x + \v y)\hat m(\v y) - \bar f(\v x) \sum_{\v y \in \set M} \hat m(\v y) }{
   \sqrt{ \sum_{\v y \in \set M} (f(\v x + \v y)-\bar f(\v x))^2 } } \\
   &= \frac{ \sum_{\v y \in \set M} f(\v x + \v y)\hat m(\v y) }{
   \sqrt{ \sum_{\v y \in \set M} (f(\v x + \v y)-\bar f(\v x))^2 } } \\
   &= \frac{ \sum_{\v y \in \set M} f(\v x + \v y)\hat m(\v y) }{
   \sqrt{ \sum_{\v y \in \set M} f^2(\v x + \v y) - 
   2 \bar f(\v x) \sum_{\v y \in \set M} f(\v x + \v y) + \sum_{\v y \in \set M} \bar f^2(\v x) } } \\
   &= \frac{ \sum_{\v y \in \set M} f(\v x + \v y)\hat m(\v y) }{
   \sqrt{ M\bar{f^2}(\v x) - M {\bar f(\v x)}^2 }}

Observe that:

.. math::
   \bar{f^2} = \frac{1}{M} (f^2 \ast [\set M]^\star), \quad \bar{f} = \frac{1}{M} (f \ast [\set M]^\star)
   
   
Using this the function $\ncc$ can be rewritten as:

.. math::
   \ncc = \frac{f \ast \hat m^\star}{
   \sqrt{f^2\ast[\set M]^\star - \frac{1}{M}(f\ast[\set M]^\star)^2 }}

.. this should be:
   \ncc = \frac{f \ast \hat m^\star}{
   \sqrt{ M\,f^2\ast[\set M]^\star - (f\ast[\set M]^\star)^2 }}

   
where $[\set M]$ is the indicator function of the set $\set M$ (using
the Iverson bracket notation). In case $\set M$ is an axis aligned
rectangle the uniform correlations with $[\set M]$ can be done
fast using an integral image. The correlation with $\hat m$ can
be implemented using the FFT or the normalized mask can be
approximated as a weighted summation of indicator functions.

So the recipe for tracking based on normalized cross correlation is:




1. in the first frame where the object is visible select a region
   $\set M$ to track in subsequent frames. This gives the mask
   function $m$ and mask set $\set M$.

2. Normalize the mask, i.e. subtract the mean and divide by the
   standard deviation:

   .. math::
      \hat m(\v x) = \frac{m(\v x) - \bar m}{s_m}


3. Let $f$ be the image where to search for the mask. Calculate the
   following convolutions:

   .. math::
      \nom &= f \ast \hat m^\star \\
      \den_1 &= f^2 \ast [\set M]^\star \\
      \den_2 &= f \ast [\set M]^\star \\

4. Calculate the ncc image:

   .. math::
      ncc = \frac{\nom}{\sqrt{\den_1-\den_2^2/M}}
      

5. Find the position $\v x^\star$ where $\ncc$ is maximal

   .. math::
      \v x^\star = \arg\max_{\v x} \ncc(\v x)


The actual code for tracking based on normalized cross correlation is
not much more complicated then shown in these steps. Seasoned
programmers immediately spot two possible sources of problems:

- The square root function requires a positive argument. Due to
  numerical round off errors this need not be true in general.

- A division by zero (or a very small number is problematic). A simple
  way to deal with this is to set the result of division to zero in
  case the denonimator is really small. Another, easier, way to deal
  with this is to add a very small value to the denominator.

In skimage the normalized cross correlation is available as the
function ``match_template``. In one of the lab exercises you are asked
to use this function to track a ball.