Correlation Tracking

\(\DeclareMathOperator{\ssd}{ssd}\) \(\DeclareMathOperator{\nssd}{nssd}\) \(\DeclareMathOperator{\ncc}{ncc}\) \(\newcommand{\one}{\mathbb{I}}\) \(\DeclareMathOperator{\nom}{nom}\) \(\DeclareMathOperator{\den}{den}\)

Assume that we are watching a video and at time \(t_0\) we are given a rectangular patch positioned at \(\v x_0\). This rectangular patch is called the template or mask. Let the set \(\set M\) be the set of all vectors \(\v y\) such that \(\v x_0 + \v y\) is a point in the template centered at point \(\v x_0\). The image values in the template (mask) are denoted as \(m(\v y)\) for all \(\v y\in M\). We assume that there are \(M\) elements in \(\set M\).

We can compare the mask \(m\) at every position \(\v x\) in the image at time \(t>t_0\) to see where it fits best. An obvious measure to choose is the average sum of squared distances:

\[\ssd(\v x) = \frac{1}{M} \sum_{\v y \in \set M} \left( f(\v x + \v y) - m(\v y) \right)^2\]

The template fits best at position \(\v x^\star\) where \(\ssd\) is minimal:

\[\v x^\star = \arg\min_{\v x} \ssd(\v x)\]

To make the notation somewhat simpler we will only consider the comparison of the mask with the image at position \(\v x=\v 0\) and we write:

\[\ssd = \ssd(\v 0) = \frac{1}{M} \sum_{\v y \in \set M} \left( f(\v y) - m(\v y) \right)^2\]

The problem with this simple difference measure is that changes in illumination may change the measure enormously. Therefore a normalized measure is used. Instead of comparing \(f\) with \(m\) we will compare \(\hat f\) and \(\hat m\):

\[\hat f(\v y) = \frac{f(\v y) - \bar{f}}{s_f}, \quad \hat m(\v y) = \frac{m(\v y) - \bar{m}}{m_f}\]

where

\[\bar{f} = \frac{1}{M} \sum_{\v y \in \set M} f(\v y), \quad \bar{m} = \frac{1}{M} \sum_{\v y \in \set M} m(\v y)\]

and

\[s_f = \sqrt{ \frac{1}{M} \sum_{\v y \in \set M} \left( f(\v y)-\bar f\right)^2 }, \quad m_f = \sqrt{ \frac{1}{M} \sum_{\v y \in \set M} \left( m(\v y)-\bar m\right)^2 }\]

i.e. we normalize by subtracting the mean (in the neighborhood defined by the mask) and dividing by the standard deviation (again in the neighborhood defined by the mask).

Note that the normalization of the mask has to be done only once but normalization of \(f\) will depend on the origin \(\v x\) we have chosen for the mask. Then both \(\bar f\) and \(s_f\) will become functions of \(\v x\). But for now we continue with \(\v x = \v 0\).

Using the sum of squared distances for the normalized functions we arrive at:

\[\begin{split}\nssd &= \frac{1}{M} \sum_{\v y \in \set M} \left( \hat f(\v y) - \hat m(\v y) \right)^2 \\ &= \frac{1}{M}\sum_{\v y \in \set M} {\hat f}^2(\v y) - \frac{2}{M} \sum_{\v y \in \set M} \hat f(\v y) \hat m(\v y) + \frac{1}{M} \sum_{\v y \in \set M} {\hat m}^2(\v y)\end{split}\]

We leave it as a proof to the reader to show that the first and last term in the above expression for \(\nssd\) are equal to 1, leading to:

\[\nssd = 2 \left(1 - \frac{1}{M} \sum_{\v y \in \set M} \hat f(\v y) \hat m(\v y)\right)\]

Minimizing the \(\nssd\) thus amounts to maximizing the sum of products in the above expression. This term is known as the normalized cross correlation:

\[\begin{split}\ncc &= \frac{1}{M} \sum_{\v y \in \set M} \hat f(\v y) \hat m(\v y) \\ &= \frac{1}{M}\frac{ \sum_{\v y \in \set M} (f(\v y)-\bar f)(m(\v y) - \bar m) }{ \sqrt{ \frac{1}{M} \sum_{\v y \in \set M} (f(\v y)-\bar f)^2 } \sqrt{ \frac{1}{M}\sum_{\v y \in \set M} (m(\v y)-\bar m)^2 }}\end{split}\]

Using the Cauchy-Schwarz inequality it is easy to prove that \(|ncc|<1\). Normalization did not only solves the problem of illumination changes but it also results in a measure range (from -1 to +1) that is independent of the image or template content.

Returning to the situation where we have to calculate the \(\ncc\) value for each translation of the mask (not just \(\v x=\v 0\)) we get:

\[\ncc(\v x) = \frac{ \sum_{\v y \in \set M} (f(\v x + \v y)-\bar f(\v x))(m(\v y) - \bar m) }{ \sqrt{ \sum_{\v y \in \set M} (f(\v x + \v y)-\bar f(\v x))^2 } \sqrt{ \sum_{\v y \in \set M} (m(\v y)-\bar m)^2 }}\]

This seems like a daunting expression to calculate for each \(\v x\) position in the image. Fortunately we can rewrite the expression as a rational function based on convolutions. First observe that the part containing the mask function \(m\) does not contain \(\v x\), i.e. \(\hat m\) is not dependent on \(\v x\). So we reintroduce \(\hat m\) in the above expression and obtain:

\[\begin{split}\ncc(\v x) &= \frac{ \sum_{\v y \in \set M} (f(\v x + \v y)-\bar f(\v x))\hat m(\v y) }{ \sqrt{ \sum_{\v y \in \set M} (f(\v x + \v y)-\bar f(\v x))^2 } } \\ &= \frac{ \sum_{\v y \in \set M} f(\v x + \v y)\hat m(\v y) - \bar f(\v x) \sum_{\v y \in \set M} \hat m(\v y) }{ \sqrt{ \sum_{\v y \in \set M} (f(\v x + \v y)-\bar f(\v x))^2 } } \\ &= \frac{ \sum_{\v y \in \set M} f(\v x + \v y)\hat m(\v y) }{ \sqrt{ \sum_{\v y \in \set M} (f(\v x + \v y)-\bar f(\v x))^2 } } \\ &= \frac{ \sum_{\v y \in \set M} f(\v x + \v y)\hat m(\v y) }{ \sqrt{ \sum_{\v y \in \set M} f^2(\v x + \v y) - 2 \bar f(\v x) \sum_{\v y \in \set M} f(\v x + \v y) + \sum_{\v y \in \set M} \bar f^2(\v x) } } \\ &= \frac{ \sum_{\v y \in \set M} f(\v x + \v y)\hat m(\v y) }{ \sqrt{ M\bar{f^2}(\v x) - M {\bar f(\v x)}^2 }}\end{split}\]

Observe that:

\[\bar{f^2} = \frac{1}{M} (f^2 \ast [\set M]^\star), \quad \bar{f} = \frac{1}{M} (f \ast [\set M]^\star)\]

Using this the function \(\ncc\) can be rewritten as:

\[\ncc = \frac{f \ast \hat m^\star}{ \sqrt{f^2\ast[\set M]^\star - \frac{1}{M}(f\ast[\set M]^\star)^2 }}\]

where \([\set M]\) is the indicator function of the set \(\set M\) (using the Iverson bracket notation). In case \(\set M\) is an axis aligned rectangle the uniform correlations with \([\set M]\) can be done fast using an integral image. The correlation with \(\hat m\) can be implemented using the FFT or the normalized mask can be approximated as a weighted summation of indicator functions.

So the recipe for tracking based on normalized cross correlation is:

  1. in the first frame where the object is visible select a region \(\set M\) to track in subsequent frames. This gives the mask function \(m\) and mask set \(\set M\).

  2. Normalize the mask, i.e. subtract the mean and divide by the standard deviation:

    \[\hat m(\v x) = \frac{m(\v x) - \bar m}{s_m}\]
  3. Let \(f\) be the image where to search for the mask. Calculate the following convolutions:

    \[\begin{split}\nom &= f \ast \hat m^\star \\ \den_1 &= f^2 \ast [\set M]^\star \\ \den_2 &= f \ast [\set M]^\star \\\end{split}\]
  4. Calculate the ncc image:

    \[ncc = \frac{\nom}{\sqrt{\den_1-\den_2^2/M}}\]
  5. Find the position \(\v x^\star\) where \(\ncc\) is maximal

    \[\v x^\star = \arg\max_{\v x} \ncc(\v x)\]

The actual code for tracking based on normalized cross correlation is not much more complicated then shown in these steps. Seasoned programmers immediately spot two possible sources of problems:

  • The square root function requires a positive argument. Due to numerical round off errors this need not be true in general.

  • A division by zero (or a very small number is problematic). A simple way to deal with this is to set the result of division to zero in case the denonimator is really small. Another, easier, way to deal with this is to add a very small value to the denominator.

In skimage the normalized cross correlation is available as the function match_template. In one of the lab exercises you are asked to use this function to track a ball.