3.2.1. Correlation Tracking

\(\DeclareMathOperator{\ssd}{ssd}\) \(\DeclareMathOperator{\nssd}{nssd}\) \(\DeclareMathOperator{\ncc}{ncc}\) \(\newcommand{\one}{\mathbb{I}}\) \(\DeclareMathOperator{\nom}{nom}\) \(\DeclareMathOperator{\den}{den}\) \(\newcommand{\wh}[1]{\widehat #1}\)

Assume that we are watching a video and at time \(t_0\) we are given a rectangular patch positioned at \(\v x_0\). This rectangular patch is called the template or mask. Let the set \(\set M\) be the set of all vectors \(\v y\) such that \(\v x + \v y\) is a point in the template centered at point \(\v x\). The image values in the template (mask) are denoted as \(m(\v y)\) for all \(\v y\in\set M\). We assume that there are \(M\) elements in \(\set M\). Note that the elements of \(\set M\) are position in space (the pixels) and that \([\set M]\) is the indicator function of the set \(\set M\).

../../../_images/motion.svg

Fig. 3.11 Correlation Matching. On the left a sketch of a frame (image) at time \(t_0\). The orange ball is nicely within the set \(\set M\) with its origin in \(\v x^\star(\v x)\). The detail of the image \(f\) of the shape of \(\set M\) is used as a mask (template) function \(m\). In the image at time \(t>t_0\) we look for the ball. We shift the mask to every position \(\v x\) in the image and compare the image within that shifted mask with the mask function. In \(\v x^\star(t)\) the difference between mask an image is smallest.

We can compare the mask \(m\) at every position \(\v x\) in the image at time \(t>t_0\) to see where it fits best. An obvious measure to choose is the average sum of squared distances:

(3.1)\[\ssd(\v x, t) = \frac{1}{M} \sum_{\v y \in \set M} \left( f(\v x + \v y, t) - m(\v y) \right)^2\]

The template fits best at position \(\v x^\star\) where \(\ssd\) is minimal:

(3.2)\[\v x^\star(t) = \arg\min_{\v x} \ssd(\v x, t)\]

With the scipy.ndimage.generic_filter function it is easy to implement the calculation of the (mean) sum of squared differences. You are asked to do just that in one of the exercises.

But we can do better than that. We can implement the ssd filter using convolutions/correlations. Expanding the quadratic term we get

(3.3)\[\ssd(\v x, t) = \frac{1}{M} \sum_{\v y \in \set M} f^2(\v x + \v y, t) - 2 \frac{1}{M} \sum_{\v y \in \set M} f(\v x + \v y, t) m(\v y) + \frac{1}{M} \sum_{\v y \in \set M} m^2(\v y)\]

In the second sum we immediately recognize a correlation in case we extend the definition of \(m\) to be zero outside of \(\set M\) (i.e. for \(\v y\not\in\set M\) we set \(m(\v y)=0\)). Then we have:

\[2 \frac{1}{M} \sum_{\v y \in \set M} f(\v x + \v y, t) m(\v y) = 2 \frac{1}{M} \sum_{\v y} f(\v x + \v y, t) m(\v y) = \frac{2}{M}\left( f \star_c m \right)\]

The third term in eq-ssdexpanded is easy. There is no dependency of \(\v x\) in this term and we are left with a constant:

\[\frac{1}{M} \sum_{\v y \in \set M} m^2(\v y) = \overline{m^2}\]

the mean of the squared values \(m^2(\v y)\) for all \(\v y \in \set M\).

The first term is also a correlation. To see that we use the Iverson symbol (see sidebar).

\[\begin{split}\frac{1}{M} \sum_{\v y \in \set M} f^2(\v x + \v y, t) &= \frac{1}{M} \sum_{\v y} f^2(\v x + \v y, t) [\set M](\v y) \\ &= \frac{1}{M} \left( f^2 \star_c [\set M]\right)(\v x)\end{split}\]

or equivalently:

\[\frac{1}{M} \sum_{\v y \in \set M} f^2(\v x + \v y, t) = \left( f^2 \star_c \frac{1}{M} [\set M]\right)(\v x)\]

In this last correlation you should recognize a uniform filter (which can be implemented in constant time per pixel irrespective of the size of \(\set M\)).

Using all these expressions for the terms in the \(\ssd\) equation we arrive at

\[ssd = \left(f^2 \star_c \frac{1}{M}[\set M]\right) + \left(\frac{2}{M} f\star_c m \right) + \overline{m^2}\]

The uniform correlation is very efficient to compute. The correlation with \(m\) is less efficient of course (although there are ways to speed things up like implementing the correlation in the Fourier domain, that are outside the scope of these lecture notes).

A disavantage of the ssd tracking method is that it is not invariant under (affine) changes in the image sequence. After the transformation \(f\rightarrow\alpha f + \beta\) we have:

\[f\rightarrow\alpha f + \beta:\qquad \ssd(\v x, t)\rightarrow \alpha^2 \ssd(\v x, t)\]

assuming that the mask \(m\) is transformed in the same way. In case the image changes after the mask image was obtained from a frame the differences may get even larger.

To mitigate the influence of changing lighting conditions we will use the normalized versions of \(f\) and \(m\). Normalization of the mask function \(m\) giving \(\hat m\) is straightforward:

\[\wh m(\v y) = \frac{m(\v y) - \overline{m}}{s_m}\]

where

\[\overline{m} = \frac{1}{M} \sum_{\v y \in\set M} m(\v y)\]

is the mean value of the mask and

\[s_m^2 = \frac{1}{M} \sum_{\v y \in\set M} \left(m(\v y)-\overline{m}\right)^2\]

is the variance of the values in the mask.

The normalization of the function \(f\) is a bit more difficult. The mean and variance are calculated over all positions \(\v x + \v y\) for \(\v y\in\set M\). Thus the mean and variance are dependent on the position \(\v x\):

\[\overline f(\v x) = \frac{1}{M}\sum_{\v y \in\set M} f(\v x + \v y)\]

and

\[s_f^2(\v x) = \frac{1}{M}\sum_{\v y \in\set M} \left( f(\v x + \v y) - \overline f (\v x)\right)^2\]

As the normalization is dependent on the position \(\v x\) we will write \(\hat f(\v x, \v y)\) and then:

\[\wh f(\v x, \v y) = \frac{f(\v x + \v y) - \overline f(\v x)}{s_f(\v x)}\]

This all leads to the following expression for the normalized (mean) sum of squared differences:

\[\nssd(\v x) = \frac{1}{M} \sum_{\v y \in \set M} \left( \wh f(\v x, \v y) - \wh m(\v y) \right)^2\]

Expanding the square of the difference leads to

\[\nssd(\v x) = \underbrace{\frac{1}{M}\sum_{\v y\in\set M} \wh f^2(\v x, \v y)}_{(1)} - \underbrace{\frac{2}{M} \sum_{\v y\in\set M} \wh f(\v x, \v y)\wh m(\v y)}_{(2)} + \underbrace{\frac{1}{M}\sum_{\v y\in\set M} \wh m^2(\v y)}_{(3)}\]

We leave it as an exercise to the reader to prove that the first and third term in the above equation both are equal to one. Thus:

\[\nssd(\v x) = 2\left( 1 - \frac{1}{M} \sum_{\v y\in\set M} \wh f(\v x, \v y)\wh m(\v y)\right)\]

Note that minimizing the \(\nssd\) thus amount the maximizing the sum in the equation above, i.e. maximizing the normalized cross correlation:

\[\ncc(\v x) = \frac{1}{M} \sum_{\v y\in\set M} \wh f(\v x, \v y)\wh m(\v y)\]

Substituting the expression for \(\wh f(\v x, \v y)\) we get

\[\begin{split}\ncc(\v x) &= \frac{1}{M s_f(\v x)} \sum_{\v y \in \set M} \left( f(\v x + \v y) - \overline{f}(\v x) \right) \wh m(\v y)\\ &= \frac{1}{M s_f(\v x)}\left( \sum_{\v y \in \set M} f(\v x + \v y)\wh m(\v y) - \overline{f}(\v x) \sum_{\v y \in \set M} \wh m(\v y) \right)\end{split}\]

Evidently the sum of \(\wh m\) is zero (can you explain and prove this?) and thus

\[\ncc(\v x) = \frac{\sum_{\v y \in \set M} f(\v x + \v y)\wh m(\v y)}{M s_f(\v x)}\]

In the nominator we recognize the correlation \((f\star_c\wh m)(\v x)\). For the denominator we have to rewrite the expression for the standard deviation of \(s_f(\v x)\). To omit the square root for now we look at the variance:

\[s_f^2(\v x) = \frac{1}{M}\sum_{\v y\in\set M} \left(f(\v x+\v y) - \overline{f}(\v x)\right)^2\]

Expanding the quadratic term we get:

\[s_f^2(\v x) = \frac{1}{M}\sum_{\v y\in\set M} f^2(\v x + \v y) - \frac{2}{M} \overline f(\v x) \sum_{\v y\in\set M} f(\v x + \v y) + \frac{1}{M} \sum_{\v y\in\set M} \overline f^2(\v x)\]

We leave to an exercise to prove that this leads to

\[s_f^2(\v x) = \frac{1}{M}\sum_{\v y\in\set M} f^2(\v x + \v y) - \overline f^2(\v x)\]

where the first term is a correlation \(f^2\star_c \frac{1}{M}[\set M]\) and the second term is a correlation squared leading to

\[s_f^2 = f^2\star_c \frac{1}{M}[\set M] - \left(f\star_c \frac{1}{M}[\set M]\right)^2\]

Substituting the square root into the expression for the \(\ncc\) we finally arrive at:

\[\ncc = \frac{f\star_c\wh m}{M \sqrt{f^2\star_c \frac{1}{M}[\set M] - \left(f\star_c \frac{1}{M}[\set M]\right)^2}}\]

We have thus replaced a position dependent local operator with a non linear combination of three simple correlation operators two of which are uniform correlations and thus can be implemented very efficiently. Note that in scipy.ndimage there is no uniform_correlation, instead you can use the uniform_filter (after of course mirroring the uniform kernel in case the origin is not in the center…)