3.2.1. Correlation Tracking
\(\DeclareMathOperator{\ssd}{ssd}\) \(\DeclareMathOperator{\nssd}{nssd}\) \(\DeclareMathOperator{\ncc}{ncc}\) \(\newcommand{\one}{\mathbb{I}}\) \(\DeclareMathOperator{\nom}{nom}\) \(\DeclareMathOperator{\den}{den}\) \(\newcommand{\wh}[1]{\widehat #1}\)
Assume that we are watching a video and at time \(t_0\) we are given a rectangular patch positioned at \(\v x_0\). This rectangular patch is called the template or mask. Let the set \(\set M\) be the set of all vectors \(\v y\) such that \(\v x + \v y\) is a point in the template centered at point \(\v x\). The image values in the template (mask) are denoted as \(m(\v y)\) for all \(\v y\in\set M\). We assume that there are \(M\) elements in \(\set M\). Note that the elements of \(\set M\) are position in space (the pixels) and that \([\set M]\) is the indicator function of the set \(\set M\).
We can compare the mask \(m\) at every position \(\v x\) in the image at time \(t>t_0\) to see where it fits best. An obvious measure to choose is the average sum of squared distances:
The template fits best at position \(\v x^\star\) where \(\ssd\) is minimal:
With the scipy.ndimage.generic_filter
function it is easy to
implement the calculation of the (mean) sum of squared
differences. You are asked to do just that in one of the exercises.
But we can do better than that. We can implement the ssd filter using convolutions/correlations. Expanding the quadratic term we get
In the second sum we immediately recognize a correlation in case we extend the definition of \(m\) to be zero outside of \(\set M\) (i.e. for \(\v y\not\in\set M\) we set \(m(\v y)=0\)). Then we have:
The third term in eq-ssdexpanded
is easy. There is no
dependency of \(\v x\) in this term and we are left with a constant:
the mean of the squared values \(m^2(\v y)\) for all \(\v y \in \set M\).
The first term is also a correlation. To see that we use the Iverson symbol (see sidebar).
or equivalently:
In this last correlation you should recognize a uniform filter (which can be implemented in constant time per pixel irrespective of the size of \(\set M\)).
Using all these expressions for the terms in the \(\ssd\) equation we arrive at
The uniform correlation is very efficient to compute. The correlation with \(m\) is less efficient of course (although there are ways to speed things up like implementing the correlation in the Fourier domain, that are outside the scope of these lecture notes).
A disavantage of the ssd tracking method is that it is not invariant under (affine) changes in the image sequence. After the transformation \(f\rightarrow\alpha f + \beta\) we have:
assuming that the mask \(m\) is transformed in the same way. In case the image changes after the mask image was obtained from a frame the differences may get even larger.
To mitigate the influence of changing lighting conditions we will use the normalized versions of \(f\) and \(m\). Normalization of the mask function \(m\) giving \(\hat m\) is straightforward:
where
is the mean value of the mask and
is the variance of the values in the mask.
The normalization of the function \(f\) is a bit more difficult. The mean and variance are calculated over all positions \(\v x + \v y\) for \(\v y\in\set M\). Thus the mean and variance are dependent on the position \(\v x\):
and
As the normalization is dependent on the position \(\v x\) we will write \(\hat f(\v x, \v y)\) and then:
This all leads to the following expression for the normalized (mean) sum of squared differences:
Expanding the square of the difference leads to
We leave it as an exercise to the reader to prove that the first and third term in the above equation both are equal to one. Thus:
Note that minimizing the \(\nssd\) thus amount the maximizing the sum in the equation above, i.e. maximizing the normalized cross correlation:
Substituting the expression for \(\wh f(\v x, \v y)\) we get
Evidently the sum of \(\wh m\) is zero (can you explain and prove this?) and thus
In the nominator we recognize the correlation \((f\star_c\wh m)(\v x)\). For the denominator we have to rewrite the expression for the standard deviation of \(s_f(\v x)\). To omit the square root for now we look at the variance:
Expanding the quadratic term we get:
We leave to an exercise to prove that this leads to
where the first term is a correlation \(f^2\star_c \frac{1}{M}[\set M]\) and the second term is a correlation squared leading to
Substituting the square root into the expression for the \(\ncc\) we finally arrive at:
We have thus replaced a position dependent local operator with a non
linear combination of three simple correlation operators two of which
are uniform correlations and thus can be implemented very efficiently.
Note that in scipy.ndimage
there is no
uniform_correlation
, instead you can use the
uniform_filter
(after of course mirroring the uniform kernel
in case the origin is not in the center…)