4.3. From fully connected network to convolutional network

In the Machine Learning course we considered fully connected neural networks. For the first processing layer in such a network with \(m\) nodes in the input and \(n\) nodes in the second layer (i.e. the output of the first processing layer) we needed \(n\times m\) weight parameters and \(m\) bias parameters.

For small images like the images in the MNIST data set, using a fully connected neural network is quite feasible (as we have seen in one the LabExercises of the Machine Learning course). But in case we deal with substantial larger images a fully connected network easily needs millions of parameters per processing layer.

Let \(f\) be an input image and let \(g\) be the output image, in a fully connected neural network the linear combination part of one processing step (form layer to layer in the neural network) can be written as:

(4.2)\[g(\v x) = \sum_{\v y} f(\v y) w(\v x, \v y)\]

Carefully observe that where in machine learning it is customary to denote inputs and outputs with variables \(\v x\) and \(\v y\), when working with images we reserve those symbols for the spatial coordinates (coordinate vectors). From a machine learning point of view \(\v x\) and \(\v y\) are simple indices into the data structures (\(f\), \(g\) and \(w\)). Writing \(g_i\) instead of \(g(\v x)\) and \(f_j\) instead of \(g(\v y)\) the above expression becomes:

\[g_i = \sum_j w_{ij} f_j\]

showing that \(w\) is indeed the equivalent of the weight matrix \(W\) that we have seen in the formulation of a fully connected neural network. The image \(g\) (interpreted as vector \(\v g\) with elements \(g_i\)) then is given as

(4.4)\[\v g = W \v f\]

where \(\v f\) is the vector representation of image \(f\). Remember from your machine learning class that if \(\partial\ell/\partial \v g\), the derivative of the loss of a neural network with respect to all elements in \(\v g\), is known, the derivative of the loss with respect to the input image \(f\) is given as:

\[\pfrac{\ell}{\v f} = W\T \pfrac{\ell}{\v g}\]

and the derivative of the loss with respect to the weight matrix \(W\) is given as:

\[\pfrac{\ell}{W} = \pfrac{\ell}{\v g} \v f\T\]

Such a fully connected layer in a neural network (containing several of those layers) is bound to result in way too many parameters to be learned in case images will be larger then the images in ‘toy problems’ like the classification of the digits in the MNIST dataset.

The use of a convolution/correlation architecture for a neural network taking images as input was pionered by Fukushima (in the neo cognitron) in the 1980’s and later by Lecun (in LeNet) in the 1990’s. The fully connected linear processing step is replaced by a correlation:

\[g = f \star_c w\]

where \(w\) is a—usually small—correlation kernel. Depicting this as a classical neural network we see that to calculated \(g(\v x)\) we only need to know the values \(f(\v x+\v y)\) for \(\v y\) in the support of the kernel function \(w\) centered at \(\v x\). The support of a function is the set of pixels where the function is not equal zero.

Note that also for the correlation equation above, we can make a weight matrix \(W\) such that \(\v g = W \v f\). Not only do we make weight matrix that is a very sparse weight matrix but also the weights calculating \(g_i\) from all \(f_j\) are shared weights for all \(i\).

As an example consider a \(4\times4\) image \(f\) and the correlation with kernel

\[\begin{split}w = \begin{Bmatrix} & a&\\ b& c& d\\ & e& \end{Bmatrix}\end{split}\]

Writing \(f\) in vector notation (row major form) then we have:

(4.4)\[\begin{split}\v g = \pmatrix{g_{1}\\\vdots\\g_{16}} = W \pmatrix{f_{1}\\\vdots\\f_{16}} = W \v f\end{split}\]

where:

\(\newcommand{\z}{\,\cdot\,}\) \(\newcommand{\block}[3]{\begin{array}{cccc}#2& #3& \z& \z\\ #1& #2& #3& \z\\ \z& #1& #2& #3\\ \z& \z& #1& #2\end{array}}\)

(4.5)\[\begin{split}W &= \begin{pmatrix} \block{b}{c}{d}& \block{\z}{e}{\z}& \block{\z}{\z}{\z}& \block{\z}{\z}{\z}\\ \block{\z}{a}{\z}& \block{b}{c}{d}& \block{\z}{e}{\z}& \block{\z}{\z}{\z}\\ \block{\z}{\z}{\z}& \block{\z}{a}{\z}& \block{b}{c}{d}& \block{\z}{e}{\z}\\ \block{\z}{\z}{\z}& \block{\z}{\z}{\z}& \block{\z}{a}{\z}& \block{b}{c}{d} \end{pmatrix}\end{split}\]

The expression in Eq. (4.4) is just to convince you that convolutional neural networks used in vision tasks are really the kind of neural networks that we have seen before. Note that for the backward pass for a linear unit described with a matrix we also have a matrix multiplication but now with the transpose matrix:

\[\pfrac{\ell}{\v f} = W\T \pfrac{\ell}{\v g}\]

If you you transpose the matrix \(W\) as given for our simple convolution kernel \(w\):

\[\begin{split}w = \begin{Bmatrix} & a&\\ b& c& d\\ & e& \end{Bmatrix}\end{split}\]

you will see that \(W\T\) corresponds with the mirrored kernel \(w^\star\):

\[\begin{split}w^\star = \begin{Bmatrix} & e&\\ d& c& b\\ & a& \end{Bmatrix}\end{split}\]

So representing image with functions instead of vector we get:

\[\pfrac{\ell}{f} = \pfrac{\ell}{g} \ast w^\star\]

In a subsequent section we will prove this along a different route (more alike how we did the backpropagation rules for a fully connected module).