4.2. Recap Machine Learning and Neural Networks
It is assumed that you have some understanding of neural networks. In this section we give a brief recap and introduce the notions that are needed to introduce the math behind the convolutional neural networks. For a more comprehensive introduction of neural networks used for classification we refer to these lecture notes.
4.2.1. Supervised Machine Learning
A machine learning system transforms the input into the output using a fixed (hypothesis) function \(h_{\v p}\) that is parameterized with (a lot of) parameters that are collected in the vector \(\v p\):
As an example consider a system to classify small images of handwritten digits (collected in the MNIST data set). All images are of size \(28\times28\) and flattened as a vector \(\v x\) we have \(\v x\in\setR^{784}\). The output vector in this case will be 10 dimensional, one scalar for each of the digits. The final classification will be the index of the element with the largest output in the vector \(\hat{\v y}\) (starting with index 0).
The parameter vector \(\v p\) is learned from examples \((\v x\ls i, \v y\ls i)\) with \(i\in[1,m]\). The vector \(\v x\ls i\) is an example for which the required output \(\v y\ls i\)—the target value, or ground truth—is known. Learning the parameter vector \(\v p\) from these examples is called supervised learning and aims at finding the parameter vector that will minimize the total loss (error) made for all examples in the training set.
where \(\ell\) is the loss contribution for one example-target pair. There are several loss functions \(\ell\) that can be used in machine learning. The simplest one is the quadratic loss function:
Machine learning in this context is as ‘simple’ as finding the vector \(\v p^\star\) that minimizes the total loss function:
For almost all machine learning systems the hypothesis function and hence the loss function is so complex that we have to resort to a numerical optimization technique to find the optimal parameter vector \(\v p^\star\). The simplest of these techniques is an iterative gradient descent procedure:
Most often a fixed number of iterations of this gradient descent procedure is done to approximate the required parameter vector \(\v p^\star\).
4.2.2. Fully Connected Neural Network
A fully connected neural network is a sequence of processing modules that take an input vector \(\v{a}_\text{in}\) and produces an output vector \(\v{a}_\text{out}\) according to the formula:
where the bias vector \(\v b\) and the weight matrix \(W\) are the parameters of this processing module. The function \(\eta\) is the activation function. Note that the notation \(\eta\aew(\cdot)\) is used to indicate that the function \(\eta\) is applied to all elements of its (vector/tensor/matrix) argument. Well known activation functions are the sigmoid function and the ReLU function (that is often used in deep learning convolutional networks).
A FC neural network is a sequence of these modules. The derivatives of the loss with respect to all parameters in the system (i.e. \(W\ls i\), \(\v b\ls i\) for \(i=1, 2, 3\)) can be calculated using the chain rule of differentiation. Starting at the end of the sequence the derivatives can be calculated in the back propagation pass (see the machine learning course notes). The steps in the backpropagation pass for one module are indicated in purple.
For one module we have (see Fig. 4.16) we have the forward pass given by Eq. (4.1) and the for the backward pass (the back propagation) we have:
where \(\odot\) is the elementwise multiplication of vectors or matrices. For the parameters \(W\) and \(\v b\) we have:
Once we know what one processing block ‘does’ in both the forward as well as in the backward pass, we can do forward and backward passes for an entire feedforward network of these modules. Note that in a sequence of two modules the \(\v a_{\text{out}}\) of the first module is the \({\v a}_\text{in}\) of the second module.
It should be noted that when using a framework like Tensorflow or PyTorch the programmer only has to specify the forward pass, the backward pass is automagically infered from that.
When we will be looking at convolutional neural networks we will be replacing the fully connected linear unit (indicated with the ‘dot’ in the above figures) with a convolution using kernel \(w\) we are only changing the relation \(\v{u} = W\v{a}_\text{in}\) in the forward pass for a convolution and in the backward pass we will also get a convolution. That is the topic of subsequent sections.