4. Convolutional Neural Networks
We have not discussed in what way computer vision systems can be used to link visual observations with semantic labels. Given an image tell me what i am looking at: a cat, a house, the number 7, etc, etc.
In this chapter we take that ‘simple’ task in mind. Decide whether a particular object or concept is visible in the image. Examples are:
- Recognizing digits (MNIST dataset)
Below are a number of images from the MNIST data set. Small images of handwritten digits.
- Recognizing objects (CIFAR10/CIFAR100 data sets)
Below some examples from the CIFAR-10 image data set. For each of the images a semantic label is given for what is visible in the image. Task is to get the same label using a computer vision system.
- Recognizing objects: ImageNet dataset
ImageNet is a collection of images depicting objected out of 1000 categories (with labels). This set was used in many image classification challenges in the past.
Many more datasets for all kind of challenges are available on the web nowadays. Tasks for CNN’s nowadays are among others:
recognize AND detect (with a bounding box) an object
recognize and detect multiple objects in images
recognize and segment objects (not only a bounding box but a labeling on pixel level)
Furthermore CNN’s, often in combination with networks specialized for (time) sequences, are used to analyse video’s and detect and analyze motion.
And there are a lot of applications that were never thought of before: a NN to transfer the Monet painting style onto a photograph. Or the other way round: how to make a photograph out of a painting. Automatic colorization of old black and white pictures and movies. Transfering the movements and expressions from one face to another face leading to the so called /deep fakes/.
The last 10 years have shown that CNN’s form the basis of applications in the image processing and computer vision field that were then thought of to be only attainable in the far future. But where the last 10 years has shown that CNN based systems are very good at pattern recognition (in human terms: /pre attentive vision/) even much better than we thought was possible, but that there is still a long way to go to get to real understanding (whatever that may be…) of images. As far as i know we still have to come up with a CNN-based vision system that can distinguish between the two spirals in the image on the front cover of the (in)famous book on perceptrons.
In this chapter we lay the foundation for understanding what makes a CNN work. How to do back propagation for layers in a neural network that are convolution operators and what about max-pooling layers? We will look at these networks from a computer vision point of view. Using CNN’s in practice requires more knowledge then we provide in this chapter. For that you will need an advanced machine learning (deep learning) course.
We start with a section on some history on the use of (convolutional) neural networks for image classification. A history that already dates back to the 1950’s.
Then we take a brief look at fully connected neural networks as a starting point for our discussion on convolutional nets. Not only that but most classification type of CNN’s start with (a lot of) convolutions layers but end in several fully connected layers to do the actual classification.
In a subsequent section we will consider a convolutional processing block in a CNN. We will describe the forward pass as a (set of) convolutions and will show that the backward pass also consists of convolutions (albeit with mirrored kernels).