4.1. A Brief History of Object Classification and CNN’s
4.1.1. Old AI and CV Paradigm: “Model the world”
Understand and model the world and based on that knowledge analyse the world as sensed with a camera. Example blocks world and the edge junctions interpretation.
Note that ‘blocks world’ was/is not only about image interpretation but more about reasoning/thinking/planning. A problem not (yet?) solved in modern AI.
Story about the Marvin Minsky assignment in the 60ties.
To recognize a chair in a picture would have gone along steps like:
Model a chair (3D model)
Model how that chair is projected on the retina
Recognize the basic 2D structural building blocks (edges, corners etc) in an image
Verify which of those recognized building blocks can be modelled as the projection of a chair
What is the most likely projection
Infer the 3D model of the chair
So basically ‘old fashioned’ AI is bottom up processing of the visual data guided by a model of not only the object in the 3D world but also by a model of how that object appears in an image taken with a camera (or observed through our eyes).
David Marr was an important proponent of this approach to vision. His book “Vision” was one of the first principled books on computer vision. A lot of what he wrote is nowadays not very important anymore (but still an important book to read if you are interested in history).
4.1.2. Statistics and machine learning: the bag of visual words model
Instead of a structural model of how a chair is depicted in an image a statistical model is used. Local neighborhoods in an image are characterized as a point in a high dimensional feature space (think of the SIFT descriptor). This feature space is discretized into a few thousand discrete regions corresponding with characteristic image details \(D_1,...,D_N\). All (well a lot..) local neighborhoods in an image are discretized this way and a histogram is made giving the relatice occurance of detail \(D_i\) in the image. The histogram does not account for spatial structure of the observations, it is a bag where all details are thrown in, in random order. Surprisingly enough you can then use a classical machine learning algorithm (the SVM was the classifies of choice in those days. An SVM is like a logistic regression classifier that is capable of representing its feature vectors in a very high (even infinite) dimensional space, a bit like adding augmented features through polynomials, through the kernel trick.
Note that a BOW model does not follow the route set out in the classical AI paradigm of modelling the world first (through knowledge models) and interpreting (visual) observations guided by the model. Instead in its bottom-up processing it relies on the statistical correspondence between local desciptors as seen in different views of the same object. It never tries to label these local descriptors in terms of a knowledge model of the world.
4.1.3. The CNN comes in: “Model the brain”
In 2012 (yes only 10 years ago) AlexNet won the ImageNet Large Scale Visual Recognition Challenge. And this CNN improved the state of the art results by a large margin. Cees Snoek (a professor at the UvA leading the Vision group), who did win the challenge a few times in years before that with the BOW model, came back from that conference and declared: “throw out our software, we have to switch to CNN’s”.
A CNN is a neural network that utilized a priori knowledge about images, namely the local structure assumption. Compared with a FC network a CNN is not more… no it is less (in one layer) but making a deep network out of these simple layers proved to lead to superior results compared with the ‘old’ BOW models.
A CNN completely bypasses the model based AI approach from the past. Knowledge and representation are ‘hidden’ in the weights in the network. Neural networks can’t explain why they reach a specific conclusion (and even that sentence is misleading: a CNN just cranks out a number which we have taught the system to correspond with a particular semantic concept). What a CNN based vision does to recognize a bicycle in an image is really hard to explain. Only for the very first few layers in the network we have some idea now what the network is doing and representing in its weights. But that is at the level of small image details (we would call it local structure), but what model is used to combine these details (and the missing details…) into the notion of a bike is far from evident.
Explainable AI is therefore a part of AI that will be very influential for the acceptance of AI in society.
4.1.4. Success has many fathers
The use of NN’s for visual tasks has a long history. It started with the perceptron (Rosenblatt, 1957). The perceptron takes a 20x20 image as input and could classify the input as one of the digits from 0 to 9.
With hindsight, the perceptron is just ‘logistic regression with the wrong activation function.’ Instead of a differentiable activation function a binary function was used. If the weighted sum of inputs is above the threshold the perceptron outputs 1, else -1 (the binary inputs were encoded with -1 and +1 as well).
Surprisingly, given the non differentiable activation function, there was a learning method for the perceptron. If the data set (set of examples) was linearly seperable the learning rule would indeed lead to a set of weights that could classify the dataset correctly.
The non linear activation function however does not allow for slightly overlapping data sets and for a multi layer perceptron learning algorithm. The principle of backprop in networks was known around the same time but not used in the realm of neural networks.
This and more observations lead Minsky and Papert to write a very influential book “Perceptrons” (in 1969) that criticized the perceptron for its shortcomings. Unfortunately this book lead to a decrease in the interest of neural networks and focus shifted to the symbolic (model the world) AI paradigm.
Neural networks returned to the spotlight in AI research with the development of the back propagation algorithm. It was Werbos in 1974 who foresaw the use of back propagation to learn neural networks. It was not until the mid 80’s that the BP algorithm was used in practice to learn NN’s (Rummelhart and Hinton).