3. Overview of Machine Learning

Machine learning is all about interpreting data, lots of data. Observing and characterizing one object or situation in the real world through sensors or formal description results in what is called a data (or feature) vector \(\v x\in\setR^n\). Such a vector can be low dimensional, say \(n=3\) for an RGB vector describing the color of a pixel in the image, or very high dimensional like a video sequence of a few seconds (amounting to megabytes of elements in the data vector).

In our restricted view on machine learning we want to assign a result vector \(\hat{\v y}\) to each possible observation \(\v x\) where \(\hat{\v y}\) is either a label characterizing the object or situation associated with that feature vector or it is a (set of) numerical values.

In supervised machine learning a set of examples \((\v x\ls i, \v y\ls i)\) is available for learning a function \(f\) that maps the feature vector \(\v x\) onto the desired target value \(\hat{\v y}\). We assume that the ‘true’ target \(\v y\) is given in the learning set (hence the term supervised learning). In practice the mapping \(f\) that is learned will not always produce the correct target value.

In these lecture notes we restrict ourselves to the principal categories of machine learning systems, namely:

Dimensionality Reduction.

The dimension of the feature vector representing one object or situation can be quite high. Consider an image of several mega pixels or worse, imagine a video sequence. Let’s take an image of size \(1000\times 1000\). The feature vector \(\v x\) then is a vector in \(\setR^{3000000}\) assuming we have a red, green and blue value for each pixel. Now assume that an R,G and B value takes a byte to encode than there are \((256\times256\times256)^{10000000}\) possible images that are distinct from each other. But for most of these images it is true that they are semantically indistinguishable from other images (a horse is a horse of course of course) or depict totally nonsense to the human eye or are simply noise. The point is that the effective dimension (degrees of freedom) is much lower than the dimension of the raw data.

In machine learning several methods are known to reduce the dimensionality of the data without changing its interpretation. As a pre processing step it can help to reduce the time complexity of machine learning algorithms. Evidently dimensionality reduction is also central in the (lossy) compression of data.

In these notes we will look at principal component analysis as a way to reduce the dimensionality of our data.

Regression.

Now consider the case that our feature vector gives a lot of numerical data describing a house (area code, number of bedrooms, area, etc). Given that feature vector we would like to predict the house price. So \(\hat y = f(\v x)\) is the predicted house price.

In general the regression task is to come up with a function \(f\) to be learned from the learning set that minimizes the error\(\v y - \hat{\v y}\), i.e. the differences between the targets \(\v y\ls i\) and the predicted value \(f(\v x\ls i)\).

As with most supervised machine learning methods regression doesn’t search for an arbitrary function to match the data. Instead we use a parameterized function for the prediction \(\hat{\v y} = f(\v x, \v p)\) where the function \(f\) is fixed (to be chosen a priori) and all that is left to do is find the parameter vector \(\v p\) that best fits the data.

In these lecture notes we look at linear regression methods. The ‘hello world’ example of a linear regression problem is the task to fit a straight line through a set of data points. It will become clear that linear regression does not owe it name to the straight line fitting, Much more complex functions can be fitted to data.

Regression is not restricted to mappings from \(\setR^n\) to \(\setR\), i.e. resulting in scalar values. An example where the result \(\v y\) is a vector as well is to predict where in am image an object is to be seen at what size. Then the input feature vector is a vector representation of the input image and the result is a vector \(\v y\) representing the bounding box encompassing the object visible in the image (something like \((y_1,y_2)\) for the position of the bounding box and \((y_3,y_4)\) for the width and height of the boundingbox.

We also take a brief look at using a neural network for regression. But reading that section is best postponed to after learning about the use of neural networks for classification.

Classification

In classification the feature vector \(\v x\) is mapped onto a label \(y\) (or set of probabilistic labels \(\v y\)) characterizing the object that is represented by the feature vector. An example is classifying pieces of fruit as either apple, pear or grapes etc given a picture of the piece of fruit.

Each point in the feature space \(\setR^n\) is assigned to one of the possible labels. With this the feature space is divided into regions corresponding with one of the labels. The boundaries of these regions are called decision boundaries.

In these notes we will discuss several methods for classification: naive Bayes classification, logistic regression and neural networks.

Clustering

In clustering you are given a dataset with only feature vectors and no target value. The task then is to divide these feature vectors (i.e. points in feature space) into clusters such that points in one cluster share a common interpretation and points in different clusters are semantically different.

Clustering is a useful technique for several reasons. It is used in exploratory data analysis to look for patterns in the data (which articles in a supermarked tend to end up in the same cart and what does that tell us about these customers). It is also of more technical interest. For instance when considering all the colors in an image (a lot of points in 3D space) we can often find a limited number of clusters in color space. The cluster centers than can serve as a limited set of colors to paint the image without too much visual degradation of the image appearance.

In these lecture notes we consider a simple clustering algorithm: k-means clustering.

Besides the four basic categories of machine learning mwthods we will look at the methodology for applying machine learning in practice. How to utilize the data for learning and testing. How to quantify the performance of your machine learning system.

It is important to note that this introductory course on machine learning barely scratches the surface of the field of machine learning. Not all classical methods are discussed, the support vector machine and all other kernel methods in learning, the state of the art in machine learning only two decades ago are skipped. Also methods like decision trees are not discussed in these notes. Still these methods are not without their special merits and use even today. For instance decision trees are useful when dealing with categorical data without the need to transform categorical data into numerical data as is needed for most other methods.

And besides not mentioning methods from the past, we also don’t look at state of the art modern methods. Auto encoders, transformer networks, LSTM networks, generative adversary networks and diffusion models are only a few of the modern developments that we did not discuss. These will looked at in master level courses on machine and deep learning.