7. Clustering

Clustering is the process to group closely related feature vectors. We will consider clustering of real valued feature vectors \(\v x\ls i\in\setR^n\). Clustering is an unsupervised learning algorithm: only the set of feature vectors is known, there are no target values.

../../_images/clustering_examples.png

Fig. 7.1 Clusters in 2D data sets. In every column the data set is shown in the top row, in the second row the same data set but then the ‘true’ cluster label is used to distinguish the clusters. Note a ‘true’ cluster label in practice is not known. (the clusters are generated with standard sklearn code).

In Fig. 7.1 some 2D datasets are shown. In all cases it is quite obvious what the clusters should approximately be (with the possible exception of the clusters in column (b)).

The clusters in the first three columns are relatively simple. A cluster is well represented with its center and membership to a cluster can be expressed in terms of distance to the cluster center. Note that in columns (b) en (c) the distances should take the shape/size of the clusters into account.

The clusters in the first three columns are the types of clusters that can be found using (variants of) the k-Means algorithm. In this chapter we only discuss these algoritms in depth.

The clusters in column (d) en (e) are more difficult. The exists a variety of algorithms that can deal with these types of clusters. We refer to the manual of sklearn for this.