7. Clustering
Clustering is the process to group closely related feature vectors. We will consider clustering of real valued feature vectors \(\v x\ls i\in\setR^n\). Clustering is an unsupervised learning algorithm: only the set of feature vectors is known, there are no target values.
In Fig. 7.1 some 2D datasets are shown. In all cases it is quite obvious what the clusters should approximately be (with the possible exception of the clusters in column (b)).
The clusters in the first three columns are relatively simple. A cluster is well represented with its center and membership to a cluster can be expressed in terms of distance to the cluster center. Note that in columns (b) en (c) the distances should take the shape/size of the clusters into account.
The clusters in the first three columns are the types of clusters that can be found using (variants of) the k-Means algorithm. In this chapter we only discuss these algoritms in depth.
The clusters in column (d) en (e) are more difficult. The exists a variety of algorithms that can deal with these types of clusters. We refer to the manual of sklearn for this.