Clustering methods are in charge of grouping sets of data objects with the aim of finding relationships within the objects that compose the data. Such a process allows the data scientist to find similarities within the data, draw inferences, find hidden patterns and also to reconstruct a possibly underlying data structure. For example the relative connectivity within the data. In machine learning (ML) literature, clustering is one of the methods that is normally used in unsupervised learning with the aim of learning the underlying hidden structures of the data and its categorization.
Therefore, there is great interest in carrying out a clustering task in an exploratory analysis to find new insights. We know that there are different clustering methods and each method has been proven to provide accurate clustering results in different fields. However, how can we choose the best clustering algorithm for our data analysis task? and how can we assess the best number of clusters ?
Having these two questions in mind I will briefly summarize some of the most common clustering methods that have shown relevant applicability. Additionally, I will comment on the criteria that should be taken into account when performing clustering and comment on the most well-known methods for accessing how good your clustering is.
To begin with it is good to bear in mind that there are different types of clustering outputs performed by the clustering methods. These types of clustering are known as hard and soft clustering. The difference between these two methods of clustering is that in soft clustering one object can belong to more than one cluster, whereas in hard clustering an object belong to only one cluster. Additionally, there are different ways to approach the clustering problem. While some methods tend to find the possible underlying distributions of the data or the density of the data, other methods focus on the distance between the neighboring points and their connectivity.
The list of algorithms to carry out clustering is quite extensive. Thus, I have only picked up some of them to comment in this article bearing in mind the different ways they perform the clustering and therefore the different situations in which they can be very useful.
K-means it will be hard to find a clustering article with no mention to the well-known K-means algorithm. K-means is a hard clustering algorithms (there also implementations for soft clustering) that assign objects to an initial k number of clusters, then initial clusters are iteratively re-organized by assigning each entry object to its closest cluster (centroid) and re-calculating cluster centroids until no further changes take place. The optimizing criterion in the clustering process is the sum-of-squared-error E between the objects in the clusters and their respective cluster centroids. This method is commonly used as a first approach for clustering, has the particularity that tend to find cloud of clusters within a distance range to its centroid. As a drawback it might be conditioned by the initial position of its centroids (initialization problem) and tend to not work well with data that lies in a convex geometric shape. In exploratory analysis, K-means can also be used to detect possible outliers in the data. Below in figure 1 you can find an example of how K-means learn the positions of the centroids. (This example has been borrowed from r-bloggers.com).
Fig 1. K-means centroids learning process
Hierarchical clustering is a different type of clustering that is based on the concept of proximity. The algorithm build nestles clusters by merging or splitting them successively. This hierarchy of clusters is represented with a tree diagram called a dendrogram. The linkage criteria determines the distance between sets of objects as a function of the pairwise distances between the objects. Therefore, choosing different distance criteria can lead to different clustering outputs. This method finds hard clusters and allows you to find cluster object connectivity properties. As a drawback hierarchical clustering methods are not very robust towards outliers, which will either show up as additional clusters or even cause other clusters to merge. In exploratory analysis, hierarchical clustering can be used not only for clustering but also to find underlying connectivity properties. In contrast to K-means it works well with convex geometric data shapes. It is an interesting clustering method for image segmentation in image processing. Below in figure 2 you can find an example of a dendrogram in hierarchical clustering.
Fig 2. Dendrogram example in hierarchical clustering
Gaussian Mixture Models (GMM) is another type of clustering based on distribution models where cluster objects are very likely to belong to the same distribution. Thus, GMM is a probabilistic model that represents the presence of sub populations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a GMM corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. GMM is used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. This is a soft clustering method that in many scenarios outperforms k-means and works very well for data with implicit Gaussian distributions. As a drawback this distribution based clustering tend to suffer from the over-fitting problem, when not initialized with a fixed number of Gaussian distributions (a given k clusters). In exploratory analysis, GMM could be a very interesting clustering method specially when the data might follow a combination of Gaussian distributions. Below in figure 3 you can find an example of an example of GMM using different types of Gaussian distributions.
Fig 3. Gaussian Mixture Example (scikit-learn library)
Other very interesting clustering algorithms not mentioned in this post are:
- Spectral clustering
- Density-based spatial clustering (DBSCAN)
- Self organised maps (SOM)
Validity criteria methods are a way to assess how good our clustering is. Put it in another way, we know our algorithm will always throw back to us a given clustering. So, in a purely exploratory analysis scenario these are the means by which we make sense of our clustering. It helps to avoid finding patterns in noise, to compare clustering algorithms and to compare two sets of clusters.
The two types of validity criteria that are mostly used are: external and internal index.
External indexes are used to assess if a given clustering correlates to previous information about categories of the data. This means, that there is a previous knowledge about present categories in the data and therefore the goal of the clustering is to verify if our clustering output correlates to these categories.
Put it in another way, the external indexes can be used to:
- Measure the extent to which cluster labels match externally supplied class categories.
- Compare the results of a cluster analysis to externally known results.
Known external index methods: Purity index, F-measure and other measures borrowed from supervised learning to measure accuracy.
In a pure blind exploratory data analysis, internal indexes are used to evaluate the “goodness” of the resulting clusters and it also tries to determine the ‘correct’ number of clusters that should be used by some of the clustering methods for the K number of desired clusters.
A good clustering algorithm should generate clusters with high intra-cluster homogeneity, good inter-cluster separation and high connectedness between neighboring data objects. To achieve this aim, internal index address to measure:
- The cluster cohesion (how close the points belonging to the same cluster are)
- The cluster separation (how distinct or well separated a cluster is from others)
- The goodness of a clustering structure without respect to external information.
- How well the results of a cluster analysis fit the data without reference to external information
Known internal index methods: Sum of square errors, Silhouette coefficient, Bayesian information criteria (BIC), Explained variance (Elbow method).
Different methods for clustering have been presented together with the validity criteria to be taken into account when assessing the quality of the clustering. So how can we choose the best clustering algorithm for our data analysis task? How can we assess the best number of clusters ?
Well, at this point it comes to the data someone might be dealing with. Each clustering algorithm will produce different clustering results which might be very similar depending on the underlying data structure or could be totally different. So having a clear idea of the desired goals of the clustering analysis and the sources of the data someone is dealing with, would definitely help to better chose the right algorithm and its right validity measure to assess the right number of clusters.
Last but not least, data visualization is a very powerful tool to assess how close you are to your clustering goals and to know the kind of data you might be dealing with. However, choosing the right way to visualize your data in the case of having to perform dimensionality reduction will play a mayor roll when interpreting your data. In further post, i will address other clustering methods and some of the dimensionality reduction techniques.