Machine learning in practice – Let the machine find the optimal number of clusters from your data

 

What is Clustering?

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

Where Clustering is used in real life: 

Clustering is used almost everywhere – Search engines, making marketing campaigns, biological analysis, cancer analysis, your favorite phone provider is making cluster analysis to see in which group of people you belong before they decide if they will give you additional discount or special offer. the applications are countless.


How can I find the optimal number of clusters?

One fundamental question is: If the data is clusterable, then how to choose the right number of expected clusters (k)?

 

·Direct methods consist of optimizing a criterion, such as the within-cluster sums of squares or the average silhouette. The corresponding methods are named elbow and silhouette methods, respectively.
·Testing methods consist of comparing evidence against the null hypothesis. An example is the gap statistic.
 
In addition to elbow, silhouette and gap statistic methods, there are more than thirty other indices and methods that have been published for identifying the optimal number of clusters.
I have developed a KNIME Model for computing all these 30 indices in order to decide the best number of clusters using the “majority rule”.
 
Please note, since I use KNIME and R in combination, you need to have R installed and set up on your machine to run this model.
 

Three popular methods for determining the optimal number of clusters:

Elbow method for defining the optimal number of clusters


Algorithm:

There are three popular methods for determining the optimal number of clusters:

 Elbow methodAverage silhouette methodGap statistics method 

 

Concept:
The basic idea behind partitioning methods, such as k-means clustering, is to define clusters such that the total intra-cluster variation (known as a total within-cluster variation or the total within-cluster sum of the square) is minimized.
 
The total within-cluster sum of square (wss) measures the compactness of the clustering and we want it to be as small as possible.

Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clustersFor each k, calculate the total within-cluster sum of square (wss)Plot the curve of wss according to the number of clusters k. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.
 

 

After running the algorithm in R we can see the following graphical representations of the final calculations:
The elbow method suggests 3 cluster solutions.
 
 
The elbow method is sometimes ambiguous. An alternative is the average silhouette method (Kaufman and Rousseeuw [1990]) which can be also used with any clustering approach.
 

 

Average silhouette method for defining the optimal number of clusters

Algorithm

After running the algorithm in R we can see the following graphical 

 
This method measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering.
Average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximizes the average silhouette over a range of possible values for k (Kaufman and Rousseeuw [1990]).
 
The algorithm is similar to the elbow method and can be computed as follow:
Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters
  • For each k, calculate the average silhouette of observations (avg.sil)
  • Plot the curve of avg.sil according to the number of clusters k.
  • The location of the maximum is considered as the appropriate number of clusters.
Three clusters are suggested.

 

 

 

Observation

 
I ran the following algorithms in order to find the suggested optimal number of clusters:
k-meansPAM and hierarchical clustering in combination with the elbow method, average silhouette method using k-means and PAM algorithms.
 

Solutions:

·       Three cluster solutions are suggested using k-meansPAM and hierarchical clustering in combination with the elbow method.
·       The average silhouette method gives two cluster solutions using k-means and PAM algorithms. Combining hierarchical clustering and silhouette method returns 3 clusters
 
According to these observations, it’s possible to define k = 3 as the optimal number of clusters in the data.
  

Gap statistic method for defining the optimal number of clusters

 
The disadvantage of the elbow and Average silhouette methods is that they measure a global clustering characteristic only. A more sophisticated method is to use the gap statistic, which provides a statistical procedure to formalize the elbow/silhouette heuristic in order to estimate the optimal number of clusters.
 
The gap statistic has been published by R. Tibshirani, G. Walther, and T. Hastie (Standford University, 2001). The approach can be applied to any clustering method.
 
The gap statistic compares the total within intracluster variation for different values of k with their expected values under the null reference distribution of the data, i.e. a distribution with no obvious clustering.
  
More info on this method here:

The Machine Learning way for defining the optimal number of clusters

 
Searching the Web I found this package that you can implement in R. It is basically using 30 different machine learning algorithms and determining the number of optimal clusters according to the Majority rule.

NbClust: A Package providing 30 indices for determining the best number of clusters 

Many indices have been proposed in the literature for determining the optimal number of clusters in a partitioning of a data set during the clustering process.
NbClust is R package, published by Charrad et al., 2014, provides 30 indices for determining the relevant number of clusters and proposes to users the best clustering scheme from the different results obtained by varying all combinations of a number of clusters, distance measures, and clustering methods.
 
An important advantage of NbClust is that the user can simultaneously compute multiple indices and determine the number of clusters in a single function call.
The indices provided in NbClust package includes the gap statistic, the silhouette method and 28 other indices described comprehensively in the original paper of Charrad et al., 2014.
 
In KNIME we use only 28 indices because the other 2 are graphical indices and we can’t do automation with them.
Following is used: “kl”, “ch”, “hartigan”, “ccc”, “scott”, “marriot”, “trcovw”, “tracew”, “friedman”, “rubin”, “cindex”, “db”, “silhouette”, “duda”, “pseudot2”, “beale”, “ratkowsky”, “ball”, “ptbiserial”, “gap”, “frey”, “mcclain”, “gamma”, “gplus”, “tau”, “dunn”, “sdindex”, “sdbw”
 
Examples of usage in R
library(“NbClust”)
set.seed(123)
 res.nb <- NbClust(iris.scaled, distance = “euclidean”, min.nc = 2, max.nc = 10, method = “complete”, index =”gap”)
 res.nb # print the results

 

I have implemented this model fully automated in KNIME. stay tuned for more instructions on how to do it yourself in KNIME too.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.