The elbow method suggests 3 cluster solutions.
The elbow method is sometimes ambiguous. An alternative is the average silhouette method (Kaufman and Rousseeuw [1990]) which can be also used with any clustering approach.
Average silhouette method for defining the optimal number of clusters
Algorithm
After running the algorithm in R we can see the following graphical
This method measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering.
Average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximizes the average silhouette over a range of possible values for k (Kaufman and Rousseeuw [1990]).
The algorithm is similar to the elbow method and can be computed as follow:
Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters
- For each k, calculate the average silhouette of observations (avg.sil)
- Plot the curve of avg.sil according to the number of clusters k.
- The location of the maximum is considered as the appropriate number of clusters.
Three clusters are suggested.
Observation
I ran the following algorithms in order to find the suggested optimal number of clusters:
k-means, PAM and hierarchical clustering in combination with the elbow method, average silhouette method using k-means and PAM algorithms.
Solutions:
· Three cluster solutions are suggested using k-means, PAM and hierarchical clustering in combination with the elbow method.
· The average silhouette method gives two cluster solutions using k-means and PAM algorithms. Combining hierarchical clustering and silhouette method returns 3 clusters
According to these observations, it’s possible to define k = 3 as the optimal number of clusters in the data.
Gap statistic method for defining the optimal number of clusters
The disadvantage of the elbow and Average silhouette methods is that they measure a global clustering characteristic only. A more sophisticated method is to use the gap statistic, which provides a statistical procedure to formalize the elbow/silhouette heuristic in order to estimate the optimal number of clusters.
The gap statistic compares the total within intracluster variation for different values of k with their expected values under the null reference distribution of the data, i.e. a distribution with no obvious clustering.
More info on this method here:
The Machine Learning way for defining the optimal number of clusters
Searching the Web I found this package that you can implement in R. It is basically using 30 different machine learning algorithms and determining the number of optimal clusters according to the Majority rule.
NbClust: A Package providing 30 indices for determining the best number of clusters
Many indices have been proposed in the literature for determining the optimal number of clusters in a partitioning of a data set during the clustering process.
NbClust is R package, published by Charrad et al., 2014, provides 30 indices for determining the relevant number of clusters and proposes to users the best clustering scheme from the different results obtained by varying all combinations of a number of clusters, distance measures, and clustering methods.
An important advantage of NbClust is that the user can simultaneously compute multiple indices and determine the number of clusters in a single function call.
The indices provided in NbClust package includes the gap statistic, the silhouette method and 28 other indices described comprehensively in the original paper of Charrad et al., 2014.
In KNIME we use only 28 indices because the other 2 are graphical indices and we can’t do automation with them.
Following is used: “kl”, “ch”, “hartigan”, “ccc”, “scott”, “marriot”, “trcovw”, “tracew”, “friedman”, “rubin”, “cindex”, “db”, “silhouette”, “duda”, “pseudot2”, “beale”, “ratkowsky”, “ball”, “ptbiserial”, “gap”, “frey”, “mcclain”, “gamma”, “gplus”, “tau”, “dunn”, “sdindex”, “sdbw”
Examples of usage in R
library(“NbClust”)
set.seed(123)
res.nb <- NbClust(iris.scaled, distance = “euclidean”, min.nc = 2, max.nc = 10, method = “complete”, index =”gap”)
res.nb # print the results
I have implemented this model fully automated in KNIME. stay tuned for more instructions on how to do it yourself in KNIME too.