#### What is Clustering?

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

#### Where Clustering is used in real life:

Clustering is used almost everywhere – Search engines, making marketing campaigns, biological analysis, cancer analysis, your favorite phone provider is making cluster analysis to see in which group of people you belong before they decide if they will give you additional discount or special offer. the applications are countless.

How can I find the optimal number of clusters?

One fundamental question is: If the data is clusterable, then how to choose the right number of expected clusters (k)?

**Three popular methods for determining the optimal number of clusters:**

#### Elbow method for defining the optimal number of clusters

*Algorithm:*

Elbow methodAverage silhouette methodGap statistics method

Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clustersFor each k, calculate the total within-cluster sum of square (wss)Plot the curve of wss according to the number of clusters k. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.

### Average silhouette method for defining the optimal number of clusters

#### Algorithm

After running the algorithm in R we can see the following graphical

- For each k, calculate the average silhouette of observations (avg.sil)
- Plot the curve of avg.sil according to the number of clusters k.
- The location of the maximum is considered as the appropriate number of clusters.

#### Observation

**k-means**,

**PAM**and

**hierarchical**clustering in combination with the

**elbow method**, average silhouette method using

**k-means**and

**PAM**algorithms.

#### Solutions:

**k-means**,

**PAM**and

**hierarchical**clustering in combination with the

**elbow method**.

**k-means**and

**PAM**algorithms. Combining hierarchical clustering and silhouette method returns 3 clusters

### Gap statistic method for defining the optimal number of clusters

**gap statistic**has been published by R. Tibshirani, G. Walther, and T. Hastie (Standford University, 2001). The approach can be applied to any clustering method.

### The Machine Learning way for defining the optimal number of clusters

### NbClust: A Package providing 30 indices for determining the best number of clusters

**NbClust**is R package, published by Charrad et al., 2014, provides 30 indices for determining the relevant number of clusters and proposes to users the best clustering scheme from the different results obtained by varying all combinations of a number of clusters, distance measures, and clustering methods.

**NbClust**package includes the gap statistic, the silhouette method and 28 other indices described comprehensively in the original paper of Charrad et al., 2014.

**Examples of usage in R**

**library**(“NbClust”)