Machine learning – Clustering algorithms

In machine learning, no one algorithm works best for every problem, and it’s especially relevant for supervised learning.

For example, you can’t say that neural networks are always better than decision trees or vice-versa. There are many factors you should consider before choosing the machine learning algorithm you will use to solve your challenge.

As a result, you should try many different algorithms for your problem, while using a hold-out “test set” of data to evaluate performance and select the winner.

Of course, the algorithms you try must be appropriate for your problem, which is where picking the right machine learning task comes in. As an analogy, if you need to clean your house, you might use a vacuum, a broom, or a mop, but you wouldn’t bust out a shovel and start digging.

What is Clustering?

\

Clustering can be considered the most important unsupervised machine learning problem; so, like every other problem of this kind, it deals with finding a structure in a collection of unlabeled data.
A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”.
A cluster is, therefore, a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.
We can show this with a simple graphical example:


The Goals of Clustering

The goal of clustering is to reduce the amount of data by categorizing or grouping similar data items together. Such grouping is pervasive in the way humans process information, and one of the motivations for using clustering algorithms is to provide automated tools to help in constructing categories or taxonomies. The methods may also be used to minimize the effects of human factors in the process.

Clustering methods can be divided into two basic types: 

  • hierarchical and
  • partitioned clustering.


Within each of the types, there exists a wealth of subtypes and different algorithms for finding the clusters.

Algorithms for clustering:

Connectivity-based clustering (hierarchical clustering)

Connectivity-based clustering, also known as hierarchical clustering, is based on the core idea of objects being more related to nearby objects than to objects farther away. These algorithms connect “objects” to form “clusters” based on their distance. A cluster can be described largely by the maximum distance needed to connect parts of the cluster. At different distances, different clusters will form, which can be represented using a dendrogram, which explains where the common name “hierarchical clustering” comes from: these algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge with each other at certain distances. In a dendrogram, the y-axis marks the distance at which the clusters merge, while the objects are placed along the x-axis such that the clusters don’t mix.

Centroid-based clustering

In centroid-based clustering, clusters are represented by a central vector, which may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the {\displaystyle k} k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.

Distribution-based clustering

The clustering model most closely related to statistics is based on distribution models. Clusters can then easily be defined as objects belonging most likely to the same distribution. A convenient property of this approach is that this closely resembles the way artificial data sets are generated: by sampling random objects from a distribution.

While the theoretical foundation of these methods is excellent, they suffer from one key problem known as overfitting, unless constraints are put on the model complexity. A more complex model will usually be able to explain the data better, which makes choosing the appropriate model complexity inherently difficult.

Density-based clustering

The clustering model most closely related to statistics is based on distribution models. Clusters can then easily be defined as objects belonging most likely to the same distribution. A convenient property of this approach is that this closely resembles the way artificial data sets are generated: by sampling random objects from a distribution.

While the theoretical foundation of these methods is excellent, they suffer from one key problem known as overfitting, unless constraints are put on the model complexity. A more complex model will usually be able to explain the data better, which makes choosing the appropriate model complexity inherently difficult.

Possible Applications of clustering

Clustering algorithms can be applied in many fields, for instance:

  • Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records;
  • Biology: classification of plants and animals given their features;
  • Libraries: book ordering;
  • Insurance: identifying groups of motor insurance policyholders with a high average claim cost; identifying frauds;
  • City-planning: identifying groups of houses according to their house type, value and geographical location;
  • Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones;
  • WWW: document classification; clustering weblog data to discover groups of similar access patterns. Recommendations engines

Defining the optimal number of clusters

Determining the number of clusters in a data set, a quantity often labeled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem.
For a certain class of clustering algorithms (in particular k-means, k-medoids, and expectation–maximization algorithm), there is a parameter commonly referred to ask that specifies the number of clusters to detect. Other algorithms such as DBSCAN and OPTICS algorithm do not require the specification of this parameter; hierarchical clustering avoids the problem altogether.
The correct choice of k is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. In addition, increasing k without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own cluster (i.e., when k equals the number of data points, n). Intuitively then, the optimal choice of k will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster. If an appropriate value of k is not apparent from prior knowledge of the properties of the dataset, it must be chosen somehow.

An automated way of defining the optimal number of clusters:

 There are several categories of methods for making this decision. I have described some of them and how to make an automated way of defining the optimal number of clusters in this article:

Requirements for Clustering

The main requirements that a clustering algorithm should satisfy are:

  • scalability;
  • dealing with different types of attributes;
  • discovering clusters with arbitrary shape;
  • minimal requirements for domain knowledge to determine input parameters;
  • ability to deal with noise and outliers;
  • insensitivity to order of input records;
  • high dimensionality;
  • interpretability and usability.
  • Problems


There are a number of problems with clustering. Among them:


current clustering techniques do not address all the requirements adequately (and concurrently);
dealing with a large number of dimensions and a large number of data items can be problematic because of time complexity;
the effectiveness of the method depends on the definition of “distance” (for distance-based clustering);
if an obvious distance measure doesn’t exist we must “define” it, which is not always easy, especially in multi-dimensional spaces;
the result of the clustering algorithm (that in many cases can be arbitrary itself) can be interpreted in different ways.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.