Clustering analysis, or simply Clustering, is an Unsupervised learning approach that splits data points into a number of distinct batches or groups, so that data points in the same groups have similar attributes and data points in other groups have different properties in some way. It consists of several strategies based on differential evolution.
Fundamentally, all clustering approaches take the same approach: we calculate similarities first, then utilise them to group or batch the data points. We will concentrate on the Density-based spatial clustering of applications with noise (DBSCAN) clustering algorithm in this section. Clusters are dense sections of data space separated by lesser density point regions. The DBSCAN algorithm is built around the intuitive concepts of “clusters” and “noise.” The basic principle is that the neighbourhood of a certain radius must contain at least a certain number of points for each point of a cluster.
Why DBSCAN?
Partitioning algorithms (K-means, PAM clustering) and hierarchical clustering can be used to find spherical or convex clusters. In other words, they are only acceptable for compact and well-separated clusters. Furthermore, the presence of noise and outliers in the data has a significant impact on them.
Real life data may contain irregularities, like:
- Clusters can be of arbitrary shape
- Data may contain noise.
k-means algorithm has difficulties in identifying these clusters with arbitrary shapes.
The DBSCAN algorithm requires two parameters:
- eps : It specifies the neighborhood of a data point, which means that if the distance between two points is less than or equal to ‘eps,’ they are considered neighbors. If the eps value is set too low, a considerable portion of the data will be deemed outliers. If the size is high enough, the clusters will merge and the bulk of the data points will be in the same clusters. The k-distance graph may be used to calculate the eps value.
- MinPts: Minimum number of neighbors (data points) within eps radius. Larger the dataset, the larger value of MinPts must be chosen. As a general rule, the minimum MinPts can be derived from the number of dimensions D in the dataset as, MinPts >= D+1. The minimum value of MinPts must be chosen at least.
Click here to learn machine learning with Entri app
In this algorithm, we have 3 types of data points.
Core Point: A point is a core point if it has more than MinPts points within eps.
Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood of a core point.
Noise or outlier: A point which is not a core point or border point.
DBSCAN algorithm can be abstracted in the following steps:
Find all of the neighbor points inside eps and locate the core points or locations visited by more than MinPts neighbours.
Create a new cluster for each core point that is not currently assigned to one.
Find all of its density-related points recursively and assign them to the same cluster as the core point.
Point a and b are said to be density linked if a point c has a sufficient number of points in its neighbors and both a and b are within the eps distance. This is a chaining process. So, if b is neighbor of c, c is neighbor of d, d is neighbor of e, which in turn is a neighbor of a implies that b is a neighbor of a. Iterate through the remaining unvisited points in the dataset. Those points that do not belong to any cluster are noise.
DBSCAN clustering algorithm in pseudocode:
DBSCAN(dataset, eps, MinPts){ # cluster index C = 1 for each unvisited point p in dataset { mark p as visited # find neighbors Neighbors N = find the neighboring points of p if |N|>=MinPts: N = N U N' if p' is not a member of any cluster: add p' to cluster C } Implementation of the above algorithm in Python : Here, we’ll use the Python library sklearn to compute DBSCAN. We’ll also use the matplotlib.pyplot library for visualizing clusters.
Evaluation Metrics
Furthermore, we will evaluate clustering methods using the Silhouette score and the Adjusted rand score.
The silhouette score ranges from -1 to 1. A score close to 1 indicates that the data point I is highly compact inside the cluster to which it belongs and is remote from the other clusters. The poorest possible value is -1. Near-zero values indicate overlapping clusters.
The Absolute Rand Score ranges from 0 to 1. More than 0.9 indicates great cluster recovery, whereas more than 0.8 indicates fair recovery. A recovery rate of less than 0.5 is deemed bad.
Example
- Python3
import matplotlib.pyplot as plt import numpy as np from sklearn.cluster import DBSCAN from sklearn import metrics from sklearn.datasets.samples_generator import make_blobs from sklearn.preprocessing import StandardScaler from sklearn import datasets # Load data in X X, y_true = make_blobs(n_samples = 300 , centers = 4 , cluster_std = 0.50 , random_state = 0 ) db = DBSCAN(eps = 0.3 , min_samples = 10 ).fit(X) core_samples_mask = np.zeros_like(db.labels_, dtype = bool ) core_samples_mask[db.core_sample_indices_] = True labels = db.labels_ # Number of clusters in labels, ignoring noise if present. n_clusters_ = len ( set (labels)) - ( 1 if - 1 in labels else 0 ) print (labels) # Plot result # Black removed and is used for noise instead. unique_labels = set (labels) colors = [ 'y' , 'b' , 'g' , 'r' ] print (colors) for k, col in zip (unique_labels, colors): if k = = - 1 : # Black used for noise. col = 'k' class_member_mask = (labels = = k) xy = X[class_member_mask & core_samples_mask] plt.plot(xy[:, 0 ], xy[:, 1 ], 'o' , markerfacecolor = col, markeredgecolor = 'k' , markersize = 6 ) xy = X[class_member_mask & ~core_samples_mask] plt.plot(xy[:, 0 ], xy[:, 1 ], 'o' , markerfacecolor = col, markeredgecolor = 'k' , markersize = 6 ) plt.title( 'number of clusters: %d' % n_clusters_) plt.show() #evaluation metrics sc = metrics.silhouette_score(X, labels) print ( "Silhouette Coefficient:%0.2f" % sc) ari = adjusted_rand_score(y_true, labels) print ( "Adjusted Rand Index: %0.2f" % ari) |
Output:
Silhouette Coefficient:0.13 Adjusted Rand Index: 0.31
Black points represent outliers. By changing the eps and the MinPts , we can change the cluster configuration.
Disadvantage Of K-MEANS:
K-Means forms spherical clusters only. This algorithm fails when data is not spherical ( i.e. same variance in all directions). K-Means algorithm is sensitive towards outlier. Outliers can skew the clusters in K-Means in very large extent.