Unsupervised learning is category of machine learning approach which deals with finding a pattern in the data under observation. In this approach input variables “X” are specified without actually providing corresponding mapped output variables “Y”
Supervised Vs Unsupervised Learning.
In supervised learning, the system tries to learn from the previous observations that are given. On contrary, in unsupervised learning, the system attempts to find the patterns directly in the given observations. Thus, labelled datasets falls into supervised problem, whereas unlabelled datasets falls into unsupervised problem. This can be explained using scatter plot mentioned below
Here, the scatter plot to the left is an example for supervised learning where we use regression techniques to find best fit line between the features to classify or differentiate them. Whereas, in the case of unsupervised learning(right) the inputs are sequestered – prediction is done based on various features to determine the cluster to which the current given input should belong.
Clustering is a type of unsupervised learning approach in which entire data set is divided into various groups or clusters. In simple terms, crux of this approach is to segregate input data with similar traits into clusters
This can be explained with an example mentioned below.
Here, scatter plot to the left is data where the clustering isn’t done yet. Whereas, scatter plot to the right is clustered i.e. the data is classified based on various features. When a particular input is fed into clustering algorithm, a prediction is done by checking which cluster should it belong to based on its features.
K-Means Clustering in Python
This is simplest clustering algorithm. The “K” in the k-means refers to the fact that the algorithm is look for “K” different clusters. Which means that a when a k-mean algorithm is applied to a data set then the algorithm will split he data set into “K” different clusters i.e. when we specify value of k=3, then the algorithm will the data set into 3 clusters.
- Segregate the data set into “k” groups or cluster
- Select k points at random as cluster centroids or seed points.
- Assign objects to their closest cluster on the basis of Euclidean distance function between centroid and the object.
- Determine the centroid (seed point) or mean of all objects in each cluster.
- Repeat steps number 2, 3 and 4 until the same data objects are assigned to each cluster in consecutive rounds.
Hierarchical clustering is bit different from K means clustering here data is assigned to cluster of their own. There are two approaches in hierarchical clustering they are bottom up approach and top down approach. In bottom up approach each data point is regarded as a cluster and then the two cluster which are closest to each other are merged to form cluster of clusters. The algorithm goes on till one cluster is left. Whereas, in top-down approach all the data points are regarded as one big cluster which is broken down into various small clusters.
Hierarchical clustering can be illustrated using a dendrogram which is mentioned below.
Algorithm for both the approaches is mentioned below
Agglomerative hierarchical algorithm (Bottom-Up approach):
- Let us begin by considering each data point as a single cluster. So, if we have ”N” data points in our data set. Thus, we have “N” different clusters
- In this step we will join two closely related cluster to form one one big cluster. Hence, in the end of this step we will be left with “N-1” cluster.
- Next, to form more big clusters we need to join two closest clusters. Hence , the result of this step will be total of “N-2” clusters.
- Repeat step 1,2,3 until we have one big cluster.
Divisive hierarchical algorithm /DIANA algorithm (Top-Down approach):
- In this step we regard all the points in the data set as one big cluster. We split this cluster into multiple clusters using flat clustering method.
- Choose the best cluster among all the newly created clusters to split.
- Now, split this newly selected cluster using flat clustering method.
- Repeat step 2,3 unit each data point is in its own singleton cluster.
Divisive algorithm is also more complex and accurate than agglomerative clustering. As agglomerative clustering makes decisions by considering the local patterns or neighbor points without initially taking into account the global distribution of data unlike divisive algorithm. These early decisions cannot be undone. whereas divisive clustering takes into consideration the global distribution of data when making top-level partitioning decisions.
DBSCAN Clustering (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN algorithm as the name suggests is a density based clustering algorithm. it tends to groups together data points from a particular dataset that are closely packed together (points with many nearby neighbours),and also marking as outliers points that lie alone in low-density regions.
Before starting on with the algorithm we need to highlight few parameters and the terminologies used. So, let us consider a set of data points that need to be clustered. Let ε (epsilon) be parameter which denotes the radius of the neighborhood with respect some point “p”. In case DBSCAN algorithm points are classified into core points, reachable points(boundary point) and outlier.
- A point is called core point if there are minimum points (MinPoint) within the ε distance of it by including that particular point.
- A point “X” is directly reachable from point “Y” if it is within epsilon distance from “Y”.
- A point “X” is reachable from point “Y” if there is path from Y1,…Yn with Y1=Y and Yn=X, where each Yi+1 is directly reachable from We have to make sure that initial point and all points on the path must be core points, with the possible exception of X.
- Any points which are not reachable from any other point are outliers or noise points.
NOTE: Only core points can reach non-core points. The opposite is not true
- For each data point form n dimensional shape of radius of “ε” around that data point.
- Count the number of data points that fall into that shape for a particular data point “p”. Repeat this step for all the data points in the data set.
- Check for a particular data point “p”, if the count >= MinPts then mark that particular data point as core point.
- Check for a particular data point “p”, if the count < MinPts and point “p” is within “ε” radius of any core point then mark point “p” as boundary point.
- Check for particular data point “p”, if the count<MinPts and point “p” is not in “ε” radius of any core point then mark point “p” as outlier.
- Repeat steps for 3,4,5 for all the points.
That’s a quick overview regarding important clustering algorithms.
Thanks for reading, Follow our website to learn the latest technologies, and concepts.
You can also check out our post on: Loss Function and Optimization Function