Understand Clustering Algorithms

The process of identifying same groups of data in a data set is known clustering. Clustering or cluster analysis is basically an unsupervised learning process. It is usually used as a data analysis technique for identifying interesting patterns in data, such as grouping users based on their reviews.

Based upon problem statement there are different types of clustering algorithms. Good idea will be to explore a range of clustering algorithms and how to configure each algorithm. So to give a basic overview clustering is a task of dividing the population or data points into a number of groups such that data points in same groups are almost similar to other data points. First we shall go ahead with understanding of different types of clustering.

Types of clustering:

  1. Hard clustering
  2. Soft clustering
  3. Hierarchical  clustering
  4. Flat clustering
  5. Model based clustering

1. Hard clustering:  In this type of clustering the  data point is assigned to only one cluster. It is also known as exclusive clustering. The k-means clustering mechanism is the best example for it.

2. Soft clustering: The given data point which belong to more than one cluster is called soft clustering. This is also known as overlapping clustering. The fuzzy k-means algorithm is an example of soft clustering.

3. Hierarchical clustering: In hierarchical, a hierarchy of clusters is built using the top down (divisive) or bottom up (agglomerative) approach.

4. Flat clustering: It is  a simple technique, we can say where no hierarchy is present.

5. Model-based clustering: In model based technique data is modeled using a standard statistical model to work with different distributions. The idea is to find a model that best fits the data.

Clustering algorithms:

  1. k-Means
  2. Mean Shift Clustering.
  3. DBSCAN
  4. Expectation Maximization Clustering
  5. Agglomerative Hierarchical Clustering

1. k Means Clustering:

k-Means is a basic technique which is used in machine learning. It was developed in 1967 by researcher named john Macqueen. K Means is specially used for unlabeled data. It is the type of unsupervised learning and it is a popular among all the unsupervised learning techniques. This technique is known for its simplicity and efficiency. K Means algorithm is used when the dataset is not well organized.

It determines the centroids points and then locates each data point to the nearest cluster. In simple words it refers to the data averaging.

Working of K Means algorithm:
  1. First we define the numbers of clusters (k).
  2. Set the centroids of dataset by shuffling and then select the data points without replacements.
  3. Keep shuffling until no change takes in centroids.
    1. Calculate the sum of square distance of centroids and data points
    2. Allot the point to the nearest cluster.
    3. Process the points by taking average of all data points.

2. Mean Shift Clustering:

In the previous algorithm number of clusters has to be defined earlier before processing the task and this was the drawback of the K means algorithm. In order to overcome this problem mean shift algorithm is used.

It is also known as non parametric algorithm and it doesn’t work on any assumptions. It give the data points to clusters randomly by shifting the highest value of data points (cluster centroids). In this outliers doesn’t create any issue. It is used to process complex clusters.

Simple steps to learn mean shift algorithm.
  1. In first step the data points are given to clusters on their own.
  2. Then the algorithm will process the centroids.
  3. It updates the new area where the centroids are placed.
  4. Now move to the higher density after the processing is done.
  5. It will check the data points if data points are not moved further then the process will stop or else the process will repeat form step 2.

3. DBSCAN:

DBSCAN is a algorithm in which the detect the area of higher density which was spilled by low density region. The different shapes are discovered in the DBSCAN and is optimized in large data amount and this contains noise and outliers.

The two major concepts of DBSCAN Is listed below:

  1. Reachability
  2. Connectivity
DBSCAN Algorithm Steps:
  1. In the initial step the arbitrary point is given which isn’t not visited and its neighbor’s information is given in the parameter.
  2. If minpts is detected within the parameter then cluster information will start or else it is named as noise ( concepts like reachability and connectivity plays vital role in this step )
  3. If the points are core then it is said to be a part of neighbor’s cluster and all the points are added within that cluster
  4. The above steps will repeat again and again until the full cluster is found.
  5. If full cluster is founded then a new process will start for a new cluster and is named as noise.

4. Expectation Maximization Clustering:

This algorithm is use to search the maximum probability for the parameters of the model if the dataset is not fully completed and data points are missing. This is the repetition to approximate maximum probability functions. In simple terms the more complex the algorithm is the more is the possibilities of finding the parameters are there even though missing data is there. In order create a better guess for the initial set we utilize new values in the algorithm. The given data is used to find the missing data and then those parameters are updated in the records

Following steps are used to find the model parameters in the given variables:

  1. In the first step, a set starting parameters are taken in incomplete data.
  2. Expectation is used to find the missing values in the given data.
  3. Maximization is used to produce whole data and also updated the missing values.
  4. If the missing value of dataset are found then the process will stop or else step 2 and step 3 are repeated to find the missing values.

Easy way to learn the EM clustering Algorithm:

  1. In the initial stage, a dataset is given which is unobserved and incomplete in the system the data is computed from the specific model.
  2. This step is used to updates the variables after computing the incomplete data.
  3. This step fulfills the data that was given by the previous step (step2).
  4. This check whether all the values are gathered or not if not then steps are repeated until all values are found.

5. Agglomerative Hierarchical Clustering:

This clustering algorithm is very popularly used to merge the variable by finding their similarity. It treats each object as a single cluster. It continues joining all the pairs of object clusters until all are grouped into 1 single clusters

Steps to learn the Agglomerative hierarchical clustering in easy way:
  1. In the initial stage the data is gathered for process.
  2. Calculating the variation in the data in each pair of cluster.
  3. To convert group objects into hierarchical cluster tree linkage method is used. The object that are similar are used to link in single cluster.
  4. In this step it is necessary to decide where to cut the hierarchical tree into cluster in order to separate the data.

Conclusion:

That’s all about these algorithms hope it was a good read and you now understand the process of each algorithms. You can also check this post on Unsupervised learning.

Spread the knowledge

Divya Kalwani

Final year student in computer engineering

Leave a Reply

Your email address will not be published. Required fields are marked *