{"id":25524158,"date":"2022-05-13T20:00:31","date_gmt":"2022-05-13T14:30:31","guid":{"rendered":"https:\/\/entri.app\/blog\/?p=25524158"},"modified":"2022-05-13T18:28:57","modified_gmt":"2022-05-13T12:58:57","slug":"top-clustering-algorithms-data-scientists-should-know","status":"publish","type":"post","link":"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/","title":{"rendered":"Top Clustering Algorithms Data Scientists Should Know"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_79_2 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-69e781062341d\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-69e781062341d\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#Introduction\" >Introduction<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#Hierarchical_Clustering\" >Hierarchical Clustering<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#K-means_Clustering\" >K-means Clustering<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#Gaussian_Mixture_Models\" >Gaussian Mixture Models<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#Mean_Shift_Clustering\" >Mean Shift Clustering<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#Density-based_Spatial_Clustering_of_Applications_With_Noise_DBSCAN\" >Density-based Spatial Clustering of Applications With Noise (DBSCAN)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#Expectation%E2%80%93Maximization_Algorithm_EM\" >Expectation\u2013Maximization Algorithm (EM)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#Principal_Component_Analysis_PCA\" >Principal Component Analysis (PCA)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#Self-Organizing_Maps_SOM\" >Self-Organizing Maps (SOM)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#Label_Propagation_LP\" >Label Propagation (LP)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#Spectral_Clustering\" >Spectral Clustering<\/a><\/li><\/ul><\/nav><\/div>\n<p><span data-slate-fragment=\"JTVCJTdCJTIydHlwZSUyMiUzQSUyMnBhcmFncmFwaCUyMiUyQyUyMmNoaWxkcmVuJTIyJTNBJTVCJTdCJTIydGV4dCUyMiUzQSUyMkNsdXN0ZXJpbmclMjBpcyUyMG9uZSUyMG9mJTIwdGhlJTIwbW9zdCUyMGNvbW1vbiUyMHRlY2huaXF1ZXMlMjB1c2VkJTIwYnklMjBkYXRhJTIwc2NpZW50aXN0cyUyMHRvJTIwb3JnYW5pemUlMjBkYXRhJTIwaW50byUyMGdyb3VwcyUyMHdpdGglMjBzaW1pbGFyJTIwZmVhdHVyZXMuJTIwQ2x1c3RlcmluZyUyMGFsZ29yaXRobXMlMjBhcmUlMjBjb21tb25seSUyMHVzZWQlMjBpbiUyMGltYWdlJTIwYW5hbHlzaXMlMkMlMjBzcGVlY2glMjByZWNvZ25pdGlvbiUyQyUyMGFydGlmaWNpYWwlMjBpbnRlbGxpZ2VuY2UlMkMlMjBhbmQlMjBhbG1vc3QlMjBhbnklMjBvdGhlciUyMGZpZWxkJTIwdGhhdCUyMHJlcXVpcmVzJTIwZ3JvdXBpbmclMjBsYXJnZSUyMGFtb3VudHMlMjBvZiUyMHVuc3RydWN0dXJlZCUyMGRhdGEuJTIwSW4lMjB0aGlzJTIwYXJ0aWNsZSUyQyUyMHdlJTIwd2lsbCUyMGdvJTIwb3ZlciUyMDEwJTIwZGlmZmVyZW50JTIwY2x1c3RlcmluZyUyMGFsZ29yaXRobXMlMjB5b3UlMjBzaG91bGQlMjBrbm93JTIwYXMlMjBhJTIwZGF0YSUyMHNjaWVudGlzdCUyQyUyMGluY2x1ZGluZyUyMGV4YW1wbGVzJTIwb2YlMjBob3clMjB0aGV5JTIwY2FuJTIwYmUlMjBhcHBsaWVkJTIwYW5kJTIwd2hlbiUyMHRoZXklMjBzaG91bGQlMjBiZSUyMGF2b2lkZWQuJTIyJTdEJTVEJTdEJTVE\">Clustering is one of the most common techniques used by data scientists to organize data into groups with similar features. Clustering algorithms are commonly used in image analysis, speech recognition, artificial intelligence, and almost any other field that requires grouping large amounts of unstructured data. In this article, we will go over 10 different clustering algorithms you should know as a data scientist, including examples of how they can be applied and when they should be avoided. <\/span><a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\">Clustering algorithms<\/a> are one of the most popular topics when it comes to data science, and rightly so \u2014 they\u2019re often used in areas where the goal is to group objects together on some level, such as customer segmentation and market segmentation, or in data mining tasks like anomaly detection. Given their popularity, it\u2019s useful to learn what other people consider the best clustering algorithms out there, so here are my top 10 favorite ones and why I like them! <span data-slate-fragment=\"JTVCJTdCJTIydHlwZSUyMiUzQSUyMnBhcmFncmFwaCUyMiUyQyUyMmNoaWxkcmVuJTIyJTNBJTVCJTdCJTIydGV4dCUyMiUzQSUyMldoZW4lMjBhbmFseXppbmclMjBkYXRhJTJDJTIwY2x1c3RlcmluZyUyMGFsZ29yaXRobXMlMjBncm91cCUyMGVsZW1lbnRzJTIwdGhhdCUyMGFyZSUyMHNpbWlsYXIlMjBhbmQlMjBzZXBhcmF0ZSUyMHRob3NlJTIwdGhhdCUyMGFyZSUyMGRpZmZlcmVudC4lMjBDbHVzdGVyaW5nJTIwYWxnb3JpdGhtcyUyMGRvJTIwdGhpcyUyMGJ5JTIwbG9va2luZyUyMGF0JTIwdGhlJTIwZGF0YSUyMHBvaW50cyUyMGFuZCUyMGRldGVybWluaW5nJTIwaWYlMjB0aGV5JTIwbG9vayUyMHNpbWlsYXIlMjB0byUyMGVhY2glMjBvdGhlciUyMG9yJTIwZGlmZmVyZW50JTIwZnJvbSUyMGVhY2glMjBvdGhlciVFMiU4MCU5NHRoZXklMjBsb29rJTIwYXQlMjB0aGUlMjBmZWF0dXJlcyUyMG9mJTIwZWFjaCUyMGRhdGElMjBwb2ludCUyMGFuZCUyMGRlY2lkZSUyMGlmJTIwaXQlMjBsb29rcyUyMGxpa2UlMjBhbm90aGVyJTIwZGF0YSUyMHBvaW50LiUyMiU3RCU1RCU3RCU1RA==\">When analyzing data, clustering algorithms group elements that are similar and separate those that are different. Clustering algorithms do this by looking at the data points and determining if they look similar to each other or different from each other\u2014they look at the features of each data point and decide if it looks like another data point.<\/span><\/p>\n<p style=\"text-align: center\"><a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\">Get the latest updates on data science in the Entri app<\/a><\/p>\n<h2><a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-25520997 size-full\" src=\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Rectangle.png\" alt=\"\" width=\"970\" height=\"250\" srcset=\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Rectangle.png 970w, https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Rectangle-300x77.png 300w, https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Rectangle-768x198.png 768w, https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Rectangle-750x193.png 750w\" sizes=\"auto, (max-width: 970px) 100vw, 970px\" \/><\/a><\/h2>\n<h2><span class=\"ez-toc-section\" id=\"Introduction\"><\/span><strong>Introduction<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Machine learning is widely used in data science to predict future trends based on historical data. A collection of algorithms are applied to large data sets and identify common patterns, which can be used to forecast future outcomes. The objective of clustering algorithms is to group similar data points together so that they become easier to analyze and manage. In marketing research, for example, customers might be categorized into different groups depending on their demographic profiles. These groups can then be used as a basis for market segmentation strategies. There are a number of approaches that can be used when designing a cluster analysis model, including k-means and hierarchical <a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\">clustering algorithms<\/a>\u2014let\u2019s take a look at 10 key examples of The k-means algorithm: This algorithm has been popularized by its use in market segmentation applications. It works by identifying centroids (which represent clusters) from inputted data, after which each observation is assigned to its nearest centroid using an iterative approach until all observations have been allocated to one of these clusters. Hierarchical clustering: This type of algorithm also identifies clusters but does so using an agglomerative approach, meaning that it begins with each point representing its own cluster and combines them together into larger clusters until there is only one left.<\/p>\n<p style=\"text-align: center\"><a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\">To know more about data science in the Entri app<\/a><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Hierarchical_Clustering\"><\/span><strong>Hierarchical Clustering<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Because clustering algorithms are limited to unsupervised learning, I&#8217;ll start with one of my favorites: hierarchical clustering. Hierarchical clustering builds a hierarchy of clusters from your data using a single distance metric, such as Euclidean distance or Manhattan distance. It also uses a linkage criterion to define how two clusters are combined into a parent cluster. In most cases, hierarchical clustering results in at least two levels of cluster groups that form what&#8217;s called an ancestral tree. The bottom-most level of clusters consists of individual records; higher levels represent increasingly <a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\">abstract groupings<\/a>, and so on up until you arrive at leaf nodes, which represent entire classes (or species) in your dataset. A nice feature of hierarchical clustering is that it can be performed either iteratively or recursively, depending on whether you want to create just one final cluster grouping or multiple iterations. Although it may not always result in perfect clusters\u2014there&#8217;s no guarantee all members of a given cluster are more similar than they are different\u2014it does produce meaningful groupings for many datasets. And because it doesn&#8217;t require ground-truth labels for each record, it&#8217;s often used for exploratory analysis where you&#8217;re trying to understand your data better without necessarily having specific questions about it.<\/p>\n<p style=\"text-align: center\"><a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\">Enroll in our latest data science course in the Entri app<\/a><\/p>\n<h2><span class=\"ez-toc-section\" id=\"K-means_Clustering\"><\/span><strong>K-means Clustering<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>In short, K-means is a type of clustering algorithm that attempts to group data into a pre-defined number of clusters. What is interesting about K-means is that it requires an additional variable (called a centroid) to compare data points in order to determine their appropriate cluster. In its most basic form, K-means takes n observations and creates n clusters where each observation belongs to the cluster with the <a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\">centroid closest<\/a> to it. For example, given 10 observations at (1,2), (3,4), (5,6),&#8230;,(9,8) k=2 would mean clustering into two clusters with each point having its own centroid: 2 and 8. The algorithm then iterates over all observations and recalculates which cluster they belong to until convergence. The distance between a point xi and a centroid ci can be calculated as follows: d = xi &#8211; ci ^2 \/ xi &#8211; xi ^2 * ci &#8211; ci ^2 . This value determines whether or not an observation should be moved from one cluster to another during iteration.<\/p>\n<p style=\"text-align: center\"><a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\">Get the latest updates on data science in the Entri app<\/a><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Gaussian_Mixture_Models\"><\/span><strong>Gaussian Mixture Models<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>This algorithm is an example of a density-based clustering algorithm, meaning it works best when given data that indicates which areas are dense and which are sparse. That\u2019s because mean shift clusters by moving entire regions, rather than individual points. It\u2019s an iterative algorithm: A set of points starts in one region, then is iteratively moved to higher-density regions until it can no longer find another home. Then it returns to its previous location\u2014but with its neighbors now updated based on where else it could have gone. The advantage here is that you can choose how big each cluster should be: too small and you may not capture enough variation; too large and your clusters will <a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\">become meaningless<\/a>. If you need more detail, check out our deep dive into Gaussian mixture models. (Also known as K-means) Is there anything faster? Turns out yes, but with a catch: You lose flexibility. Instead of trying multiple random starting positions and adjusting from there, k-means algorithms start from predetermined locations. Then they move items around within their own group if they detect outliers or new members that don\u2019t fit well within those groups before allowing them to join other groups. Like mean shift, k-means has many variations for different needs, such as k-medoids for skewed data or hierarchical k-means for very large sets of data containing many distinct clusters.<\/p>\n<p style=\"text-align: center\"><a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\">To know more about data science in the Entri app<\/a><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Mean_Shift_Clustering\"><\/span><strong>Mean Shift Clustering<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Mean shift is a non-parametric clustering algorithm that can be used when you have limited knowledge of your data. It was introduced in 2001 by Pierre-Luc Carrier and Jerome H. Friedman as an alternative to K-means. Unlike K-means, mean shift does not require you to specify how many clusters you want at the outset; it just analyzes your data and determines when there are enough clusters to make a partitioning. If you\u2019re interested in learning more about mean shift, check out our post on what it is and how to use it. Also check out Fast Forward Labs for some great visualization tools for mean shift analysis. The R code below shows how you might run mean shift on your own data: Phase-based algorithms: There are a number of different phase-based algorithms that we could discuss here, but let\u2019s focus on one: SOMs (Self Organizing Maps). SOMs are useful because they allow you to map multiple dimensions onto one another so that similar things tend to cluster together. This makes them especially useful for visualizing relationships between concepts. Like many other clustering algorithms, SOMs were developed in response to two important questions: How do I cluster my data? How do I know when I should stop?<\/p>\n<p style=\"text-align: center\"><a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\">Get the latest updates on data science in the Entri app<\/a><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Density-based_Spatial_Clustering_of_Applications_With_Noise_DBSCAN\"><\/span><strong>Density-based Spatial Clustering of Applications With Noise (DBSCAN)<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>DBSCAN (pronounced dee-bscan) is a density-based clustering algorithm, which means it clusters nearby similar data points. It was developed by Martin Ester, Hans-Peter Kriegel, Jurgen Von Campenhausen and Dieter Rombach in 1997 as an improved alternative to algorithms such as Voronoi tessellation, Mean Shift and Centroid computation. DBSCAN is not an exhaustive search algorithm but rather relies on local information; thus it performs well even in cases of high dimensionality or when only small amounts of data are available. The algorithm groups together nearby data <a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\">points based<\/a> on their density &#8211; i.e., how often each point occurs in conjunction with other points around it. Points that occur very frequently have more influence over neighboring points than those that occur less frequently. Points that do not appear to be part of any cluster can be removed from further consideration since they are probably anomalies. DBSCAN uses a metric called epsilon (\u03b5), which determines how far away from another data point a given one can be while still being considered part of its cluster. Epsilon starts at zero and increases until all unclustered data points have been eliminated. The final set of clustered items is returned once no more unclustered items remain within epsilon distance from any existing cluster member.<\/p>\n<p style=\"text-align: center\"><a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\">To know more about data science in the Entri app<\/a><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Expectation%E2%80%93Maximization_Algorithm_EM\"><\/span><strong>Expectation\u2013Maximization Algorithm (EM)<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The expectation-maximization algorithm (EM), also known as Dempster\u2013Shafer, is a general purpose clustering algorithm used in statistics and data mining. It has several useful applications in machine learning such as data dimensionality reduction and density estimation for Gaussian mixture models. The Expectation step of EM computes an estimate formula_1 which maximizes formula_2 subject to formula_3, where formula_4 is a given set of variables, formula_5 denotes distribution function over the set of all possible random variables, and formula_6 are parameters associated with each variable. In other words it computes a conditional distribution that assumes that all other variables are also included within their probability mass distributions. The Maximization step then takes these estimates and maximizes them by adding additional constraints on them. Briefly, E-step finds maximum likelihood estimates using Bayes&#8217; rule: formula_7 whereas M-step adds additional constraints to make sure they satisfy marginality conditions:formula_8where N is number of observations, X is vector of observed variables, y is vector of latent variables or clusters (which we have no direct access) and \u03a3 denotes summation over all values in X. This ensures that latent variables are conditionally independent given observed variables.<\/p>\n<p style=\"text-align: center\"><a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\">Get the latest updates on data science in the Entri app<\/a><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Principal_Component_Analysis_PCA\"><\/span><strong>Principal Component Analysis (PCA)<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The most common use of PCA is to summarize data sets that are too large to visualize by projecting them onto a smaller number of dimensions, much like reducing a map down to two or three variables. PCA can also reveal structure in data that you may not have thought was there. (In fact, it&#8217;s an essential ingredient in modern machine learning methods like neural networks and deep learning.) In some cases, PCA can tell you which combinations of variables are actually telling you something about your data\u2014something that exploratory factor analysis would miss. For instance, if factors A and B explain 90% of your results but only 60% when used together, it could be valuable information for <a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\">scientists trying<\/a> to understand their experiments. Another potential use of PCA is dimensionality reduction. Let\u2019s say you have one million examples and 10 features; PCA might help reduce that down to 100 examples with 100 features. It won&#8217;t necessarily uncover hidden insights in your data, but it will let you work with less complexity so you can spend more time on other tasks without missing out on key details. This isn&#8217;t always practical or possible (in our example above, if each example has hundreds of features, using just 10 features from those examples would give us useless information), but if dimensionality reduction does make sense for your problem, PCA should be at least part of your toolkit.<\/p>\n<p style=\"text-align: center\"><a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\">To know more about data science in the Entri app<\/a><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Self-Organizing_Maps_SOM\"><\/span><strong>Self-Organizing Maps (SOM)<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>SOM is a type of artificial neural network (ANN) used in unsupervised machine learning, which is employed when unlabeled data are inputted. In these cases, SOM learns to group data according to similarity automatically through an iterative process called self-organization. The goal of SOM is to create a low-dimensional representation of high-dimensional data. Usually, SOMs are constructed by an algorithm that repeatedly applies a predefined set of simple transformations (or algorithm). The initial results from each step form several clusters (called maps) around various centroids. Maps representing larger and more numerous clusters are then built with adjacent maps until all <a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\">neighboring groups<\/a> have been merged together. This process continues until there is only one single map left. The k-means clustering algorithm: K-means clustering is another popular method for cluster analysis, which has its origins in signal processing and statistics. It&#8217;s often used as a first approach when data need to be clustered into two or more groups, but it can also be applied to many other situations. As its name suggests, k-means clustering involves defining k different centers that act as cluster prototypes; after that point, every new observation gets assigned to one of those prototypes based on how close it falls within its proximity range.<\/p>\n<p style=\"text-align: center\"><a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\">Get the latest updates on data science in the Entri app<\/a><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Label_Propagation_LP\"><\/span><strong>Label Propagation (LP)<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>LP is a clustering algorithm that works by creating clusters in an iterative manner. In each iteration, it divides objects into two clusters and assigns a label to each cluster. It then goes over to unlabeled objects (data points) and assigns them labels based on which cluster they are closer to. It does that in such a way so as to minimize any between-cluster distances for labeled objects (i.e., it tries to place nearby data points into one cluster and distant ones into another). The <a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\">main advantage<\/a> of LP is its simplicity, along with speed. A disadvantage of LP is that it doesn\u2019t always produce optimal results, especially when there are few data points or they don\u2019t form clear clusters. Another problem is that labeling can be arbitrary: if you have many clusters, your data point could be assigned to any of them, leading to random assignment instead of optimizing for similarity. Finally, LP tends to get stuck in local minima\u2014it may not find a globally optimal solution. The k-means algorithm: K-means is probably one of the most well known algorithms out there because it has been used since its introduction in 1964.<\/p>\n<p style=\"text-align: center\"><a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\">To know more about data science in the Entri app<\/a><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Spectral_Clustering\"><\/span><strong>Spectral Clustering<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>A method of unsupervised machine learning that is based on spectral graph theory. It can be used to discover hidden structures within a data set (and is particularly good at finding natural groupings, or clusters, in multivariate data). It works by identifying connected components within a graph and then mapping them out as individual clusters. Although it\u2019s not widely used, it\u2019s one of my favorite clustering algorithms because I find its results very aesthetically pleasing (I like pretty graphs!). If you want to learn more about how it works, I highly recommend reading Kohavi et al.&#8217;s original <a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\">paper Efficient<\/a> Cluster Discovery Via Graph Bisection. (Kohavi_CLUSTERING) K-means: K-means is probably one of the most well-known clustering algorithms. Given a set of n observations, k-means will attempt to identify k clusters within those observations using an iterative algorithm that calculates centroids for each cluster and updates those values after every iteration until convergence has been reached. If you are interested to learn new coding skills, the Entri app will help you to acquire them very easily. Entri app is following a structural study plan so that the students can learn very easily. If you don&#8217;t have a coding background, it won&#8217;t be any problem. You can download the Entri app from the google play store and enroll in your favorite course.<\/p>\n<p style=\"text-align: center\"><a href=\"https:\/\/entri.sng.link\/Bcofz\/w75k\/zvbw\">Get the latest updates on data science in the Entri app<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Clustering is one of the most common techniques used by data scientists to organize data into groups with similar features. Clustering algorithms are commonly used in image analysis, speech recognition, artificial intelligence, and almost any other field that requires grouping large amounts of unstructured data. In this article, we will go over 10 different clustering [&hellip;]<\/p>\n","protected":false},"author":93,"featured_media":25524165,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[802,1864,1882,1883,1881],"tags":[],"class_list":["post-25524158","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-articles","category-data-science-ml","category-java-programming","category-react-native","category-web-android-development"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.6 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Top Clustering Algorithms Data Scientists Should Know - Entri Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Top Clustering Algorithms Data Scientists Should Know - Entri Blog\" \/>\n<meta property=\"og:description\" content=\"Clustering is one of the most common techniques used by data scientists to organize data into groups with similar features. Clustering algorithms are commonly used in image analysis, speech recognition, artificial intelligence, and almost any other field that requires grouping large amounts of unstructured data. In this article, we will go over 10 different clustering [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/\" \/>\n<meta property=\"og:site_name\" content=\"Entri Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/entri.me\/\" \/>\n<meta property=\"article:published_time\" content=\"2022-05-13T14:30:31+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/05\/Untitled-35-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"820\" \/>\n\t<meta property=\"og:image:height\" content=\"615\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Akhil M G\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@entri_app\" \/>\n<meta name=\"twitter:site\" content=\"@entri_app\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Akhil M G\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"13 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/\"},\"author\":{\"name\":\"Akhil M G\",\"@id\":\"https:\/\/entri.app\/blog\/#\/schema\/person\/875646423b2cce93c1bd5bc16850fff6\"},\"headline\":\"Top Clustering Algorithms Data Scientists Should Know\",\"datePublished\":\"2022-05-13T14:30:31+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/\"},\"wordCount\":2653,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/entri.app\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/05\/Untitled-35-1.png\",\"articleSection\":[\"Articles\",\"Data Science and Machine Learning\",\"Java Programming\",\"React Native\",\"Web and Android Development\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/\",\"url\":\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/\",\"name\":\"Top Clustering Algorithms Data Scientists Should Know - Entri Blog\",\"isPartOf\":{\"@id\":\"https:\/\/entri.app\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/05\/Untitled-35-1.png\",\"datePublished\":\"2022-05-13T14:30:31+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#primaryimage\",\"url\":\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/05\/Untitled-35-1.png\",\"contentUrl\":\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/05\/Untitled-35-1.png\",\"width\":820,\"height\":615,\"caption\":\"Top Clustering Algorithms Data Scientists Should Know\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/entri.app\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Entri Skilling\",\"item\":\"https:\/\/entri.app\/blog\/category\/entri-skilling\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Data Science and Machine Learning\",\"item\":\"https:\/\/entri.app\/blog\/category\/entri-skilling\/data-science-ml\/\"},{\"@type\":\"ListItem\",\"position\":4,\"name\":\"Top Clustering Algorithms Data Scientists Should Know\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/entri.app\/blog\/#website\",\"url\":\"https:\/\/entri.app\/blog\/\",\"name\":\"Entri Blog\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/entri.app\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/entri.app\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/entri.app\/blog\/#organization\",\"name\":\"Entri App\",\"url\":\"https:\/\/entri.app\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/entri.app\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2019\/10\/Entri-Logo-1.png\",\"contentUrl\":\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2019\/10\/Entri-Logo-1.png\",\"width\":989,\"height\":446,\"caption\":\"Entri App\"},\"image\":{\"@id\":\"https:\/\/entri.app\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/entri.me\/\",\"https:\/\/x.com\/entri_app\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/entri.app\/blog\/#\/schema\/person\/875646423b2cce93c1bd5bc16850fff6\",\"name\":\"Akhil M G\",\"url\":\"https:\/\/entri.app\/blog\/author\/akhil\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Top Clustering Algorithms Data Scientists Should Know - Entri Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/","og_locale":"en_US","og_type":"article","og_title":"Top Clustering Algorithms Data Scientists Should Know - Entri Blog","og_description":"Clustering is one of the most common techniques used by data scientists to organize data into groups with similar features. Clustering algorithms are commonly used in image analysis, speech recognition, artificial intelligence, and almost any other field that requires grouping large amounts of unstructured data. In this article, we will go over 10 different clustering [&hellip;]","og_url":"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/","og_site_name":"Entri Blog","article_publisher":"https:\/\/www.facebook.com\/entri.me\/","article_published_time":"2022-05-13T14:30:31+00:00","og_image":[{"width":820,"height":615,"url":"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/05\/Untitled-35-1.png","type":"image\/png"}],"author":"Akhil M G","twitter_card":"summary_large_image","twitter_creator":"@entri_app","twitter_site":"@entri_app","twitter_misc":{"Written by":"Akhil M G","Est. reading time":"13 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#article","isPartOf":{"@id":"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/"},"author":{"name":"Akhil M G","@id":"https:\/\/entri.app\/blog\/#\/schema\/person\/875646423b2cce93c1bd5bc16850fff6"},"headline":"Top Clustering Algorithms Data Scientists Should Know","datePublished":"2022-05-13T14:30:31+00:00","mainEntityOfPage":{"@id":"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/"},"wordCount":2653,"commentCount":0,"publisher":{"@id":"https:\/\/entri.app\/blog\/#organization"},"image":{"@id":"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#primaryimage"},"thumbnailUrl":"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/05\/Untitled-35-1.png","articleSection":["Articles","Data Science and Machine Learning","Java Programming","React Native","Web and Android Development"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/","url":"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/","name":"Top Clustering Algorithms Data Scientists Should Know - Entri Blog","isPartOf":{"@id":"https:\/\/entri.app\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#primaryimage"},"image":{"@id":"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#primaryimage"},"thumbnailUrl":"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/05\/Untitled-35-1.png","datePublished":"2022-05-13T14:30:31+00:00","breadcrumb":{"@id":"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#primaryimage","url":"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/05\/Untitled-35-1.png","contentUrl":"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/05\/Untitled-35-1.png","width":820,"height":615,"caption":"Top Clustering Algorithms Data Scientists Should Know"},{"@type":"BreadcrumbList","@id":"https:\/\/entri.app\/blog\/top-clustering-algorithms-data-scientists-should-know\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/entri.app\/blog\/"},{"@type":"ListItem","position":2,"name":"Entri Skilling","item":"https:\/\/entri.app\/blog\/category\/entri-skilling\/"},{"@type":"ListItem","position":3,"name":"Data Science and Machine Learning","item":"https:\/\/entri.app\/blog\/category\/entri-skilling\/data-science-ml\/"},{"@type":"ListItem","position":4,"name":"Top Clustering Algorithms Data Scientists Should Know"}]},{"@type":"WebSite","@id":"https:\/\/entri.app\/blog\/#website","url":"https:\/\/entri.app\/blog\/","name":"Entri Blog","description":"","publisher":{"@id":"https:\/\/entri.app\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/entri.app\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/entri.app\/blog\/#organization","name":"Entri App","url":"https:\/\/entri.app\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/entri.app\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/entri.app\/blog\/wp-content\/uploads\/2019\/10\/Entri-Logo-1.png","contentUrl":"https:\/\/entri.app\/blog\/wp-content\/uploads\/2019\/10\/Entri-Logo-1.png","width":989,"height":446,"caption":"Entri App"},"image":{"@id":"https:\/\/entri.app\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/entri.me\/","https:\/\/x.com\/entri_app"]},{"@type":"Person","@id":"https:\/\/entri.app\/blog\/#\/schema\/person\/875646423b2cce93c1bd5bc16850fff6","name":"Akhil M G","url":"https:\/\/entri.app\/blog\/author\/akhil\/"}]}},"_links":{"self":[{"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/posts\/25524158","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/users\/93"}],"replies":[{"embeddable":true,"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/comments?post=25524158"}],"version-history":[{"count":2,"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/posts\/25524158\/revisions"}],"predecessor-version":[{"id":25524166,"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/posts\/25524158\/revisions\/25524166"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/media\/25524165"}],"wp:attachment":[{"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/media?parent=25524158"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/categories?post=25524158"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/tags?post=25524158"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}