當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器学习中的无监督学习_无监督机器学习中聚类背后的直觉

發(fā)布時間：2023/12/15 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习中的无监督学习_无监督机器学习中聚类背后的直觉小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

機器學習中的無監(jiān)督學習

When it comes to analyzing & making sense of the data from the past and understanding the future world based on those data , we rely on machine learning methodologies . This field of machine learning as I have discussed in my past articles on machine learning fundamentals is broadly categorized into

在分析和理解過去的數(shù)據(jù)并基于這些數(shù)據(jù)了解未來世界時，我們依賴于機器學習方法。正如我在過去有關機器學習基礎的文章中所討論的那樣，該機器學習領域大致分為以下幾類：

Supervised Machine Learning
監(jiān)督機器學習
Unsupervised Machine Learning
無監(jiān)督機器學習

要了解監(jiān)督的ML，請訪問： (To understand supervised ML please visit :)

集群：無監(jiān)督機器學習的世界 (Clustering : The World Of Unsupervised Machine Learning)

Today, will dig deeper into the world of Unsupervised learning. To help you catch the concept , let me put up the example of e-Commerce portals like Flipkart, Amazon etc.

今天，它將更深入地研究無監(jiān)督學習的世界。為了幫助您理解這一概念，讓我舉一個Flipkart，Amazon等電子商務門戶的示例。

“Do you know how these eCommerce giants which you use everyday, manages to segment huge list of products into various categories with an intelligence which customizes the experience of browsing based on how you navigate on their portal . ”

“ 您知道您每天使用的這些電子商務巨人如何利用智能根據(jù)您在門戶網(wǎng)站上的導航方式定制瀏覽體驗，從而將龐大的產(chǎn)品列表劃分為各種類別。 ”

These tailor made intelligence to categorize the products is made possible by one of the popular Unsupervised learning techniques called clustering , where they group the set of customers based on their behavior and try to make sense of the data points generated by those segments of user, to offer tailor made services.

這些流行的無監(jiān)督學習技術(稱為聚類 )使這些量身定制的智能能夠?qū)Ξa(chǎn)品進行分類，在這種技術中，他們根據(jù)自己的行為對客戶群進行分組，并試圖理解由這些用戶細分產(chǎn)生的數(shù)據(jù)點，從而提供量身定制的服務。

因此，一些受歡迎的例子是： (So, some of the popular examples are :)

Market segmentation
市場細分
Product Segmentation
產(chǎn)品細分
User segmentation
用戶細分
Organizing the system files into group of folders
將系統(tǒng)文件組織到文件夾組中
Organizing emails into different folder category etc..
將電子郵件組織到不同的文件夾類別等中。

為什么將其稱為無監(jiān)督？ (Why it is called unsupervised ?)

Because in this field of Machine learning the data set provided to train the ML models doesn’t have any pre-defined set of labels/outcome defined with-in the data , so the prediction or segmentation of data has to be done to group the set of people, product or data into a cluster by the model itself.

因為在機器學習的此領域提供的用于訓練ML模型的數(shù)據(jù)集沒有在數(shù)據(jù)中定義任何預定義的標簽/結(jié)果，因此必須進行數(shù)據(jù)的預測或分段才能對模型本身將一組人員，產(chǎn)品或數(shù)據(jù)集合到一個集群中。

例如： (For Example :)

In case of problem where you are given the set of past data from the bank which has the list of user attributes along with one target column attributes which labels the user as

如果出現(xiàn)問題，您會從銀行獲得一組過去的數(shù)據(jù)，其中包含用戶屬性列表以及一個將用戶標記為

Defaulter
默認值
Non-Defaulter
非默認值

Now our models has to be trained on these data with a known target to achieve as a result which is to predict whether any user which comes int the loan disbursal system will default or not is a kind of Supervised Machine learning model .

現(xiàn)在我們的模型必須在這些數(shù)據(jù)上訓練有一個已知的目標，結(jié)果是可以預測進入貸款支付系統(tǒng)的任何用戶是否會違約是一種監(jiān)督機器學習模型。

But What if you had the data which has no such kind of target column available and your model has to group the customers into a set of defaulters and non-defaulter , well when your model is trained to perform these kind of segmentation it is known to be an Unsupervised learning model.

但是，如果您擁有的數(shù)據(jù)沒有此類目標列可用，并且您的模型必須將客戶分組為一組默認值和非默認值，那么當訓練您的模型以執(zhí)行此類細分時，眾所周知成為無監(jiān)督的學習模型。

So, with this basic understanding of unsupervised learning it’s time to get into the fundamentals of Clustering which is a kind of unsupervised learning . Here we will cover :

因此，基于對無監(jiān)督學習的基本了解，是時候深入了解聚類的基礎知識了，它是一種無監(jiān)督學習。在這里，我們將介紹：

What Is Clustering In Unsupervised ML ?
什么是無監(jiān)督ML中的聚類？
What Are The Types Of Clustering?
群集的類型有哪些？
What Is K-Means Clustering ?
什么是K均值聚類？

什么是群集？ (What Is Clustering ?)

It is a mechanism of grouping the set of given data to create a segments based on the concept of similarity among those data points. The intuition behind the concept of similarity comes from the word called distance .

它是一種將給定數(shù)據(jù)集進行分組以基于這些數(shù)據(jù)點之間的相似性概念創(chuàng)建段的機制。相似的概念背后的直覺來自于所謂的距離的話。

什么是集群？ (What Is Cluster?)

It is a collection of data object which are similar

它是相似的數(shù)據(jù)對象的集合

So, it is important here to understand two highlighted world in the definition above

因此，重要的是要了解上面定義中的兩個突出顯示的世界

Similarity
相似
Distance
距離

聚類中的相似性概念： (The Concept of Similarity In Clustering :)

In cluster analysis , we stress on the concept of data point similarity, where similarity is a measure of distance between those given data points .

在聚類分析中，我們強調(diào)數(shù)據(jù)點相似性的概念，其中相似性是對給定數(shù)據(jù)點之間距離的度量。

Those distance to measure how close the given data points are used to infer how similar those data points . Some of the popular distance measuring techniques are

那些距離用來測量給定數(shù)據(jù)點的接近程度，用以推斷這些數(shù)據(jù)點的相似程度。一些流行的距離測量技術是

Manhattan Distance
曼哈頓距離
Euclidean Distances
歐氏距離
Chebyshev Distances
切比雪夫距離
Minkowski Distance
明可夫斯基距離

歐氏距離： (Euclidean Distance :)

Is probably the most common measure of distance we all are very familiar with in data science or mathematical world.

這可能是我們在數(shù)據(jù)科學或數(shù)學世界中都非常熟悉的最常見的距離度量。

As per wiki,

根據(jù)維基，

In the field of mathematics, the Euclidean distance or Euclidean metric is the “ordinary” straight-line distance between two points in Euclidean space.

在數(shù)學領域， 歐幾里得距離或歐幾里得度量是歐幾里得空間中兩點之間的“普通”直線距離。

The Euclidean distance between points X and Y is the length of the line segment connecting then, In Cartesian coordinates, Euclidean distance (d) :

X點和Y點之間的歐幾里得距離是連接的線段的長度，在直角坐標系中， 歐幾里得距離(d)：

from X to Y, or from Y to X is given by the Pythagorean formula:

從X到Y(jié)，或從Y到X由畢達哥拉斯公式給出：

歐式距離：2維，3維和N維： (Euclidean Distance : 2 Dimension, 3 Dimension & N- Dimension :)

Euclidean distance as discussed used the popular Pythagorean theorem to calculate the measure of distance between the given set of vectors/points in n dimensional space.

討論的歐幾里得距離使用流行的畢達哥拉斯定理來計算n維空間中給定向量/點集之間的距離。

Below are the formula for the same in 2, 3 and n- dimensional space :

以下是2維，3維和n維空間中的相同公式：

曼哈頓距離： (Manhattan Distance :)

Unlike Euclidean distance, where we calculated the sum of the squares of the given vector points, here the distance between two points is the

與歐幾里得距離不同，我們計算給定矢量點的平方和，此處??兩點之間的距離為

sum of the absolute differences of their Cartesian coordinates.

笛卡爾坐標的絕對差之和。

This metric of distance is also known as snake distance, city block distance, or Manhattan length, This names has taken inspiration form the grid layout of most streets on the island of Manhattan, which causes the shortest path a car could take between two intersections in the borough to have length equal to the intersections’ distance in taxicab kind of geometry

這種距離度量也稱為蛇距， 街區(qū)距離 或 曼哈頓長度 。該名稱的靈感來自曼哈頓島上大多數(shù)街道的網(wǎng)格布局，這導致汽車在加利福尼亞州的兩個交叉點之間可以走的最短路徑自治市鎮(zhèn)的長度等于出租車形狀中的相交距離

Manhattan distance which is also called a taxicab distance can be defined by the below given formula’s

曼哈頓距離，也稱為出租車距離，可以通過以下公式來定義

Chebysev距離： (Chebysev Distance:)

Also popularly called as Chess Board distance :

也通常稱為國際象棋棋盤距離：

It is nothing but the Max(Of Manhattan Distance )

就是最大(曼哈頓距離)

根據(jù)維基， (As per wiki,)

In mathematics, Chebyshev distance (or Tchebychev distance), maximum metric, is a metric defined on a vector space where the distance between two vectors is the greatest of their differences along any coordinate dimension. It is named after Pafnuty Chebyshev.

在數(shù)學中， Chebyshev距離 (或 Tchebychev距離 )( 最大度量 )是在向量空間上定義的度量，其中兩個向量之間的距離是沿任何坐標維度的最大差異。它以 Pafnuty Chebyshev 命名。

It is also known as chessboard distance, since in the game of chess the minimum number of moves needed by a king to go from one square on a chessboard to another equals the Chebyshev distance between the centers of the squares, if the squares have side length one, as represented in 2-D spatial coordinates with axes aligned to the edges of the board.

這也稱為 棋盤距離 ，因為在下棋時，國王從棋盤上的一個正方形移到另一個正方形所需的最小移動次數(shù)等于正方形中心之間的切比雪夫距離。一個以二維空間坐標表示，其軸與電路板的邊緣對齊。

So , for two vectors or points x and y, with standard coordinates xi and yi respectively, is given in the below figure. Also for 2 dimensional plane, we can see the formula below.

因此，下圖給出了兩個向量或點x和y分別具有標準坐標xi和yi的情況。同樣對于二維平面，我們可以看到以下公式。

So now that we have understood the fundamentals of similarity based on measure of distance , its time to know what are the types of clustering and how do they make use of the above discussed distance metric to cluster the given vectors of data or an object .

因此，現(xiàn)在我們已經(jīng)了解了基于距離度量的相似性基礎，是時候知道什么是聚類類型了，以及它們?nèi)绾卫蒙鲜鼍嚯x度量來聚類給定的數(shù)據(jù)或?qū)ο笫噶俊?

無監(jiān)督學習中的聚類類型： (Types Of Clustering In Unsupervised Learning :)

There are basically two major categorization of clustering in the field of unsupervised learning

在無監(jiān)督學習領域中，聚類基本上有兩個主要類別

Connectivity based clustering : Also known as Hierarchical clustering
基于連接性的集群：也稱為分層集群
Centroid Based Clustering : K-Means being the most popular kind
基于質(zhì)心的聚類： K-Means是最受歡迎的一種

基于連接的群集： (Connectivity Based Clustering :)

For a tabular dataframe with N no of columns and rows, if we calculate the distance between every pair of an object in a row to find which of those are closely related or similar, to be further clustered together, we call this expensive mechanism of clustering as connectivity based clustering . The intuition behind this extensive approach is;

對于沒有N個列和行的表格數(shù)據(jù)框，如果我們計算一行中每對對象之間的距離，以找出其中哪些緊密相關或相似，然后將它們進一步聚在一起，則我們將這種昂貴的聚類機制稱為作為基于連接的群集。這種廣泛方法背后的直覺是：

That objects being more related to nearby objects than to the objects which are farther away

這些對象與附近的對象比與更遠的對象更相關

When the size of the data set is not very large this kind of clustering is very effective , but if data set is too big , this can be really resource intensive. For example , if we have a data set with 1000 rows than it will lead of 1/2 a million pairs of data to be analysed for similarity , this could be extremely costly to process. Imagine if the no of rows becomes 10,000.

當數(shù)據(jù)集的大小不是很大時，這種聚類非常有效，但是如果數(shù)據(jù)集太大，則可能會占用大量資源。例如，如果我們有一個包含1000行的數(shù)據(jù)集，那么它將導致1/2百萬對數(shù)據(jù)的相似性被分析，這可能會非常昂貴。想象一下，如果行數(shù)變?yōu)?0,000。

So to sum up :

綜上所述：

These connectivity based algorithms connect “objects” to form “clusters” based on their distance. A cluster can be described largely by the maximum distance needed to connect parts of the cluster. At different distances, different clusters will form, which can be represented using a dendrogram, which explains where the common name “hierarchical clustering” comes from, these algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge with each other at certain distances

這些基于連通性的算法根據(jù)對象的距離將“對象”連接起來以形成“簇”。可以通過連接集群各部分所需的最大距離來大致描述集群。在不同的距離處，將形成不同的聚類，可以使用樹狀圖表示，這解釋了通用名稱“分層聚類”的來源，這些算法不提供數(shù)據(jù)集的單個分區(qū)，而是提供一個廣泛的層次結(jié)構(gòu)。在特定距離彼此融合的集群

I have covered hierarchical based connectivity clustering in detail in one of my article linked below, do take some time to understand the same in more depth.

我在下面鏈接的一篇文章中詳細介紹了基于層次的連接性群集，需要花一些時間來更深入地了解它們。

基于質(zhì)心的聚類： (Centroid Based Clustering :)

Unlike hierarchical/connectivity based clustering Centroid-based clustering organizes the data into non-hierarchical clusters.

與基于層次/連接性的聚類不同，基于質(zhì)心的聚類將數(shù)據(jù)組織到非層次性聚類中。

基于質(zhì)心聚類的直覺： (Intuition Behind Centroid Based Clustering :)

Here we get the pre-defined number of clusters at the outset .So, Instead of visiting each and every pair of object in n no of rows to calculate the distance , this algorithm requires you to define what are no of clusters we want to obtain , based on that centroid of those clusters are identified and the distance of the data points are calculated with respect to those identified centroids.

這里我們從一開始就獲得了預定義的聚類數(shù)，因此，與其訪問n個行中的每一對對象都不計算距離，該算法還需要您定義要獲取的聚類數(shù)，基于這些聚類的質(zhì)心被識別，并針對那些識別出的質(zhì)心計算數(shù)據(jù)點的距離。

This algorithm is very cheap as compared to hierarchical clustering, which can be understood by the example. So if you had 1000 rows and 5 clusters are defined at the outset . The algo has to process only 5*1000= 5000 data points , which would have been 1/2 million data points in the case of connectivity based clustering algorithm.

與分層聚類相比，該算法非常便宜，可以通過示例理解。因此，如果您有1000行并且一開始就定義了5個群集。該算法僅需處理5 * 1000 = 5000個數(shù)據(jù)點，在基于連接的聚類算法的情況下，將是1/2百萬個數(shù)據(jù)點。

我們怎么會沒有集群？ (How does we come No of cluster ?)

We will answer this question when we uncover K-means clustering , but to ponder , it is related to popular method known as Elbow Method .

當我們發(fā)現(xiàn)K-means聚類時，我們將回答這個問題，但是要想一想，它與流行的方法Elbow Method有關。

K-均值聚類： (K-Means Clustering :)

k-means is the most widely-used centroid-based clustering algorithm. Centroid-based algorithms are efficient but sensitive to initial conditions and outliers. We will get into the details of K-Means clustering in the next part of this series of unsupervised learning , where we will cover

k均值是使用最廣泛的基于質(zhì)心的聚類算法。基于質(zhì)心的算法有效，但對初始條件和異常值敏感。在本系列無監(jiān)督學習的下一部分中，我們將詳細介紹K-Means聚類的細節(jié)，

What Is K-Means Clustering ?
什么是K均值聚類？
How does it work ?
它是如何工作的？
Implementing k-means clustering algorithm using hands-on python lab
使用動手python實驗室實現(xiàn)k-means聚類算法

翻譯自: https://medium.com/predict/intuition-behind-clustering-in-unsupervised-machinelearning-ff8567fb7841