机器学习 | 聚类评估指标
文章目錄
- 1. 聚類評估指標
- 1.1 外部評估指標
- RI 蘭德指數
- ARI 調整蘭德指數
- Jaccard JC指數
- FMI FMI指數
- MI 互信息
- NMI 歸一化互信息
- AMI 調整互信息
- 1.2 內部評估指標
- DBI 戴維森堡丁指數
- DI Dunn指數
- SC 輪廓系數
- 參考文獻
相關文章:
機器學習 | 目錄
機器學習 | 距離計算
無監督學習 | KMeans與KMeans++原理
無監督學習 | KMeans之Skleaen實現:電影評分聚類
1. 聚類評估指標
Clustering performance evaluation
聚類性能度量亦稱聚類“有效性指標”(validity index)。與監督學習中的性能度量相似,對聚類結果,我們需通過某種性能度量來評估其好壞;另一方面,若明確了最終將要使用的性能度量,則可直接將其作為聚類過程的優化目標,從而更好地得到符合要求的聚類結果。
聚類是將樣本集D劃分為若干互不相關的子集,即樣本簇(類),而我們又希望聚類結果的“簇內相似度”(intra-cluster similarity)高且“簇間相似度”(intra-cluster similarity)低。
聚類性能度量大致有兩類,一類是將聚類結果與某個“參考模型”(reference model,樣本含標簽的)進行比較,稱為“外部指標”(external index);另一類是直接考察聚類結果而不利用任何參考模型,稱為“內部指標”(internal index)。
對數據集 D={x1,x2,?,xn}D=\{x_1,x_2,\cdots,x_n\}D={x1?,x2?,?,xn?},假定通過聚類給出的 kkk 個簇,劃分為 C={C1,C2,?,Ck}C=\{C_1,C_2,\cdots,C_k\}C={C1?,C2?,?,Ck?},參考模型給出的 sss 個簇劃分為 C?={C1?,C2?,?,Cs?}C^*=\{C_1^*,C_2^*,\cdots,C_s^*\}C?={C1??,C2??,?,Cs??}。相應地,令 λ\lambdaλ 與 λ?\lambda^*λ? 分別表示 CCC 與 C?C^*C? 對應的簇標記向量。我們將樣本兩兩配對考慮,定義:
a=∣SS∣,SS={(xi,xj)∣λi=λj,λi?=λj?,i<j}(1)a=|SS|,\quad SS=\{(x_i,x_j)| \lambda_i=\lambda_j,\lambda_i^*=\lambda_j^*,i<j\} \tag{1}a=∣SS∣,SS={(xi?,xj?)∣λi?=λj?,λi??=λj??,i<j}(1)
b=∣SD∣,SD={(xi,xj)∣λi=λj,λi?≠λj?,i<j}(2)b=|SD|,\quad SD=\{(x_i,x_j)| \lambda_i=\lambda_j,\lambda_i^*\neq\lambda_j^*,i<j\} \tag{2}b=∣SD∣,SD={(xi?,xj?)∣λi?=λj?,λi???=λj??,i<j}(2)
c=∣DS∣,DS={(xi,xj)∣λi≠λj,λi?=λj?,i<j}(3)c=|DS|,\quad DS=\{(x_i,x_j)| \lambda_i\neq\lambda_j,\lambda_i^*=\lambda_j^*,i<j\} \tag{3}c=∣DS∣,DS={(xi?,xj?)∣λi??=λj?,λi??=λj??,i<j}(3)
d=∣DD∣,DD={(xi,xj)∣λi≠λj,λi?≠λj?,i<j}(4)d=|DD|,\quad DD=\{(x_i,x_j)| \lambda_i\neq\lambda_j,\lambda_i^*\neq\lambda_j^*,i<j\} \tag{4}d=∣DD∣,DD={(xi?,xj?)∣λi??=λj?,λi???=λj??,i<j}(4)
其中集合 SSSSSS 表示點 iii 和點 jjj 在聚類結果中處于同一個簇,而實際上這兩個點也是處于同一個簇的所有點的集合,相當于混淆矩陣中的 TP;
集合 SDSDSD 表示點 iii 和點 jjj 在聚類結果中處于同一個簇,而實際上這兩個點不處于同一個簇的所有點的集合,相當于混淆矩陣中的 FP,…。
由于每個樣本對 (xi,xj)(i<j)(x_i,x_j)(i<j)(xi?,xj?)(i<j) 僅能出現在一個集合中,因此有 a+b+c+d=Cn2=n(n?1)/2a+b+c+d=C_n^2=n(n-1)/2a+b+c+d=Cn2?=n(n?1)/2 成立。[1]
1.1 外部評估指標
基于公式(1-4)可以導出下面常用的聚類性能度量外部指標:
RI 蘭德指數
Rand Index 蘭德指數
RI=(a+b)Cn2()RI=\frac{(a+b)}{C_n^2}\tag{}RI=Cn2?(a+b)?()
ARI 調整蘭德指數
Adjuested Rand Index 調整蘭德指數
sklearn.metrics.adjusted_rand_score
ARI=RI?E(RI)max(RI)?E(RI)(5)ARI=\frac{RI-E(RI)}{max(RI)-E(RI)}\tag{5}ARI=max(RI)?E(RI)RI?E(RI)?(5)
Advantages
-
Random (uniform) label assignments have a ARI score close to 0.0 for any value of n_clusters and n_samples (which is not the case for raw Rand index or the V-measure for instance).
-
Bounded range [-1, 1]: negative values are bad (independent labelings), similar clusterings have a positive ARI, 1.0 is the perfect match score.
-
No assumption is made on the cluster structure: can be used to compare clustering algorithms such as k-means which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster with “folded” shapes.
Drawbacks
- Contrary to inertia, ARI requires knowledge of the ground truth classes while is almost never available in practice or requires manual assignment by human annotators (as in the supervised learning setting).
However ARI can also be useful in a purely unsupervised setting as a building block for a Consensus Index that can be used for clustering model selection (TODO).
Jaccard JC指數
Jaccard Coefficient
JC=aa+b+c(6)JC=\frac{a}{a+b+c}\tag{6}JC=a+b+ca?(6)
FMI FMI指數
Fowlkes and Mallows Index
sklearn.metrics.fowlkes_mallows_score
FMI=aa+b?aa+c(7)FMI=\sqrt{\frac{a}{a+b}\cdot\frac{a}{a+c}}\tag{7}FMI=a+ba??a+ca??(7)
Advantages
-
Random (uniform) label assignments have a FMI score close to 0.0 for any value of n_clusters and n_samples (which is not the case for raw Mutual Information or the V-measure for instance).
-
Upper-bounded at 1: Values close to zero indicate two label assignments that are largely independent, while values close to one indicate significant agreement. Further, values of exactly 0 indicate purely independent label assignments and a FMI of exactly 1 indicates that the two label assignments are equal (with or without permutation).
-
No assumption is made on the cluster structure: can be used to compare clustering algorithms such as k-means which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster with “folded” shapes.
Drawbacks
- Contrary to inertia, FMI-based measures require the knowledge of the ground truth classes while almost never available in practice or requires manual assignment by human annotators (as in the supervised learning setting).
MI 互信息
Mutual Information 互信息
sklearn.metrics.mutual_info_score
對數據集 D={x1,x2,?,xn}D=\{x_1,x_2,\cdots,x_n\}D={x1?,x2?,?,xn?},假定通過聚類給出的 kkk 個簇,劃分為 C={C1,C2,?,Ck}C=\{C_1,C_2,\cdots,C_k\}C={C1?,C2?,?,Ck?},參考模型給出的 sss 個簇劃分為 C?={C1?,C2?,?,Cs?}C^*=\{C_1^*,C_2^*,\cdots,C_s^*\}C?={C1??,C2??,?,Cs??}。
MI(C,C?)=∑i=1k∑j=1sP(Ci,Cj?)logP(Ci∩Cj?)P(Ci)P(Cj?)=∑i=1k∑j=1s∣Ci∩Cj?∣nlogn?∣Ci∩Cj?∣∣Ci∣∣Cj?∣(8)\begin{aligned} {MI(C,C^*)} &= {\sum_{i=1}^k \sum_{j=1}^s P(C_i,C_j^*)log \frac{P(C_i\cap C_j^*)}{P(C_i)P(C_j^*)}} \\ &= {\sum_{i=1}^k \sum_{j=1}^s \frac{|C_i\cap C_j^*|}{n}log\frac{n\cdot|C_i\cap C_j^*|}{|C_i||C_j^*|}} \\ \end{aligned}\tag{8} MI(C,C?)?=i=1∑k?j=1∑s?P(Ci?,Cj??)logP(Ci?)P(Cj??)P(Ci?∩Cj??)?=i=1∑k?j=1∑s?n∣Ci?∩Cj??∣?log∣Ci?∣∣Cj??∣n?∣Ci?∩Cj??∣??(8)
P(Ci),P(Cj?),P(Ci∩Cj?)P(C_i),P(C_j^*),P(C_i\cap C_j^*)P(Ci?),P(Cj??),P(Ci?∩Cj??) 可以分別看作樣本屬于聚類簇 CiC_iCi? ,屬于類 Cj?C_j^*Cj?? 以及同時屬于兩者的概率。
定義熵 H:
H(C)=?∑i=1kP(Ci)logP(Ci)=?∑i=1k∣Ci∣nlog(∣Ci∣n)(9)\begin{aligned} H(C)& =-\sum_{i=1}^kP(C_i)log P(C_i)\\ & = -\sum_{i=1}^k \frac{|C_i|}{n}log(\frac{|C_i|}{n})\\ \end{aligned}\tag{9} H(C)?=?i=1∑k?P(Ci?)logP(Ci?)=?i=1∑k?n∣Ci?∣?log(n∣Ci?∣?)?(9)
給定簇信息 C?C^*C? 的前提條件下,類別信息 CCC 的增加量,或者說其不確定度的減少量,直觀的,可以寫出如下形式:
MI(C,C?)=H(C)?H(C∣C?)(10)MI(C,C^*)=H(C)-H(C|C^*)\tag{10}MI(C,C?)=H(C)?H(C∣C?)(10)
-
互信息的最小值為 0, 當類簇相對于類別只是隨機的, 也就是說兩者獨立的情況下, CCC 對于 C?C^*C? 未帶來任何有用的信息;
-
如果得到的 CCC 與 C?C^*C? 關系越密切, 那么 MI(C,C?)MI(C,C^*)MI(C,C?) 值越大. 如果 CCC 完整重現了 C?C^*C? , 此時互信息最大。
-
當 k=nk=nk=n 時,即類簇數和樣本個數相等,MI 也能達到最大值。所以 MI 也存在和純度類似的問題,即它并不對簇數目較大的聚類結果進行懲罰,因此也不能在其他條件一樣的情況下,對簇數目越小越好的這種期望進行形式化。
NMI 歸一化互信息
Normalized Mutual Information 歸一化互信息
sklearn.metrics.normalized_mutual_info_score
NMI 則可以解決上述問題,因為熵會隨著簇的數目的增長而增大。當 k=nk=nk=n 時, H(C)H(C)H(C) 會達到其最大值 log(n)log(n)log(n) , 此時就能保證 NMI 的值較低。之所以采用 12H(C)+H(C?)\frac{1}{2} H(C)+H(C^*)21?H(C)+H(C?) 作為分母,是因為它是 MI(C,C?)MI(C,C^*)MI(C,C?) 的緊上界, 因此可以保證 NMI∈[0,1]NMI\in[0,1]NMI∈[0,1] 。
NMI(C,C?)=2×MI(C,C?)H(C)+H(C?)(11)NMI(C,C^*)=\frac{2\times MI(C,C^*)}{H(C)+H(C^*)}\tag{11}NMI(C,C?)=H(C)+H(C?)2×MI(C,C?)?(11)
AMI 調整互信息
Adjusted Mutual Information 調整互信息
sklearn.metrics.adjusted_mutual_info_score
AMI(C,C?)=MI(C,C?)?E(MI(C,C?))avg(H(C),H(C?))?E[MI(C,C?)](12)AMI(C,C^*)=\frac{MI(C,C^*)-E(MI(C,C^*))}{avg(H(C),H(C^*))-E[MI(C,C^*)]}\tag{12}AMI(C,C?)=avg(H(C),H(C?))?E[MI(C,C?)]MI(C,C?)?E(MI(C,C?))?(12)
令 ai=∣Ci∣,bj=∣CJ?∣a_i=|C_i|,b_j=|C_J^*|ai?=∣Ci?∣,bj?=∣CJ??∣ ,則 E[MI(C,C?)]E[MI(C,C^*)]E[MI(C,C?)] 為:
E[MI(C,C?)]=∑i=1∣C∣∑j=1∣C?∣∑nij=(ai+bj?n)+min(ai,bj)nijnlog(n?nijaibj)ai!bj!(n?ai)!(n=bj)!n!nij!(ai?nij)!(bj?nij)!(n?ai?bj+nij)!(13)E[MI(C,C^*)] = \sum_{i=1}^{|C|}\sum_{j=1}^{|C^*|} \sum_{n_{ij}=(a_i+b_j-n)^+}^{min(a_i,b_j)} \frac{n_{ij}}{n}log(\frac{n\cdot n_{ij}}{a_i b_j}) \frac{a_i!b_j!(n-a_i)!(n=b_j)!}{n!n_{ij}!(a_i-n_{ij})!(b_j-n_{ij})!(n-a_i-b_j+n_{ij})!} \tag{13}E[MI(C,C?)]=i=1∑∣C∣?j=1∑∣C?∣?nij?=(ai?+bj??n)+∑min(ai?,bj?)?nnij??log(ai?bj?n?nij??)n!nij?!(ai??nij?)!(bj??nij?)!(n?ai??bj?+nij?)!ai?!bj?!(n?ai?)!(n=bj?)!?(13)
當 log 取 2 為底時,單位為 bit,取 e 為底時單位為 nat。[2]
1.2 內部評估指標
考慮聚類結果的 kkk 個簇劃分 C={C1,C2,?,Ck}C=\{C_1,C_2,\cdots,C_k\}C={C1?,C2?,?,Ck?},其中 dist(?,?)dist(\cdot,\cdot)dist(?,?) 用于計算兩個樣本之間的距離(2. 距離計算),μ\muμ 代表簇 CCC 的中心點 μ=1k∑1≤i≤kxi\mu=\frac{1}{k}\sum_{1\leq i\leq k} x_iμ=k1?∑1≤i≤k?xi?。
定義:
簇 CCC 內樣本間的平均距離 avg(C)avg(C)avg(C):
avg(C)=Ck2∑1≤i≤j≤kdist(xi,xj)(14)avg(C)=C_k^2\sum_{1\leq i\leq j \leq k} dist(x_i,x_j) \tag{14}avg(C)=Ck2?1≤i≤j≤k∑?dist(xi?,xj?)(14)
簇 CCC 內樣本間的最遠距離 diam(C)diam(C)diam(C):
diam(C)=max1≤i≤j≤kdist(xi,xj)(15)diam(C)=max_{1\leq i\leq j \leq k}dist(x_i,x_j) \tag{15}diam(C)=max1≤i≤j≤k?dist(xi?,xj?)(15)
簇 CiC_iCi? 與簇 CjC_jCj? 最近的樣本間距離 dmin(Ci,Cj)d_{min}(C_i,C_j)dmin?(Ci?,Cj?):
dmin(Ci,Cj)=minxi∈Ci,xj∈Cjdist(xi,xj)(16)d_{min}(C_i,C_j)=min_{x_i \in C_i,x_j \in C_j} dist(x_i,x_j) \tag{16}dmin?(Ci?,Cj?)=minxi?∈Ci?,xj?∈Cj??dist(xi?,xj?)(16)
簇 CiC_iCi? 與簇 CjC_jCj? 中心點間的距離 dcen(Ci,Cj)d_{cen}(C_i,C_j)dcen?(Ci?,Cj?):
dcen(Ci,Cj)=dist(xi,xj)(17)d_{cen}(C_i,C_j)=dist(x_i,x_j) \tag{17}dcen?(Ci?,Cj?)=dist(xi?,xj?)(17)
DBI 戴維森堡丁指數
Davies-Bouldin Index 戴維森堡丁指數,越小越好。
sklearn.metrics.davies_bouldin_score
DBI=1k∑i=1kmax?j≠i(avg(Ci)+avg(Cj)dcen(μi,μj))(18)DBI=\frac{1}{k}\sum_{i=1}^k \max \limits_{j\neq i}(\frac{avg(C_i)+avg(C_j)}{d_{cen}(\mu_i,\mu_j)}) \tag{18} DBI=k1?i=1∑k?j?=imax?(dcen?(μi?,μj?)avg(Ci?)+avg(Cj?)?)(18)
DI Dunn指數
Dunn Index,越大越好。
DI=min?1≤i≤k{min?j≠i(dmin(Ci,Cj)max?1≤l≤kdiam(Cl))}(19)DI= \min \limits_{1\leq i\leq k} \bigg\{ \min \limits_{j \neq i}\bigg(\frac{d_{min}(C_i,C_j)}{\max \limits_{1 \leq l \leq k} diam(C_l)}\bigg) \bigg\} \tag{19} DI=1≤i≤kmin?{j?=imin?(1≤l≤kmax?diam(Cl?)dmin?(Ci?,Cj?)?)}(19)
SC 輪廓系數
Silhouette Coefficient 輪廓系數
sklearn.metrics.silhouette_score
S=b?amax(a,b)(20)S=\frac{b-a}{max(a,b)} \tag{20}S=max(a,b)b?a?(20)
Advantages
-
The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around zero indicate overlapping clusters.
-
The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.
Drawbacks
- The Silhouette Coefficient is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN.【不適用于緊湊、密集的數據,DBSCAN的帶噪聲的數據】
參考文獻
[1] 周志華. 機器學習[M]. 北京: 清華大學出版社, 2016: 197-199.
[2] Gan Pan.[ML] 聚類評價指標[EB/OL].https://zhuanlan.zhihu.com/p/53840697, 2019-06-28.
總結
以上是生活随笔為你收集整理的机器学习 | 聚类评估指标的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 10.4.4 使用ctypes调用ker
- 下一篇: STM32----摸石头过河系列(五)