scikit-learn点滴
scikit-learn點(diǎn)滴
scikit-learn是非常漂亮的一個(gè)機(jī)器學(xué)習(xí)庫(kù),在某些時(shí)候,使用這些庫(kù)能夠大量的節(jié)省你的時(shí)間,至少,我們用Python,應(yīng)該是很難寫出速度快如斯的代碼的.
scikit-learn官方出了一些文檔,但是個(gè)人覺(jué)得,它的文檔很多東西都沒(méi)有講清楚,它說(shuō)算法原理的時(shí)候,只是描述一下,除非你對(duì)這種算法已經(jīng)爛熟于心,才會(huì)對(duì)它的描述會(huì)心一笑,它描述API的時(shí)候,很多時(shí)候只是講了一些常見(jiàn)用法,一些比較高級(jí)的用法就語(yǔ)焉不詳,雖然有很多人說(shuō),這玩意的文檔寫得不錯(cuò),但是我覺(jué)得特坑.所以這篇博文,會(huì)記錄一些我使用這個(gè)庫(kù)的時(shí)候碰到的一些坑,以及如何跨過(guò)這些坑.慢慢來(lái)更新吧,當(dāng)然,以后如果不用了,文章估計(jì)也不會(huì)更新了,當(dāng)然,我也沒(méi)有打算說(shuō),這篇文章有多少人能看.就這樣吧.
聚類
坑1: 如何自定義距離函數(shù)?
雖然說(shuō)scikit-learn這個(gè)庫(kù)實(shí)現(xiàn)了很多的聚類函數(shù),但是這些算法使用的距離大部分都是歐氏距離或者明科夫斯基距離,事實(shí)上,根據(jù)我們教材上的描述,所謂的距離,可不單單僅有這兩種,為了不同的目的,我們可以用不同的距離來(lái)度量?jī)蓚€(gè)向量之間的距離,但是很遺憾,我并沒(méi)有看見(jiàn)scikit-learn中提供自定義距離的選項(xiàng),網(wǎng)上搜了一大圈也沒(méi)有見(jiàn)到.
但是不用擔(dān)心,我們可以間接實(shí)現(xiàn)這個(gè)東西.以DBSCAN算法為例,下面是類的一個(gè)構(gòu)造函數(shù):
class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric='euclidean', algorithm='auto', leaf_size=30, p=None, n_jobs=1) # eps表示兩個(gè)向量可以被視作為同一個(gè)類的最大的距離 # min_samples表示一個(gè)類中至少要包含的元素?cái)?shù)量,如果小于這個(gè)數(shù)量,那么不構(gòu)成一個(gè)類?
?
我們要特別注意一下metric這個(gè)選項(xiàng),我們來(lái)看一下選項(xiàng):
metric : string, or callableThe metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.calculate_distance for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only “nonzero” elements may be considered neighbors for DBSCAN. New in version 0.17: metric precomputed to accept precomputed sparse matrix.?
?
這段描述其實(shí)透露了一個(gè)很重要的信息,那就是其實(shí)你可以自己提前計(jì)算各個(gè)向量的相似度,構(gòu)成一個(gè)相似度的矩陣,只要你設(shè)置metric='precomputedd'就行,那么如何調(diào)用呢?
我們來(lái)看一下fit函數(shù).
fit(X, y=None, sample_weight=None) # X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples) # A feature array, or array of distances between samples if metric='precomputed'.?
?
上面的注釋是什么意思呢,我翻譯一下,如果你將metric設(shè)置成了precomputed的話,那么傳入的X參數(shù)應(yīng)該為各個(gè)向量之間的相似度矩陣,然后fit函數(shù)會(huì)直接用你這個(gè)矩陣來(lái)進(jìn)行計(jì)算.否則的話,你還是要乖乖地傳入(n_samples, n_features)形式的向量.
這意味著什么,同志們.這意味著我們可以用我們自定義的距離事先計(jì)算好各個(gè)向量的相似度,然后調(diào)用這個(gè)函數(shù)來(lái)獲得結(jié)果,是不是很爽.
具體怎么來(lái)編程,我給個(gè)例子,拋個(gè)磚.
import numpy as np from sklearn.cluster import DBSCAN if __name__ == '__main__': Y = np.array([[0, 1, 2], [1, 0, 3], [2, 3, 0]]) # 相似度矩陣,距離越小代表兩個(gè)向量距離越近 # N = Y.shape[0] db = DBSCAN(eps=0.13, metric='precomputed', min_samples=3).fit(Y) labels = db.labels_ # 然后來(lái)看一下分類的結(jié)果吧! n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) # 類的數(shù)目 print('類的數(shù)目是:%d'%(n_clusters_))?
?
我們繼續(xù)來(lái)看一下AP聚類,其實(shí)也很類似:
class sklearn.cluster.AffinityPropagation(damping=0.5, max_iter=200, convergence_iter=15, copy=True, preference=None, affinity='euclidean', verbose=False)?
?
關(guān)鍵在這個(gè)affinity參數(shù)上:
affinity : string, optional, default=``euclidean``Which affinity to use. At the moment precomputed and euclidean are supported. euclidean uses the negative squared euclidean distance between points.?
?
這個(gè)東西也支持precomputed參數(shù).再來(lái)看一下fit函數(shù):
fit(X, y=None) # Create affinity matrix from negative euclidean distances, then apply affinity propagation clustering. # Parameters: # X: array-like, shape (n_samples, n_features) or (n_samples, n_samples) : # Data matrix or, if affinity is precomputed, matrix of similarities / affinities.?
這里的X和前面是類似的,如果你將metric設(shè)置成了precomputed的話,那么傳入的X參數(shù)應(yīng)該為各個(gè)向量之間的相似度矩陣,然后fit函數(shù)會(huì)直接用你這個(gè)矩陣來(lái)進(jìn)行計(jì)算.否則的話,你還是要乖乖地傳入(n_samples, n_features)形式的向量.
例子1
"""目標(biāo):~~~~~~~~~~~~~~~~在這個(gè)文件里面,我最想測(cè)試一下的是,我前面的那些聚類算法是否是正確的.首先要測(cè)試的是AP聚類. """ from sklearn.cluster import AffinityPropagation from sklearn import metrics from sklearn.datasets.samples_generator import make_blobs from sklearn.metrics.pairwise import euclidean_distances import matplotlib.pyplot as plt from itertools import cycle def draw_pic(n_clusters, cluster_centers_indices, labels, X): ''' 口說(shuō)無(wú)憑,繪制一張圖就一目了然. ''' colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk') for k, col in zip(range(n_clusters), colors): class_members = labels == k cluster_center = X[cluster_centers_indices[k]] # 得到聚類的中心 plt.plot(X[class_members, 0], X[class_members, 1], col + '.') plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=14) for x in X[class_members]: plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col) plt.title('Estimated number of clusters: %d' % n_clusters) plt.show() if __name__ == '__main__': centers = [[1, 1], [-1, -1], [1, -1]] # 接下來(lái)要生成300個(gè)點(diǎn),并且每個(gè)點(diǎn)屬于哪一個(gè)中心都要標(biāo)記下來(lái),記錄到labels_true中. X, labels_true = make_blobs(n_samples=300, centers=centers, cluster_std=0.5, random_state=0) af = AffinityPropagation(preference=-50).fit(X) # 開(kāi)始用AP聚類 cluster_centers_indices = af.cluster_centers_indices_ # 得到聚類的中心點(diǎn) labels = af.labels_ # 得到label n_clusters = len(cluster_centers_indices) # 類的數(shù)目 draw_pic(n_clusters, cluster_centers_indices, labels, X) #===========接下來(lái)的話提前計(jì)算好距離=================# distance_matrix = -euclidean_distances(X, squared=True) # 提前計(jì)算好歐幾里德距離,需要注意的是,這里使用的是歐幾里德距離的平方 af1 = AffinityPropagation(affinity='precomputed', preference=-50).fit(distance_matrix) cluster_centers_indices1 = af1.cluster_centers_indices_ # 得到聚類的中心 labels1 = af1.labels_ # 得到label n_clusters1 = len(cluster_centers_indices1) # 類的數(shù)目 draw_pic(n_clusters1, cluster_centers_indices1, labels1, X)?
兩種方法都將產(chǎn)生這樣的圖:
例子2
既然都到這里了,我們索性來(lái)測(cè)試一下DBSCAN算法好了.
"""目標(biāo):~~~~~~~~~~~~~~前面已經(jīng)測(cè)試過(guò)了ap聚類,接下來(lái)測(cè)試DBSACN. """ import numpy as np from sklearn.cluster import DBSCAN from sklearn import metrics from sklearn.datasets.samples_generator import make_blobs from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt from sklearn.metrics.pairwise import euclidean_distances def draw_pic(n_clusters, core_samples_mask, labels, X): ''' 開(kāi)始繪制圖片 ''' # Black removed and is used for noise instead. unique_labels = set(labels) colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels))) for k, col in zip(unique_labels, colors): if k == -1: # Black used for noise. col = 'k' class_member_mask = (labels == k) xy = X[class_member_mask & core_samples_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=14) xy = X[class_member_mask & ~core_samples_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=6) plt.title('Estimated number of clusters: %d' % n_clusters) plt.show() if __name__ == '__main__': #=========首先產(chǎn)生數(shù)據(jù)===========# centers = [[1, 1], [-1, -1], [1, -1]] X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4, random_state=0) X = StandardScaler().fit_transform(X) #=========接下來(lái)開(kāi)始聚類==========# db = DBSCAN(eps=0.3, min_samples=10).fit(X) labels = db.labels_ # 每個(gè)點(diǎn)的標(biāo)簽 core_samples_mask = np.zeros_like(db.labels_, dtype=bool) core_samples_mask[db.core_sample_indices_] = True n_clusters = len(set(labels)) - (1 if -1 in labels else 0) # 類的數(shù)目 draw_pic(n_clusters, core_samples_mask, labels, X) #==========接下來(lái)我們提前計(jì)算好距離============# distance_matrix = euclidean_distances(X) db1 = DBSCAN(eps=0.3, min_samples=10, metric='precomputed').fit(distance_matrix) labels1 = db1.labels_ # 每個(gè)點(diǎn)的標(biāo)簽 core_samples_mask1 = np.zeros_like(db1.labels_, dtype=bool) core_samples_mask1[db1.core_sample_indices_] = True n_clusters1 = len(set(labels1)) - (1 if -1 in labels1 else 0) # 類的數(shù)目 draw_pic(n_clusters1, core_samples_mask1, labels1, X)?
兩種方法都將產(chǎn)生這樣的圖:
好吧,暫時(shí)介紹到這里吧,但是有意思的是,最簡(jiǎn)單的KMeans算法倒是不支持這樣的干活.
總結(jié)
以上是生活随笔為你收集整理的scikit-learn点滴的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 混合代码块 Markdown Leedc
- 下一篇: [整理]充分发挥FireWork功能,实