【Machine Learning】K-means算法及C语言实现
生活随笔
收集整理的這篇文章主要介紹了
【Machine Learning】K-means算法及C语言实现
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
聚類算法是一種無監督的分類方法,即樣本預先不知所屬類別或標簽,需要根據樣本之間的距離或相似程度自動進行分類。聚類算法可以分為基于劃分的方法、基于連通性的方法、基于密度的方法、基于概率分布模型的方法等,K-means(K均值)屬于基于劃分的聚類方法。
一、基本原理 基于劃分的聚類方法是將樣本集組成的矢量空間劃分為多個區域,每個區域都存在一個區域相關的表示,即區域中心。對于每個樣本可以建立一種樣本到區域中心的映射:
其中l()為指數函數。 根據建立的映射q(x),可以將相應的樣本分類到相應的中心,得到最終的劃分結果。
不同的基于劃分的聚類算法的主要區別在于如何建立相應的映射方式q(x)。在經典的K-means聚類算法中,映射是通過樣本與各中心的之間的距離平方和最小準則來確立的。
假設有樣本集合, K-means聚類算法的目標是將數據集劃分為k(k<n)類:S = {S1, S2, ..., SK},使劃分后的K個子集合滿足類內的距離平方和最小:
其中,
求解目標函數是一個NP-hard問題,無法保證得到一個穩定的全局最優解。在經典的聚類算法中,采取迭代優化策略,有效地求解目標函數的局部最優解。
算法步驟如下: 步驟1??初始化聚類中心,可選取樣本集的前k個樣本,或者隨機選取k個樣本; 步驟2?分配各樣本到相近的聚類集合,樣本分配依據為:
? ? ? ? ? ?式 中 i = 1,2, ...,k,p ≠ j。 步驟3??根據步驟2的分配結果,更新聚類中心:
步驟4?若迭代達到最大迭代步數,或前后兩次迭代的差小于設定閾值,即,則迭代終止,否則重復步驟2。
其中,步驟2和步驟3分別對樣本集合重新分配和更新計算聚類中心,通過迭代計算過程中優化目標函數,實現類內距離平方和最小。
二、K-means算法的優化
2.1 聚類中心初始化的優化 K-means對聚類中心的初始化比較敏感,不同初始化值會帶來不同的聚類結果,這是因為K-means僅僅對目標函數求取近似局部最優解,不能保證得到全局最優解,即在一定數據分布下聚類結果會因為初始化的不同而產生很大的偏差。
下面介紹一下K-means的改進算法,即K-means++算法,改算法能夠有效產生初始的聚類中心。 首先,隨機初始化一個聚類中心; 然后,通過迭代計算最大概率值:
加入下一個聚類中心:
直到選擇k個中心。
K-means++算法的計算復雜度為O(knd),沒有增加過多的計算負擔,同時可以保證算法更有效的近似于最優解。
2.2 類別個數的自適應確定 經典的K-means算法中,聚類的個數k是預先確定的,不具備自適應選擇類別個數的能力。而聚類算法中類別個數的設定將會在很大程度上決定聚類效果。 ISODATA算法與K-means在基本原則上是一致的,通過計算距離平方和最小來實現聚類,但在迭代的過程中會引入類別的合并與分離機制。 在每一次迭代中,ISODATA算法首先在固定類別的情況下進行聚類,然后根據設定樣本之間的距離閾值進行合并操作,并根據每一組類別Si中樣本協方差矩陣信息來判斷是否分開。 但IOSDATA算法的效率會相比于K-means大大降低。
附:K-means算法C語言實現 #include <math.h> #include <stdlib.h> #include <stdio.h>#define sqr(x) ((x)*(x))#define MAX_CLUSTERS 16#define MAX_ITERATIONS 100#define BIG_double (INFINITY)void fail(char *str){printf(str);exit(-1);}double calc_distance(int dim, double *p1, double *p2){double distance_sq_sum = 0;for (int ii = 0; ii < dim; ii++)distance_sq_sum += sqr(p1[ii] - p2[ii]);return distance_sq_sum;}void calc_all_distances(int dim, int n, int k, double *X, double *centroid, double *distance_output){for (int ii = 0; ii < n; ii++) // for each pointfor (int jj = 0; jj < k; jj++) // for each cluster{// calculate distance between point and cluster centroiddistance_output[ii*k + jj] = calc_distance(dim, &X[ii*dim], ¢roid[jj*dim]);}}double calc_total_distance(int dim, int n, int k, double *X, double *centroids, int *cluster_assignment_index)// NOTE: a point with cluster assignment -1 is ignored{double tot_D = 0;// for every pointfor (int ii = 0; ii < n; ii++){// which cluster is it in?int active_cluster = cluster_assignment_index[ii];// sum distanceif (active_cluster != -1)tot_D += calc_distance(dim, &X[ii*dim], ¢roids[active_cluster*dim]);}return tot_D;}void choose_all_clusters_from_distances(int dim, int n, int k, double *distance_array, int *cluster_assignment_index){// for each pointfor (int ii = 0; ii < n; ii++){int best_index = -1;double closest_distance = BIG_double;// for each clusterfor (int jj = 0; jj < k; jj++){// distance between point and cluster centroiddouble cur_distance = distance_array[ii*k + jj];if (cur_distance < closest_distance){best_index = jj;closest_distance = cur_distance;}}// record in arraycluster_assignment_index[ii] = best_index;}}void calc_cluster_centroids(int dim, int n, int k, double *X, int *cluster_assignment_index, double *new_cluster_centroid){int cluster_member_count[MAX_CLUSTERS];// initialize cluster centroid coordinate sums to zerofor (int ii = 0; ii < k; ii++) {cluster_member_count[ii] = 0;for (int jj = 0; jj < dim; jj++)new_cluster_centroid[ii*dim + jj] = 0;}// sum all points// for every pointfor (int ii = 0; ii < n; ii++){// which cluster is it in?int active_cluster = cluster_assignment_index[ii];// update count of members in that clustercluster_member_count[active_cluster]++;// sum point coordinates for finding centroidfor (int jj = 0; jj < dim; jj++)new_cluster_centroid[active_cluster*dim + jj] += X[ii*dim + jj];}// now divide each coordinate sum by number of members to find mean/centroid// for each clusterfor (int ii = 0; ii < k; ii++) {if (cluster_member_count[ii] == 0)printf("WARNING: Empty cluster %d! \n", ii);// for each dimensionfor (int jj = 0; jj < dim; jj++)new_cluster_centroid[ii*dim + jj] /= cluster_member_count[ii]; /// XXXX will divide by zero here for any empty clusters!}}void get_cluster_member_count(int n, int k, int *cluster_assignment_index, int *cluster_member_count){// initialize cluster member countsfor (int ii = 0; ii < k; ii++) cluster_member_count[ii] = 0;// count members of each cluster for (int ii = 0; ii < n; ii++)cluster_member_count[cluster_assignment_index[ii]]++;}void update_delta_score_table(int dim, int n, int k, double *X, int *cluster_assignment_cur, double *cluster_centroid, int *cluster_member_count, double *point_move_score_table, int cc){// for every point (both in and not in the cluster)for (int ii = 0; ii < n; ii++){double dist_sum = 0;for (int kk = 0; kk < dim; kk++){double axis_dist = X[ii*dim + kk] - cluster_centroid[cc*dim + kk]; dist_sum += sqr(axis_dist);}double mult = ((double)cluster_member_count[cc] / (cluster_member_count[cc] + ((cluster_assignment_cur[ii]==cc) ? -1 : +1)));point_move_score_table[ii*dim + cc] = dist_sum * mult;}}void perform_move(int dim, int n, int k, double *X, int *cluster_assignment, double *cluster_centroid, int *cluster_member_count, int move_point, int move_target_cluster){int cluster_old = cluster_assignment[move_point];int cluster_new = move_target_cluster;// update cluster assignment arraycluster_assignment[move_point] = cluster_new;// update cluster count arraycluster_member_count[cluster_old]--;cluster_member_count[cluster_new]++;if (cluster_member_count[cluster_old] <= 1)printf("WARNING: Can't handle single-member clusters! \n");// update centroid arrayfor (int ii = 0; ii < dim; ii++){cluster_centroid[cluster_old*dim + ii] -= (X[move_point*dim + ii] - cluster_centroid[cluster_old*dim + ii]) / cluster_member_count[cluster_old];cluster_centroid[cluster_new*dim + ii] += (X[move_point*dim + ii] - cluster_centroid[cluster_new*dim + ii]) / cluster_member_count[cluster_new];}} void cluster_diag(int dim, int n, int k, double *X, int *cluster_assignment_index, double *cluster_centroid){int cluster_member_count[MAX_CLUSTERS];get_cluster_member_count(n, k, cluster_assignment_index, cluster_member_count);printf(" Final clusters \n");for (int ii = 0; ii < k; ii++) printf(" cluster %d: members: %8d, centroid (%.1f %.1f) \n", ii, cluster_member_count[ii], cluster_centroid[ii*dim + 0], cluster_centroid[ii*dim + 1]);}void copy_assignment_array(int n, int *src, int *tgt){for (int ii = 0; ii < n; ii++)tgt[ii] = src[ii];}int assignment_change_count(int n, int a[], int b[]){int change_count = 0;for (int ii = 0; ii < n; ii++)if (a[ii] != b[ii])change_count++;return change_count;}void kmeans(int dim, // dimension of data double *X, // pointer to dataint n, // number of elementsint k, // number of clustersdouble *cluster_centroid, // initial cluster centroidsint *cluster_assignment_final // output){double *dist = (double *)malloc(sizeof(double) * n * k);int *cluster_assignment_cur = (int *)malloc(sizeof(int) * n);int *cluster_assignment_prev = (int *)malloc(sizeof(int) * n);double *point_move_score = (double *)malloc(sizeof(double) * n * k);if (!dist || !cluster_assignment_cur || !cluster_assignment_prev || !point_move_score)fail("Error allocating dist arrays");// initial setup calc_all_distances(dim, n, k, X, cluster_centroid, dist);choose_all_clusters_from_distances(dim, n, k, dist, cluster_assignment_cur);copy_assignment_array(n, cluster_assignment_cur, cluster_assignment_prev);// BATCH UPDATEdouble prev_totD = BIG_double;int batch_iteration = 0;while (batch_iteration < MAX_ITERATIONS){ // printf("batch iteration %d \n", batch_iteration); // cluster_diag(dim, n, k, X, cluster_assignment_cur, cluster_centroid);// update cluster centroidscalc_cluster_centroids(dim, n, k, X, cluster_assignment_cur, cluster_centroid);// deal with empty clusters// XXXXXXXXXXXXXX// see if we've failed to improvedouble totD = calc_total_distance(dim, n, k, X, cluster_centroid, cluster_assignment_cur);if (totD > prev_totD)// failed to improve - currently solution worse than previous{// restore old assignmentscopy_assignment_array(n, cluster_assignment_prev, cluster_assignment_cur);// recalc centroidscalc_cluster_centroids(dim, n, k, X, cluster_assignment_cur, cluster_centroid);printf(" negative progress made on this step - iteration completed (%.2f) \n", totD - prev_totD);// done with this phasebreak;}// save previous stepcopy_assignment_array(n, cluster_assignment_cur, cluster_assignment_prev);// move all points to nearest clustercalc_all_distances(dim, n, k, X, cluster_centroid, dist);choose_all_clusters_from_distances(dim, n, k, dist, cluster_assignment_cur);int change_count = assignment_change_count(n, cluster_assignment_cur, cluster_assignment_prev);printf("%3d %u %9d %16.2f %17.2f\n", batch_iteration, 1, change_count, totD, totD - prev_totD);fflush(stdout);// done with this phase if nothing has changedif (change_count == 0){printf(" no change made on this step - iteration completed \n");break;}prev_totD = totD;batch_iteration++;}// write to output arraycopy_assignment_array(n, cluster_assignment_cur, cluster_assignment_final); free(dist);free(cluster_assignment_cur);free(cluster_assignment_prev);free(point_move_score);}
2017.11.17
一、基本原理 基于劃分的聚類方法是將樣本集組成的矢量空間劃分為多個區域,每個區域都存在一個區域相關的表示,即區域中心。對于每個樣本可以建立一種樣本到區域中心的映射:
其中l()為指數函數。 根據建立的映射q(x),可以將相應的樣本分類到相應的中心,得到最終的劃分結果。
不同的基于劃分的聚類算法的主要區別在于如何建立相應的映射方式q(x)。在經典的K-means聚類算法中,映射是通過樣本與各中心的之間的距離平方和最小準則來確立的。
假設有樣本集合, K-means聚類算法的目標是將數據集劃分為k(k<n)類:S = {S1, S2, ..., SK},使劃分后的K個子集合滿足類內的距離平方和最小:
其中,
求解目標函數是一個NP-hard問題,無法保證得到一個穩定的全局最優解。在經典的聚類算法中,采取迭代優化策略,有效地求解目標函數的局部最優解。
算法步驟如下: 步驟1??初始化聚類中心,可選取樣本集的前k個樣本,或者隨機選取k個樣本; 步驟2?分配各樣本到相近的聚類集合,樣本分配依據為:
? ? ? ? ? ?式 中 i = 1,2, ...,k,p ≠ j。 步驟3??根據步驟2的分配結果,更新聚類中心:
步驟4?若迭代達到最大迭代步數,或前后兩次迭代的差小于設定閾值,即,則迭代終止,否則重復步驟2。
其中,步驟2和步驟3分別對樣本集合重新分配和更新計算聚類中心,通過迭代計算過程中優化目標函數,實現類內距離平方和最小。
二、K-means算法的優化
2.1 聚類中心初始化的優化 K-means對聚類中心的初始化比較敏感,不同初始化值會帶來不同的聚類結果,這是因為K-means僅僅對目標函數求取近似局部最優解,不能保證得到全局最優解,即在一定數據分布下聚類結果會因為初始化的不同而產生很大的偏差。
下面介紹一下K-means的改進算法,即K-means++算法,改算法能夠有效產生初始的聚類中心。 首先,隨機初始化一個聚類中心; 然后,通過迭代計算最大概率值:
加入下一個聚類中心:
直到選擇k個中心。
K-means++算法的計算復雜度為O(knd),沒有增加過多的計算負擔,同時可以保證算法更有效的近似于最優解。
2.2 類別個數的自適應確定 經典的K-means算法中,聚類的個數k是預先確定的,不具備自適應選擇類別個數的能力。而聚類算法中類別個數的設定將會在很大程度上決定聚類效果。 ISODATA算法與K-means在基本原則上是一致的,通過計算距離平方和最小來實現聚類,但在迭代的過程中會引入類別的合并與分離機制。 在每一次迭代中,ISODATA算法首先在固定類別的情況下進行聚類,然后根據設定樣本之間的距離閾值進行合并操作,并根據每一組類別Si中樣本協方差矩陣信息來判斷是否分開。 但IOSDATA算法的效率會相比于K-means大大降低。
附:K-means算法C語言實現 #include <math.h> #include <stdlib.h> #include <stdio.h>#define sqr(x) ((x)*(x))#define MAX_CLUSTERS 16#define MAX_ITERATIONS 100#define BIG_double (INFINITY)void fail(char *str){printf(str);exit(-1);}double calc_distance(int dim, double *p1, double *p2){double distance_sq_sum = 0;for (int ii = 0; ii < dim; ii++)distance_sq_sum += sqr(p1[ii] - p2[ii]);return distance_sq_sum;}void calc_all_distances(int dim, int n, int k, double *X, double *centroid, double *distance_output){for (int ii = 0; ii < n; ii++) // for each pointfor (int jj = 0; jj < k; jj++) // for each cluster{// calculate distance between point and cluster centroiddistance_output[ii*k + jj] = calc_distance(dim, &X[ii*dim], ¢roid[jj*dim]);}}double calc_total_distance(int dim, int n, int k, double *X, double *centroids, int *cluster_assignment_index)// NOTE: a point with cluster assignment -1 is ignored{double tot_D = 0;// for every pointfor (int ii = 0; ii < n; ii++){// which cluster is it in?int active_cluster = cluster_assignment_index[ii];// sum distanceif (active_cluster != -1)tot_D += calc_distance(dim, &X[ii*dim], ¢roids[active_cluster*dim]);}return tot_D;}void choose_all_clusters_from_distances(int dim, int n, int k, double *distance_array, int *cluster_assignment_index){// for each pointfor (int ii = 0; ii < n; ii++){int best_index = -1;double closest_distance = BIG_double;// for each clusterfor (int jj = 0; jj < k; jj++){// distance between point and cluster centroiddouble cur_distance = distance_array[ii*k + jj];if (cur_distance < closest_distance){best_index = jj;closest_distance = cur_distance;}}// record in arraycluster_assignment_index[ii] = best_index;}}void calc_cluster_centroids(int dim, int n, int k, double *X, int *cluster_assignment_index, double *new_cluster_centroid){int cluster_member_count[MAX_CLUSTERS];// initialize cluster centroid coordinate sums to zerofor (int ii = 0; ii < k; ii++) {cluster_member_count[ii] = 0;for (int jj = 0; jj < dim; jj++)new_cluster_centroid[ii*dim + jj] = 0;}// sum all points// for every pointfor (int ii = 0; ii < n; ii++){// which cluster is it in?int active_cluster = cluster_assignment_index[ii];// update count of members in that clustercluster_member_count[active_cluster]++;// sum point coordinates for finding centroidfor (int jj = 0; jj < dim; jj++)new_cluster_centroid[active_cluster*dim + jj] += X[ii*dim + jj];}// now divide each coordinate sum by number of members to find mean/centroid// for each clusterfor (int ii = 0; ii < k; ii++) {if (cluster_member_count[ii] == 0)printf("WARNING: Empty cluster %d! \n", ii);// for each dimensionfor (int jj = 0; jj < dim; jj++)new_cluster_centroid[ii*dim + jj] /= cluster_member_count[ii]; /// XXXX will divide by zero here for any empty clusters!}}void get_cluster_member_count(int n, int k, int *cluster_assignment_index, int *cluster_member_count){// initialize cluster member countsfor (int ii = 0; ii < k; ii++) cluster_member_count[ii] = 0;// count members of each cluster for (int ii = 0; ii < n; ii++)cluster_member_count[cluster_assignment_index[ii]]++;}void update_delta_score_table(int dim, int n, int k, double *X, int *cluster_assignment_cur, double *cluster_centroid, int *cluster_member_count, double *point_move_score_table, int cc){// for every point (both in and not in the cluster)for (int ii = 0; ii < n; ii++){double dist_sum = 0;for (int kk = 0; kk < dim; kk++){double axis_dist = X[ii*dim + kk] - cluster_centroid[cc*dim + kk]; dist_sum += sqr(axis_dist);}double mult = ((double)cluster_member_count[cc] / (cluster_member_count[cc] + ((cluster_assignment_cur[ii]==cc) ? -1 : +1)));point_move_score_table[ii*dim + cc] = dist_sum * mult;}}void perform_move(int dim, int n, int k, double *X, int *cluster_assignment, double *cluster_centroid, int *cluster_member_count, int move_point, int move_target_cluster){int cluster_old = cluster_assignment[move_point];int cluster_new = move_target_cluster;// update cluster assignment arraycluster_assignment[move_point] = cluster_new;// update cluster count arraycluster_member_count[cluster_old]--;cluster_member_count[cluster_new]++;if (cluster_member_count[cluster_old] <= 1)printf("WARNING: Can't handle single-member clusters! \n");// update centroid arrayfor (int ii = 0; ii < dim; ii++){cluster_centroid[cluster_old*dim + ii] -= (X[move_point*dim + ii] - cluster_centroid[cluster_old*dim + ii]) / cluster_member_count[cluster_old];cluster_centroid[cluster_new*dim + ii] += (X[move_point*dim + ii] - cluster_centroid[cluster_new*dim + ii]) / cluster_member_count[cluster_new];}} void cluster_diag(int dim, int n, int k, double *X, int *cluster_assignment_index, double *cluster_centroid){int cluster_member_count[MAX_CLUSTERS];get_cluster_member_count(n, k, cluster_assignment_index, cluster_member_count);printf(" Final clusters \n");for (int ii = 0; ii < k; ii++) printf(" cluster %d: members: %8d, centroid (%.1f %.1f) \n", ii, cluster_member_count[ii], cluster_centroid[ii*dim + 0], cluster_centroid[ii*dim + 1]);}void copy_assignment_array(int n, int *src, int *tgt){for (int ii = 0; ii < n; ii++)tgt[ii] = src[ii];}int assignment_change_count(int n, int a[], int b[]){int change_count = 0;for (int ii = 0; ii < n; ii++)if (a[ii] != b[ii])change_count++;return change_count;}void kmeans(int dim, // dimension of data double *X, // pointer to dataint n, // number of elementsint k, // number of clustersdouble *cluster_centroid, // initial cluster centroidsint *cluster_assignment_final // output){double *dist = (double *)malloc(sizeof(double) * n * k);int *cluster_assignment_cur = (int *)malloc(sizeof(int) * n);int *cluster_assignment_prev = (int *)malloc(sizeof(int) * n);double *point_move_score = (double *)malloc(sizeof(double) * n * k);if (!dist || !cluster_assignment_cur || !cluster_assignment_prev || !point_move_score)fail("Error allocating dist arrays");// initial setup calc_all_distances(dim, n, k, X, cluster_centroid, dist);choose_all_clusters_from_distances(dim, n, k, dist, cluster_assignment_cur);copy_assignment_array(n, cluster_assignment_cur, cluster_assignment_prev);// BATCH UPDATEdouble prev_totD = BIG_double;int batch_iteration = 0;while (batch_iteration < MAX_ITERATIONS){ // printf("batch iteration %d \n", batch_iteration); // cluster_diag(dim, n, k, X, cluster_assignment_cur, cluster_centroid);// update cluster centroidscalc_cluster_centroids(dim, n, k, X, cluster_assignment_cur, cluster_centroid);// deal with empty clusters// XXXXXXXXXXXXXX// see if we've failed to improvedouble totD = calc_total_distance(dim, n, k, X, cluster_centroid, cluster_assignment_cur);if (totD > prev_totD)// failed to improve - currently solution worse than previous{// restore old assignmentscopy_assignment_array(n, cluster_assignment_prev, cluster_assignment_cur);// recalc centroidscalc_cluster_centroids(dim, n, k, X, cluster_assignment_cur, cluster_centroid);printf(" negative progress made on this step - iteration completed (%.2f) \n", totD - prev_totD);// done with this phasebreak;}// save previous stepcopy_assignment_array(n, cluster_assignment_cur, cluster_assignment_prev);// move all points to nearest clustercalc_all_distances(dim, n, k, X, cluster_centroid, dist);choose_all_clusters_from_distances(dim, n, k, dist, cluster_assignment_cur);int change_count = assignment_change_count(n, cluster_assignment_cur, cluster_assignment_prev);printf("%3d %u %9d %16.2f %17.2f\n", batch_iteration, 1, change_count, totD, totD - prev_totD);fflush(stdout);// done with this phase if nothing has changedif (change_count == 0){printf(" no change made on this step - iteration completed \n");break;}prev_totD = totD;batch_iteration++;}// write to output arraycopy_assignment_array(n, cluster_assignment_cur, cluster_assignment_final); free(dist);free(cluster_assignment_cur);free(cluster_assignment_prev);free(point_move_score);}
2017.11.17
總結
以上是生活随笔為你收集整理的【Machine Learning】K-means算法及C语言实现的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 【Hisi音频】Hi3520D音频输出调
- 下一篇: 【Machine Learning】Op