當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器学习算法3

發布時間：2024/7/5 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习算法3 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

轉換器與估計器
分類算法-K近鄰算法
- 一個例子弄懂k-近鄰
- 計算距離公式
- sklearn.neighbors
- Method
- k近鄰實例
- k-近鄰算法案例分析
- 對Iris數據集進行分割
- 對特征數據進行標準化
樸素貝葉斯
- 概率論基礎
- 聯合概率與條件概率
- 聯合概率
- 條件概率
- 如果每個事件相互獨立
- 拉普拉斯平滑
- sklearn樸素貝葉斯實現API
- 樸素貝葉斯優缺點
- 詞袋法特征值計算
- 案例

轉換器與估計器

分類算法-K近鄰算法

一個例子弄懂k-近鄰

電影可以按照題材分類，每個題材又是如何定義的呢？那么假如兩種類型的電影，動作片和愛情片。動作片有哪些公共的特征？那么愛情片又存在哪些明顯的差別呢？我們發現動作片中打斗鏡頭的次數較多，而愛情片中接吻鏡頭相對更多。當然動作片中也有一些接吻鏡頭，愛情片中也會有一些打斗鏡頭。所以不能單純通過是否存在打斗鏡頭或者接吻鏡頭來判斷影片的類別。那么現在我們有6部影片已經明確了類別，也有打斗鏡頭和接吻鏡頭的次數，還有一部電影類型未知

那么我們使用K-近鄰算法來分類愛情片和動作片：存在一個樣本數據集合，也叫訓練樣本集，樣本個數M個，知道每一個數據特征與類別對應關系，然后存在未知類型數據集合1個，那么我們要選擇一個測試樣本數據中與訓練樣本中M個的距離，排序過后選出最近的K個，這個取值一般不大于20個。選擇K個最相近數據中次數最多的分類。那么我們根據這個原則去判斷未知電影的分類

我們假設K為3，那么排名前三個電影的類型都是愛情片，所以我們判定這個未知電影也是一個愛情片。那么計算距離是怎樣計算的呢？
歐氏距離那么對于兩個向量點

??之間的距離,可以通過該公式表示

計算距離公式

歐氏距離
k近鄰算法需要做標準化處理

sklearn.neighbors

sklearn.neighbors提供監督的基于鄰居的學習方法的功能，sklearn.neighbors.KNeighborsClassifier是一個最近鄰居分類器。那么KNeighborsClassifier是一個類，我們看一下實例化時候的參數

class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=1, **kwargs)**""":param n_neighbors：int，可選（默認= 5），k_neighbors查詢默認使用的鄰居數:param algorithm：{'auto'，'ball_tree'，'kd_tree'，'brute'}，可選用于計算最近鄰居的算法：'ball_tree'將會使用 BallTree，'kd_tree'將使用 KDTree，“野獸”將使用強力搜索。'auto'將嘗試根據傳遞給fit方法的值來決定最合適的算法。:param n_jobs：int，可選（默認= 1),用于鄰居搜索的并行作業數。如果-1，則將作業數設置為CPU內核數。不影響fit方法。""" import numpy as np from sklearn.neighbors import KNeighborsClassifierneigh = KNeighborsClassifier(n_neighbors=3)

Method

fit(X, y)
使用X作為訓練數據擬合模型，y作為X的類別值。X，y為數組或者矩陣

X = np.array([[1,1],[1,1.1],[0,0],[0,0.1]]) y = np.array([1,1,0,0]) neigh.fit(X,y)

kneighbors(X=None, n_neighbors=None, return_distance=True)
找到指定點集X的n_neighbors個鄰居，return_distance為False的話，不返回距離

neigh.kneighbors(np.array([[1.1,1.1]]),return_distance= False)neigh.kneighbors(np.array([[1.1,1.1]]),return_distance= False,n_neighbors=2)

predict(X)
預測提供的數據的類標簽

neigh.predict(np.array([[0.1,0.1],[1.1,1.1]]))

predict_proba(X)
返回測試數據X屬于某一類別的概率估計

neigh.predict_proba(np.array([[1.1,1.1]]))

k近鄰實例

def knncls():"""K-近鄰預測用戶簽到位置:return:None"""# 讀取數據data = pd.read_csv("./data/FBlocation/train.csv")# print(data.head(10))# 處理數據# 1、縮小數據,查詢數據曬訊data = data.query("x > 1.0 & x < 1.25 & y > 2.5 & y < 2.75")# 處理時間的數據time_value = pd.to_datetime(data['time'], unit='s')print(time_value)# 把日期格式轉換成字典格式time_value = pd.DatetimeIndex(time_value)# 構造一些特征data['day'] = time_value.daydata['hour'] = time_value.hourdata['weekday'] = time_value.weekday# 把時間戳特征刪除data = data.drop(['time'], axis=1)print(data)# 把簽到數量少于n個目標位置刪除place_count = data.groupby('place_id').count()tf = place_count[place_count.row_id > 3].reset_index()data = data[data['place_id'].isin(tf.place_id)]# 取出數據當中的特征值和目標值y = data['place_id']x = data.drop(['place_id'], axis=1)# 進行數據的分割訓練集合測試集x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)# 特征工程（標準化）std = StandardScaler()# 對測試集和訓練集的特征值進行標準化x_train = std.fit_transform(x_train)x_test = std.transform(x_test)# 進行算法流程 # 超參數knn = KNeighborsClassifier()# # fit， predict,score# knn.fit(x_train, y_train)## # 得出預測結果# y_predict = knn.predict(x_test)## print("預測的目標簽到位置為：", y_predict)## # 得出準確率# print("預測的準確率:", knn.score(x_test, y_test))# 構造一些參數的值進行搜索param = {"n_neighbors": [3, 5, 10]}# 進行網格搜索gc = GridSearchCV(knn, param_grid=param, cv=2)gc.fit(x_train, y_train)# 預測準確率print("在測試集上準確率：", gc.score(x_test, y_test))print("在交叉驗證當中最好的結果：", gc.best_score_)print("選擇最好的模型是：", gc.best_estimator_)print("每個超參數每次交叉驗證的結果：", gc.cv_results_)return None C:\Users\HP\Anaconda3\python.exe D:/PycharmProjects/untitled2/算法/算法3.pyrow_id x y accuracy time place_id 0 0 0.7941 9.0809 54 470702 8523065625 1 1 5.9567 4.7968 13 186555 1757726713 2 2 8.3078 7.0407 74 322648 1137537235 3 3 7.3665 2.5165 65 704587 6567393236 4 4 4.0961 1.1307 31 472130 7440663949 5 5 3.8099 1.9586 75 178065 6289802927 6 6 6.3336 4.3720 13 666829 9931249544 7 7 5.7409 6.7697 85 369002 5662813655 8 8 4.3114 6.9410 3 166384 8471780938 9 9 6.3414 0.0758 65 400060 1253803156 600 1970-01-01 18:09:40 957 1970-01-10 02:11:10 4345 1970-01-05 15:08:02 4735 1970-01-06 23:03:03 5580 1970-01-09 11:26:50 6090 1970-01-02 16:25:07 6234 1970-01-04 15:52:57 6350 1970-01-01 10:13:36 7468 1970-01-09 15:26:06 8478 1970-01-08 23:52:02 9357 1970-01-04 16:53:19 12125 1970-01-07 03:55:07 14937 1970-01-06 03:46:38 20660 1970-01-08 03:08:15 20930 1970-01-02 21:31:48 21731 1970-01-07 08:52:19 26584 1970-01-04 15:48:09 27937 1970-01-08 03:51:54 30798 1970-01-01 20:58:30 33184 1970-01-06 15:31:39 33877 1970-01-02 14:58:01 34340 1970-01-04 14:03:40 37405 1970-01-04 15:35:01 38968 1970-01-08 08:56:00 41861 1970-01-01 03:13:36 42135 1970-01-02 02:36:41 42729 1970-01-01 16:03:37 44283 1970-01-08 06:48:09 44549 1970-01-07 01:10:01 44694 1970-01-08 14:30:07... 29070221 1970-01-07 02:55:07 29070322 1970-01-01 18:13:24 29070934 1970-01-03 03:44:08 29071712 1970-01-08 04:19:17 29072165 1970-01-04 12:42:07 29073572 1970-01-07 20:29:38 29074121 1970-01-08 02:30:21 29077579 1970-01-08 18:08:30 29077716 1970-01-09 11:51:11 29079070 1970-01-07 00:33:24 29079416 1970-01-05 10:48:15 29079931 1970-01-02 05:35:45 29083241 1970-01-02 00:35:10 29083789 1970-01-05 09:39:49 29084739 1970-01-06 12:04:17 29085497 1970-01-03 11:31:33 29086167 1970-01-08 01:04:37 29087094 1970-01-04 22:25:01 29089004 1970-01-01 23:26:24 29090443 1970-01-03 09:00:22 29093677 1970-01-07 10:03:36 29094547 1970-01-09 11:44:34 29096155 1970-01-04 08:07:44 29099420 1970-01-04 15:47:47 29099686 1970-01-08 01:24:11 29100203 1970-01-01 10:33:56 29108443 1970-01-07 23:22:04 29109993 1970-01-08 15:03:14 29111539 1970-01-04 00:53:41 29112154 1970-01-08 23:01:07 Name: time, Length: 17710, dtype: datetime64[ns] D:/PycharmProjects/untitled2/算法/算法3.py:36: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copydata['day']=time_value.day D:/PycharmProjects/untitled2/算法/算法3.py:37: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copydata['hour']=time_value.hour D:/PycharmProjects/untitled2/算法/算法3.py:38: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copydata['weekday']=time_value.weekdayrow_id x y accuracy place_id day hour weekday 600 600 1.2214 2.7023 17 6683426742 1 18 3 957 957 1.1832 2.6891 58 6683426742 10 2 5 4345 4345 1.1935 2.6550 11 6889790653 5 15 0 4735 4735 1.1452 2.6074 49 6822359752 6 23 1 5580 5580 1.0089 2.7287 19 1527921905 9 11 4 6090 6090 1.1140 2.6262 11 4000153867 2 16 4 6234 6234 1.1449 2.5003 34 3741484405 4 15 6 6350 6350 1.0844 2.7436 65 5963693798 1 10 3 7468 7468 1.0058 2.5096 66 9076695703 9 15 4 8478 8478 1.2015 2.5187 72 3992589015 8 23 3 9357 9357 1.1916 2.7323 170 5163401947 4 16 6 12125 12125 1.1388 2.5029 69 7536975002 7 3 2 14937 14937 1.1426 2.7441 11 6780386626 6 3 1 20660 20660 1.2387 2.5959 65 3683087833 8 3 3 20930 20930 1.0519 2.5208 67 6399991653 2 21 4 21731 21731 1.2171 2.7263 99 8048985799 7 8 2 26584 26584 1.1235 2.6282 63 5606572086 4 15 6 27937 27937 1.1287 2.6332 588 5606572086 8 3 3 30798 30798 1.0422 2.6474 49 1435128522 1 20 3 33184 33184 1.0128 2.5865 75 1913341282 6 15 1 33877 33877 1.1437 2.6972 972 6683426742 2 14 4 34340 34340 1.1513 2.5824 176 2355236719 4 14 6 37405 37405 1.2122 2.7106 10 2946102544 4 15 6 38968 38968 1.1496 2.6298 166 9598377925 8 8 3 41861 41861 1.0886 2.6840 10 3312463746 1 3 3 42135 42135 1.0498 2.6840 5 3312463746 2 2 4 42729 42729 1.0694 2.5829 10 1812226671 1 16 3 44283 44283 1.2384 2.7398 60 8048985799 8 6 3 44549 44549 1.2077 2.5370 76 3992589015 7 1 2 44694 44694 1.0380 2.5315 152 5035268417 8 14 3 ... ... ... ... ... ... ... ... ... 29070221 29070221 1.1678 2.5605 66 2355236719 7 2 2 29070322 29070322 1.0493 2.7010 74 3312463746 1 18 3 29070934 29070934 1.1899 2.5176 28 2199223958 3 3 5 29071712 29071712 1.2260 2.7367 4 2946102544 8 4 3 29072165 29072165 1.0175 2.6220 42 5283227804 4 12 6 29073572 29073572 1.2467 2.7316 64 8048985799 7 20 2 29074121 29074121 1.2071 2.6646 161 5270522918 8 2 3 29077579 29077579 1.2479 2.6474 42 2006503124 8 18 3 29077716 29077716 1.1898 2.7013 5 6683426742 9 11 4 29079070 29079070 1.1882 2.5476 28 1731306153 7 0 2 29079416 29079416 1.2335 2.5903 72 6766324666 5 10 0 29079931 29079931 1.0213 2.6554 167 5270522918 2 5 4 29083241 29083241 1.0600 2.6722 71 9632980559 2 0 4 29083789 29083789 1.0674 2.6184 88 1097200869 5 9 0 29084739 29084739 1.2319 2.6767 63 2327054745 6 12 1 29085497 29085497 1.0550 2.5997 175 1097200869 3 11 5 29086167 29086167 1.0515 2.6758 57 6237569496 8 1 3 29087094 29087094 1.0088 2.5978 71 1097200869 4 22 6 29089004 29089004 1.1860 2.6926 153 2215268322 1 23 3 29090443 29090443 1.0568 2.6959 58 2460093296 3 9 5 29093677 29093677 1.0016 2.5252 16 9013153173 7 10 2 29094547 29094547 1.1101 2.6530 24 5270522918 9 11 4 29096155 29096155 1.0122 2.6450 65 8178619377 4 8 6 29099420 29099420 1.1675 2.5556 9 2355236719 4 15 6 29099686 29099686 1.0405 2.6723 13 3312463746 8 1 3 29100203 29100203 1.0129 2.6775 12 3312463746 1 10 3 29108443 29108443 1.1474 2.6840 36 3533177779 7 23 2 29109993 29109993 1.0240 2.7238 62 6424972551 8 15 3 29111539 29111539 1.2032 2.6796 87 3533177779 4 0 6 29112154 29112154 1.1070 2.5419 178 4932578245 8 23 3[17710 rows x 8 columns] C:\Users\HP\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:645: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.return self.partial_fit(X, y) C:\Users\HP\Anaconda3\lib\site-packages\sklearn\base.py:464: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.return self.fit(X, **fit_params).transform(X) D:/PycharmProjects/untitled2/算法/算法3.py:62: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.x_test=std.transform(x_test) C:\Users\HP\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:652: Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=2.% (min_groups, self.n_splits)), Warning) 在測試集上準確率 0.42174940898345153 在交叉驗證中最好結果 0.3899747793190416 選擇最好的模型 KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',metric_params=None, n_jobs=None, n_neighbors=10, p=2,weights='uniform') 每個超參數每次交叉驗證的結果 {'mean_fit_time': array([0.01645494, 0.01138353, 0.01047194]), 'std_fit_time': array([0.00448656, 0.00058341, 0.00249255]), 'mean_score_time': array([0.75878692, 0.61585569, 0.6273967 ]), 'std_score_time': array([0.01371908, 0.09724212, 0.01204288]), 'param_n_neighbors': masked_array(data=[3, 5, 10],mask=[False, False, False],fill_value='?',dtype=object), 'params': [{'n_neighbors': 3}, {'n_neighbors': 5}, {'n_neighbors': 10}], 'split0_test_score': array([0.33999688, 0.36842105, 0.38841168]), 'split1_test_score': array([0.34622116, 0.37549722, 0.39156722]), 'mean_test_score': array([0.34308008, 0.37192623, 0.38997478]), 'std_test_score': array([0.00311201, 0.00353793, 0.0015777 ]), 'rank_test_score': array([3, 2, 1]), 'split0_train_score': array([0.60875099, 0.56022275, 0.49880668]), 'split1_train_score': array([0.59253475, 0.53959082, 0.48633453]), 'mean_train_score': array([0.60064287, 0.54990678, 0.49257061]), 'std_train_score': array([0.00810812, 0.01031597, 0.00623608])}Process finished with exit code 0

k-近鄰算法案例分析

本案例使用最著名的”鳶尾“數據集，該數據集曾經被Fisher用在經典論文中，目前作為教科書般的數據樣本預存在Scikit-learn的工具包中

from sklearn.datasets import load_iris # 使用加載器讀取數據并且存入變量iris iris = load_iris()# 查驗數據規模 iris.data.shape# 查看數據說明（這是一個好習慣） print iris.DESCR

通過上述代碼對數據的查驗以及數據本身的描述，我們了解到Iris數據集共有150朵鳶尾數據樣本，并且均勻分布在3個不同的亞種；每個數據樣本有總共4個不同的關于花瓣、花萼的形狀特征所描述。由于沒有制定的測試集合，因此按照慣例，我們需要對數據進行隨即分割，25%的樣本用于測試，其余75%的樣本用于模型的訓練。
由于不清楚數據集的排列是否隨機，可能會有按照類別去進行依次排列，這樣訓練樣本的不均衡的，所以我們需要分割數據，已經默認有隨機采樣的功能

對Iris數據集進行分割

from sklearn.cross_validation import train_test_split X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,test_size=0.25,random_state=42)

對特征數據進行標準化

from sklearn.preprocessing import StandardScalerss = StandardScaler() X_train = ss.fit_transform(X_train) X_test = ss.fit_transform(X_test)

K近鄰算法是非常直觀的機器學習模型，我們可以發現K近鄰算法沒有參數訓練過程，也就是說，我們沒有通過任何學習算法分析訓練數據，而只是根據測試樣本訓練數據的分布直接作出分類決策。因此，K近鄰屬于無參數模型中非常簡單一種。

from sklearn.datasets import load_iris from sklearn.cross_validation import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import classification_report from sklearn.model_selection import GridSearchCVdef knniris():"""鳶尾花分類:return: None"""# 數據集獲取和分割lr = load_iris()x_train, x_test, y_train, y_test = train_test_split(lr.data, lr.target, test_size=0.25)# 進行標準化std = StandardScaler()x_train = std.fit_transform(x_train)x_test = std.transform(x_test)# estimator流程knn = KNeighborsClassifier()# # 得出模型# knn.fit(x_train,y_train)## # 進行預測或者得出精度# y_predict = knn.predict(x_test)## # score = knn.score(x_test,y_test)# 通過網格搜索,n_neighbors為參數列表param = {"n_neighbors": [3, 5, 7]}gs = GridSearchCV(knn, param_grid=param, cv=10)# 建立模型gs.fit(x_train,y_train)# print(gs)# 預測數據print(gs.score(x_test,y_test))# 分類模型的精確率和召回率# print("每個類別的精確率與召回率：",classification_report(y_test, y_predict,target_names=lr.target_names))return Noneif __name__ == "__main__":knniris()

樸素貝葉斯

樸素貝葉斯（Naive Bayes）是一個非常簡單，但是實用性很強的分類模型。樸素貝葉斯分類器的構造基礎是貝葉斯理論。

概率論基礎

概率定義為一件事情發生的可能性。事情發生的概率可以通過觀測數據中的事件發生次數來計算，事件發生的概率等于改事件發生次數除以所有事件發生的總次數。舉一些例子：
扔出一個硬幣，結果頭像朝上
某天是晴天
某個單詞在未知文檔中出現

聯合概率與條件概率

聯合概率

是指兩件事情同時發生的概率。那么我們假設樣本空間有一些天氣數據：

條件概率

如果每個事件相互獨立

拉普拉斯平滑

sklearn樸素貝葉斯實現API

class sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)""":param alpha：float，optional（default = 1.0）加法（拉普拉斯/ Lidstone）平滑參數（0為無平滑）"""

樸素貝葉斯優缺點

詞袋法特征值計算

案例

互聯網新聞分類

def naviebayes():"""樸素貝葉斯進行文本分類:return: None"""news=fetch_20newsgroups(subset='all')#進行數據分割x_train,x_test,y_train,y_test=train_test_split(news.data,news.target,test_size=0.25)#對數據集進行特征抽取tf=TfidfVectorizer()# 以訓練集當中的詞的列表進行每篇文章重要性統計['a','b','c','d']x_train=tf.fit_transform(x_train)# print(tf.get_feature_names())x_test=tf.transform(x_test)#進行樸素貝葉斯預測mlt=MultinomialNB(alpha=1.0)# print(x_train.toarray())mlt.fit(x_train,y_train)y_predict=mlt.predict(x_test)print("預測文章類別",y_predict)#得出準確率# print("準確率為",mlt.score(x_test,y_test))# print("每個類別準確率和召回率",classification_report(y_test,y_predict,target_names=news.target_names))return None

總結

以上是生活随笔為你收集整理的机器学习算法3的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： win10右键闪退到桌面_WIN10设置
下一篇： linux数字雨代码解释,linux提权