【Python学习】 - sklearn学习 - 数据集分割方法 - 随机划分与K折交叉划分与StratifiedKFold与StratifiedShuffleSplit
一、隨機劃分
import numpy as np from sklearn import datasetsiris = datasets.load_iris() X = iris.data y = iris.target# 1)歸一化前,將原始數據分割 from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2,stratify=y, # 按照標簽來分層采樣shuffle=True, # 是否先打亂數據的順序再劃分random_state=1) # 控制將樣本隨機打亂函數聲明:train_test_split(test_size, train_size, random_state=None, shuffle=True, stratify=None)
參數說明:
test_size:可以接收float,int或者None。如果是float,則需要傳入0.0-1.0之間的數,代表測試集占總樣本數的比例。如果傳入的是int,則代表測試集樣本數,如果是None,即未聲明test_size參數,則默認為train_size的補數。如果train_size也是None(即兩者都是None),則默認是0.25。
train_size:和前者同理。
random_state:可以接收int,隨機種子實例,或者None。random_state是隨機數生成器使用的種子,如果是None則默認通過 ' np.random ' 來生成隨機種子。
stratify:接收一個類數組對象 或 None。如果不為None,則數據將以分層的方式進行分割,使用這個作為分類標簽。(找了半天關于分層的方式進行分割的具體說明,總算找到個像樣的,見下文)
shuffle: 默認是True,是否在分割之前重新洗牌數據。如果shuffle =?False那么stratify必須是None。
關于stratify參數的詳細說明:
stratify是為了保持split前類的分布。比如有100個數據,80個屬于A類,20個屬于B類。如果train_test_split(… test_size=0.25, stratify = y_all), 那么split之后數據如下:
training: 75個數據,其中60個屬于A類,15個屬于B類。
testing: 25個數據,其中20個屬于A類,5個屬于B類。
用了stratify參數,training集和testing集的類的比例是 A:B= 4:1,等同于split前的比例(80:20)。通常在這種類分布不平衡的情況下會用到stratify。
將stratify=X_data(數據)就是按照X中的比例分配
將stratify=Y_data(標簽)就是按照y中的比例分配
一般來說都是 stratify = y 的
?
二、K折交叉劃分
傳送門
?
定義與原理:將原始數據D隨機分成K份,每次選擇(K-1)份作為訓練集,剩余的1份(紅色部分)作為測試集。交叉驗證重復K次,取K次準確率的平均值作為最終模型的評價指標。過程如下圖所示,它可以有效避免過擬合和欠擬合狀態的發生,K值的選擇根據實際情況調節。
?
注意這兩句是等價的。而使用參數average='macro'或者'weighted'是不等價的 print(precision_score(y_test, y_pred,average='micro')) print(np.sum(y_test == y_pred) / len(y_test))K-Fold是最簡單的K折交叉,n-split就是K值,shuffle指是否對數據洗牌,random_state為隨機種子
K值的選取會影響bias和viriance。K越大,每次投入的訓練集的數據越多,模型的Bias越小。但是K越大,又意味著每一次選取的訓練集之前的相關性越大,而這種大相關性會導致最終的test error具有更大的Variance。一般來說,根據經驗我們一般選擇k=5或10。
?
?使用iris數據集進行簡單實戰:
import numpy as np from sklearn import datasetsiris = datasets.load_iris() X = iris.data y = iris.targetfrom sklearn.model_selection import KFold from sklearn.ensemble import GradientBoostingClassifier as GBDT from sklearn.metrics import precision_scoreclf = GBDT(n_estimators=100) precision_scores = []kf = KFold(n_splits=5, random_state=0, shuffle=False) for train_index, valid_index in kf.split(X, y):x_train, x_valid = X[train_index], X[valid_index]y_train, y_valid = y[train_index], y[valid_index]clf.fit(x_train, y_train)y_pred = clf.predict(x_valid)precision_scores.append(precision_score(y_valid, y_pred, average='micro')) print('Precision', np.mean(precision_scores))進一步加深理解K折交叉驗證:
?
注意:
1.? for循環中參數如果是兩個,則永遠都是訓練集和驗證集的下標!!而不是x和y!!(即不是數據和標簽!)
2.? for trn_idx,val_idx in kf.split(X) :,這里可以只有X,沒有y!因為如果只需要提取出下標來的話,則和數據標簽沒啥關系,我只是要分出訓練集和驗證集!
如果有X和y,則維度必須一致!!否則報錯:比如下面這個例子:
輸入:XX = np.array(['A','B','C','D','E','F','G','H','I','J']) yy = np.array(range(15)) kf = KFold(n_splits=2, random_state=0, shuffle=False)for trn_idx,val_idx in kf.split(XX,yy) :# 如果帶上y?但是維度不一致?print('驗證集下標和驗證集分別是:')print(val_idx)print(XX[val_idx])報錯: ValueError: Found input variables with inconsistent numbers of samples: [10, 15]?
輸入:XX = np.array(['A','B','C','D','E','F','G','H','I','J']) kf = KFold(n_splits=3, random_state=0, shuffle=False)for trn_idx,val_idx in kf.split(XX) :print('驗證集下標和驗證集分別是:')print(val_idx)print(XX[val_idx])輸出:驗證集下標和驗證集分別是: [0 1 2 3] ['A' 'B' 'C' 'D'] 驗證集下標和驗證集分別是: [4 5 6] ['E' 'F' 'G'] 驗證集下標和驗證集分別是: [7 8 9] ['H' 'I' 'J'] 輸入:XX = np.array(['A','B','C','D','E','F','G','H','I','J']) kf = KFold(n_splits=5, random_state=0, shuffle=False)for trn_idx,val_idx in kf.split(XX) :# 如果帶上y?但是維度不一致? # ============================================================================= # print('訓練集下標和訓練集分別是:') # print(trn_idx) # print(XX[trn_idx]) # =============================================================================print('驗證集下標和驗證集分別是:')print(val_idx)print(XX[val_idx])輸出: 驗證集下標和驗證集分別是: [0 1] ['A' 'B'] 驗證集下標和驗證集分別是: [2 3] ['C' 'D'] 驗證集下標和驗證集分別是: [4 5] ['E' 'F'] 驗證集下標和驗證集分別是: [6 7] ['G' 'H'] 驗證集下標和驗證集分別是: [8 9] ['I' 'J'] 輸入:from sklearn.model_selection import KFold import numpy as np x = np.array(['B', 'H', 'L', 'O', 'K', 'P', 'W', 'G']) kf = KFold(n_splits=2) d = kf.split(x) for train_idx, test_idx in d:train_data = x[train_idx]test_data = x[test_idx]print('train_idx:{}, train_data:{}'.format(train_idx, train_data))print('test_idx:{}, test_data:{}'.format(test_idx, test_data))輸出:train_idx:[4 5 6 7], train_data:['K' 'P' 'W' 'G'] test_idx:[0 1 2 3], test_data:['B' 'H' 'L' 'O'] train_idx:[0 1 2 3], train_data:['B' 'H' 'L' 'O'] test_idx:[4 5 6 7], test_data:['K' 'P' 'W' 'G']為了更好的體現當前進行到的組數,可以進行如下更改:
輸入:XX = np.array(['A','B','C','D','E','F','G','H','I','J']) yy = np.arange(10) kf = KFold(n_splits=2, random_state=0, shuffle=False) for fold_, (trn_idx,val_idx) in enumerate(kf.split(XX,yy)) :print("fold n°{}".format(fold_+1))print('訓練集下標和訓練集分別是:')print(trn_idx)print(XX[trn_idx])print('驗證集下標和驗證集分別是:')print(val_idx)print(XX[val_idx])輸出: fold n°1 訓練集下標和訓練集分別是: [5 6 7 8 9] ['F' 'G' 'H' 'I' 'J'] 驗證集下標和驗證集分別是: [0 1 2 3 4] ['A' 'B' 'C' 'D' 'E'] fold n°2 訓練集下標和訓練集分別是: [0 1 2 3 4] ['A' 'B' 'C' 'D' 'E'] 驗證集下標和驗證集分別是: [5 6 7 8 9] ['F' 'G' 'H' 'I' 'J']或者這樣:
輸入:XX = np.array(['A','B','C','D','E','F','G','H','I','J']) yy = np.arange(10) num = np.arange(10) kf = KFold(n_splits=2, random_state=0, shuffle=False) for fold_, (trn_idx,val_idx) in zip(num,kf.split(XX)) :print("fold n°{}".format(fold_+1))print('訓練集下標和訓練集分別是:')print(trn_idx)print(XX[trn_idx])print('驗證集下標和驗證集分別是:')print(val_idx)print(XX[val_idx])輸出:fold n°1 訓練集下標和訓練集分別是: [5 6 7 8 9] ['F' 'G' 'H' 'I' 'J'] 驗證集下標和驗證集分別是: [0 1 2 3 4] ['A' 'B' 'C' 'D' 'E'] fold n°2 訓練集下標和訓練集分別是: [0 1 2 3 4] ['A' 'B' 'C' 'D' 'E'] 驗證集下標和驗證集分別是: [5 6 7 8 9] ['F' 'G' 'H' 'I' 'J']?
三、StratifiedKFold
StratifiedShuffleSplit允許設置設置train/valid對中train和valid所占的比例
# coding = utf-8 import numpy as np from sklearn.model_selection import train_test_split from sklearn.model_selection import StratifiedShuffleSplit from sklearn import datasets from sklearn.ensemble import GradientBoostingClassifier as GBDT from sklearn.metrics import precision_scoreiris = datasets.load_iris() X = iris.data y = iris.targetx_train,x_test,y_train,y_test = train_test_split(X, y, test_size=0.2,stratify=y, # 按照標簽來分層采樣shuffle=True, # 是否先打亂數據的順序再劃分random_state=1) # 控制將樣本隨機打亂clf = GBDT(n_estimators=100) precision_scores = []kf = StratifiedShuffleSplit(n_splits=10, train_size=0.6, test_size=0.4, random_state=0) for train_index, valid_index in kf.split(x_train, y_train):x_train, x_valid = X[train_index], X[valid_index]y_train, y_valid = y[train_index], y[valid_index]clf.fit(x_train, y_train)y_pred = clf.predict(x_valid)precision_scores.append(precision_score(y_valid, y_pred, average='micro'))print('Precision', np.mean(precision_scores))?
四、StratifiedShuffleSplit
StratifiedShuffleSplit允許設置設置train/valid對中train和valid所占的比例
# coding = utf-8 import numpy as np from sklearn.model_selection import train_test_split from sklearn.model_selection import StratifiedShuffleSplit from sklearn import datasets from sklearn.ensemble import GradientBoostingClassifier as GBDT from sklearn.metrics import precision_scoreiris = datasets.load_iris() X = iris.data y = iris.targetx_train,x_test,y_train,y_test = train_test_split(X, y, test_size=0.2,stratify=y, # 按照標簽來分層采樣shuffle=True, # 是否先打亂數據的順序再劃分random_state=1) # 控制將樣本隨機打亂clf = GBDT(n_estimators=100) precision_scores = []kf = StratifiedShuffleSplit(n_splits=10, train_size=0.6, test_size=0.4, random_state=0) for train_index, valid_index in kf.split(x_train, y_train):x_train, x_valid = X[train_index], X[valid_index]y_train, y_valid = y[train_index], y[valid_index]clf.fit(x_train, y_train)y_pred = clf.predict(x_valid)precision_scores.append(precision_score(y_valid, y_pred, average='micro'))print('Precision', np.mean(precision_scores))其他的方法如RepeatedStratifiedKFold、GroupKFold等詳見sklearn官方文檔。
https://www.programcreek.com/python/example/91149/sklearn.model_selection.StratifiedShuffleSplit
總結
以上是生活随笔為你收集整理的【Python学习】 - sklearn学习 - 数据集分割方法 - 随机划分与K折交叉划分与StratifiedKFold与StratifiedShuffleSplit的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 小米海外品牌新机POCO F4发布:骁龙
- 下一篇: 小米海外品牌新机POCO X4 GT发布