Sklearn中的CV与KFold详解
關于交叉驗證,我在之前的文章中已經進行了簡單的介紹,而現在我們則通過幾個更加詳盡的例子.詳細的介紹
CV
%matplotlib inline import numpy as np from sklearn.model_selection import train_test_split from sklearn import datasets from sklearn import svmiris = datasets.load_iris() iris.data.shape,iris.target.shape ((150, 4), (150,))一般的分割方式,訓練集-測試集.然而這種方式并不是很好
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0) clf_svc = svm.SVC(kernel='linear').fit(X_train,y_train) clf_svc.score(X_test,y_test) 0.9666666666666667- 缺點一:浪費數據
- 缺點二:容易過擬合,且矯正方式不方便
這時,我們需要使用另外一種分割方式-交叉驗證
from sklearn.model_selection import cross_val_score clf_svc_cv = svm.SVC(kernel='linear',C=1) scores_clf_svc_cv = cross_val_score(clf_svc_cv,iris.data,iris.target,cv=5) print(scores_clf_svc_cv) print("Accuracy: %0.2f (+/- %0.2f)" % (scores_clf_svc_cv.mean(), scores_clf_svc_cv.std() * 2)) [ 0.96666667 1. 0.96666667 0.96666667 1. ] Accuracy: 0.98 (+/- 0.03)同時我們也可以為cross_val_score選擇不同的性能度量函數
from sklearn import metrics scores_clf_svc_cv_f1 = cross_val_score(clf_svc_cv,iris.data,iris.target,cv=5,scoring='f1_macro') print("F1: %0.2f (+/- %0.2f)" % (scores_clf_svc_cv_f1.mean(), scores_clf_svc_cv_f1.std() * 2)) F1: 0.98 (+/- 0.03)同時也正是這些特性使得,cv與數據轉化以及pipline(sklearn中的管道機制)變得更加契合
from sklearn import preprocessing from sklearn.pipeline import make_pipeline clf_pipline = make_pipeline(preprocessing.StandardScaler(),svm.SVC(C=1)) scores_pipline_cv = cross_val_score(clf_pipline,iris.data,iris.target,cv=5) print("Accuracy: %0.2f (+/- %0.2f)" % (scores_clf_svc_cv_f1.mean(), scores_clf_svc_cv_f1.std() * 2)) Accuracy: 0.98 (+/- 0.03)同時我們還可以在交叉驗證使用多個度量函數
from sklearn.model_selection import cross_validate from sklearn import metricsscoring = ['precision_macro', 'recall_macro'] clf_cvs = svm.SVC(kernel='linear', C=1, random_state=0) scores_cvs = cross_validate(clf_cvs,iris.data,iris.target,cv=5,scoring=scoring,return_train_score = False) sorted(scores_cvs.keys()) ['fit_time', 'score_time', 'test_precision_macro', 'test_recall_macro'] print(scores_cvs['test_recall_macro']) print("test_recall_macro: %0.2f (+/- %0.2f)" % (scores_cvs['test_recall_macro'].mean(), scores_cvs['test_recall_macro'].std() * 2)) [ 0.96666667 1. 0.96666667 0.96666667 1. ] test_recall_macro: 0.98 (+/- 0.03)同時cross_validate也可以使用make_scorer自定義度量功能
或者使用單一獨量
關于Sklearn中的CV還有cross_val_predict可用于預測,下面則是Sklearn中一個關于使用該方法進行可視化預測錯誤的案例
from sklearn import datasets from sklearn.model_selection import cross_val_predict from sklearn import linear_model import matplotlib.pyplot as pltlr = linear_model.LinearRegression() boston = datasets.load_boston() y = boston.target# cross_val_predict returns an array of the same size as `y` where each entry # is a prediction obtained by cross validation: predicted = cross_val_predict(lr, boston.data, y, cv=10)fig, ax = plt.subplots() fig.set_size_inches(18.5,10.5) ax.scatter(y, predicted, edgecolors=(0, 0, 0)) ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4) ax.set_xlabel('Measured') ax.set_ylabel('Predicted') plt.show()KFlod的例子
Stratified k-fold:實現了分層交叉切分
from sklearn.model_selection import StratifiedKFold X = np.array([[1, 2, 3, 4],[11, 12, 13, 14],[21, 22, 23, 24],[31, 32, 33, 34],[41, 42, 43, 44],[51, 52, 53, 54],[61, 62, 63, 64],[71, 72, 73, 74]])y = np.array([1, 1, 0, 0, 1, 1, 0, 0])stratified_folder = StratifiedKFold(n_splits=4, random_state=0, shuffle=False) for train_index, test_index in stratified_folder.split(X, y):print("Stratified Train Index:", train_index)print("Stratified Test Index:", test_index)print("Stratified y_train:", y[train_index])print("Stratified y_test:", y[test_index],'\n') Stratified Train Index: [1 3 4 5 6 7] Stratified Test Index: [0 2] Stratified y_train: [1 0 1 1 0 0] Stratified y_test: [1 0] Stratified Train Index: [0 2 4 5 6 7] Stratified Test Index: [1 3] Stratified y_train: [1 0 1 1 0 0] Stratified y_test: [1 0] Stratified Train Index: [0 1 2 3 5 7] Stratified Test Index: [4 6] Stratified y_train: [1 1 0 0 1 0] Stratified y_test: [1 0] Stratified Train Index: [0 1 2 3 4 6] Stratified Test Index: [5 7] Stratified y_train: [1 1 0 0 1 0] Stratified y_test: [1 0] from sklearn.model_selection import StratifiedKFold X = np.array([[1, 2, 3, 4],[11, 12, 13, 14],[21, 22, 23, 24],[31, 32, 33, 34],[41, 42, 43, 44],[51, 52, 53, 54],[61, 62, 63, 64],[71, 72, 73, 74]])y = np.array([1, 1, 0, 0, 1, 1, 0, 0])stratified_folder = StratifiedKFold(n_splits=4, random_state=0, shuffle=False) for train_index, test_index in stratified_folder.split(X, y):print("Stratified Train Index:", train_index)print("Stratified Test Index:", test_index)print("Stratified y_train:", y[train_index])print("Stratified y_test:", y[test_index],'\n') Stratified Train Index: [1 3 4 5 6 7] Stratified Test Index: [0 2] Stratified y_train: [1 0 1 1 0 0] Stratified y_test: [1 0] Stratified Train Index: [0 2 4 5 6 7] Stratified Test Index: [1 3] Stratified y_train: [1 0 1 1 0 0] Stratified y_test: [1 0] Stratified Train Index: [0 1 2 3 5 7] Stratified Test Index: [4 6] Stratified y_train: [1 1 0 0 1 0] Stratified y_test: [1 0] Stratified Train Index: [0 1 2 3 4 6] Stratified Test Index: [5 7] Stratified y_train: [1 1 0 0 1 0] Stratified y_test: [1 0]除了這幾種交叉切分KFlod外,還有很多其他的分割方式,比如StratifiedShuffleSplit重復分層KFold,實現了每個K中各類別的比例與原數據集大致一致,而RepeatedStratifiedKFold 可用于在每次重復中用不同的隨機化重復分層 K-Fold n 次。至此基本的KFlod在Sklearn中都實現了
注意
i.i.d 數據是機器學習理論中的一個常見假設,在實踐中很少成立。如果知道樣本是使用時間相關的過程生成的,則使用 time-series aware cross-validation scheme 更安全。 同樣,如果我們知道生成過程具有 group structure (群體結構)(從不同 subjects(主體) , experiments(實驗), measurement devices (測量設備)收集的樣本),則使用 group-wise cross-validation 更安全。
下面就是一個分組KFold的例子,
from sklearn.model_selection import GroupKFoldX = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10] y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"] groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]gkf = GroupKFold(n_splits=3) for train, test in gkf.split(X, y, groups=groups):print("%s %s" % (train, test)) [0 1 2 3 4 5] [6 7 8 9] [0 1 2 6 7 8 9] [3 4 5] [3 4 5 6 7 8 9] [0 1 2]更多內容請參考:sklearn相應手冊
總結
以上是生活随笔為你收集整理的Sklearn中的CV与KFold详解的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 机器学习中的数据集划分问题
- 下一篇: 快速认识网络爬虫与Scrapy网络爬虫框