监督学习 | 决策树之网络搜索
文章目錄
- 1. 通過(guò)網(wǎng)格搜索完善模型
- 1.1 數(shù)據(jù)導(dǎo)入
- 1.2 拆分?jǐn)?shù)據(jù)為訓(xùn)練集和測(cè)試集
- 1.3 擬合決策樹模型
- 1.4 使用網(wǎng)絡(luò)搜索完善模型
- 1.5 交叉驗(yàn)證可視化
- 1.5 總結(jié)
相關(guān)文章:
機(jī)器學(xué)習(xí) | 目錄
監(jiān)督學(xué)習(xí) | ID3 決策樹原理及Python實(shí)現(xiàn)
監(jiān)督學(xué)習(xí) | ID3 & C4.5 決策樹原理
監(jiān)督學(xué)習(xí) | CART 分類回歸樹原理
監(jiān)督學(xué)習(xí) | 決策樹之Sklearn實(shí)現(xiàn)
監(jiān)督學(xué)習(xí) | 決策樹之網(wǎng)絡(luò)搜索
1. 通過(guò)網(wǎng)格搜索完善模型
在本文中,我們將為決策樹模型擬合一些樣本數(shù)據(jù)。 這個(gè)初始模型會(huì)過(guò)擬合。 然后,我們將使用網(wǎng)格搜索為這個(gè)模型找到更好的參數(shù),以減少過(guò)擬合。
首先,導(dǎo)入所需要的庫(kù):
%matplotlib inline import pandas as pd import numpy as np import matplotlib.pyplot as plt1.1 數(shù)據(jù)導(dǎo)入
首先定義一個(gè)函數(shù)用于讀取 csv 數(shù)據(jù)并進(jìn)行可視化:
def load_pts(csv_name):data = np.asarray(pd.read_csv(csv_name, header=None))X = data[:,0:2]y = data[:,2]plt.scatter(X[np.argwhere(y==0).flatten(),0], X[np.argwhere(y==0).flatten(),1],s = 50, color = 'blue', edgecolor = 'k')plt.scatter(X[np.argwhere(y==1).flatten(),0], X[np.argwhere(y==1).flatten(),1],s = 50, color = 'red', edgecolor = 'k')plt.xlim(-2.05,2.05)plt.ylim(-2.05,2.05)plt.grid(False)plt.tick_params(axis='x',which='both',bottom='off',top='off')return X,yX, y = load_pts('Data/data.csv') plt.show()1.2 拆分?jǐn)?shù)據(jù)為訓(xùn)練集和測(cè)試集
from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score, make_scorer#Fixing a random seed import random random.seed(42)# Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)1.3 擬合決策樹模型
from sklearn.tree import DecisionTreeClassifier# Define the model (with default hyperparameters) clf = DecisionTreeClassifier(random_state=42)# Fit the model clf.fit(X_train, y_train)# Make predictions train_predictions = clf.predict(X_train) test_predictions = clf.predict(X_test)現(xiàn)在我們來(lái)可視化模型,并測(cè)試 f1_score,首先定義可視化函數(shù):
def plot_model(X, y, clf):# 繪制兩類點(diǎn)的散點(diǎn)圖plt.scatter(X[np.argwhere(y==0).flatten(),0],X[np.argwhere(y==0).flatten(),1],s = 50, color = 'blue', edgecolor = 'k')plt.scatter(X[np.argwhere(y==1).flatten(),0],X[np.argwhere(y==1).flatten(),1],s = 50, color = 'red', edgecolor = 'k')# 圖形設(shè)置plt.xlim(-2.05,2.05)plt.ylim(-2.05,2.05)plt.grid(False)plt.tick_params(axis='x',which='both',bottom='off',top='off')# 利用 np.meshgrid(r,r) 生成一個(gè)平面對(duì)于的橫縱坐標(biāo)r = np.linspace(-2.1,2.1,300)s,t = np.meshgrid(r,r)# 將坐標(biāo)轉(zhuǎn)換為與決策樹的訓(xùn)練集相同格式s = np.reshape(s,(np.size(s),1))t = np.reshape(t,(np.size(t),1))h = np.concatenate((s,t),1)# 對(duì)平面上的每一個(gè)點(diǎn)進(jìn)行預(yù)測(cè)類別z = clf.predict(h)# 將橫縱坐標(biāo)及對(duì)應(yīng)類別轉(zhuǎn)換為矩陣形式s = s.reshape((np.size(r),np.size(r)))t = t.reshape((np.size(r),np.size(r)))z = z.reshape((np.size(r),np.size(r)))# 利用 plt.contourf 繪制不同等高面plt.contourf(s,t,z,colors = ['blue','red'],alpha = 0.2,levels = range(-1,2))# 繪制等高面邊緣if len(np.unique(z)) > 1:plt.contour(s,t,z,colors = 'k', linewidths = 2)plt.show() plot_model(X, y, clf) print('The Training F1 Score is', f1_score(train_predictions, y_train)) print('The Testing F1 Score is', f1_score(test_predictions, y_test)) The Training F1 Score is 1.0 The Testing F1 Score is 0.7000000000000001訓(xùn)練集得分為 1 ,而測(cè)試集得分為 0.7,可以看出當(dāng)前模型有些過(guò)擬合,下面我們通過(guò)網(wǎng)絡(luò)搜索來(lái)優(yōu)化參數(shù)。
1.4 使用網(wǎng)絡(luò)搜索完善模型
現(xiàn)在,我們將執(zhí)行以下步驟:
1.首先,定義一些參數(shù)來(lái)執(zhí)行網(wǎng)格搜索:max_depth, min_samples_leaf, 和 min_samples_split。
2.使用f1_score,為模型制作記分器。
3.使用參數(shù)和記分器,在分類器上執(zhí)行網(wǎng)格搜索。
4.將數(shù)據(jù)擬合到新的分類器中。
5.繪制模型并找到 f1_score。
6.如果模型不太好,則更改參數(shù)的范圍并再次擬合。
from sklearn.metrics import make_scorer from sklearn.model_selection import GridSearchCVclf = DecisionTreeClassifier(random_state=42)# 生成參數(shù)列表 parameters = {'max_depth':[2,4,6,8,10],'min_samples_leaf':[2,4,6,8,10], 'min_samples_split':[2,4,6,8,10]}# 定義計(jì)分器 scorer = make_scorer(f1_score)# 生成網(wǎng)絡(luò)搜索器 grid_obj = GridSearchCV(clf, parameters, scoring=scorer)# 擬合網(wǎng)絡(luò)搜索器 grid_fit = grid_obj.fit(X_train, y_train)# 獲得最佳決策樹模型 best_clf = grid_fit.best_estimator_# 對(duì)最佳模型進(jìn)行擬合 best_clf.fit(X_train, y_train)# 對(duì)測(cè)試集和訓(xùn)練集進(jìn)行預(yù)測(cè) best_train_predictions = best_clf.predict(X_train) best_test_predictions = best_clf.predict(X_test)# 計(jì)算測(cè)試集得分和訓(xùn)練集得分 print('The training F1 Score is', f1_score(best_train_predictions, y_train)) print('The testing F1 Score is', f1_score(best_test_predictions, y_test))# 模型可視化 plot_model(X, y, best_clf)# 查看最佳模型的參數(shù)設(shè)置 best_clf The training F1 Score is 0.8148148148148148 The testing F1 Score is 0.8 DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,max_features=None, max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=2, min_samples_split=2,min_weight_fraction_leaf=0.0, presort=False,random_state=42, splitter='best')由此可以看出,最佳參數(shù)為:
max_depth=4
min_samples_leaf=2
min_samples_split=2
且相對(duì)于第一個(gè)圖,邊界更為簡(jiǎn)單,這意味著它不太可能過(guò)擬合。
1.5 交叉驗(yàn)證可視化
首先看一下不同參數(shù)下的信息:
results = pd.DataFrame(grid_obj.cv_results_) results.T| 0.000536919 | 0.000609636 | 0.00067091 | 0.0005006 | 0.000532627 | 0.000538429 | 0.00162276 | 0.000725031 | 0.000346661 | 0.000960668 | ... | 0.000691652 | 0.000363668 | 0.00054733 | 0.000414769 | 0.000365416 | 0.000314713 | 0.000483354 | 0.000389099 | 0.000378688 | 0.000585318 |
| 7.36079e-05 | 0.000217965 | 0.000120917 | 7.64067e-05 | 0.000118579 | 0.000201879 | 0.00125017 | 0.000257437 | 2.95338e-05 | 0.000841452 | ... | 0.00027542 | 3.20732e-05 | 0.000166427 | 5.31612e-05 | 2.22742e-05 | 5.03509e-06 | 0.000156771 | 0.000113168 | 6.09452e-05 | 0.000168713 |
| 0.00124542 | 0.00209157 | 0.0011754 | 0.00118478 | 0.00127451 | 0.00132982 | 0.00173569 | 0.00158167 | 0.000804345 | 0.00165256 | ... | 0.00107495 | 0.000776132 | 0.0010496 | 0.00107972 | 0.000799974 | 0.000889381 | 0.00097998 | 0.000957966 | 0.00082167 | 0.00108504 |
| 0.000460223 | 0.00131765 | 0.000217313 | 0.000175357 | 0.000274129 | 0.000684221 | 0.00012585 | 0.000531796 | 2.83336e-05 | 0.00103978 | ... | 0.000327278 | 1.61637e-05 | 0.000181963 | 0.000226213 | 3.18651e-05 | 0.00015253 | 0.000282182 | 0.00014771 | 5.40954e-05 | 0.00015636 |
| 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | ... | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
| 2 | 2 | 2 | 2 | 2 | 4 | 4 | 4 | 4 | 4 | ... | 8 | 8 | 8 | 8 | 8 | 10 | 10 | 10 | 10 | 10 |
| 2 | 4 | 6 | 8 | 10 | 2 | 4 | 6 | 8 | 10 | ... | 2 | 4 | 6 | 8 | 10 | 2 | 4 | 6 | 8 | 10 |
| {'max_depth': 2, 'min_samples_leaf': 2, 'min_s... | {'max_depth': 2, 'min_samples_leaf': 2, 'min_s... | {'max_depth': 2, 'min_samples_leaf': 2, 'min_s... | {'max_depth': 2, 'min_samples_leaf': 2, 'min_s... | {'max_depth': 2, 'min_samples_leaf': 2, 'min_s... | {'max_depth': 2, 'min_samples_leaf': 4, 'min_s... | {'max_depth': 2, 'min_samples_leaf': 4, 'min_s... | {'max_depth': 2, 'min_samples_leaf': 4, 'min_s... | {'max_depth': 2, 'min_samples_leaf': 4, 'min_s... | {'max_depth': 2, 'min_samples_leaf': 4, 'min_s... | ... | {'max_depth': 10, 'min_samples_leaf': 8, 'min_... | {'max_depth': 10, 'min_samples_leaf': 8, 'min_... | {'max_depth': 10, 'min_samples_leaf': 8, 'min_... | {'max_depth': 10, 'min_samples_leaf': 8, 'min_... | {'max_depth': 10, 'min_samples_leaf': 8, 'min_... | {'max_depth': 10, 'min_samples_leaf': 10, 'min... | {'max_depth': 10, 'min_samples_leaf': 10, 'min... | {'max_depth': 10, 'min_samples_leaf': 10, 'min... | {'max_depth': 10, 'min_samples_leaf': 10, 'min... | {'max_depth': 10, 'min_samples_leaf': 10, 'min... |
| 0.642857 | 0.642857 | 0.642857 | 0.642857 | 0.642857 | 0.642857 | 0.642857 | 0.642857 | 0.642857 | 0.642857 | ... | 0.642857 | 0.642857 | 0.642857 | 0.642857 | 0.642857 | 0.642857 | 0.642857 | 0.642857 | 0.642857 | 0.642857 |
| 0.764706 | 0.764706 | 0.764706 | 0.764706 | 0.764706 | 0.764706 | 0.764706 | 0.764706 | 0.764706 | 0.764706 | ... | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 |
| 0.709677 | 0.709677 | 0.709677 | 0.709677 | 0.709677 | 0.709677 | 0.709677 | 0.709677 | 0.709677 | 0.709677 | ... | 0.714286 | 0.714286 | 0.714286 | 0.714286 | 0.714286 | 0.666667 | 0.666667 | 0.666667 | 0.666667 | 0.666667 |
| 0.705698 | 0.705698 | 0.705698 | 0.705698 | 0.705698 | 0.705698 | 0.705698 | 0.705698 | 0.705698 | 0.705698 | ... | 0.617857 | 0.617857 | 0.617857 | 0.617857 | 0.617857 | 0.602381 | 0.602381 | 0.602381 | 0.602381 | 0.602381 |
| 0.0501306 | 0.0501306 | 0.0501306 | 0.0501306 | 0.0501306 | 0.0501306 | 0.0501306 | 0.0501306 | 0.0501306 | 0.0501306 | ... | 0.0889995 | 0.0889995 | 0.0889995 | 0.0889995 | 0.0889995 | 0.0737135 | 0.0737135 | 0.0737135 | 0.0737135 | 0.0737135 |
| 14 | 14 | 14 | 14 | 14 | 14 | 14 | 14 | 14 | 14 | ... | 42 | 42 | 42 | 42 | 42 | 62 | 62 | 62 | 62 | 62 |
14 rows × 125 columns
接著我們來(lái)看一下在不同的最大深度(max_depth)下,每片葉子的最小樣本數(shù)(min_samples_leaf)和每次分裂的最小樣本數(shù)(min_samples_split)對(duì)決策樹模型的泛化性能的影響。
首先定義一個(gè)函數(shù)來(lái)繪制不同最大深度下的熱力圖(需安裝 mglearn):
def hotmap(max_depth, results):fliter = results[results['param_max_depth']==max_depth]scores = np.array(fliter['mean_test_score']).reshape(5, 5)mglearn.tools.heatmap(scores, xlabel='min_samples_split', xticklabels=parameters['min_samples_split'],ylabel='min_samples_leaf', yticklabels=parameters['min_samples_leaf'], cmap="viridis")繪制到子圖中:
import matplotlib.pyplot as plt plt.figure(figsize=(20, 20)) plt for i in [1,2,3,4,5]:plt.subplot(1,5,i, title='max_depth={}'.format(2*i))hotmap(2*i, results)從圖中可以看出,每次分裂的最小樣本數(shù)(min_samples_split)對(duì)模型幾乎沒(méi)有影響,而隨著最大深度(max_depth)的增加,模型得分逐漸降低。
1.5 總結(jié)
通過(guò)使用網(wǎng)格搜索,我們將 F1 分?jǐn)?shù)從 0.7 提高到 0.8(同時(shí)我們失去了一些訓(xùn)練分?jǐn)?shù),但這沒(méi)問(wèn)題)。 另外,如果你看繪制的圖,第二個(gè)模型的邊界更為簡(jiǎn)單,這意味著它不太可能過(guò)擬合。
創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎(jiǎng)勵(lì)來(lái)咯,堅(jiān)持創(chuàng)作打卡瓜分現(xiàn)金大獎(jiǎng)總結(jié)
以上是生活随笔為你收集整理的监督学习 | 决策树之网络搜索的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 基于VTK User Guide和VTK
- 下一篇: 第一部分:TCL基本知识