當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

监督学习 | 决策树之网络搜索

發(fā)布時(shí)間：2025/3/15 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了监督学习 | 决策树之网络搜索小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

文章目錄

1. 通過(guò)網(wǎng)格搜索完善模型
- 1.1 數(shù)據(jù)導(dǎo)入
- 1.2 拆分?jǐn)?shù)據(jù)為訓(xùn)練集和測(cè)試集
- 1.3 擬合決策樹模型
- 1.4 使用網(wǎng)絡(luò)搜索完善模型
- 1.5 交叉驗(yàn)證可視化
- 1.5 總結(jié)

相關(guān)文章：

機(jī)器學(xué)習(xí) | 目錄

監(jiān)督學(xué)習(xí) | ID3 決策樹原理及Python實(shí)現(xiàn)

監(jiān)督學(xué)習(xí) | ID3 & C4.5 決策樹原理

監(jiān)督學(xué)習(xí) | CART 分類回歸樹原理

監(jiān)督學(xué)習(xí) | 決策樹之Sklearn實(shí)現(xiàn)

監(jiān)督學(xué)習(xí) | 決策樹之網(wǎng)絡(luò)搜索

1. 通過(guò)網(wǎng)格搜索完善模型

在本文中，我們將為決策樹模型擬合一些樣本數(shù)據(jù)。這個(gè)初始模型會(huì)過(guò)擬合。然后，我們將使用網(wǎng)格搜索為這個(gè)模型找到更好的參數(shù)，以減少過(guò)擬合。

首先，導(dǎo)入所需要的庫(kù)：

%matplotlib inline import pandas as pd import numpy as np import matplotlib.pyplot as plt

1.1 數(shù)據(jù)導(dǎo)入

首先定義一個(gè)函數(shù)用于讀取 csv 數(shù)據(jù)并進(jìn)行可視化：

def load_pts(csv_name):data = np.asarray(pd.read_csv(csv_name, header=None))X = data[:,0:2]y = data[:,2]plt.scatter(X[np.argwhere(y==0).flatten(),0], X[np.argwhere(y==0).flatten(),1],s = 50, color = 'blue', edgecolor = 'k')plt.scatter(X[np.argwhere(y==1).flatten(),0], X[np.argwhere(y==1).flatten(),1],s = 50, color = 'red', edgecolor = 'k')plt.xlim(-2.05,2.05)plt.ylim(-2.05,2.05)plt.grid(False)plt.tick_params(axis='x',which='both',bottom='off',top='off')return X,yX, y = load_pts('Data/data.csv') plt.show()

1.2 拆分?jǐn)?shù)據(jù)為訓(xùn)練集和測(cè)試集

from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score, make_scorer#Fixing a random seed import random random.seed(42)# Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

1.3 擬合決策樹模型

from sklearn.tree import DecisionTreeClassifier# Define the model (with default hyperparameters) clf = DecisionTreeClassifier(random_state=42)# Fit the model clf.fit(X_train, y_train)# Make predictions train_predictions = clf.predict(X_train) test_predictions = clf.predict(X_test)

現(xiàn)在我們來(lái)可視化模型，并測(cè)試 f1_score，首先定義可視化函數(shù)：

def plot_model(X, y, clf):# 繪制兩類點(diǎn)的散點(diǎn)圖plt.scatter(X[np.argwhere(y==0).flatten(),0],X[np.argwhere(y==0).flatten(),1],s = 50, color = 'blue', edgecolor = 'k')plt.scatter(X[np.argwhere(y==1).flatten(),0],X[np.argwhere(y==1).flatten(),1],s = 50, color = 'red', edgecolor = 'k')# 圖形設(shè)置plt.xlim(-2.05,2.05)plt.ylim(-2.05,2.05)plt.grid(False)plt.tick_params(axis='x',which='both',bottom='off',top='off')# 利用 np.meshgrid(r,r) 生成一個(gè)平面對(duì)于的橫縱坐標(biāo)r = np.linspace(-2.1,2.1,300)s,t = np.meshgrid(r,r)# 將坐標(biāo)轉(zhuǎn)換為與決策樹的訓(xùn)練集相同格式s = np.reshape(s,(np.size(s),1))t = np.reshape(t,(np.size(t),1))h = np.concatenate((s,t),1)# 對(duì)平面上的每一個(gè)點(diǎn)進(jìn)行預(yù)測(cè)類別z = clf.predict(h)# 將橫縱坐標(biāo)及對(duì)應(yīng)類別轉(zhuǎn)換為矩陣形式s = s.reshape((np.size(r),np.size(r)))t = t.reshape((np.size(r),np.size(r)))z = z.reshape((np.size(r),np.size(r)))# 利用 plt.contourf 繪制不同等高面plt.contourf(s,t,z,colors = ['blue','red'],alpha = 0.2,levels = range(-1,2))# 繪制等高面邊緣if len(np.unique(z)) > 1:plt.contour(s,t,z,colors = 'k', linewidths = 2)plt.show() plot_model(X, y, clf) print('The Training F1 Score is', f1_score(train_predictions, y_train)) print('The Testing F1 Score is', f1_score(test_predictions, y_test)) The Training F1 Score is 1.0 The Testing F1 Score is 0.7000000000000001

訓(xùn)練集得分為 1 ，而測(cè)試集得分為 0.7，可以看出當(dāng)前模型有些過(guò)擬合，下面我們通過(guò)網(wǎng)絡(luò)搜索來(lái)優(yōu)化參數(shù)。

1.4 使用網(wǎng)絡(luò)搜索完善模型

現(xiàn)在，我們將執(zhí)行以下步驟：

1.首先，定義一些參數(shù)來(lái)執(zhí)行網(wǎng)格搜索：max_depth, min_samples_leaf, 和 min_samples_split。

2.使用f1_score，為模型制作記分器。

3.使用參數(shù)和記分器，在分類器上執(zhí)行網(wǎng)格搜索。

4.將數(shù)據(jù)擬合到新的分類器中。

5.繪制模型并找到 f1_score。

6.如果模型不太好，則更改參數(shù)的范圍并再次擬合。

from sklearn.metrics import make_scorer from sklearn.model_selection import GridSearchCVclf = DecisionTreeClassifier(random_state=42)# 生成參數(shù)列表 parameters = {'max_depth':[2,4,6,8,10],'min_samples_leaf':[2,4,6,8,10], 'min_samples_split':[2,4,6,8,10]}# 定義計(jì)分器 scorer = make_scorer(f1_score)# 生成網(wǎng)絡(luò)搜索器 grid_obj = GridSearchCV(clf, parameters, scoring=scorer)# 擬合網(wǎng)絡(luò)搜索器 grid_fit = grid_obj.fit(X_train, y_train)# 獲得最佳決策樹模型 best_clf = grid_fit.best_estimator_# 對(duì)最佳模型進(jìn)行擬合 best_clf.fit(X_train, y_train)# 對(duì)測(cè)試集和訓(xùn)練集進(jìn)行預(yù)測(cè) best_train_predictions = best_clf.predict(X_train) best_test_predictions = best_clf.predict(X_test)# 計(jì)算測(cè)試集得分和訓(xùn)練集得分 print('The training F1 Score is', f1_score(best_train_predictions, y_train)) print('The testing F1 Score is', f1_score(best_test_predictions, y_test))# 模型可視化 plot_model(X, y, best_clf)# 查看最佳模型的參數(shù)設(shè)置 best_clf The training F1 Score is 0.8148148148148148 The testing F1 Score is 0.8 DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,max_features=None, max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=2, min_samples_split=2,min_weight_fraction_leaf=0.0, presort=False,random_state=42, splitter='best')

由此可以看出，最佳參數(shù)為：

max_depth=4

min_samples_leaf=2

min_samples_split=2

且相對(duì)于第一個(gè)圖，邊界更為簡(jiǎn)單，這意味著它不太可能過(guò)擬合。

1.5 交叉驗(yàn)證可視化

首先看一下不同參數(shù)下的信息：

results = pd.DataFrame(grid_obj.cv_results_) results.T 0123456789...115116117118119120121122123124mean_fit_timestd_fit_timemean_score_timestd_score_timeparam_max_depthparam_min_samples_leafparam_min_samples_splitparamssplit0_test_scoresplit1_test_scoresplit2_test_scoremean_test_scorestd_test_scorerank_test_score

0.000536919	0.000609636	0.00067091	0.0005006	0.000532627	0.000538429	0.00162276	0.000725031	0.000346661	0.000960668	...	0.000691652	0.000363668	0.00054733	0.000414769	0.000365416	0.000314713	0.000483354	0.000389099	0.000378688	0.000585318
7.36079e-05	0.000217965	0.000120917	7.64067e-05	0.000118579	0.000201879	0.00125017	0.000257437	2.95338e-05	0.000841452	...	0.00027542	3.20732e-05	0.000166427	5.31612e-05	2.22742e-05	5.03509e-06	0.000156771	0.000113168	6.09452e-05	0.000168713
0.00124542	0.00209157	0.0011754	0.00118478	0.00127451	0.00132982	0.00173569	0.00158167	0.000804345	0.00165256	...	0.00107495	0.000776132	0.0010496	0.00107972	0.000799974	0.000889381	0.00097998	0.000957966	0.00082167	0.00108504
0.000460223	0.00131765	0.000217313	0.000175357	0.000274129	0.000684221	0.00012585	0.000531796	2.83336e-05	0.00103978	...	0.000327278	1.61637e-05	0.000181963	0.000226213	3.18651e-05	0.00015253	0.000282182	0.00014771	5.40954e-05	0.00015636
2	2	2	2	2	2	2	2	2	2	...	10	10	10	10	10	10	10	10	10	10
2	2	2	2	2	4	4	4	4	4	...	8	8	8	8	8	10	10	10	10	10
2	4	6	8	10	2	4	6	8	10	...	2	4	6	8	10	2	4	6	8	10
{'max_depth': 2, 'min_samples_leaf': 2, 'min_s...	{'max_depth': 2, 'min_samples_leaf': 2, 'min_s...	{'max_depth': 2, 'min_samples_leaf': 2, 'min_s...	{'max_depth': 2, 'min_samples_leaf': 2, 'min_s...	{'max_depth': 2, 'min_samples_leaf': 2, 'min_s...	{'max_depth': 2, 'min_samples_leaf': 4, 'min_s...	{'max_depth': 2, 'min_samples_leaf': 4, 'min_s...	{'max_depth': 2, 'min_samples_leaf': 4, 'min_s...	{'max_depth': 2, 'min_samples_leaf': 4, 'min_s...	{'max_depth': 2, 'min_samples_leaf': 4, 'min_s...	...	{'max_depth': 10, 'min_samples_leaf': 8, 'min_...	{'max_depth': 10, 'min_samples_leaf': 8, 'min_...	{'max_depth': 10, 'min_samples_leaf': 8, 'min_...	{'max_depth': 10, 'min_samples_leaf': 8, 'min_...	{'max_depth': 10, 'min_samples_leaf': 8, 'min_...	{'max_depth': 10, 'min_samples_leaf': 10, 'min...	{'max_depth': 10, 'min_samples_leaf': 10, 'min...	{'max_depth': 10, 'min_samples_leaf': 10, 'min...	{'max_depth': 10, 'min_samples_leaf': 10, 'min...	{'max_depth': 10, 'min_samples_leaf': 10, 'min...
0.642857	0.642857	0.642857	0.642857	0.642857	0.642857	0.642857	0.642857	0.642857	0.642857	...	0.642857	0.642857	0.642857	0.642857	0.642857	0.642857	0.642857	0.642857	0.642857	0.642857
0.764706	0.764706	0.764706	0.764706	0.764706	0.764706	0.764706	0.764706	0.764706	0.764706	...	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5
0.709677	0.709677	0.709677	0.709677	0.709677	0.709677	0.709677	0.709677	0.709677	0.709677	...	0.714286	0.714286	0.714286	0.714286	0.714286	0.666667	0.666667	0.666667	0.666667	0.666667
0.705698	0.705698	0.705698	0.705698	0.705698	0.705698	0.705698	0.705698	0.705698	0.705698	...	0.617857	0.617857	0.617857	0.617857	0.617857	0.602381	0.602381	0.602381	0.602381	0.602381
0.0501306	0.0501306	0.0501306	0.0501306	0.0501306	0.0501306	0.0501306	0.0501306	0.0501306	0.0501306	...	0.0889995	0.0889995	0.0889995	0.0889995	0.0889995	0.0737135	0.0737135	0.0737135	0.0737135	0.0737135
14	14	14	14	14	14	14	14	14	14	...	42	42	42	42	42	62	62	62	62	62

14 rows × 125 columns

接著我們來(lái)看一下在不同的最大深度（max_depth）下，每片葉子的最小樣本數(shù)（min_samples_leaf）和每次分裂的最小樣本數(shù)（min_samples_split）對(duì)決策樹模型的泛化性能的影響。

首先定義一個(gè)函數(shù)來(lái)繪制不同最大深度下的熱力圖（需安裝 mglearn）：

def hotmap(max_depth, results):fliter = results[results['param_max_depth']==max_depth]scores = np.array(fliter['mean_test_score']).reshape(5, 5)mglearn.tools.heatmap(scores, xlabel='min_samples_split', xticklabels=parameters['min_samples_split'],ylabel='min_samples_leaf', yticklabels=parameters['min_samples_leaf'], cmap="viridis")

繪制到子圖中：

import matplotlib.pyplot as plt plt.figure(figsize=(20, 20)) plt for i in [1,2,3,4,5]:plt.subplot(1,5,i, title='max_depth={}'.format(2*i))hotmap(2*i, results)

從圖中可以看出，每次分裂的最小樣本數(shù)（min_samples_split）對(duì)模型幾乎沒(méi)有影響，而隨著最大深度（max_depth）的增加，模型得分逐漸降低。

1.5 總結(jié)

通過(guò)使用網(wǎng)格搜索，我們將 F1 分?jǐn)?shù)從 0.7 提高到 0.8（同時(shí)我們失去了一些訓(xùn)練分?jǐn)?shù)，但這沒(méi)問(wèn)題）。另外，如果你看繪制的圖，第二個(gè)模型的邊界更為簡(jiǎn)單，這意味著它不太可能過(guò)擬合。

創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎(jiǎng)勵(lì)來(lái)咯，堅(jiān)持創(chuàng)作打卡瓜分現(xiàn)金大獎(jiǎng)

總結(jié)

以上是生活随笔為你收集整理的监督学习 | 决策树之网络搜索的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：基于VTK User Guide和VTK
下一篇：第一部分:TCL基本知识