當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

用xgboost模型对特征重要性进行排序

發(fā)布時間：2023/12/8 编程问答 32 豆豆

生活随笔收集整理的這篇文章主要介紹了用xgboost模型对特征重要性进行排序小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

用xgboost模型對特征重要性進(jìn)行排序

在這篇文章中，你將會學(xué)習(xí)到：

xgboost對預(yù)測模型特征重要性排序的原理（即為什么xgboost可以對預(yù)測模型特征重要性進(jìn)行排序）。

如何繪制xgboost模型得到的特征重要性條形圖。

如何根據(jù)xgboost模型得到的特征重要性，在scikit-learn進(jìn)行特征選擇。

梯度提升算法是如何計算特征重要性的？

使用梯度提升算法的好處是在提升樹被創(chuàng)建后，可以相對直接地得到每個屬性的重要性得分。一般來說，重要性分?jǐn)?shù)，衡量了特征在模型中的提升決策樹構(gòu)建中價值。一個屬性越多的被用來在模型中構(gòu)建決策樹，它的重要性就相對越高。

屬性重要性是通過對數(shù)據(jù)集中的每個屬性進(jìn)行計算，并進(jìn)行排序得到。在單個決策書中通過每個屬性分裂點(diǎn)改進(jìn)性能度量的量來計算屬性重要性，由節(jié)點(diǎn)負(fù)責(zé)加權(quán)和記錄次數(shù)。也就說一個屬性對分裂點(diǎn)改進(jìn)性能度量越大（越靠近根節(jié)點(diǎn)），權(quán)值越大；被越多提升樹所選擇，屬性越重要。性能度量可以是選擇分裂節(jié)點(diǎn)的Gini純度，也可以是其他度量函數(shù)。

最終將一個屬性在所有提升樹中的結(jié)果進(jìn)行加權(quán)求和后然后平均，得到重要性得分。

繪制特征重要性

一個已訓(xùn)練的xgboost模型能夠自動計算特征重要性，這些重要性得分可以通過成員變量feature_importances_得到。可以通過如下命令打印：

print(model.feature_importances_)

我們可以直接在條形圖上繪制這些分?jǐn)?shù)，以獲得數(shù)據(jù)集中每個特征的相對重要性的直觀顯示。例如：

# plot pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_) pyplot.show()

我們可以通過在the Pima Indians onset of diabetes 數(shù)據(jù)集上訓(xùn)練XGBOOST模型來演示，并從計算的特征重要性中繪制條形圖。

# plot feature importance manually from numpy import loadtxt from xgboost import XGBClassifier from matplotlib import pyplot # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] y = dataset[:,8] # fit model no training data model = XGBClassifier() model.fit(X, y) # feature importance print(model.feature_importances_) # plot pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_) pyplot.show()

運(yùn)行這個示例，首先的輸出特征重要性分?jǐn)?shù)：

[0.089701,?0.17109634,?0.08139535,?0.04651163,?0.10465116,?0.2026578,?0.1627907,?0.14119601]

相對重要性條形圖：

這種繪制的缺點(diǎn)在于，只顯示了特征重要性而沒有排序，可以在繪制之前對特征重要性得分進(jìn)行排序。

通過內(nèi)建的繪制函數(shù)進(jìn)行特征重要性得分排序后的繪制，這個函數(shù)就是plot_importance()，示例如下：

# plot feature importance using built-in function from numpy import loadtxt from xgboost import XGBClassifier from xgboost import plot_importance from matplotlib import pyplot # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] y = dataset[:,8] # fit model no training data model = XGBClassifier() model.fit(X, y) # plot feature importance plot_importance(model) pyplot.show()

運(yùn)行示例得到條形圖：

根據(jù)其在輸入數(shù)組中的索引，特征被自動命名為f0-f7。在問題描述中手動的將這些索引映射到名稱，我們可以看到，F5（身體質(zhì)量指數(shù)）具有最高的重要性，F3（皮膚折疊厚度）具有最低的重要性。

根據(jù)xgboost特征重要性得分進(jìn)行特征選擇

特征重要性得分，可以用于在scikit-learn中進(jìn)行特征選擇。通過SelectFromModel類實(shí)現(xiàn)，該類采用模型并將數(shù)據(jù)集轉(zhuǎn)換為具有選定特征的子集。這個類可以采取預(yù)先訓(xùn)練的模型，例如在整個數(shù)據(jù)集上訓(xùn)練的模型。然后，它可以閾值來決定選擇哪些特征。當(dāng)在SelectFromModel實(shí)例上調(diào)用transform()方法時，該閾值被用于在訓(xùn)練集和測試集上一致性選擇相同特征。

在下面的示例中，我們首先在訓(xùn)練集上訓(xùn)練xgboost模型，然后在測試上評估。使用從訓(xùn)練數(shù)據(jù)集計算的特征重要性，然后，將模型封裝在一個SelectFromModel實(shí)例中。我們使用這個來選擇訓(xùn)練集上的特征，用所選擇的特征子集訓(xùn)練模型，然后在相同的特征方案下對測試集進(jìn)行評估。

示例：

# select features using threshold selection = SelectFromModel(model, threshold=thresh, prefit=True) select_X_train = selection.transform(X_train) # train model selection_model = XGBClassifier() selection_model.fit(select_X_train, y_train) # eval model select_X_test = selection.transform(X_test) y_pred = selection_model.predict(select_X_test)

我們可以通過測試多個閾值，來從特征重要性中選擇特征。具體而言，每個輸入變量的特征重要性，本質(zhì)上允許我們通過重要性來測試每個特征子集。

完整示例如下：

# use feature importance for feature selection from numpy import loadtxt from numpy import sort from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.feature_selection import SelectFromModel # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] Y = dataset[:,8] # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7) # fit model on all training data model = XGBClassifier() model.fit(X_train, y_train) # make predictions for test data and evaluate y_pred = model.predict(X_test) predictions = [round(value) for value in y_pred] accuracy = accuracy_score(y_test, predictions) print("Accuracy: %.2f%%" % (accuracy * 100.0)) # Fit model using each importance as a threshold thresholds = sort(model.feature_importances_) for thresh in thresholds:# select features using thresholdselection = SelectFromModel(model, threshold=thresh, prefit=True)select_X_train = selection.transform(X_train)# train modelselection_model = XGBClassifier()selection_model.fit(select_X_train, y_train)# eval modelselect_X_test = selection.transform(X_test)y_pred = selection_model.predict(select_X_test)predictions = [round(value) for value in y_pred]accuracy = accuracy_score(y_test, predictions)print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

運(yùn)行示例，得到輸出：

Accuracy: 77.95% Thresh=0.071, n=8, Accuracy: 77.95% Thresh=0.073, n=7, Accuracy: 76.38% Thresh=0.084, n=6, Accuracy: 77.56% Thresh=0.090, n=5, Accuracy: 76.38% Thresh=0.128, n=4, Accuracy: 76.38% Thresh=0.160, n=3, Accuracy: 74.80% Thresh=0.186, n=2, Accuracy: 71.65% Thresh=0.208, n=1, Accuracy: 63.78%

我們可以看到，模型的性能通常隨著所選擇的特征的數(shù)量而減少。在這一問題上，可以對測試集準(zhǔn)確率和模型復(fù)雜度做一個權(quán)衡，例如選擇4個特征，接受準(zhǔn)確率從77.95%降到76.38%。這可能是對這樣一個小數(shù)據(jù)集的清洗，但對于更大的數(shù)據(jù)集和使用交叉驗(yàn)證作為模型評估方案可能是更有用的策略。

總結(jié)

以上是生活随笔為你收集整理的用xgboost模型对特征重要性进行排序的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。