當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Kaggle学习笔记--XGBoost

發布時間：2023/12/10 编程问答 39 豆豆

生活随笔收集整理的這篇文章主要介紹了 Kaggle学习笔记--XGBoost 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Kaggle學習筆記--XGBoost

簡介
- XGBoost是什么
- - 梯度提升XGBoost是一種通過循環迭代的將模型添加到集合中的方法
- XGBoost 的優點
數據加載
步驟1：創建XGBoost模型
步驟2：改進模型（1）——獲得更低的MAE
步驟3：改進模型（2）——獲得更高的MAE
總結

課程原文：https://www.kaggle.com/alexisbcook/xgboost

簡介

XGBoost是什么

Xgboost是Boosting算法的其中一種，Boosting算法的思想是將許多弱分類器集成在一起，形成一個強分類器。因為Xgboost是一種提升樹模型，所以它是將許多樹模型集成在一起，形成一個很強的分類器。而所用到的樹模型則是CART回歸樹模型。Xgboost是在GBDT的基礎上進行改進，使之更強大，適用于更大范圍。
【詳細介紹 https://www.cnblogs.com/wj-1314/p/9402324.html】

梯度提升XGBoost是一種通過循環迭代的將模型添加到集合中的方法

1.首先從用單個模型初始化集合開始，其預測結果可能很天真.（即使其預測非常不準確，隨后對該集合進行的添加也將修正這些錯誤。）
2.然后，開始循環：
首先使用當前集合為數據集中的每個觀測值生成預測。為了做出預測，將集合中所有模型的預測值相加。這些預測用于計算損失函數（例如，均方誤差）。
然后，使用損失函數來擬合將要添加到集合中的新模型。具體來說，我們確定了模型參數，所以將此新模型添加到集合中將減少損失。（附帶說明：“梯度增強”中的“梯度”是指對損失函數使用梯度下降【gradient descent】來確定此新模型中的參數。）
3.最后，將新模型添加到集合中，并重復上述操作！

XGBoost 的優點

1.正則化
　　XGBoost在代價函數里加入了正則項，用于控制模型的復雜度。正則項里包含了樹的葉子節點個數、每個葉子節點上輸出的score的L2模的平方和。從Bias-variance tradeoff角度來講，正則項降低了模型的variance，使學習出來的模型更加簡單，防止過擬合，這也是xgboost優于傳統GBDT的一個特性。
2. 并行處理
　　XGBoost工具支持并行。Boosting不是一種串行的結構嗎?怎么并行的？注意XGBoost的并行不是tree粒度的并行，XGBoost也是一次迭代完才能進行下一次迭代的（第t次迭代的代價函數里包含了前面t-1次迭代的預測值）。XGBoost的并行是在特征粒度上的。
　　我們知道，決策樹的學習最耗時的一個步驟就是對特征的值進行排序（因為要確定最佳分割點），XGBoost在訓練之前，預先對數據進行了排序，然后保存為block結構，后面的迭代中重復地使用這個結構，大大減小計算量。這個block結構也使得并行成為了可能，在進行節點的分裂時，需要計算每個特征的增益，最終選增益最大的那個特征去做分裂，那么各個特征的增益計算就可以開多線程進行。
3. 靈活性
　　XGBoost支持用戶自定義目標函數和評估函數，只要目標函數二階可導就行。
4. 缺失值處理
　　對于特征的值有缺失的樣本，xgboost可以自動學習出它的分裂方向
5. 剪枝
　　XGBoost 先從頂到底建立所有可以建立的子樹，再從底到頂反向進行剪枝。比起GBM，這樣不容易陷入局部最優解。
6. 內置交叉驗證
　　XGBoost允許在每一輪boosting迭代中使用交叉驗證。因此，可以方便地獲得最優boosting迭代次數。而GBM使用網格搜索，只能檢測有限個值。　　
【原文鏈接：https://blog.csdn.net/luanpeng825485697/article/details/79907149.】

數據加載

import pandas as pd from sklearn.model_selection import train_test_split# Read the data X = pd.read_csv('.../train.csv', index_col='Id') X_test_full = pd.read_csv('.../test.csv', index_col='Id')# Remove rows with missing target, separate target from predictors X.dropna(axis=0, subset=['SalePrice'], inplace=True) y = X.SalePrice X.drop(['SalePrice'], axis=1, inplace=True)# Break off validation set from training data X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,random_state=0)# "Cardinality" means the number of unique values in a column # Select categorical columns with relatively low cardinality (convenient but arbitrary) low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 andX_train_full[cname].dtype == "object"]# Select numeric columns numeric_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]# Keep selected columns only my_cols = low_cardinality_cols + numeric_cols X_train = X_train_full[my_cols].copy() X_valid = X_valid_full[my_cols].copy() X_test = X_test_full[my_cols].copy()# One-hot encode the data (to shorten the code, we use pandas) X_train = pd.get_dummies(X_train) X_valid = pd.get_dummies(X_valid) X_test = pd.get_dummies(X_test) X_train, X_valid = X_train.align(X_valid, join='left', axis=1) X_train, X_test = X_train.align(X_test, join='left', axis=1)

步驟1：創建XGBoost模型

XGBoost（xgboost.XGBRegressor）導入scikit-learn API。這使得能夠像在scikit-learn中一樣構建和擬合模型。 XGBRegressor類具有許多可調參數

具體步驟如下：
1.首先將“ my_model_1”設置為XGBoost模型。
2.使用XGBRegressor類，并將隨機種子設置為0（random_state = 0）。 將所有其他參數保留為默認值。
3.將模型擬合到X_train和y_train中的訓練數據。
4.最后，使用mean_absolute_error（）函數來計算與驗證集的預測相對應的平均絕對誤差（MAE）。其中驗證數據的標簽存儲在y_valid中。

from xgboost import XGBRegressor # Define the model my_model_1 = XGBRegressor(random_state=0) # Fit the model my_model_1.fit(X_train, y_train) from sklearn.metrics import mean_absolute_error # Get predictions predictions_1 = my_model_1.predict(X_valid) '''''' # Calculate MAE mae_1 = mean_absolute_error(y_valid,predictions_1) # Your code here # Uncomment to print MAE print("Mean Absolute Error:" , mae_1)

Mean Absolute Error: 16803.434690710616

步驟2：改進模型（1）——獲得更低的MAE

1.將my_model_2設置為XGBoost模型。例設置n_estimators和learning_rate參數以獲得更好的結果。
2.將模型擬合到X_train和y_train中的訓練數據。
3.將“ predictions_2”設置為模型對驗證數據的預測。驗證功能存儲在X_valid中。
4.最后，使用mean_absolute_error（）函數來計算與驗證集上的預測相對應的平均絕對誤差（MAE）。驗證數據的標簽存儲在y_valid中。

# Define the model my_model_2 = XGBRegressor(learning_rate=0.1,n_estimators=450,random_state=0) # Your code here# Fit the model my_model_2.fit(X_train,y_train)# Your code here# Get predictions predictions_2 = my_model_2.predict(X_valid) # Your code here# Calculate MAE mae_2 = mean_absolute_error(y_valid,predictions_2) # Your code here# Uncomment to print MAE print("Mean Absolute Error:" , mae_2)

Mean Absolute Error: 15875.706670055652

步驟3：改進模型（2）——獲得更高的MAE

1.將my_model_3設置為XGBoost模型。修改n_estimators和learning_rate參數以獲得更差的結果。
2.將模型擬合到X_train和y_train中的訓練數據。
3.將“ predictions_3”設置為模型對驗證數據的預測。驗證功能存儲在X_valid中。
4.最后，使用mean_absolute_error（）函數來計算與驗證集上的預測相對應的平均絕對誤差（MAE）。驗證數據的標簽存儲在y_valid中。

# Define the model my_model_3 = XGBRegressor(n_estimators=10,learning_rate=0.5)# Fit the model my_model_3.fit(X_train,y_train) # Your code here# Get predictions predictions_3 = my_model_3.predict(X_valid)# Calculate MAE mae_3 = mean_absolute_error(y_valid,predictions_3)# Uncomment to print MAE print("Mean Absolute Error:" , mae_3)

Mean Absolute Error: 21031.549991973458

總結

通過修改XGBRegressor中n_estimators和learning_rate的參數的值，從而獲得更好或者更差的結果。通過多次嘗試可找出針對本模型MAE的變化規律：
n_estimators越大 learning_rate越小模型效果越好
n_estimators越小,learning_rate越大模型效果越差
從而獲得更優秀的訓練模型。

【鏈接】：XGBRegressor 參數調優

總結

以上是生活随笔為你收集整理的Kaggle学习笔记--XGBoost的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：威斯康星大学计算机专业找工作,威斯康星麦
下一篇： UOJ#196. 【ZJOI2016】线