當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器学习实战:GBDT Xgboost LightGBM对比

發布時間：2024/1/17 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习实战:GBDT Xgboost LightGBM对比小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Mnist數據集識別

使用Sklearn的GBDT

GradientBoostingClassifier

GradientBoostingRegressor

import gzip import pickle as pkl from sklearn.model_selection import train_test_splitdef load_data(path):f = gzip.open(path, 'rb')try:#Python3train_set, valid_set, test_set = pkl.load(f, encoding='latin1')except:#Python2train_set, valid_set, test_set = pkl.load(f)f.close()return(train_set,valid_set,test_set)path = 'mnist.pkl.gz' train_set,valid_set,test_set = load_data(path)Xtrain,_,ytrain,_ = train_test_split(train_set[0], train_set[1], test_size=0.9) Xtest,_,ytest,_ = train_test_split(test_set[0], test_set[1], test_size=0.9) print(Xtrain.shape, ytrain.shape, Xtest.shape, ytest.shape)

(5000, 784) (5000,) (1000, 784) (1000,)

參數說明：
learning_rate: The learning parameter controls the magnitude of this change in the estimates. (default=0.1)
n_extimators: The number of sequential trees to be modeled. (default=100)
max_depth: The maximum depth of a tree. (default=3)
min_samples_split: Tthe minimum number of samples (or observations) which are required in a node to be considered for splitting. (default=2)
min_samples_leaf: The minimum samples (or observations) required in a terminal node or leaf. (default=1)
min_weight_fraction_leaf: Similar to min_samples_leaf but defined as a fraction of the total number of observations instead of an integer. (default=0.)
subsample: The fraction of observations to be selected for each tree. Selection is done by random sampling. (default=1.0)
max_features: The number of features to consider while searching for a best split. These will be randomly selected. (default=None)
max_leaf_nodes: The maximum number of terminal nodes or leaves in a tree. (default=None)
min_impurity_decrease: A node will be split if this split induces a decrease of the impurity greater than or equal to this value. (default=0.) from sklearn.ensemble import GradientBoostingClassifier import numpy as np import time clf = GradientBoostingClassifier(n_estimators=10, learning_rate=0.1, max_depth=3)# start training start_time = time.time() clf.fit(Xtrain, ytrain) end_time = time.time() print('The training time = {}'.format(end_time - start_time))# prediction and evaluation pred = clf.predict(Xtest) accuracy = np.sum(pred == ytest) / pred.shape[0] print('Test accuracy = {}'.format(accuracy))

The training time = 11.989675521850586
Test accuracy = 0.825

集成算法可以得出特征重要性，說白了就是看各個樹使用特征的情況，使用的多當然就重要了,這是分類器告訴我們的。 %matplotlib inline import matplotlib.pyplot as plt plt.hist(clf.feature_importances_) print(max(clf.feature_importances_), min(clf.feature_importances_))

0.0249318971528 0.0

一般情況下，我們還可以篩選一下。 from collections import OrderedDict d = {} for i in range(len(clf.feature_importances_)):if clf.feature_importances_[i] > 0.01:d[i] = clf.feature_importances_[i]sorted_feature_importances = OrderedDict(sorted(d.items(), key=lambda x:x[1], reverse=True)) D = sorted_feature_importances rects = plt.bar(range(len(D)), D.values(), align='center') plt.xticks(range(len(D)), D.keys(),rotation=90) plt.show()

由于是像素點，所以看的沒那么直觀，正常特征看起來其實蠻直接的。

XGBoost

加入了更多的剪枝策略和正則項，控制過擬合風險。傳統的GBDT用的是CART，Xgboost能支持的分類器更多，也可以是線性的。GBDT只用了一階導，但是xgboost對損失函數做了二階的泰勒展開，并且還可以自定義損失函數。

import xgboost as xgb import numpy as np import time# read data into Xgboost DMatrix format dtrain = xgb.DMatrix(Xtrain, label=ytrain) dtest = xgb.DMatrix(Xtest, label=ytest)# specify parameters via map params = {'booster':'gbtree', # tree-based models'objective': 'multi:softmax', 'num_class':10, 'eta': 0.1, # Same to learning rate'gamma':0, # Similar to min_impurity_decrease in GBDT'alpha': 0, # L1 regularization term on weight (analogous to Lasso regression)'lambda': 2, # L2 regularization term on weights (analogous to Ridge regression)'max_depth': 3, # Same as the max_depth of GBDT'subsample': 1, # Same as the subsample of GBDT'colsample_bytree': 1, # Similar to max_features in GBM'min_child_weight': 1, # minimum sum of instance weight (Hessian) needed in a child'nthread':1, # default to maximum number of threads available if not set } num_round = 10# start training start_time = time.time() bst = xgb.train(params, dtrain, num_round) end_time = time.time() print('The training time = {}'.format(end_time - start_time))# get prediction and evaluate ypred = bst.predict(dtest) accuracy = np.sum(ypred == ytest) / ypred.shape[0] print('Test accuracy = {}'.format(accuracy))

The training time = 13.496984481811523
Test accuracy = 0.821

Xgboost參數

LightGBM

放到最后肯定有一堆優點的：

更快的訓練效率
低內存使用
更好的準確率
支持并行學習
可處理大規模數據

它摒棄了現在大部分GBDT使用的按層生長（level-wise）的決策樹生長策略，使用帶有深度限制的按葉子生長（leaf-wise）的策略。level-wise過一次數據可以同時分裂同一層的葉子，容易進行多線程優化，也好控制模型復雜度，不容易過擬合。但實際上level-wise是一種低效的算法，因為它不加區分的對待同一層的葉子，帶來了很多沒必要的開銷，因為實際上很多葉子的分裂增益較低，沒必要進行搜索和分裂。

Leaf-wise則是一種更為高效的策略，每次從當前所有葉子中，找到分裂增益最大的一個葉子，然后分裂，如此循環。因此同Level-wise相比，在分裂次數相同的情況下，Leaf-wise可以降低更多的誤差，得到更好的精度。Leaf-wise的缺點是可能會長出比較深的決策樹，產生過擬合。因此LightGBM在Leaf-wise之上增加了一個最大深度的限制，在保證高效率的同時防止過擬合。

安裝指引

import lightgbm as lgb train_data = lgb.Dataset(Xtrain, label=ytrain) test_data = lgb.Dataset(Xtest, label=ytest)# specify parameters via map params = {'num_leaves':31, # Same to max_leaf_nodes in GBDT, but GBDT's default value is None'max_depth': -1, # Same to max_depth of xgboost'tree_learner': 'serial', 'application':'multiclass', # Same to objective of xgboost'num_class':10, # Same to num_class of xgboost'learning_rate': 0.1, # Same to eta of xgboost'min_split_gain': 0, # Same to gamma of xgboost'lambda_l1': 0, # Same to alpha of xgboost'lambda_l2': 0, # Same to lambda of xgboost'min_data_in_leaf': 20, # Same to min_samples_leaf of GBDT'bagging_fraction': 1.0, # Same to subsample of xgboost'bagging_freq': 0,'bagging_seed': 0,'feature_fraction': 1.0, # Same to colsample_bytree of xgboost'feature_fraction_seed': 2,'min_sum_hessian_in_leaf': 1e-3, # Same to min_child_weight of xgboost'num_threads': 1 } num_round = 10# start training start_time = time.time() bst = lgb.train(params, train_data, num_round) end_time = time.time() print('The training time = {}'.format(end_time - start_time))# get prediction and evaluate ypred_onehot = bst.predict(Xtest) ypred = [] for i in range(len(ypred_onehot)):ypred.append(ypred_onehot[i].argmax())accuracy = np.sum(ypred == ytest) / len(ypred) print('Test accuracy = {}'.format(accuracy))

The training time = 4.891559839248657
Test accuracy = 0.902

參數解釋

結果對比

| | time(s) | accuracy(%) | |----------|---------|-------------| | GBDT | 11.98 | 0.825 | | XGBoost | 13.49 | 0.821 | | LightGBM | 4.89 | 0.902 |

http://lightgbm.apachecn.org/cn/latest/Parameters-Tuning.html

總結

以上是生活随笔為你收集整理的机器学习实战:GBDT Xgboost LightGBM对比的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：十大经典数据挖掘算法：SVM
下一篇：机器学习实战：使用lightGBM预测饭