Py之lightgbm:lightgbm的简介、安装、使用方法之详细攻略
Py之lightgbm:lightgbm的簡介、安裝、使用方法之詳細攻略
?
?
?
目錄
lightgbm的簡介
lightgbm的安裝
lightgbm的使用方法
1、class lightgbm.Dataset
2、LGBMRegressor類
?
?
?
?
?
lightgbm的簡介
LightGBM 是一個梯度 boosting 框架, 使用基于學習算法的決策樹. 它是分布式的, 高效的, 裝逼的, 它具有以下優勢:
- 速度和內存使用的優化
- 減少分割增益的計算量
- 通過直方圖的相減來進行進一步的加速
- 減少內存的使用 減少并行學習的通信代價
- 稀疏優化
- 準確率的優化
- Leaf-wise (Best-first) 的決策樹生長策略
- 類別特征值的最優分割
- 網絡通信的優化
- 并行學習的優化
- 特征并行
- 數據并行
- 投票并行
- GPU 支持可處理大規模數據
1、效率
為了比較效率, 我們只運行沒有任何測試或者度量輸出的訓練進程,并且我們不計算 IO 的時間。如下是耗時的對比表格:
| Higgs | 3794.34 s | 551.898 s | 238.505513 s |
| Yahoo LTR | 674.322 s | 265.302 s | 150.18644 s |
| MS LTR | 1251.27 s | 385.201 s | 215.320316 s |
| Expo | 1607.35 s | 588.253 s | 138.504179 s |
| Allstate | 2867.22 s | 1355.71 s | 348.084475 s |
我們發現在所有數據集上 LightGBM 都比 xgboost 快。
2、準確率
為了比較準確率, 我們使用數據集測試集部分的準確率進行公平比較。
| Higgs | AUC | 0.839593 | 0.845605 | 0.845154 |
| Yahoo LTR | NDCG<sub>1</sub> | 0.719748 | 0.720223 | 0.732466 |
| NDCG<sub>3</sub> | 0.717813 | 0.721519 | 0.738048 | ? |
| NDCG<sub>5</sub> | 0.737849 | 0.739904 | 0.756548 | ? |
| NDCG<sub>10</sub> | 0.78089 | 0.783013 | 0.796818 | ? |
| MS LTR | NDCG<sub>1</sub> | 0.483956 | 0.488649 | 0.524255 |
| NDCG<sub>3</sub> | 0.467951 | 0.473184 | 0.505327 | ? |
| NDCG<sub>5</sub> | 0.472476 | 0.477438 | 0.510007 | ? |
| NDCG<sub>10</sub> | 0.492429 | 0.496967 | 0.527371 | ? |
| Expo | AUC | 0.756713 | 0.777777 | 0.777543 |
| Allstate |
3、內存消耗
我們在運行訓練任務時監視 RES,并在 LightGBM 中設置?two_round=true?(將增加數據載入時間,但會減少峰值內存使用量,不影響訓練速度和準確性)以減少峰值內存使用量。
| Higgs | 4.853GB | 3.784GB | 0.868GB |
| Yahoo LTR | 1.907GB | 1.468GB | 0.831GB |
| MS LTR | 5.469GB | 3.654GB | 0.886GB |
| Expo | 1.553GB | 1.393GB | 0.543GB |
| Allstate | 6.237GB | 4.990GB |
4、綜述
? ? ? ?LightGBM是個快速的,分布式的,高性能的基于決策樹算法的梯度提升框架。可用于排序,分類,回歸以及很多其他的機器學習任務中。
? ? ? ?Gbdt是受歡迎的機器學習算法,當特征維度很高或數據量很大時,有效性和可拓展性沒法滿足。lightgbm提出GOSS(Gradient-based One-Side Sampling)和EFB(Exclusive Feature Bundling)進行改進。lightgbm與傳統的gbdt在達到相同的精確度時,快20倍。
? ? ? ?在競賽題中,我們知道XGBoost算法非常熱門,它是一種優秀的拉動框架,但是在使用過程中,其訓練耗時很長,內存占用比較大。在2017年年1月微軟在GitHub的上開源了一個新的升壓工具--LightGBM。在不降低準確率的前提下,速度提升了10倍左右,占用內存下降了3倍左右。因為他是基于決策樹算法的,它采用最優的葉明智策略分裂葉子節點,然而其它的提升算法分裂樹一般采用的是深度方向或者水平明智而不是葉,明智的。因此,在LightGBM算法中,當增長到相同的葉子節點,葉明智算法比水平-wise算法減少更多的損失。因此導致更高的精度,而其他的任何已存在的提升算法都不能夠達。與此同時,它的速度也讓人感到震驚,這就是該算法名字 ?燈的原因。
LightGBM 中文文檔:http://lightgbm.apachecn.org/#/
lightgbm github:https://github.com/Microsoft/LightGBM
lightgbm pypi:https://pypi.org/project/lightgbm/
?
?
lightgbm的安裝
pip install lightgbmlightgbm的使用方法
1、class lightgbm.Dataset
class lightgbm.Dataset(data, label=None, max_bin=None, reference=None, weight=None, group=None, init_score=None, silent=False, feature_name='auto', categorical_feature='auto', params=None, free_raw_data=True)
Parameters:
- data?(string__,?numpy array?or?scipy.sparse) – Data source of Dataset. If string, it represents the path to txt file.
- label?(list__,?numpy 1-D array?or?None__,?optional?(default=None)) – Label of the data.
- max_bin?(int?or?None__,?optional?(default=None)) – Max number of discrete bins for features. If None, default value from parameters of CLI-version will be used.
- reference?(Dataset?or?None__,?optional?(default=None)) – If this is Dataset for validation, training data should be used as reference.
- weight?(list__,?numpy 1-D array?or?None__,?optional?(default=None)) – Weight for each instance.
- group?(list__,?numpy 1-D array?or?None__,?optional?(default=None)) – Group/query size for Dataset.
- init_score?(list__,?numpy 1-D array?or?None__,?optional?(default=None)) – Init score for Dataset.
- silent?(bool__,?optional?(default=False)) – Whether to print messages during construction.
- feature_name?(list of strings?or?'auto'__,?optional?(default="auto")) – Feature names. If ‘auto’ and data is pandas DataFrame, data columns names are used.
- categorical_feature?(list of strings?or?int__, or?'auto'__,?optional?(default="auto")) – Categorical features. If list of int, interpreted as indices. If list of strings, interpreted as feature names (need to specify?feature_name?as well). If ‘auto’ and data is pandas DataFrame, pandas categorical columns are used.
- params?(dict?or?None__,?optional?(default=None)) – Other parameters.
- free_raw_data?(bool__,?optional?(default=True)) – If True, raw data is freed after constructing inner Dataset.
2、LGBMRegressor類
https://lightgbm.readthedocs.io/en/latest/Python-API.html?highlight=LGBMRegressor
classlightgbm.LGBMModel(boosting_type='gbdt',?num_leaves=31,?max_depth=-1,?learning_rate=0.1,?n_estimators=100,?subsample_for_bin=200000,?objective=None,?class_weight=None,?min_split_gain=0.0,?min_child_weight=0.001,?min_child_samples=20,?subsample=1.0,?subsample_freq=0,?colsample_bytree=1.0,?reg_alpha=0.0,?reg_lambda=0.0,?random_state=None,?n_jobs=-1,?silent=True,?importance_type='split',?**kwargs)
- boosting_type?(string,?optional?(default='gbdt')) – ‘gbdt’, traditional Gradient Boosting Decision Tree. ‘dart’, Dropouts meet Multiple Additive Regression Trees. ‘goss’, Gradient-based One-Side Sampling. ‘rf’, Random Forest.
- num_leaves?(int,?optional?(default=31)) – Maximum tree leaves for base learners.
- max_depth?(int,?optional?(default=-1)) – Maximum tree depth for base learners, -1 means no limit.
- learning_rate?(float,?optional?(default=0.1)) – Boosting learning rate. You can use?callbacks?parameter of?fit?method to shrink/adapt learning rate in training using?reset_parameter?callback. Note, that this will ignore the?learning_rate?argument in training.
- n_estimators?(int,?optional?(default=100)) – Number of boosted trees to fit.
- subsample_for_bin?(int,?optional?(default=200000)) – Number of samples for constructing bins.
- objective?(string,?callable?or?None,?optional?(default=None)) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below). Default: ‘regression’ for?LGBMRegressor, ‘binary’ or ‘multiclass’ for LGBMClassifier, ‘lambdarank’ for LGBMRanker.
- class_weight?(dict,?'balanced'?or?None,?optional?(default=None)) – Weights associated with classes in the form?{class_label:?weight}. Use this parameter only for multi-class classification task; for binary classification task you may use?is_unbalance?or?scale_pos_weight?parameters. The ‘balanced’ mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as?n_samples?/?(n_classes?*?np.bincount(y)). If None, all classes are supposed to have weight one. Note, that these weights will be multiplied with?sample_weight?(passed through the?fit?method) if?sample_weight?is specified.
- min_split_gain?(float,?optional?(default=0.)) – Minimum loss reduction required to make a further partition on a leaf node of the tree.
- min_child_weight?(float,?optional?(default=1e-3)) – Minimum sum of instance weight (hessian) needed in a child (leaf).
- min_child_samples?(int,?optional?(default=20)) – Minimum number of data needed in a child (leaf).
- subsample?(float,?optional?(default=1.)) – Subsample ratio of the training instance.
- subsample_freq?(int,?optional?(default=0)) – Frequence of subsample, <=0 means no enable.
- colsample_bytree?(float,?optional?(default=1.)) – Subsample ratio of columns when constructing each tree.
- reg_alpha?(float,?optional?(default=0.)) – L1 regularization term on weights.
- reg_lambda?(float,?optional?(default=0.)) – L2 regularization term on weights.
- random_state?(int?or?None,?optional?(default=None)) – Random number seed. If None, default seeds in C++ code will be used.
- n_jobs?(int,?optional?(default=-1)) – Number of parallel threads.
- silent?(bool,?optional?(default=True)) – Whether to print messages while running boosting.
- importance_type?(string,?optional?(default='split')) – The type of feature importance to be filled into?feature_importances_. If ‘split’, result contains numbers of times the feature is used in a model. If ‘gain’, result contains total gains of splits which use the feature.
- bagging_fraction?(? ?)
- feature_fraction?(? ??)
- min_data_in_leaf?(? ??)
- min_sum_hessian_in_leaf?(? ??)
?
?
?
?
?
?
總結
以上是生活随笔為你收集整理的Py之lightgbm:lightgbm的简介、安装、使用方法之详细攻略的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: ML之XGBoost:XGBoost算法
- 下一篇: 成功解决UnicodeDecodeErr