【机器学习基础】你应该知道的LightGBM各种操作!
LightGBM是基于XGBoost的一款可以快速并行的樹(shù)模型框架,內(nèi)部集成了多種集成學(xué)習(xí)思路,在代碼實(shí)現(xiàn)上對(duì)XGBoost的節(jié)點(diǎn)劃分進(jìn)行了改進(jìn),內(nèi)存占用更低訓(xùn)練速度更快。
LightGBM官網(wǎng):https://lightgbm.readthedocs.io/en/latest/
參數(shù)介紹:https://lightgbm.readthedocs.io/en/latest/Parameters.html
本文內(nèi)容如下,原始代碼獲取方式見(jiàn)文末。
1 安裝方法
2 調(diào)用方法
2.1 定義數(shù)據(jù)集
2.2 模型訓(xùn)練
2.3 模型保存與加載
2.4 查看特征重要性
2.5 繼續(xù)訓(xùn)練
2.6 動(dòng)態(tài)調(diào)整模型超參數(shù)
2.7 自定義損失函數(shù)
2.8 調(diào)參方法
人工調(diào)參
網(wǎng)格搜索
貝葉斯優(yōu)化
1 安裝方法
LightGBM的安裝非常簡(jiǎn)單,在Linux下很方便的就可以開(kāi)啟GPU訓(xùn)練。可以優(yōu)先選用從pip安裝,如果失敗再?gòu)脑创a安裝。
安裝方法:從源碼安裝
安裝方法:pip安裝
2 調(diào)用方法
在Python語(yǔ)言中LightGBM提供了兩種調(diào)用方式,分為為原生的API和Scikit-learn API,兩種方式都可以完成訓(xùn)練和驗(yàn)證。當(dāng)然原生的API更加靈活,看個(gè)人習(xí)慣來(lái)進(jìn)行選擇。
2.1 定義數(shù)據(jù)集
df_train?=?pd.read_csv('https://cdn.coggle.club/LightGBM/examples/binary_classification/binary.train',?header=None,?sep='\t') df_test?=?pd.read_csv('https://cdn.coggle.club/LightGBM/examples/binary_classification/binary.test',?header=None,?sep='\t') W_train?=?pd.read_csv('https://cdn.coggle.club/LightGBM/examples/binary_classification/binary.train.weight',?header=None)[0] W_test?=?pd.read_csv('https://cdn.coggle.club/LightGBM/examples/binary_classification/binary.test.weight',?header=None)[0]y_train?=?df_train[0] y_test?=?df_test[0] X_train?=?df_train.drop(0,?axis=1) X_test?=?df_test.drop(0,?axis=1) num_train,?num_feature?=?X_train.shape#?create?dataset?for?lightgbm #?if?you?want?to?re-use?data,?remember?to?set?free_raw_data=Falselgb_train?=?lgb.Dataset(X_train,?y_train,weight=W_train,?free_raw_data=False)lgb_eval?=?lgb.Dataset(X_test,?y_test,?reference=lgb_train,weight=W_test,?free_raw_data=False)2.2 模型訓(xùn)練
params?=?{'boosting_type':?'gbdt','objective':?'binary','metric':?'binary_logloss','num_leaves':?31,'learning_rate':?0.05,'feature_fraction':?0.9,'bagging_fraction':?0.8,'bagging_freq':?5,'verbose':?0 }#?generate?feature?names feature_name?=?['feature_'?+?str(col)?for?col?in?range(num_feature)] gbm?=?lgb.train(params,lgb_train,num_boost_round=10,valid_sets=lgb_train,??#?eval?training?datafeature_name=feature_name,categorical_feature=[21])2.3 模型保存與加載
#?save?model?to?file gbm.save_model('model.txt')print('Dumping?model?to?JSON...') model_json?=?gbm.dump_model()with?open('model.json',?'w+')?as?f:json.dump(model_json,?f,?indent=4)2.4 查看特征重要性
#?feature?names print('Feature?names:',?gbm.feature_name())#?feature?importances print('Feature?importances:',?list(gbm.feature_importance()))2.5 繼續(xù)訓(xùn)練
#?continue?training #?init_model?accepts: #?1.?model?file?name #?2.?Booster() gbm?=?lgb.train(params,lgb_train,num_boost_round=10,init_model='model.txt',valid_sets=lgb_eval) print('Finished?10?-?20?rounds?with?model?file...')2.6 動(dòng)態(tài)調(diào)整模型超參數(shù)
#?decay?learning?rates #?learning_rates?accepts: #?1.?list/tuple?with?length?=?num_boost_round #?2.?function(curr_iter) gbm?=?lgb.train(params,lgb_train,num_boost_round=10,init_model=gbm,learning_rates=lambda?iter:?0.05?*?(0.99?**?iter),valid_sets=lgb_eval) print('Finished?20?-?30?rounds?with?decay?learning?rates...')#?change?other?parameters?during?training gbm?=?lgb.train(params,lgb_train,num_boost_round=10,init_model=gbm,valid_sets=lgb_eval,callbacks=[lgb.reset_parameter(bagging_fraction=[0.7]?*?5?+?[0.6]?*?5)]) print('Finished?30?-?40?rounds?with?changing?bagging_fraction...')2.7 自定義損失函數(shù)
#?self-defined?objective?function #?f(preds:?array,?train_data:?Dataset)?->?grad:?array,?hess:?array #?log?likelihood?loss def?loglikelihood(preds,?train_data):labels?=?train_data.get_label()preds?=?1.?/?(1.?+?np.exp(-preds))grad?=?preds?-?labelshess?=?preds?*?(1.?-?preds)return?grad,?hess#?self-defined?eval?metric #?f(preds:?array,?train_data:?Dataset)?->?name:?string,?eval_result:?float,?is_higher_better:?bool #?binary?error #?NOTE:?when?you?do?customized?loss?function,?the?default?prediction?value?is?margin #?This?may?make?built-in?evalution?metric?calculate?wrong?results #?For?example,?we?are?doing?log?likelihood?loss,?the?prediction?is?score?before?logistic?transformation #?Keep?this?in?mind?when?you?use?the?customization def?binary_error(preds,?train_data):labels?=?train_data.get_label()preds?=?1.?/?(1.?+?np.exp(-preds))return?'error',?np.mean(labels?!=?(preds?>?0.5)),?Falsegbm?=?lgb.train(params,lgb_train,num_boost_round=10,init_model=gbm,fobj=loglikelihood,feval=binary_error,valid_sets=lgb_eval) print('Finished?40?-?50?rounds?with?self-defined?objective?function?and?eval?metric...')2.8 調(diào)參方法
人工調(diào)參
For Faster Speed
Use bagging by setting bagging_fraction and bagging_freq
Use feature sub-sampling by setting feature_fraction
Use small max_bin
Use save_binary to speed up data loading in future learning
Use parallel learning, refer to Parallel Learning Guide <./Parallel-Learning-Guide.rst>__
For Better Accuracy
Use large max_bin (may be slower)
Use small learning_rate with large num_iterations
Use large num_leaves (may cause over-fitting)
Use bigger training data
Try dart
Deal with Over-fitting
Use small max_bin
Use small num_leaves
Use min_data_in_leaf and min_sum_hessian_in_leaf
Use bagging by set bagging_fraction and bagging_freq
Use feature sub-sampling by set feature_fraction
Use bigger training data
Try lambda_l1, lambda_l2 and min_gain_to_split for regularization
Try max_depth to avoid growing deep tree
Try extra_trees
Try increasing path_smooth
網(wǎng)格搜索
lg?=?lgb.LGBMClassifier(silent=False) param_dist?=?{"max_depth":?[4,5,?7],"learning_rate"?:?[0.01,0.05,0.1],"num_leaves":?[300,900,1200],"n_estimators":?[50,?100,?150]}grid_search?=?GridSearchCV(lg,?n_jobs=-1,?param_grid=param_dist,?cv?=?5,?scoring="roc_auc",?verbose=5) grid_search.fit(train,y_train) grid_search.best_estimator_,?grid_search.best_score_貝葉斯優(yōu)化
import?warnings import?time warnings.filterwarnings("ignore") from?bayes_opt?import?BayesianOptimization def?lgb_eval(max_depth,?learning_rate,?num_leaves,?n_estimators):params?=?{"metric"?:?'auc'}params['max_depth']?=?int(max(max_depth,?1))params['learning_rate']?=?np.clip(0,?1,?learning_rate)params['num_leaves']?=?int(max(num_leaves,?1))params['n_estimators']?=?int(max(n_estimators,?1))cv_result?=?lgb.cv(params,?d_train,?nfold=5,?seed=0,?verbose_eval?=200,stratified=False)return?1.0?*?np.array(cv_result['auc-mean']).max()lgbBO?=?BayesianOptimization(lgb_eval,?{'max_depth':?(4,?8),'learning_rate':?(0.05,?0.2),'num_leaves'?:?(20,1500),'n_estimators':?(5,?200)},?random_state=0)lgbBO.maximize(init_points=5,?n_iter=50,acq='ei') print(lgbBO.max)獲取本文代碼,可以在作者公眾號(hào)“datawhale”后臺(tái)回復(fù)【lgb】,即可獲取本文的代碼Notebook!
往期精彩回顧適合初學(xué)者入門(mén)人工智能的路線及資料下載機(jī)器學(xué)習(xí)及深度學(xué)習(xí)筆記等資料打印機(jī)器學(xué)習(xí)在線手冊(cè)深度學(xué)習(xí)筆記專輯《統(tǒng)計(jì)學(xué)習(xí)方法》的代碼復(fù)現(xiàn)專輯 AI基礎(chǔ)下載機(jī)器學(xué)習(xí)的數(shù)學(xué)基礎(chǔ)專輯獲取一折本站知識(shí)星球優(yōu)惠券,復(fù)制鏈接直接打開(kāi):
https://t.zsxq.com/y7uvZF6
本站qq群704220115。
加入微信群請(qǐng)掃碼:
總結(jié)
以上是生活随笔為你收集整理的【机器学习基础】你应该知道的LightGBM各种操作!的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 【Python基础】当变量有值时,为什么
- 下一篇: 【机器学习基础】用Python画出几种常