实战: 对GBDT(lightGBM)分类任务进行贝叶斯优化, 并与随机方法对比
目錄:
- 一. 數(shù)據(jù)預(yù)處理
- 1.1 讀取&清理&切割數(shù)據(jù)
- 1.2 標簽的分布
- 二. 基礎(chǔ)模型建立
- 2.1 LightGBM建模
- 2.2 默認參數(shù)的效果
- 三. 設(shè)置參數(shù)空間
- 3.* 參數(shù)空間采樣
- 四. 隨機優(yōu)化
- 4.1 交叉驗證LightGBM
- 4.2 Objective Function
- 4.3 執(zhí)行隨機調(diào)參
- 4.4 Random Search 結(jié)果
- 五. 貝葉斯優(yōu)化
- 5.1 Objective Function
- 5.2 Domain Space
- 5.2.1 學(xué)習(xí)率分布
- 5.2.2 葉子數(shù)分布
- 5.2.3 boosting_type
- 5.2.4 參數(shù)分布匯總
- 5.2.4.* 參數(shù)采樣結(jié)果看一下
- 5.3 準備貝葉斯優(yōu)化
- 5.4 貝葉斯優(yōu)化結(jié)果
- 5.4.1 保存結(jié)果
- 5.4.2 測試集上的效果
- 六. 隨機VS貝葉斯 方法對比
- 6.1 調(diào)參過程可視化展示
- 6.2 學(xué)習(xí)率對比
- 6.3 Boosting Type 對比
- 6.4 數(shù)值型參數(shù) 對比
- 七. 貝葉斯優(yōu)化參數(shù)變化情況
- 7.1 Boosting Type 變化
- 7.2 學(xué)習(xí)率&葉子數(shù)&... 變化
- 7.3 reg_alpha, reg_lambda 變化
- 7.4 隨機與貝葉斯優(yōu)化損失變化的對比
- 7.5 保存結(jié)果
保險數(shù)據(jù)集,來進行GBDT分類任務(wù)預(yù)測,基于貝葉斯和隨機優(yōu)化方法進行對比分析.
一. 數(shù)據(jù)預(yù)處理
1.1 讀取&清理&切割數(shù)據(jù)
import pandas as pd import numpy as npdata = pd.read_csv('caravan-insurance-challenge.csv') data.head() train = data[data['ORIGIN'] == 'train'] test = data[data['ORIGIN'] == 'test']train_labels = np.array(train['CARAVAN'].astype(np.int32)).reshape((-1,)) test_labels = np.array(test['CARAVAN'].astype(np.int32)).reshape((-1,))train = train.drop(['ORIGIN', 'CARAVAN'], axis = 1) test = test.drop(['ORIGIN', 'CARAVAN'], axis = 1)features = np.array(train) test_features = np.array(test) lebels = train_labels[:]print('Train shape:', train.shape) print('Test shape:', test.shape) train.head()1.2 標簽的分布
import matplotlib.pyplot as plt import seaborn as sns %matplotlib inlineplt.hist(labels, edgecolor = 'k') plt.xlabel('Label'); plt.ylabel('Count'); plt.title('Count of Labels')
樣本是不平衡數(shù)據(jù),所以在這里選擇使用ROC曲線來進行評估,接下來我們的目標就是使得其AUC的值越大越好。
二. 基礎(chǔ)模型建立
2.1 LightGBM建模
import lightgbm as lgb model = lgb.LGBMClassifier() modelLGBMClassifier(boosting_type=‘gbdt’, class_weight=None, colsample_bytree=1.0, importance_type=‘split’, learning_rate=0.1, max_depth=-1, min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0, n_estimators=100, n_jobs=-1, num_leaves=31, objective=None, random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
2.2 默認參數(shù)的效果
這個基礎(chǔ)模型,我們要做的就是盡可能高的來提升AUC指標。
from sklearn.metrics import roc_auc_score from timeit import default_timer as timerstart = timer() model.fit(features, labels) train_time = timer() - startpredictions = model.predict_proba(test_featurs)[:, 1] auc = roc_auc_score(test_labels, predictions)print('The baseline score on the test set is {:.4f}.'.format(auc)) print('The baseline training time is {:.4f} seconds.'.format(train_time))The baseline score on the test set is 0.7092.
The baseline training time is 0.3402 seconds.
三. 設(shè)置參數(shù)空間
RandomizedSearchCV沒有Early Stopping功能 , 所以我們來自己寫一下 .
有些參數(shù)設(shè)置成對數(shù)分布,比如學(xué)習(xí)率,因為這類參數(shù)都是要累乘才能發(fā)揮效果的,一般經(jīng)驗都是寫成log分布形式。還有一些參數(shù)得在其他參數(shù)控制下來進行選擇
import randomparam_grid = {'class_weight': [None, 'balanced'],'boosting_type': ['gbdt', 'goss', 'dart'],'num_leaves': list(range(30, 150)),'learning_rate': list(np.logspace(np.log(0.005), np.log(0.2), base=np.exp(1), num=800))),'subsample_for_bin': list(range(20000, 300000, 20000)),'min_child_samples': list(range(20, 500, 5)),'reg_alpha': list(np.linspace(0, 1)),'reg_lambda': list(np.linspace(0, 1)),'colsample_bytree': list(np.linspace(0.6, 1, 10))} subsample_dist = list(np.linepace(0.5, 1, 100))# 學(xué)習(xí)率的分布 plt.hist(param_grid['learning_rate'], color = 'r', edgecolor = 'k') plt.xlabel('Learning Rate'); plt.ylabel('Count'); plt.title('Learning Rate Distribution', size =18) # 葉子數(shù)目的分布 plt.hist(param_grid['num_leaves'], color = 'm', edgecolor = 'k') plt.xlabel('Learning Number of Leaves'); plt.ylabel('Count'); plt.title('Number of Leaves Distribution')3.* 參數(shù)空間采樣
{key: random.sample(value, 2) for key, value in param_grid.items()} params = {key: random.sample(value, 1)[0] for key, value in param_grid.items()} params['subsample'] = random.sample(subsample_dist, 1)[0] if params['boosting_type'] != 'goss' else 1.0 params{‘class_weight’: ‘balanced’, ‘boosting_type’: ‘gbdt’,
‘num_leaves’: 149, ‘learning_rate’: 0.024474734290096542,
‘subsample_for_bin’: 200000, ‘min_child_samples’: 110,
‘reg_alpha’: 0.8163265306122448, ‘reg_lambda’: 0.26530612244897955,
‘colsample_bytree’: 0.6888888888888889, ‘subsample’: 0.8282828282828283}
四. 隨機優(yōu)化
4.1 交叉驗證LightGBM
# Create a lgb dataset train_set = lgb.Dataset(features, label = labels)r = lgb.cv(params, train_set, num_boost_round=10000, nfold=10, metrics='auc',early_stopping_rounds = 80, verbose_eval = False, seed = 42) # early_stopping_rounds = 80:如果再連續(xù)構(gòu)造80次還是沒進步,那就停止r_best = np.max(r['auc-mean']) # Highest score r_best_std = r['auc-stdv'][np.argmax(r['auc-mean'])] # Standard deviation of best scoreprint('The maximum ROC AUC on the validation set was {:.5f}.'.format(r_best, r_best_std)) print('The ideal numbel of iterations was {}.'.format(np.argmax(r['auc-mean']) + 1)The maximum ROC AUC on the validation set was 0.75553 with std of 0.03082.
The ideal number of iterations was 73.
4.2 Objective Function
用AUC指標當做我們的目標
Max_evals = 200 N_folds = 3 def random_objective(params, iteration, n_folds = N_folds):start = timer()cv_results = lgb.cv(params, train_set, num_boost_round = 10000, nfold = n_folds,early_stopping_rounds = 80, metrics = 'auc', seed = 42)end = timer()best_score = np.max(cv_results['auc-mean'])loss = 1 - best_scoren_estimators = int(np.argmax(cv_results['auc-mean']) + 1)return [loss, params, iteration, n_estimators, end-start]4.3 執(zhí)行隨機調(diào)參
random.seed(42)for i in range(Max_evals):params = {key: random.sample(value, 1)[0] for key, value in param_grid.items()}if params['boosting_type'] == 'goss':params['subsample'] = 1.0else:params['subsample'] = random.sample(subsample_dist, 1)[0]results_list = random_objective(params, i)random_results.loc[i, :] = results_listrandom_results.sort_values('loss', ascending = True, inplace = True) random_results.reset_index(inplace = True, drop = True) random_results.head()4.4 Random Search 結(jié)果
random_results.loc[0, 'params']{‘class_weight’: None, ‘boosting_type’: ‘dart’, ‘num_leaves’: 112,
‘learning_rate’: 0.020631460653340816, ‘subsample_for_bin’: 160000,
‘min_child_samples’: 220, ‘reg_alpha’: 0.9795918367346939,
‘reg_lambda’: 0.08163265306, ‘colsample_bytree’: 0.6, ‘subsample’: 0.7929292929292929}
The best model from random search scores 0.7179 on the test data.
This was achieved using 38 search iterations.
五. 貝葉斯優(yōu)化
5.1 Objective Function
import csv from hyperopt import STATUS_OK from timeit import default_timer as timerdef objective(params, n_folds = N_folds):global ITERATIONITERATION += 1subsample = params['boosting_type'].get('subsample', 1.0)params['boosting_type'] = params['boosting_type']['boosting_type']params['subsample'] = subsamplefor parameter_name in ['num_leaves', 'subsample_for_bin', 'min_child_samples']:params[parameter_name] = int(params[parameter_name])start = timer()cv_results = lgb.cv(params, train_set, num_boost_round = 10000, nfold = n_folds,early_stopping_rounds = 80, metrics = 'auc', seed = 42)run_time = timer() - startbest_score = np.max(cv_results['auc-mean'])loss = 1 - best_scoren_estimators = int(np.argmax(cv_results['auc-mean']) + 1)of_connection = open(out_file, 'a')writer = csv.writer(of_connection)writer.writerow([loss, params, ITERATION, n_estimators, run_time])return {'loss': loss, 'params': params, 'iteration': ITERATION,'estimators': n_estimators, 'train_time': run_time, 'status': STATUS_OK}5.2 Domain Space
5.2.1 學(xué)習(xí)率分布
from hyperopt import hp from hyperopt.pyll.stochastic import samplelearning_rate = {'learning_rate': hp.loguniform('learning_rate', np.log(0.005), np.log(0.2))}learning_rate_dist = [] for _ in range(10000):learning_rate_dist.append(sample(learning_rate)['learning_rate'])plt.figure(figsize = (8, 6)) sns.kdeplot(learning_rate_dist, color = 'r', linewidth = 2, shade = True) plt.title('Learning Rate Distribution', size = 18) plt.xlabel('Learning Rate', size = 16) plt.ylabel('Density', size = 16)5.2.2 葉子數(shù)分布
quniform的效果
num_leaves = {'num_leaves': hp.quniform('num_leaves', 30, 150, 1)} num_leaves_dist = [] for _ in range(10000):num_leaves_dist.append(sample(num_leaves)['num_leaves'])plt.figure(figsize = (8,6)) sns.kdeplot(num_leaves_dist, linewidth = 2, shade = True) plt.title('Number of Leaves Distribution', size = 18); plt.xlabel('Number of Leaves', size = 16); plt.ylabel('Density', size = 16)5.2.3 boosting_type
boosting_type = {'boosting_type': hp.choice('boosting_type',[{'boosting_type': 'gbdt', 'subsample': hp.uniform('subsample', 0.5, 1)}, {'boosting_type': 'dart', 'subsample': hp.uniform('subsample', 0.5, 1)},{'boosting_type': 'goss', 'subsample': 1.0}])} params = sample(boosting_type) params{‘boosting_type’: {‘boosting_type’: ‘gbdt’, ‘subsample’: 0.659771523544347}}
subsample = params['boosting_type'].get('subsample', 1.0)params['boosting_type'] = params['boosting_type']['boosting_type'] params['subsample'] = subsample params{‘boosting_type’: ‘gbdt’, ‘subsample’: 0.659771523544347}
5.2.4 參數(shù)分布匯總
space = {'class_weight': hp.choice('class_weight', [None, 'balanced']),'boosting_type': hp.choice('boosting_type', [{'boosting_type': 'gbdt', 'subsample': hp.uniform('gdbt_subsample', 0.5, 1)},{'boosting_type': 'dart', 'subsample': hp.uniform('dart_subsample', 0.5, 1)},{'boosting_type': 'goss', 'subsample': 1.0}]),'num_leaves': hp.quniform('num_leaves', 30, 150, 1),'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.2)),'subsample_for_bin': hp.quniform('subsample_for_bin', 20000, 300000, 20000),'min_child_samples': hp.quniform('min_child_samples', 20, 500, 5),'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0),'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0),'colsample_bytree': hp.uniform('colsample_by_tree', 0.6, 1.0)}5.2.4.* 參數(shù)采樣結(jié)果看一下
x = sample(space) subsample = x['boosting_type'].get('subsample', 1.0) x['boosting_type'] = x['boosting_type']['boosting_type'] x['subsample'] = subsample x{‘boosting_type’: ‘goss’,
‘class_weight’: ‘balanced’,
‘colsample_bytree’: 0.6765996025430209,
‘learning_rate’: 0.13232409656402305,
‘min_child_samples’: 330.0,
‘num_leaves’: 103.0,
‘reg_alpha’: 0.5849415659238283,
‘reg_lambda’: 0.4787001151843524,
‘subsample_for_bin’: 100000.0,
‘subsample’: 1.0}
5.3 準備貝葉斯優(yōu)化
from hyperopt import tpe tpe_algorithm = tpe.suggestfrom hyperopt import Trials bayes_trials = Trials()# 可以將結(jié)果保存下來out_file = 'gbm_trials.csv' of_connection = open(out_file, 'w') writer = csv.writer(of_connection)writer.writerow(['loss', 'params', 'iteration', 'estimators', 'train_time']) of_connection.close()5.4 貝葉斯優(yōu)化結(jié)果
from hyperopt import fmin# Global variable global ITERATIONITERATION = 0# Run optimization best = fmin(fn = objective, space = space, algo = tpe.suggest, max_evals = Max_evals, trials = bayes_trials, rstate = np.random.RandomState(42))# Sort the trials with lowest loss (highest AUC) first bayes_trials_results = sorted(bayes_trials.results, key = lambda x: x['loss']) bayes_trials_results[0][{‘loss’: 0.23670902556787576,
‘params’: {‘boosting_type’: ‘dart’,
‘class_weight’: None,
‘colsample_bytree’: 0.6777142263201398,
‘learning_rate’: 0.10896162558676845,
‘min_child_samples’: 200,
‘num_leaves’: 50,
‘reg_alpha’: 0.75201502515923,
‘reg_lambda’: 0.2500317899561674,
‘subsample_for_bin’: 220000,
‘subsample’: 0.8299430626318801},
‘iteration’: 109,
‘estimators’: 39,
‘train_time’: 135.7437369420004,
‘status’: ‘ok’}]
5.4.1 保存結(jié)果
results = pd.read_csv('gbm_trials.csv') results.sort_values('loss', ascending = True, inplace = True) results.reset_index(inplace = True, drop = True) print(results.shape) results.head() import ast ast.literal_eval(results.loc[0, 'params']) # 出于安全考慮,對字符串進行類型轉(zhuǎn)換的時候,最好使用ast.literal_eval()函數(shù), 而不是直接用eval(){‘boosting_type’: ‘dart’,
‘class_weight’: None,
‘colsample_bytree’: 0.6777142263201398,
‘learning_rate’: 0.10896162558676845,
‘min_child_samples’: 200,
‘num_leaves’: 50,
‘reg_alpha’: 0.75201502515923,
‘reg_lambda’: 0.2500317899561674,
‘subsample_for_bin’: 220000,
‘subsample’: 0.8299430626318801}
5.4.2 測試集上的效果
best_bayes_estimators = int(results.loc[0, 'estimators']) best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()best_bayes_model = lgb.LGBMClassifier(n_estimators=best_bayes_estimators, n_jobs=-1,objective='binary', **best_bayes_params, random_state=42) best_bayes_model.fit(features, labels)LGBMClassifier(boosting_type=‘dart’, class_weight=None,
colsample_bytree=0.6777142263201398, importance_type=‘split’,
learning_rate=0.10896162558676845, max_depth=-1,
min_child_samples=200, min_child_weight=0.001, min_split_gain=0.0,
n_estimators=39, n_jobs=-1, num_leaves=50, objective=‘binary’,
random_state=42, reg_alpha=0.75201502515923,
reg_lambda=0.2500317899561674, silent=True,
subsample=0.8299430626318801, subsample_for_bin=220000,
subsample_freq=0)
The best model from Bayes optimization scores 0.7275 AUC ROC on the test set.
This was achieved after 109 search iteration.
六. 隨機VS貝葉斯 方法對比
best_random_params['method'] = 'random search' best_bayes_params['method'] = 'Bayesian optimization' best_params = pd.DataFrame(best_bayes_params, index = [0]).append(pd.DataFrame(best_random_params, index = [0]), ignore_index = True) best_params6.1 調(diào)參過程可視化展示
random_params = pd.DataFrame(columns = list(random_results.loc[0, 'params'].keys()),index = list(range(len(random_results)))) for i, params in enumerate(random_results['params']):random_params.loc[i, :] = list(params.values())random_params['loss'] = random_results['loss'] random_params['iteration'] = random_results['iteration'] random_params.head() bayes_params = pd.DataFrame(columns = list(ast.literal_eval(results.loc[0,'params']).keys()),index = list(range(len(results)))) for i, params in enumerate(results['params']):bayes_params.loc[i, :] = list(ast.literal_eval(params).values())bayes_params['loss'] = results['loss'] bayes_params['iteration'] = results['iteration'] bayes_params.head()6.2 學(xué)習(xí)率對比
plt.figure(figsize = (20, 8)) plt.rcParams['font.size'] = 18sns.kdeplot(learning_rate_dist, label = 'Sampling Distribution', linewidth = 2) sns.kdeplot(random_params['learning_rate'], label = 'Random Search', linewidth = 2) sns.kdeplot(bayes_params['learning_rate'], label = 'Bayes Optimization', linewidth=2) plt.legend() plt.xlabel('Learning Rate') plt.ylabel('Density') plt.title('Learning Rate Distribution')6.3 Boosting Type 對比
fig, axs = plt.subplots(1, 2, sharey = True, sharex = True)random_params['boosting_type'].value_counts().plot.bar(ax=axs[0], figsize=(14,6),color='orange', title='Random Search Boosting Type') bayes_params['boosting_type'].value_counts().plot.bar(ax=axs[1], figsize= (14,6),color='green', title='Bayes Optimization Boosting Type') print('Random Search boosting type percentages:') print(100 * random_params['boosting_type'].value_counts() / len(random_params))print('Bayes Optimization boosting type percentages:') print(100 * bayes_params['boosting_type'].value_counts() / len(bayes_params))Random Search boosting type percentages:
dart 36.5
gbdt 33.0
goss 30.5
Name: boosting_type, dtype: float64
Bayes Optimization boosting type percentages:
dart 54.5
gbdt 29.0
goss 16.5
Name: boosting_type, dtype: float64
6.4 數(shù)值型參數(shù) 對比
for i, hyper in enumerate(random_params.columns):if hyper not in ['class_weight','boosting_type','iteration','subsample','metric','verbose']:plt.figure(figsize = (14, 6))if hyper != 'loss':sns.kdeplot([sample(space[hyper]) for _ in range(1000)], label = 'Sampling Distribution')sns.kdeplot(random_params[hyper], label = 'Random Search')sns.kdeplot(bayes_params[hyper], label = 'Bayes Optimization')plt.legend(loc = 1)plt.title('{} Distribution'.format(hyper))plt.xlabel('{}'.format(hyper))plt.ylabel('Density')
七. 貝葉斯優(yōu)化參數(shù)變化情況
7.1 Boosting Type 變化
bayes_params['boosting_int'] = bayes_params['boosting_type'].replace({'gbdt':1,'goss':2,'dart':3}) plt.plot(bayes_params['iteration'], bayes_params['boosting_int'], 'ro') plt.yticks([1, 2, 3], ['gdbt', 'goss', 'dart']) plt.xlabel('Iteration') plt.title('Boosting Type over Search')7.2 學(xué)習(xí)率&葉子數(shù)&… 變化
plt.figure(figsize = (14, 14)) colors = ['red', 'blue', 'orange', 'green']for i, hyper in enumerate(['colsample_bytree', 'learning_rate', 'min_child_samples', 'num_leaves']):plt.subplot(2, 2, i+1)sns.regplot('iteration', hyper, data = bayes_params, color = colors[i])# plt.xlabel('Iteration')# plt.ylabel('{}'.format(hyper))plt.title('{} over Search'.format(hyper)) plt.tight_layout()7.3 reg_alpha, reg_lambda 變化
fig, axes = plt.subplots(1, 3, figsize = (18, 6)) for i, hyper in enumerate(['reg_alpha', 'reg_lambda', 'subsample_for_bin']):sns.regplot('iteration', hyper, data = bayes_params, ax = axes[i])axes[i].set(title = '{} over Search'.format(hyper)) plt.tight_layout()7.4 隨機與貝葉斯優(yōu)化損失變化的對比
scores = pd.DataFrame({'ROC AUC': 1 - random_params['loss'],'iteration': random_params['iteration'],'search': 'random'}) scores = scores.append(pd.DataFrame({'ROC AUC': 1 - bayes_params['loss'],'iteration': bayes_params['iteration'],'search': 'Bayes'})) scores['ROC AUC'] = scores['ROC AUC'].astype(np.float32) scores['iteration'] = scores['iteration'].astype(np.int32) scores.head() plt.figure(figsize = (18, 6))plt.subplot(1, 2, 1) plt.hist(1 - random_results['loss'].astype(np.float32), label = 'Random Search', edgecolor = 'k') plt.xlabel('Validation Roc Auc') plt.ylabel('Count') plt.title('Random Search Validation Scores') plt.xlim(0.73, 0.765)plt.subplot(1, 2, 2) plt.hist(1 - bayes_params['loss'], label = 'Bayes Optimization', edgecolor = 'k') plt.xlabel('Validation Roc Auc') plt.ylabel('Count') plt.title('Bayes Optimization Validation Scores') plt.xlim(0.73, 0.765) sns.lmplot('iteration', 'ROC AUC', hue = 'search', data = scores, height = 8) plt.xlabel('Iteration') plt.ylabel('ROC AUC') plt.title('ROC AUC versus Iteration')7.5 保存結(jié)果
import json with open('trials.json', 'w') as f:f.write(json.dumps(bayes_trials.results))bayes_params.to_csv('bayes_params.csv', index = False) random_params.to_csv('random_params.csv', index = False)總結(jié)
以上是生活随笔為你收集整理的实战: 对GBDT(lightGBM)分类任务进行贝叶斯优化, 并与随机方法对比的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: C++| |异常
- 下一篇: HTML+CSS【超浪漫的表白网页代码】