LightGBMError: Length of label is not same with #data
生活随笔
收集整理的這篇文章主要介紹了
LightGBMError: Length of label is not same with #data
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
完整錯誤如下如下:
--------------------------------------------------------------------------- LightGBMError Traceback (most recent call last) <timed exec> in <module><ipython-input-10-1d89cd0dcfc2> in make_predictions(tr_df, tt_df, features_columns, target, lgb_params, NFOLDS)39 tr_data,40 valid_sets = [tr_data, vl_data], ---> 41 verbose_eval = 200,42 ) 43 /opt/conda/lib/python3.6/site-packages/lightgbm/engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)226 # construct booster227 try: --> 228 booster = Booster(params=params, train_set=train_set)229 if is_valid_contain_train:230 booster.set_train_data_name(train_data_name)/opt/conda/lib/python3.6/site-packages/lightgbm/basic.py in __init__(self, params, train_set, model_file, model_str, silent)1666 self.handle = ctypes.c_void_p()1667 _safe_call(_LIB.LGBM_BoosterCreate( -> 1668 train_set.construct().handle,1669 c_str(params_str),1670 ctypes.byref(self.handle)))/opt/conda/lib/python3.6/site-packages/lightgbm/basic.py in construct(self)1039 init_score=self.init_score, predictor=self._predictor,1040 silent=self.silent, feature_name=self.feature_name, -> 1041 categorical_feature=self.categorical_feature, params=self.params)1042 if self.free_raw_data:1043 self.data = None/opt/conda/lib/python3.6/site-packages/lightgbm/basic.py in _lazy_init(self, data, label, reference, weight, group, init_score, predictor, silent, feature_name, categorical_feature, params)851 raise TypeError('Cannot initialize Dataset from {}'.format(type(data).__name__))852 if label is not None: --> 853 self.set_label(label)854 if self.get_label() is None:855 raise ValueError("Label should not be None")/opt/conda/lib/python3.6/site-packages/lightgbm/basic.py in set_label(self, label)1339 if self.handle is not None:1340 label = list_to_1d_numpy(_label_from_pandas(label), name='label') -> 1341 self.set_field('label', label)1342 self.label = self.get_field('label') # original values can be modified at cpp side1343 return self/opt/conda/lib/python3.6/site-packages/lightgbm/basic.py in set_field(self, field_name, data)1183 ptr_data,1184 ctypes.c_int(len(data)), -> 1185 ctypes.c_int(type_data)))1186 return self1187 /opt/conda/lib/python3.6/site-packages/lightgbm/basic.py in _safe_call(ret)45 """46 if ret != 0: ---> 47 raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))48 49 LightGBMError: Length of label is not same with #data道理大家都懂,裸數據的長度X和類別標簽y長度不相等,但是到底是哪里出了問題呢?
我的原始代碼是這樣的:
def make_predictions(tr_df, tt_df, features_columns, target, lgb_params, NFOLDS=2):folds = GroupKFold(n_splits=NFOLDS)X,y = tr_df[features_columns], tr_df[target] P,P_y = tt_df[features_columns], tt_df[target] # X=X.reset_index(drop=True) # y=y.reset_index(drop=True)split_groups = tr_df['DT_M']tt_df = tt_df[['TransactionID',target]] predictions = np.zeros(len(tt_df))oof = np.zeros(len(tr_df))for fold_, (trn_idx, val_idx) in enumerate(folds.split(X, y, groups=split_groups)): # print("len(X)=",len(X))#傳入的數據都沒問題,103315 # print("len(y)=",len(y))#103315 # print("len(trn_idx)",len(trn_idx)) # print("len(val_idx)",len(val_idx))tr_x=X.iloc[trn_idx,:]tr_y=y[trn_idx]#訓練集vl_x, vl_y = X.iloc[val_idx,:], y[val_idx]#驗證集print("len(tr_y)=",len(tr_y))#81048print("len(tr_x)=",len(tr_x))#80250# print(len(tr_x),len(vl_x))tr_data = lgb.Dataset(tr_x, label=tr_y)vl_data = lgb.Dataset(vl_x, label=vl_y) estimator = lgb.train(lgb_params,tr_data,valid_sets = [tr_data, vl_data],verbose_eval = 200,) pp_p = estimator.predict(P)predictions += pp_p/NFOLDSoof_preds = estimator.predict(vl_x)oof[val_idx] = (oof_preds - oof_preds.min())/(oof_preds.max() - oof_preds.min())if LOCAL_TEST:feature_imp = pd.DataFrame(sorted(zip(estimator.feature_importance(),X.columns)), columns=['Value','Feature'])print(feature_imp)del tr_x, tr_y, vl_x, vl_y, tr_data, vl_datagc.collect()tt_df['prediction'] = predictionsprint('OOF AUC:', metrics.roc_auc_score(y, oof))if LOCAL_TEST:print('Holdout AUC:', metrics.roc_auc_score(tt_df[TARGET], tt_df['prediction']))return tt_df ## -------------------問題出在:
tr_x=X.iloc[trn_idx,:] tr_y=y[trn_idx]#訓練集這里的trn_idx中是帶有NAN的.
所以解決方案是:
X=X.reset_index(drop=True)
y=y.reset_index(drop=True)
?
由于數據集經過下采樣,index已經不連續了,但是GroupKfold使用的split_groups依然保持完整.
index不連續的情況下碰到使用tr_y=y[trn_idx]這種操作,有些index不存在的數據被trn_idx(trn_idx來自split_groups)采樣失敗.
所以每次裁剪數據集后需要記住的一個習慣就是reset_index(drop=True)
?
?
總結
以上是生活随笔為你收集整理的LightGBMError: Length of label is not same with #data的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Sublime Text提示Unable
- 下一篇: 使用K-S检验一个数列是否服从正态分布、