當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

模型融合(stackingblending)

發(fā)布時(shí)間：2024/1/23 编程问答 25 豆豆

生活随笔收集整理的這篇文章主要介紹了模型融合(stackingblending) 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

1. blending

需要得到各個(gè)模型結(jié)果集的權(quán)重，然后再線(xiàn)性組合。

"""Kaggle competition: Predicting a Biological Response. Blending {RandomForests, ExtraTrees, GradientBoosting} + stretching to [0,1]. The blending scheme is related to the idea Jose H. Solorzano presented here: http://www.kaggle.com/c/bioresponse/forums/t/1889/question-about-the-process-of-ensemble-learning/10950#post10950 '''You can try this: In one of the 5 folds, train the models, then use the results of the models as 'variables' in logistic regression over the validation data of that fold'''. Or at least this is the implementation of my understanding of that idea :-) The predictions are saved in test.csv. The code below created my best submission to the competition: - public score (25%): 0.43464 - private score (75%): 0.37751 - final rank on the private leaderboard: 17th over 711 teams :-) Note: if you increase the number of estimators of the classifiers, e.g. n_estimators=1000, you get a better score/rank on the private test set. Copyright 2012, Emanuele Olivetti. BSD license, 3 clauses. """from __future__ import division import numpy as np import load_data from sklearn.cross_validation import StratifiedKFold from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.linear_model import LogisticRegressiondef logloss(attempt, actual, epsilon=1.0e-15):"""Logloss, i.e. the score of the bioresponse competition."""attempt = np.clip(attempt, epsilon, 1.0-epsilon)return - np.mean(actual * np.log(attempt) +(1.0 - actual) * np.log(1.0 - attempt))if __name__ == '__main__':np.random.seed(0) # seed to shuffle the train setn_folds = 10verbose = Trueshuffle = FalseX, y, X_submission = load_data.load()if shuffle:idx = np.random.permutation(y.size)X = X[idx]y = y[idx]skf = list(StratifiedKFold(y, n_folds))clfs = [RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=50)]print "Creating train and test sets for blending."dataset_blend_train = np.zeros((X.shape[0], len(clfs)))dataset_blend_test = np.zeros((X_submission.shape[0], len(clfs)))for j, clf in enumerate(clfs):print j, clfdataset_blend_test_j = np.zeros((X_submission.shape[0], len(skf)))for i, (train, test) in enumerate(skf):print "Fold", iX_train = X[train]y_train = y[train]X_test = X[test]y_test = y[test]clf.fit(X_train, y_train)y_submission = clf.predict_proba(X_test)[:, 1]dataset_blend_train[test, j] = y_submissiondataset_blend_test_j[:, i] = clf.predict_proba(X_submission)[:, 1]dataset_blend_test[:, j] = dataset_blend_test_j.mean(1)printprint "Blending."clf = LogisticRegression()clf.fit(dataset_blend_train, y)y_submission = clf.predict_proba(dataset_blend_test)[:, 1]print "Linear stretch of predictions to [0,1]"y_submission = (y_submission - y_submission.min()) / (y_submission.max() - y_submission.min())print "Saving Results."tmp = np.vstack([range(1, len(y_submission)+1), y_submission]).Tnp.savetxt(fname='submission.csv', X=tmp, fmt='%d,%0.9f', header='MoleculeId,PredictedProbability', comments='') 2.stacking

stacking的核心：在訓(xùn)練集上進(jìn)行預(yù)測(cè)，從而構(gòu)建更高層的學(xué)習(xí)器。

stacking訓(xùn)練過(guò)程:

1）拆解訓(xùn)練集。將訓(xùn)練數(shù)據(jù)隨機(jī)且大致均勻的拆為m份

2）在拆解后的訓(xùn)練集上訓(xùn)練模型，同時(shí)在測(cè)試集上預(yù)測(cè)。利用m-1份訓(xùn)練數(shù)據(jù)進(jìn)行訓(xùn)練，預(yù)測(cè)剩余一份；在此過(guò)程進(jìn)行的同時(shí)，利用相同的m-1份數(shù)據(jù)訓(xùn)練，在真正的測(cè)試集上預(yù)測(cè)；如此重復(fù)m次，將訓(xùn)練集上m次結(jié)果疊加為1列，將測(cè)試集上m次結(jié)果取均值融合為1列。

3）使用k個(gè)分類(lèi)器重復(fù)2過(guò)程。將分別得到k列訓(xùn)練集的預(yù)測(cè)結(jié)果，k列測(cè)試集預(yù)測(cè)結(jié)果。

4）訓(xùn)練3過(guò)程得到的數(shù)據(jù)。將k列訓(xùn)練集預(yù)測(cè)結(jié)果和訓(xùn)練集真實(shí)label進(jìn)行訓(xùn)練，將k列測(cè)試集預(yù)測(cè)結(jié)果作為測(cè)試集。

# -*- coding: utf-8 -*- import numpy as np from sklearn.model_selection import StratifiedKFold from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier import xgboost as xgb from sklearn.ensemble import ExtraTreesClassifier from sklearn.linear_model import LogisticRegressiondef load_data():passdef stacking(train_x, train_y, test):""" stackinginput: train_x, train_y, testoutput: test的預(yù)測(cè)值clfs: 5個(gè)一級(jí)分類(lèi)器dataset_blend_train: 一級(jí)分類(lèi)器的prediction, 二級(jí)分類(lèi)器的train_xdataset_blend_test: 二級(jí)分類(lèi)器的test"""# 5個(gè)一級(jí)分類(lèi)器clfs = [SVC(C = 3, kernel="rbf"),RandomForestClassifier(n_estimators=100, max_features="log2", max_depth=10, min_samples_leaf=1, bootstrap=True, n_jobs=-1, random_state=1),KNeighborsClassifier(n_neighbors=15, n_jobs=-1),xgb.XGBClassifier(n_estimators=100, objective="binary:logistic", gamma=1, max_depth=10, subsample=0.8, nthread=-1, seed=1),ExtraTreesClassifier(n_estimators=100, criterion="gini", max_features="log2", max_depth=10, min_samples_split=2, min_samples_leaf=1,bootstrap=True, n_jobs=-1, random_state=1)]# 二級(jí)分類(lèi)器的train_x, testdataset_blend_train = np.zeros((train_x.shape[0], len(clfs)), dtype=np.int)dataset_blend_test = np.zeros((test.shape[0], len(clfs)), dtype=np.int)# 5個(gè)分類(lèi)器進(jìn)行8_folds預(yù)測(cè)n_folds = 8skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=1)for i,clf in enumerate(clfs):dataset_blend_test_j = np.zeros((test.shape[0], n_folds)) # 每個(gè)分類(lèi)器的單次fold預(yù)測(cè)結(jié)果for j,(train_index,test_index) in enumerate(skf.split(train_x, train_y)):tr_x = train_x[train_index]tr_y = train_y[train_index]clf.fit(tr_x, tr_y)dataset_blend_train[test_index, i] = clf.predict(train_x[test_index])dataset_blend_test_j[:, j] = clf.predict(test)dataset_blend_test[:, i] = dataset_blend_test_j.sum(axis=1) // (n_folds//2 + 1)# 二級(jí)分類(lèi)器進(jìn)行預(yù)測(cè)clf = LogisticRegression(penalty="l1", tol=1e-6, C=1.0, random_state=1, n_jobs=-1)clf.fit(dataset_blend_train, train_y)prediction = clf.predict(dataset_blend_test)return predictiondef main():(train_x, train_y, test) = load_data()prediction = stacking(train_x, train_y, test)return predictionif __name__ == "__main__":prediction = main()

總結(jié)

以上是生活随笔為你收集整理的模型融合(stackingblending)的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： xgboost相比传统gbdt有何不同？
下一篇： xgboost使用自定义的loss fu