Kaggle泰坦尼克号生存预测挑战——模型建立、模型调参、融合
生活随笔
收集整理的這篇文章主要介紹了
Kaggle泰坦尼克号生存预测挑战——模型建立、模型调参、融合
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
Kaggle泰坦尼克號生存預測挑戰
????這是kaggle上Getting Started 的Prediction Competition,也是比較入門和簡單的新人賽,我的最好成績好像有進入top8%,重新地回顧鞏固一下這個比賽,我將分成三個部分:
- Kaggle泰坦尼克號生存預測挑戰——數據分析
- Kaggle泰坦尼克號生存預測挑戰——簡單的特征工程
- Kaggle泰坦尼克號生存預測挑戰——模型建立、模型調參、融合
先修知識
- numpy
- pandas
- matplotlib
- seaborn
- sklearn
賽題地址:Titanic: Machine Learning from Disaster
泰坦尼克號的沉沒
????1912年4月15日,在她的處女航中,被普遍認為“沉沒”的RMS泰坦尼克號與冰山相撞后沉沒。
????不幸的是,船上沒有足夠的救生艇供所有人使用,導致2224名乘客和機組人員中的1502人死亡。雖然幸存有一些運氣,但似乎有些人比其他人更有可能生存。
????在這一挑戰中,我們要求您建立一個預測模型來回答以下問題:“什么樣的人更有可能生存?” 使用乘客數據(即姓名,年齡,性別,社會經濟艙等)
????任務分析:這是一個分類任務,建立模型預測幸存者
模型建立、模型調參、融合
將數據分成 訓練集和測試集
train_data = data_all[data_all['train']==1] test_x = data_all[data_all['train']==0] ## 標簽 y=train_data ['Survived'].values train_data.drop(['train','Survived'],axis=1,inplace=True) test_x.drop(['train','Survived'],axis=1,inplace=True) np.save('./result/label',y)將數據分成 訓練集和測試集 :比例為 8:2
from sklearn.model_selection import train_test_split feature=train_data.valuesX_train, X_test, Y_train, Y_test = train_test_split(features, y, test_size=0.2, random_state=seed)模型建立
自定義封裝一個函數
def Titanicmodel(clf,features,test_data,y,model_name):if model_name =='LinearSVC':num_classes = 1 #類別數else:num_classes = 2 #類別數num_fold = 10 #10折fold_len = features.shape[0] // num_fold #每一折的數據量skf_indices = []skf = StratifiedKFold(n_splits=num_fold, shuffle=True, random_state=seed) #將訓練集分為10折for i, (train_idx, valid_idx) in enumerate(skf.split(np.ones(features.shape[0]), y)):skf_indices.extend(valid_idx.tolist())train_pred = np.zeros((features.shape[0], num_classes)) #在訓練集上的預測結果 (train_samples,classes)test_pred = np.zeros((test_data.shape[0], num_classes))#在測試集上的預測結果 (test_samples,classes)for fold in tqdm(range(num_fold)):fold_start = fold * fold_lenfold_end = (fold + 1) * fold_lenif fold == num_fold - 1:fold_end = train_data.shape[0]#訓練部分索引 9折train_indices = skf_indices[:fold_start] + skf_indices[fold_end:]# 驗證部分索引 1折test_indices = skf_indices[fold_start:fold_end]#訓練部分數據 9折train_x = features[train_indices]train_y = y[train_indices]#驗證部分數據 1折cv_test_x = features[test_indices]clf.fit(train_x, train_y) #訓練if model_name =='LinearSVC':pred = clf.decision_function(cv_test_x) #在驗證部分數據上 進行驗證train_pred[test_indices] = (pred).reshape(len(pred),1) #把預測結果先通過softmax轉換為概率分布(歸一化) 賦給驗證部分對應的位置 循環結束將會得到整個訓練集上的預測結果pred = clf.decision_function(test_data) #得到 當前訓練的模型在測試集上的預測結果test_pred += pred.reshape(len(pred),1) / num_fold#對每個模型在測試集上的預測結果先通過softmax轉換為概率分布,再直接取平均(10折將會有10個結果)else:pred = clf.predict_proba(cv_test_x) #在驗證部分數據上 進行驗證train_pred[test_indices] = pred #把預測結果 賦給驗證部分對應的位置 循環結束將會得到整個訓練集上的預測結果pred = clf.predict_proba(test_data) #得到 當前訓練的模型在測試集上的預測結果test_pred += pred / num_fold #對每個模型在測試集上的預測結果直接取平均(10折將會有10個結果)y_pred = np.argmax(train_pred, axis=1) #對訓練集上的預測結果按行取最大值 得到預測的標簽if model_name =='LinearSVC':y_pred = (train_pred>0).astype(np.int32).reshape(len(train_pred))pre = (test_pred>0).astype(np.int32).reshape(len(test_pred))else:pre = np.argmax(test_pred,axis=1)score = accuracy_score(y, y_pred) #和訓練集對應的真實標簽 accuracy_scoreprint('accuracy_score:',score)#保存邏輯回歸模型在訓練集和測試集上的預測結果np.save('./result/{0}'.format(model_name)+'train',train_pred)np.save('./result/{0}'.format(model_name)+'test',test_pred)submit = pd.DataFrame({'PassengerId':np.array(range(892,1310)),'Survived':pre.astype(np.int32)})submit.to_csv('{0}_submit.csv'.format(model_name),index=False)return clf,score邏輯回歸(LR)
pipe=Pipeline([('select',PCA(n_components=0.95)), ('classify', LogisticRegression(random_state = seed, solver = 'liblinear'))]) param = {'classify__penalty':['l1','l2'], 'classify__C':[0.001, 0.01, 0.1, 1, 5,7,8,9,10,]} LR_grid = GridSearchCV(estimator =pipe, param_grid = param, scoring='roc_auc', cv=5) LR_grid.fit(features,y) print(LR_grid.best_params_, LR_grid.best_score_) C=LR_grid.best_params_['classify__C'] penalty = LR_grid.best_params_['classify__penalty'] LR_classify=LogisticRegression(C=C,penalty=penalty,random_state = seed, solver = 'liblinear') LR_select = PCA(n_components=0.95) LR_pipeline = make_pipeline(LR_select, LR_classify) lr_model,lr_score = Titanicmodel(LR_pipeline,feature,test_data,y,'LR')支持向量機(SVM)
pipe=Pipeline([('select',SelectKBest(k=20)), ('classify',LinearSVC(random_state=seed))]) param = {'select__k':list(range(20,40,2)),'classify__penalty':['l1','l2'], 'classify__C':[0.001, 0.01, 0.1, 1, 5,7,8,9,10,50,100]} SVC_grid=GridSearchCV(estimator=pipe,param_grid=param,cv=5,scoring='roc_auc') SVC_grid.fit(features,y) print(SVC_grid.best_params_, SVC_grid.best_score_) C=SVC_grid.best_params_['classify__C'] k=SVC_grid.best_params_['select__k'] penalty = SVC_grid.best_params_['classify__penalty'] SVC_classify=LinearSVC(C=C,penalty=penalty,random_state = seed) SVC_select = PCA(n_components=0.95) SVC_pipeline = make_pipeline(SVC_select, SVC_classify) SVC_model,LinearSVC_score = Titanicmodel(SVC_pipeline,feature,test_data,y,'LinearSVC')RandomForestClassifier
pipe=Pipeline([('select',SelectKBest(k=34)), ('classify', RandomForestClassifier(criterion='gini',random_state = seed,min_samples_split=4,min_samples_leaf=5, max_features = 'sqrt',n_jobs=-1,))])param = {'classify__n_estimators':list(range(40,50,2)), 'classify__max_depth':list(range(10,25,2))} rfc_grid = GridSearchCV(estimator = pipe, param_grid = param, scoring='roc_auc', cv=10) rfc_grid.fit(features,y) print(rfc_grid.best_params_, rfc_grid.best_score_) n_estimators=rfc_grid.best_params_['classify__n_estimators'] max_depth = rfc_grid.best_params_['classify__max_depth'] rfc_classify=RandomForestClassifier(criterion='gini',n_estimators= n_estimators,max_depth=max_depth,random_state = seed,min_samples_split=4,min_samples_leaf=5, max_features = 'sqrt') rfc_select = PCA(n_components=0.95) rfc_pipeline = make_pipeline(rfc_select, rfc_classify) rfc_model,rfc_score = Titanicmodel(rfc_pipeline,feature,test_data,y,'rfc')LightGBM
pipe=Pipeline([('select',SelectKBest(k=34)), ('classify', lgb.LGBMClassifier(random_state=seed,learning_rate=0.12,n_estimators=88,max_depth=16,min_child_samples=28,min_child_weight=0.0,classify__colsample_bytree= 0.8,colsample_bytree=0.4,objective='binary') )])param = {'select__k':[i for i in range(20,40)] # 'classify__learning_rate':[i/100 for i in range(20)] } lgb_grid = GridSearchCV(estimator = pipe, param_grid = param, scoring='roc_auc', cv=10) lgb_grid.fit(features,y) print(lgb_grid.best_params_, lgb_grid.best_score_) lgb_classify= lgb.LGBMClassifier(random_state=seed,learning_rate=0.12,n_estimators=88,max_depth=16,min_child_samples=28,min_child_weight=0.0,classify__colsample_bytree= 0.8,colsample_bytree=0.4,objective='binary') lgb_select = PCA(n_components=0.96) lgb_pipeline = make_pipeline(lgb_select, lgb_classify) lgb_model,lgb_score = Titanicmodel(lgb_pipeline,feature,test_data,y,'lgb')Xgboost
pipe=Pipeline([('select',SelectKBest(k=34)), ('classify', xgb.XGBClassifier(random_state=seed,learning_rate=0.12,n_estimators=80,max_depth=8,min_child_weight=3,subsample=0.8,colsample_bytree=0.8,gamma=0.2,reg_alpha=0.2,reg_lambda=0.1,))]) param = { 'select__k':[i for i in range(20,40)'classify__learning_rate':[i/100 for i in range(10,20)], } xgb_grid = GridSearchCV(estimator = pipe, param_grid = param, scoring='roc_auc', cv=10) xgb_grid.fit(features,y) print(xgb_grid.best_params_, xgb_grid.best_score_) xgb_classify= xgb.XGBClassifier(random_state=seed,learning_rate=0.12,n_estimators=80,max_depth=8,min_child_weight=3,subsample=0.8,colsample_bytree=0.8,gamma=0.2,reg_alpha=0.2,reg_lambda=0.1,) xgb_select = SelectKBest(k = 34) xgb_pipeline = make_pipeline(xgb_select, xgb_classify) xgb_model,xgb_score = Titanicmodel(xgb_pipeline,'xgb')模型融合
LR_train = np.load('./result/LRtrain.npy') LR_test = np.load('./result/LRtest.npy') LinearSVC_train = np.load('./result/LinearSVCtrain.npy') LinearSVC_test = np.load('./result/LinearSVCtest.npy') rfc_train = np.load('./result/rfctrain.npy') rfc_test = np.load('./result/rfctest.npy') xgb_train = np.load('./result/xgbtrain.npy') xgb_test = np.load('./result/xgbtest.npy') lgb_train = np.load('./result/lgbtrain.npy') lgb_test= np.load('./result/lgbtest.npy') label = np.load('./result/label.npy') train_data = ( LR_train, rfc_train, LinearSVC_train,xgb_train, lgb_train) test_x = ( LR_test, rfc_test, LinearSVC_test,xgb_test, lgb_test) train_data = np.hstack(train_data) test_x = np.hstack(test_x) model = LogisticRegression(random_state=seed) lgbm_7leaves_model,lgbm_7leaves_score = Titanicmodel(model,features=train_data,test_data=test_x,y=label,model_name='lr_stacking')總結
以上是生活随笔為你收集整理的Kaggle泰坦尼克号生存预测挑战——模型建立、模型调参、融合的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 新版mysql的下载教程_Mysql最新
- 下一篇: 强烈推荐这款刷题小程序