當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Kaggle泰坦尼克号生存预测挑战——模型建立、模型调参、融合

發布時間：2024/1/8 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 Kaggle泰坦尼克号生存预测挑战——模型建立、模型调参、融合小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Kaggle泰坦尼克號生存預測挑戰

????這是kaggle上Getting Started 的Prediction Competition，也是比較入門和簡單的新人賽，我的最好成績好像有進入top8%，重新地回顧鞏固一下這個比賽，我將分成三個部分：

Kaggle泰坦尼克號生存預測挑戰——數據分析
Kaggle泰坦尼克號生存預測挑戰——簡單的特征工程
Kaggle泰坦尼克號生存預測挑戰——模型建立、模型調參、融合

先修知識

numpy
pandas
matplotlib
seaborn
sklearn

賽題地址：Titanic: Machine Learning from Disaster

泰坦尼克號的沉沒

????1912年4月15日，在她的處女航中，被普遍認為“沉沒”的RMS泰坦尼克號與冰山相撞后沉沒。

????不幸的是，船上沒有足夠的救生艇供所有人使用，導致2224名乘客和機組人員中的1502人死亡。雖然幸存有一些運氣，但似乎有些人比其他人更有可能生存。

????在這一挑戰中，我們要求您建立一個預測模型來回答以下問題：“什么樣的人更有可能生存？” 使用乘客數據（即姓名，年齡，性別，社會經濟艙等）

????任務分析：這是一個分類任務，建立模型預測幸存者

模型建立、模型調參、融合

將數據分成訓練集和測試集

train_data = data_all[data_all['train']==1] test_x = data_all[data_all['train']==0] ## 標簽 y=train_data ['Survived'].values train_data.drop(['train','Survived'],axis=1,inplace=True) test_x.drop(['train','Survived'],axis=1,inplace=True) np.save('./result/label',y)

將數據分成訓練集和測試集：比例為 8：2

from sklearn.model_selection import train_test_split feature=train_data.valuesX_train, X_test, Y_train, Y_test = train_test_split(features, y, test_size=0.2, random_state=seed)

模型建立

自定義封裝一個函數

def Titanicmodel(clf,features,test_data,y,model_name):if model_name =='LinearSVC':num_classes = 1 #類別數else:num_classes = 2 #類別數num_fold = 10 #10折fold_len = features.shape[0] // num_fold #每一折的數據量skf_indices = []skf = StratifiedKFold(n_splits=num_fold, shuffle=True, random_state=seed) #將訓練集分為10折for i, (train_idx, valid_idx) in enumerate(skf.split(np.ones(features.shape[0]), y)):skf_indices.extend(valid_idx.tolist())train_pred = np.zeros((features.shape[0], num_classes)) #在訓練集上的預測結果 (train_samples,classes)test_pred = np.zeros((test_data.shape[0], num_classes))#在測試集上的預測結果 (test_samples,classes)for fold in tqdm(range(num_fold)):fold_start = fold * fold_lenfold_end = (fold + 1) * fold_lenif fold == num_fold - 1:fold_end = train_data.shape[0]#訓練部分索引 9折train_indices = skf_indices[:fold_start] + skf_indices[fold_end:]# 驗證部分索引 1折test_indices = skf_indices[fold_start:fold_end]#訓練部分數據 9折train_x = features[train_indices]train_y = y[train_indices]#驗證部分數據 1折cv_test_x = features[test_indices]clf.fit(train_x, train_y) #訓練if model_name =='LinearSVC':pred = clf.decision_function(cv_test_x) #在驗證部分數據上進行驗證train_pred[test_indices] = (pred).reshape(len(pred),1) #把預測結果先通過softmax轉換為概率分布(歸一化) 賦給驗證部分對應的位置循環結束將會得到整個訓練集上的預測結果pred = clf.decision_function(test_data) #得到當前訓練的模型在測試集上的預測結果test_pred += pred.reshape(len(pred),1) / num_fold#對每個模型在測試集上的預測結果先通過softmax轉換為概率分布，再直接取平均(10折將會有10個結果)else:pred = clf.predict_proba(cv_test_x) #在驗證部分數據上進行驗證train_pred[test_indices] = pred #把預測結果賦給驗證部分對應的位置循環結束將會得到整個訓練集上的預測結果pred = clf.predict_proba(test_data) #得到當前訓練的模型在測試集上的預測結果test_pred += pred / num_fold #對每個模型在測試集上的預測結果直接取平均(10折將會有10個結果)y_pred = np.argmax(train_pred, axis=1) #對訓練集上的預測結果按行取最大值得到預測的標簽if model_name =='LinearSVC':y_pred = (train_pred>0).astype(np.int32).reshape(len(train_pred))pre = (test_pred>0).astype(np.int32).reshape(len(test_pred))else:pre = np.argmax(test_pred,axis=1)score = accuracy_score(y, y_pred) #和訓練集對應的真實標簽 accuracy_scoreprint('accuracy_score:',score)#保存邏輯回歸模型在訓練集和測試集上的預測結果np.save('./result/{0}'.format(model_name)+'train',train_pred)np.save('./result/{0}'.format(model_name)+'test',test_pred)submit = pd.DataFrame({'PassengerId':np.array(range(892,1310)),'Survived':pre.astype(np.int32)})submit.to_csv('{0}_submit.csv'.format(model_name),index=False)return clf,score

邏輯回歸(LR)

pipe=Pipeline([('select',PCA(n_components=0.95)), ('classify', LogisticRegression(random_state = seed, solver = 'liblinear'))]) param = {'classify__penalty':['l1','l2'], 'classify__C':[0.001, 0.01, 0.1, 1, 5,7,8,9,10,]} LR_grid = GridSearchCV(estimator =pipe, param_grid = param, scoring='roc_auc', cv=5) LR_grid.fit(features,y) print(LR_grid.best_params_, LR_grid.best_score_) C=LR_grid.best_params_['classify__C'] penalty = LR_grid.best_params_['classify__penalty'] LR_classify=LogisticRegression(C=C,penalty=penalty,random_state = seed, solver = 'liblinear') LR_select = PCA(n_components=0.95) LR_pipeline = make_pipeline(LR_select, LR_classify) lr_model,lr_score = Titanicmodel(LR_pipeline,feature,test_data,y,'LR')

支持向量機(SVM)

pipe=Pipeline([('select',SelectKBest(k=20)), ('classify',LinearSVC(random_state=seed))]) param = {'select__k':list(range(20,40,2)),'classify__penalty':['l1','l2'], 'classify__C':[0.001, 0.01, 0.1, 1, 5,7,8,9,10,50,100]} SVC_grid=GridSearchCV(estimator=pipe,param_grid=param,cv=5,scoring='roc_auc') SVC_grid.fit(features,y) print(SVC_grid.best_params_, SVC_grid.best_score_) C=SVC_grid.best_params_['classify__C'] k=SVC_grid.best_params_['select__k'] penalty = SVC_grid.best_params_['classify__penalty'] SVC_classify=LinearSVC(C=C,penalty=penalty,random_state = seed) SVC_select = PCA(n_components=0.95) SVC_pipeline = make_pipeline(SVC_select, SVC_classify) SVC_model,LinearSVC_score = Titanicmodel(SVC_pipeline,feature,test_data,y,'LinearSVC')

RandomForestClassifier

pipe=Pipeline([('select',SelectKBest(k=34)), ('classify', RandomForestClassifier(criterion='gini',random_state = seed,min_samples_split=4,min_samples_leaf=5, max_features = 'sqrt',n_jobs=-1,))])param = {'classify__n_estimators':list(range(40,50,2)), 'classify__max_depth':list(range(10,25,2))} rfc_grid = GridSearchCV(estimator = pipe, param_grid = param, scoring='roc_auc', cv=10) rfc_grid.fit(features,y) print(rfc_grid.best_params_, rfc_grid.best_score_) n_estimators=rfc_grid.best_params_['classify__n_estimators'] max_depth = rfc_grid.best_params_['classify__max_depth'] rfc_classify=RandomForestClassifier(criterion='gini',n_estimators= n_estimators,max_depth=max_depth,random_state = seed,min_samples_split=4,min_samples_leaf=5, max_features = 'sqrt') rfc_select = PCA(n_components=0.95) rfc_pipeline = make_pipeline(rfc_select, rfc_classify) rfc_model,rfc_score = Titanicmodel(rfc_pipeline,feature,test_data,y,'rfc')

LightGBM

pipe=Pipeline([('select',SelectKBest(k=34)), ('classify', lgb.LGBMClassifier(random_state=seed,learning_rate=0.12,n_estimators=88,max_depth=16,min_child_samples=28,min_child_weight=0.0,classify__colsample_bytree= 0.8,colsample_bytree=0.4,objective='binary') )])param = {'select__k':[i for i in range(20,40)] # 'classify__learning_rate':[i/100 for i in range(20)] } lgb_grid = GridSearchCV(estimator = pipe, param_grid = param, scoring='roc_auc', cv=10) lgb_grid.fit(features,y) print(lgb_grid.best_params_, lgb_grid.best_score_) lgb_classify= lgb.LGBMClassifier(random_state=seed,learning_rate=0.12,n_estimators=88,max_depth=16,min_child_samples=28,min_child_weight=0.0,classify__colsample_bytree= 0.8,colsample_bytree=0.4,objective='binary') lgb_select = PCA(n_components=0.96) lgb_pipeline = make_pipeline(lgb_select, lgb_classify) lgb_model,lgb_score = Titanicmodel(lgb_pipeline,feature,test_data,y,'lgb')

Xgboost

pipe=Pipeline([('select',SelectKBest(k=34)), ('classify', xgb.XGBClassifier(random_state=seed,learning_rate=0.12,n_estimators=80,max_depth=8,min_child_weight=3,subsample=0.8,colsample_bytree=0.8,gamma=0.2,reg_alpha=0.2,reg_lambda=0.1,))]) param = { 'select__k':[i for i in range(20,40)'classify__learning_rate':[i/100 for i in range(10,20)], } xgb_grid = GridSearchCV(estimator = pipe, param_grid = param, scoring='roc_auc', cv=10) xgb_grid.fit(features,y) print(xgb_grid.best_params_, xgb_grid.best_score_) xgb_classify= xgb.XGBClassifier(random_state=seed,learning_rate=0.12,n_estimators=80,max_depth=8,min_child_weight=3,subsample=0.8,colsample_bytree=0.8,gamma=0.2,reg_alpha=0.2,reg_lambda=0.1,) xgb_select = SelectKBest(k = 34) xgb_pipeline = make_pipeline(xgb_select, xgb_classify) xgb_model,xgb_score = Titanicmodel(xgb_pipeline,'xgb')

模型融合

LR_train = np.load('./result/LRtrain.npy') LR_test = np.load('./result/LRtest.npy') LinearSVC_train = np.load('./result/LinearSVCtrain.npy') LinearSVC_test = np.load('./result/LinearSVCtest.npy') rfc_train = np.load('./result/rfctrain.npy') rfc_test = np.load('./result/rfctest.npy') xgb_train = np.load('./result/xgbtrain.npy') xgb_test = np.load('./result/xgbtest.npy') lgb_train = np.load('./result/lgbtrain.npy') lgb_test= np.load('./result/lgbtest.npy') label = np.load('./result/label.npy') train_data = ( LR_train, rfc_train, LinearSVC_train,xgb_train, lgb_train) test_x = ( LR_test, rfc_test, LinearSVC_test,xgb_test, lgb_test) train_data = np.hstack(train_data) test_x = np.hstack(test_x) model = LogisticRegression(random_state=seed) lgbm_7leaves_model,lgbm_7leaves_score = Titanicmodel(model,features=train_data,test_data=test_x,y=label,model_name='lr_stacking')

總結

以上是生活随笔為你收集整理的Kaggle泰坦尼克号生存预测挑战——模型建立、模型调参、融合的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：新版mysql的下载教程_Mysql最新
下一篇：强烈推荐这款刷题小程序