sklearn集合算法预测泰坦尼克号幸存者
生活随笔
收集整理的這篇文章主要介紹了
sklearn集合算法预测泰坦尼克号幸存者
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
原文:
http://ihoge.cn/2018/sklearn-ensemble.html
隨機森林分類預測泰坦尼尼克號幸存者
import pandas as pd import numpy as npdef read_dataset(fname):data = pd.read_csv(fname, index_col=0)data.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)lables = data['Sex'].unique().tolist()data['Sex'] = [*map(lambda x: lables.index(x) , data['Sex'])]lables = data['Embarked'].unique().tolist()data['Embarked'] = data['Embarked'].apply(lambda n: lables.index(n))data = data.fillna(0)return data train = read_dataset('code/datasets/titanic/train.csv')from sklearn.model_selection import train_test_splity = train['Survived'].values X = train.drop(['Survived'], axis=1).valuesX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) print("X_train_shape:", X_train.shape, " y_train_shape:", y_train.shape) print("X_test_shape:", X_test.shape," y_test_shape:", y_test.shape) X_train_shape: (712, 7) y_train_shape: (712,) X_test_shape: (179, 7) y_test_shape: (179,) **** from sklearn.ensemble import RandomForestClassifier import timestart = time.clock() entropy_thresholds = np.linspace(0, 1, 50) gini_thresholds = np.linspace(0, 0.1, 50) #設置參數矩陣: param_grid = [{'criterion': ['entropy'], 'min_impurity_decrease': entropy_thresholds},{'criterion': ['gini'], 'min_impurity_decrease': gini_thresholds},{'max_depth': np.arange(2,10)},{'min_samples_split': np.arange(2,20)},{'n_estimators':np.arange(2,20)}] clf = GridSearchCV(RandomForestClassifier(), param_grid, cv=5) clf.fit(X, y)print("耗時:",time.clock() - start) print("best param:{0}\nbest score:{1}".format(clf.best_params_, clf.best_score_)) 耗時: 13.397480000000002 best param:{'min_samples_split': 10} best score:0.8406285072951739 clf = RandomForestClassifier(min_samples_split=10) clf.fit(X_train, y_train) y_pred = clf.predict(X_test)print("訓練集得分:", clf.score(X_train, y_train)) print("測試集得分:", clf.score(X_test, y_test)) print("查準率:", metrics.precision_score(y_test, y_pred)) print("召回率:", metrics.recall_score(y_test, y_pred)) print("F1_score:", metrics.f1_score(y_test, y_pred)) 訓練集得分: 0.8974719101123596 測試集得分: 0.7988826815642458 查準率: 0.8082191780821918 召回率: 0.7283950617283951 F1_score: 0.7662337662337663這次分別對模型的criterion,max_depth,min_samples_split,n_estimators四個參數進行了比較。
經過多次執行發現結果仍不是很穩定,最優參數集中在min_samples_split分別為8,10,12上
自助聚合算法預測泰坦尼尼克號幸存者
from sklearn.ensemble import BaggingClassifierclf = BaggingClassifier(n_estimators=50) clf.fit(X_train, y_train) y_pred = clf.predict(X_test)print("訓練集得分:", clf.score(X_train, y_train)) print("測試集得分:", clf.score(X_test, y_test)) print("查準率:", metrics.precision_score(y_test, y_pred)) print("召回率:", metrics.recall_score(y_test, y_pred)) print("F1_score:", metrics.f1_score(y_test, y_pred)) 訓練集得分: 0.9817415730337079 測試集得分: 0.7877094972067039 查準率: 0.7792207792207793 召回率: 0.7407407407407407 F1_score: 0.7594936708860759Boosting正向激勵算法預測泰坦尼尼克號幸存者
from sklearn.ensemble import AdaBoostClassifierclf = AdaBoostClassifier() clf.fit(X_train, y_train) y_pred = clf.predict(X_test)print("訓練集得分:", clf.score(X_train, y_train)) print("測試集得分:", clf.score(X_test, y_test)) print("查準率:", metrics.precision_score(y_test, y_pred)) print("召回率:", metrics.recall_score(y_test, y_pred)) print("F1_score:", metrics.f1_score(y_test, y_pred)) 訓練集得分: 0.8300561797752809 測試集得分: 0.8156424581005587 查準率: 0.8076923076923077 召回率: 0.7777777777777778 F1_score: 0.7924528301886792Extra Trees算法預測泰坦尼尼克號幸存者
from sklearn.ensemble import RandomForestClassifier import timestart = time.clock() entropy_thresholds = np.linspace(0, 1, 50) gini_thresholds = np.linspace(0, 0.1, 50) #設置參數矩陣: param_grid = [{'criterion': ['entropy'], 'min_impurity_decrease': entropy_thresholds},{'criterion': ['gini'], 'min_impurity_decrease': gini_thresholds},{'max_depth': np.arange(2,10)},{'min_samples_split': np.arange(2,20)},{'n_estimators':np.arange(2,20)}] clf = GridSearchCV(ExtraTreesClassifier(), param_grid, cv=5) clf.fit(X, y)print("耗時:",time.clock() - start) print("best param:{0}\nbest score:{1}".format(clf.best_params_, clf.best_score_)) 耗時: 16.29516799999999 best param:{'min_samples_split': 12} best score:0.8226711560044894 from sklearn.ensemble import ExtraTreesClassifierclf = ExtraTreesClassifier(min_samples_split=12from sklearn.ensemble import RandomForestClassifier import timestart = time.clock() entropy_thresholds = np.linspace(0, 1, 50) gini_thresholds = np.linspace(0, 0.1, 50) #設置參數矩陣: param_grid = [{'criterion': ['entropy'], 'min_impurity_decrease': entropy_thresholds},{'criterion': ['gini'], 'min_impurity_decrease': gini_thresholds},{'max_depth': np.arange(2,10)},{'min_samples_split': np.arange(2,20)},{'n_estimators':np.arange(2,20)}] clf = GridSearchCV(ExtraTreesClassifier(), param_grid, cv=5) clf.fit(X, y)print("耗時:",time.clock() - start) print("best param:{0}\nbest score:{1}".format(clf.best_params_, clf.best_score_))) clf.fit(X_train, y_train) y_pred = clf.predict(X_test)print("訓練集得分:", clf.score(X_train, y_train)) print("測試集得分:", clf.score(X_test, y_test)) print("查準率:", metrics.precision_score(y_test, y_pred)) print("召回率:", metrics.recall_score(y_test, y_pred)) print("F1_score:", metrics.f1_score(y_test, y_pred)) 訓練集得分: 0.8932584269662921 測試集得分: 0.8100558659217877 查準率: 0.8405797101449275 召回率: 0.7160493827160493 F1_score: 0.7733333333333333結論:
針對此數據集預測泰坦尼克號的結果對比中,Boosting正向激勵算法性能最佳最穩定,其次是參數優化后的Extra Trees算法。
總結
以上是生活随笔為你收集整理的sklearn集合算法预测泰坦尼克号幸存者的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Scala编程指南
- 下一篇: SVM支持向量机绘图