[Kaggle] Heart Disease Prediction
生活随笔
收集整理的這篇文章主要介紹了
[Kaggle] Heart Disease Prediction
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
文章目錄
- 1. 數據探索
- 2. 特征處理管道
- 3. 訓練模型
- 4. 預測
kaggle項目地址
1. 數據探索
import pandas as pd train = pd.read_csv('./train.csv') test = pd.read_csv('./test.csv')train.info() test.info() abs(train.corr()['target']).sort_values(ascending=False) <class 'pandas.core.frame.DataFrame'> RangeIndex: 241 entries, 0 to 240 Data columns (total 14 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 241 non-null int64 1 sex 241 non-null int64 2 cp 241 non-null int64 3 trestbps 241 non-null int64 4 chol 241 non-null int64 5 fbs 241 non-null int64 6 restecg 241 non-null int64 7 thalach 241 non-null int64 8 exang 241 non-null int64 9 oldpeak 241 non-null float6410 slope 241 non-null int64 11 ca 241 non-null int64 12 thal 241 non-null int64 13 target 241 non-null int64 dtypes: float64(1), int64(13) memory usage: 26.5 KB訓練數據241條,13個特征(全部為數字特征),標簽為 target
- 特征與 標簽 的相關系數
- 查看特征的值
- 一些特征不能用大小來度量,將其轉為 分類變量(string 類型,后序onehot編碼)
2. 特征處理管道
- 數字特征、文字特征分離
- 抽取部分數據作為本地驗證
- 數據處理管道
3. 訓練模型
# 本地測試 from sklearn.ensemble import GradientBoostingClassifier from sklearn.linear_model import Perceptron from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.metrics import accuracy_score from sklearn.model_selection import GridSearchCVrf = RandomForestClassifier() knn = KNeighborsClassifier() lr = LogisticRegression() svc = SVC() gbdt = GradientBoostingClassifier() perceptron = Perceptron()models = [perceptron, knn, lr, svc, rf, gbdt] param_grid_list = [# perceptron[{'model__max_iter' : [10000, 5000]}],# knn[{'model__n_neighbors' : [3,5,10,15,35],'model__leaf_size' : [3,5,10,20,30,40,50]}],# lr[{'model__penalty' : ['l1', 'l2'],'model__C' : [0.05, 0.1, 0.2, 0.5, 1, 1.2],'model__max_iter' : [50000]}],# svc[{'model__degree' : [3, 5, 7],'model__C' : [0.2, 0.5, 1, 1.2, 1.5],'model__kernel' : ['rbf', 'sigmoid', 'poly']}],# rf[{# 'preparation__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent'],'model__n_estimators' : [100,200,250,300,350],'model__max_features' : [5,8, 10, 12, 15, 20, 30, 40, 50],'model__max_depth' : [3,5,7]}],# gbdt[{'model__learning_rate' : [0.02, 0.05, 0.1, 0.2],'model__n_estimators' : [30, 50, 100, 150],'model__max_features' : [5, 8, 10,20,30,40],'model__max_depth' : [3,5,7],'model__min_samples_split' : [10, 20,40],'model__min_samples_leaf' : [5,10,20],'model__subsample' : [0.5, 0.8, 1]}], ]for i, model in enumerate(models):pipe = Pipeline([('preparation', full_pipeline),('model', model)])grid_search = GridSearchCV(pipe, param_grid_list[i], cv=3,scoring='accuracy', verbose=2, n_jobs=-1)grid_search.fit(train_part, train_part_y)print(grid_search.best_params_)final_model = grid_search.best_estimator_pred = final_model.predict(valid_part)print('accuracy score: ', accuracy_score(valid_part_y, pred)) Fitting 3 folds for each of 2 candidates, totalling 6 fits {'model__max_iter': 10000} accuracy score: 0.4489795918367347Fitting 3 folds for each of 35 candidates, totalling 105 fits {'model__leaf_size': 3, 'model__n_neighbors': 3} accuracy score: 0.5306122448979592Fitting 3 folds for each of 12 candidates, totalling 36 fits {'model__C': 0.1, 'model__max_iter': 50000, 'model__penalty': 'l2'} accuracy score: 0.8979591836734694Fitting 3 folds for each of 45 candidates, totalling 135 fits {'model__C': 1, 'model__degree': 5, 'model__kernel': 'poly'} accuracy score: 0.6326530612244898Fitting 3 folds for each of 135 candidates, totalling 405 fits {'model__max_depth': 5, 'model__max_features': 5, 'model__n_estimators': 250} accuracy score: 0.8775510204081632Fitting 3 folds for each of 7776 candidates, totalling 23328 fits {'model__learning_rate': 0.05, 'model__max_depth': 7, 'model__max_features': 20, 'model__min_samples_leaf': 10, 'model__min_samples_split': 40, 'model__n_estimators': 150, 'model__subsample': 0.5} accuracy score: 0.8163265306122449LR,RF,GBDT 表現較好
4. 預測
# 全量數據訓練,提交測試 # 采用隨機參數搜索 y_train = train_['target'] X_train = train_.drop(['target'], axis=1) X_test = test_from sklearn.model_selection import RandomizedSearchCV from scipy.stats import randint import numpy as npselect_model = [lr, rf, gbdt] name = ['lr', 'rf', 'gbdt'] param_distribs = [# lr[{'model__penalty' : ['l1', 'l2'],'model__C' : np.linspace(0.01, 0.5, 10),'model__max_iter' : [50000]}],# rf[{# 'preparation__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent'],'model__n_estimators' : randint(low=50, high=500),'model__max_features' : randint(low=3, high=30),'model__max_depth' : randint(low=2, high=20)}],# gbdt[{'model__learning_rate' : np.linspace(0.01, 0.3, 10),'model__n_estimators' : randint(low=30, high=500),'model__max_features' : randint(low=5, high=50),'model__max_depth' : randint(low=3, high=20),'model__min_samples_split' : randint(low=10, high=100),'model__min_samples_leaf' : randint(low=3, high=50),'model__subsample' : np.linspace(0.5, 1.5, 10)}], ]for i, model in enumerate(select_model):pipe = Pipeline([('preparation', full_pipeline),('model', model)])rand_search = RandomizedSearchCV(pipe, param_distributions=param_distribs[i], cv=5,n_iter=1000, scoring='accuracy', verbose=2, n_jobs=-1)rand_search.fit(X_train, y_train)print(rand_search.best_params_)final_model = rand_search.best_estimator_pred = final_model.predict(X_test)print(model,"\nFINISH !!!")res = pd.DataFrame()res['Id'] = range(1,63,1)res['Prediction'] = predres.to_csv('{}_pred.csv'.format(name[i]), index=False)測試效果如下。
總結
以上是生活随笔為你收集整理的[Kaggle] Heart Disease Prediction的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: LeetCode 1716. 计算力扣银
- 下一篇: LeetCode 1552. 两球之间的