kaggle初探--泰坦尼克号生存预测
繼續學習數據挖掘,嘗試了kaggle上的泰坦尼克號生存預測。
Titanic for Machine Learning
導入和讀取
# data processing import numpy as np import pandas as pd import re #visiulization import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline plt.style.use('ggplot') train = pd.read_csv('D:/data/titanic/train.csv') test = pd.read_csv('D:/data/titanic/test.csv') train.head() .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
數據特征有:PassengerId,無特別意義
Pclass,客艙等級,對生存有影響嗎?是否高等倉的有更多機會?
Name,姓名,可幫助我們判斷性別,大概年齡。
Sex,女性的生產率是否更高?
Age,不同年齡段是否對生存有影響?
SibSp和Parch,指是否有兄弟姐妹和配偶父母,有親人的情況下生存率是提高還是降低?
Fare,票價,高票價是否有更多機會?
Cabin,Embarked,客艙和登錄港口……自然理解對生存應該沒有影響
| 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
| 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
| 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
| 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
| 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
| 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
| 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
| 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
| 891 | 891 | 891 | 204 | 889 |
| 891 | 2 | 681 | 147 | 3 |
| Hippach, Mrs. Louis Albert (Ida Sophia Fischer) | male | 1601 | C23 C25 C27 | S |
| 1 | 577 | 7 | 4 | 644 |
目標Survived特征
survive_num = train.Survived.value_counts() survive_num.plot.pie(explode=[0,0.1],autopct='%1.1f%%',labels=['died','survived'],shadow=True) plt.show() x=[0,1] plt.bar(x,survive_num,width=0.35) plt.xticks(x,('died','survived')) plt.show()特征分析
num_f = [f for f in train.columns if train.dtypes[f] != 'object'] cat_f = [f for f in train.columns if train.dtypes[f]=='object'] print('there are %d numerical features:'%len(num_f),num_f) print('there are %d category features:'%len(cat_f),cat_f)there are 7 numerical features: [‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’]
there are 5 category features: [‘Name’, ‘Sex’, ‘Ticket’, ‘Cabin’, ‘Embarked’]
feature類別:
- 數值型
- 特征型:可排序/不可排序型
- category不可排序型:sex,Embarked
category特征
性別
train.groupby(['Sex'])['Survived'].count() Sex female 314 male 577 Name: Survived, dtype: int64 f,ax = plt.subplots(figsize=(8,6)) fig = sns.countplot(x='Sex',hue='Survived',data=train) fig.set_title('Sex:Survived vs Dead') plt.show() train.groupby(['Sex'])['Survived'].sum()/train.groupby(['Sex'])['Survived'].count() Sex female 0.742038 male 0.188908 Name: Survived, dtype: float64 船上原有人數,男性遠高于女性;存活率,女性在75%左右,遠高于男性18%-19%.可見女性存活率遠高于男性,是重要特征。Embarked
sns.factorplot('Embarked','Survived',data=train) plt.show() f,ax = plt.subplots(1,3,figsize=(24,6)) sns.countplot('Embarked',data=train,ax=ax[0]) ax[0].set_title('No. Of Passengers Boarded') sns.countplot(x='Embarked',hue='Survived',data=train,ax=ax[1]) ax[1].set_title('Embarked vs Survived') sns.countplot('Embarked',hue='Pclass',data=train,ax=ax[2]) ax[2].set_title('Embarked vs Pclass') #plt.subplots_adjust(wspace=0.2,hspace=0.5) plt.show() #pd.pivot_table(train,index='Embarked',columns='Pclass',values='Fare') sns.boxplot(x='Embarked',y='Fare',hue='Pclass',data=train) plt.show()從圖中看出大部分乘客來自S port,其中多數為class 3,但是class 1 的人數也是3個口中最多的,C port的存活率最高,為0.55,因為C port中class1的人比例較高,Q port 絕大部分乘客是class 3的。C口1,2倉的票價均值較高,可能是暗示這個口上的人的社會地位較高。不過,從邏輯上說登錄口對生存率是沒有影響的,所以可以將其轉成啞變量或drop.
Pclass
train.groupby('Pclass')['Survived'].value_counts() Pclass Survived 1 1 136 0 80 2 0 97 1 87 3 0 372 1 119 Name: Survived, dtype: int64 plt.subplots(figsize=(8,6)) f = sns.countplot('Pclass',hue='Survived',data=train) sns.factorplot('Pclass','Survived',hue='Sex',data=train) plt.show()class1,2的存活率明顯較高,1有半數以上存活,2也基本持平,1,2倉女性甚至接近于1,所以客艙等級對生存有很大影響。
SibSp
train[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False) .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }| 1 | 0.535885 |
| 2 | 0.464286 |
| 0 | 0.345395 |
| 3 | 0.250000 |
| 4 | 0.166667 |
| 5 | 0.000000 |
| 8 | 0.000000 |
在沒有同伴的情況下,存活率大概在0.3左右,有一個同伴的存活率最高>0.5,可能原因是1,2倉的乘客比例較高,隨后,隨著同伴數量增加而降低,降低的主要原因可能是,超過3人以上的乘客主要在class3,class3中3人以上存活率很低
Parch
#pd.pivot_table(train,values='Survived',index='Parch',columns='Pclass') sns.countplot(x='Parch',hue='Pclass',data=train) plt.show() sns.factorplot('Parch','Survived',data=train) plt.show()趨勢跟SibSp相似,一個人存活率較低,在有1-3parents時存活率較高,隨后迅速降低,因為多數乘客來自class3
Age
train.groupby('Survived')['Age'].describe() .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }| 424.0 | 30.626179 | 14.172110 | 1.00 | 21.0 | 28.0 | 39.0 | 74.0 |
| 290.0 | 28.343690 | 14.950952 | 0.42 | 19.0 | 28.0 | 36.0 | 80.0 |
1等倉獲救年齡總體偏低,生存率年齡跨度大,尤其是20歲以上至50歲的生存率較高,可能和1等倉人年齡總體偏大有關;10歲左右的兒童在2,3等倉的生存率明顯提升,對于男性而言同理,兒童有個明顯提升,;女性的生存年齡集中在中青年;20-40歲左右的中青年人死亡人數最多。
Name
name主要用途是可以幫助我們分辨性別,幫助補充有相同title的年齡缺失值
#用正則表達式幫助找出姓名中表示年齡的title def getTitle(data):name_sal = []for i in range(len(data['Name'])):name_sal.append(re.findall(r'.\w*\.',data.Name[i]))Salut = []for i in range(len(name_sal)):name = str(name_sal[i])name = name[1:-1].replace("'","")name = name.replace(".","").strip()name = name.replace(" ","")Salut.append(name)data['Title'] = SalutgetTitle(train) train.head(2) .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | Mr |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | Mrs |
| 0 | 1 |
| 0 | 2 |
| 1 | 0 |
| 0 | 1 |
| 1 | 6 |
| 0 | 1 |
| 1 | 0 |
| 0 | 2 |
| 0 | 40 |
| 182 | 0 |
| 2 | 0 |
| 1 | 0 |
| 0 | 517 |
| 124 | 0 |
| 1 | 0 |
| 1 | 0 |
| 0 | 6 |
| 0 | 1 |
補習一波英語:Mme:稱呼非英語民族的”上層社會”已婚婦女,及有職業的婦女,相當于Mrs;Jonkheer:鄉紳;Capt:船長;Lady:貴族夫人;Don唐:是西班牙語中貴族和有地位者的尊稱;the Countess:女伯爵;Ms:Ms.或Mz:婚姻狀態不明的婦女;Col:上校;Major:少校;Mlle:小姐;Rev:牧師。
Fare
train.groupby('Pclass')['Fare'].mean() Pclass 1 84.154687 2 20.662183 3 13.675550 Name: Fare, dtype: float64 sns.distplot(train['Fare'].dropna()) plt.xlim((0,200)) plt.xticks(np.arange(0,200,10)) plt.show()初步分析總結:
- 對于性別,女性生存率明顯高于男性
- 頭等艙生存率很高,3等倉很低,class1,2女性生存率接近于1
- 10歲左右的兒童生存率又明顯提升
- SibSp和Parch相似,一個人存活率較低,有1-2SibSp或者1-3Parents生存率較高,但超過后生存率大幅下降
- name和age可以對所有數據進行處理,用name提取性別title,借助均值對age進行補充
數據處理
#合并訓練集和測試集 passID = test['PassengerId'] all_data = pd.concat([train,test],keys=["train","test"]) all_data.shape #all_data.head() (1309, 13) #統計缺失值 NAs = pd.concat([train.isnull().sum(),train.isnull().sum()/train.isnull().count(),test.isnull().sum(),test.isnull().sum()/test.isnull().count()],axis=1,keys=["train","percent_train","test","percent"]) NAs[NAs.sum(axis=1)>1].sort_values(by="percent",ascending=False) .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }| 687 | 0.771044 | 327.0 | 0.782297 |
| 177 | 0.198653 | 86.0 | 0.205742 |
| 0 | 0.000000 | 1.0 | 0.002392 |
| 2 | 0.002245 | 0.0 | 0.000000 |
| 22.0 | S | 7.2500 | Braund, Mr. Owen Harris | 0 | 3 | male | 1 | 0.0 | A/5 21171 | Mr |
| 38.0 | C | 71.2833 | Cumings, Mrs. John Bradley (Florence Briggs Th… | 0 | 1 | female | 1 | 1.0 | PC 17599 | Mrs |
Age處理
#先提取name中的title getTitle(all_data) pd.crosstab(all_data['Title'], all_data['Sex']) .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }| 0 | 1 |
| 0 | 4 |
| 1 | 0 |
| 0 | 1 |
| 1 | 0 |
| 1 | 7 |
| 0 | 1 |
| 1 | 0 |
| 0 | 2 |
| 0 | 61 |
| 260 | 0 |
| 2 | 0 |
| 1 | 0 |
| 0 | 757 |
| 196 | 0 |
| 1 | 0 |
| 2 | 0 |
| 0 | 8 |
| 0 | 1 |
| Master | 0.575000 |
| Miss | 0.702703 |
| Mr | 0.158192 |
| Mrs | 0.777778 |
- 16歲左右兒童存活率較高,最年長乘客(80歲)幸存
- 大量16~40青少年沒有存活
- 大多數乘客在16~40歲
- 為輔助分類,將年齡分段,創造新特征,同時增加兒童特征
add isChild
def male_female_child(passenger):# 取年齡和性別age,sex = passenger# 提出兒童特征if age < 16:return 'child'else:return sex # 創建新特征 all_data['person'] = all_data[['Age','Sex']].apply(male_female_child,axis=1) #0-80歲的年齡分布,若分段成3組,按少年、中青年、老年分all_data['Age_band']=0 all_data.loc[all_data['Age']<=16,'Age_band']=0 all_data.loc[(all_data['Age']>16)&(all_data['Age']<=40),'Age_band']=1 all_data.loc[all_data['Age']>40,'Age_band']=2Name處理
df = pd.get_dummies(all_data['Title'],prefix='Title') all_data = pd.concat([all_data,df],axis=1) all_data.drop('Title',axis=1,inplace=True) #drop name all_data.drop('Name',axis=1,inplace=True)fiilna Embarked
all_data.loc[all_data.Embarked.isnull()] .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }| 38.0 | NaN | 80.0 | 0 | 1 | female | 0 | 1.0 | 113572 | 2 | female | 1 |
| 62.0 | NaN | 80.0 | 0 | 1 | female | 0 | 1.0 | 113572 | 3 | female | 2 |
票價80,一等艙,很大概率是C口
all_data['Embarked'].fillna('C',inplace=True)all_data.Embarked.isnull().any() False embark_dummy = pd.get_dummies(all_data.Embarked) all_data = pd.concat([all_data,embark_dummy],axis=1) all_data.head(2) .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }| 22.0 | S | 7.2500 | 0 | 3 | male | 1 | 0.0 | A/5 21171 | male | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 38.0 | C | 71.2833 | 0 | 1 | female | 1 | 1.0 | PC 17599 | female | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
add SibSp and Parch
#創造familysize和alone兩個新特征 all_data['Family_size'] = all_data['SibSp']+all_data['Parch']#是所有親屬總和 all_data['alone'] = 0#不是一個人 all_data.loc[all_data.Family_size==0,'alone']=1#代表是一個人 f,ax=plt.subplots(1,2,figsize=(16,6)) sns.factorplot('Family_size','Survived',data=all_data[:train.shape[0]],ax=ax[0]) ax[0].set_title('Family_size vs Survived') sns.factorplot('alone','Survived',data=all_data[:train.shape[0]],ax=ax[1]) ax[1].set_title('alone vs Survived') plt.close(2) plt.close(3) plt.show()當乘客一個人的時候,生存率很低,大概在0.3左右,有1-3家庭成員時生存率上升,但>4時,生存率又急速下降。
#再將family size分段 all_data['Family_size'] = np.where(all_data['Family_size']==0, 'solo',np.where(all_data['Family_size']<=3, 'normal', 'big')) sns.factorplot('alone','Survived',hue='Sex',data=all_data[:train.shape[0]],col='Pclass') plt.show()對于女性,1,2等倉來說,是否一個人對生存率影響不大,但對于3等倉女性,一個人時反而生存率提高。
all_data['poor_girl'] = 0 all_data.loc[(all_data['Sex']=='female')&(all_data['Pclass']==3)&(all_data['alone']==1),'poor_girl']=1連續變量Fare填充、分段
#補充全缺失值 all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==1),'Fare']=84 all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==2),'Fare']=21 all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==3),'Fare']=14 sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==0,'Fare' ],color='red', label='Not Survived') sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==1,'Fare' ],color='blue', label='Survived') plt.xlim((0,100)) (0, 100) sns.lmplot('Fare','Survived',data=all_data[:train.shape[0]]) plt.show() #Fare平均分成3段取均值 all_data['Fare_band'] = pd.qcut(all_data['Fare'],3)all_data[:train.shape[0]].groupby('Fare_band')['Survived'].mean() Fare_band (-0.001, 8.662] 0.198052 (8.662, 26.0] 0.402778 (26.0, 512.329] 0.559322 Name: Survived, dtype: float64 #將連續變量fare分段,離散化all_data['Fare_cut'] = 0 all_data.loc[all_data['Fare']<=8.662,'Fare_cut'] = 0 all_data.loc[((all_data['Fare']>8.662) & (all_data['Fare']<=26)),'Fare_cut'] = 1 #all_data.loc[((all_data['Fare']>14.454) & (all_data['Fare']<=31.275)),'Fare_cut'] = 2 all_data.loc[((all_data['Fare']>26) & (all_data['Fare']<513)),'Fare_cut'] = 2sns.factorplot('Fare_cut','Survived',hue='Sex',data=all_data[:train.shape[0]]) plt.show()價格上升,生存率增加,對男性尤為明顯
# creat a feature about rich man all_data['rich_man'] = 0 all_data.loc[((all_data['Fare']>=80) & (all_data['Sex']=='male')),'rich_man'] = 1類型特征數值化
all_data.head() .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }| 22.0 | S | 7.2500 | 0 | 3 | male | 1 | 0.0 | A/5 21171 | male | … | 0 | 0 | 0 | 1 | normal | 0 | 0 | (-0.001, 8.662] | 0 | 0 |
| 38.0 | C | 71.2833 | 0 | 1 | female | 1 | 1.0 | PC 17599 | female | … | 1 | 1 | 0 | 0 | normal | 0 | 0 | (26.0, 512.329] | 2 | 0 |
| 26.0 | S | 7.9250 | 0 | 3 | female | 0 | 1.0 | STON/O2. 3101282 | female | … | 0 | 0 | 0 | 1 | solo | 1 | 1 | (-0.001, 8.662] | 0 | 0 |
| 35.0 | S | 53.1000 | 0 | 1 | female | 1 | 1.0 | 113803 | female | … | 1 | 0 | 0 | 1 | normal | 0 | 0 | (26.0, 512.329] | 2 | 0 |
| 35.0 | S | 8.0500 | 0 | 3 | male | 0 | 0.0 | 373450 | male | … | 0 | 0 | 0 | 1 | solo | 1 | 0 | (-0.001, 8.662] | 0 | 0 |
5 rows × 24 columns
舍棄特征有Embarked(已離散化),Fare,Fare_band(已用Fare_cut代替),Sex(已用Person代替),Age(有Age_band),Ticket,S,SibSp,Parch
''' 舍棄不需要的特征:Age,用Age_band分段代替了, Fare,Fare_band用Fare_cut分段代替了 Ticket無意義 ''' #all_data.drop(['Age','Fare','Fare_band','Ticket'],axis=1,inplace=True) #all_data.drop(['Age','Fare','Fare_band','Ticket','Embarked','C'],axis=1,inplace=True) all_data.drop(['Age','Fare','Ticket','Embarked','C','Fare_band','SibSp','Parch'],axis=1,inplace=True) all_data.head(2) .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }| 3 | male | 0.0 | male | 1 | 0 | 0 | 1 | 0 | 0 | 1 | normal | 0 | 0 | 0 | 0 |
| 1 | female | 1.0 | female | 1 | 0 | 0 | 0 | 1 | 0 | 0 | normal | 0 | 0 | 2 | 0 |
| 3 | male | 0.0 | male | 1 | 0 | 0 | 1 | 0 | 0 | … | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| 1 | female | 1.0 | female | 1 | 0 | 0 | 0 | 1 | 0 | … | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 3 | female | 1.0 | female | 1 | 0 | 1 | 0 | 0 | 0 | … | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| 1 | female | 1.0 | female | 1 | 0 | 0 | 0 | 1 | 0 | … | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 3 | male | 0.0 | male | 1 | 0 | 0 | 1 | 0 | 0 | … | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
5 rows × 25 columns
all_data.drop(['Sex','person','Age_band','Family_size'],axis=1,inplace=True) all_data.head() .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }| 3 | 0.0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | … | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| 1 | 1.0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | … | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 3 | 1.0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | … | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| 1 | 1.0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | … | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 3 | 0.0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | … | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
5 rows × 21 columns
建立模型
from sklearn.model_selection import cross_val_score, train_test_split from sklearn.metrics import confusion_matrix#retun array of prredict and target from sklearn.model_selection import cross_val_predict#use to retun the predict of cross val from sklearn.model_selection import GridSearchCV from sklearn import svm from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import RandomForestClassifier train_data = all_data[:train.shape[0]] test_data = all_data[train.shape[0]:] print('train data:'+str(train_data.shape)) print('test data:'+str(test_data.shape)) train data:(668, 21) test data:(641, 21) train,test = train_test_split(train_data,test_size = 0.25, random_state=0,stratify=train_data['Survived']) train_x = train.drop('Survived',axis=1)train_y = train['Survived']test_x = test.drop('Survived',axis=1) test_y = test['Survived'] print(train_x.shape) print(test_x.shape) (668, 20) (223, 20) # define score on train and test data def cv_score(model):cv_result = cross_val_score(model,train_x,train_y,cv=10,scoring = "accuracy")return(cv_result)def cv_score_test(model):cv_result_test = cross_val_score(model,test_x,test_y,cv=10,scoring = "accuracy")return(cv_result_test)rbf SVM
# RBF SVM modelparam_grid = {'C': [1e3, 5e3, 1e4, 5e4, 1e5],'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1], } clf_svc = GridSearchCV(svm.SVC(kernel='rbf', class_weight='balanced'), param_grid) clf_svc = clf_svc.fit(train_x, train_y) print("Best estimator found by grid search:") print(clf_svc.best_estimator_) acc_svc_train = cv_score(clf_svc.best_estimator_).mean() acc_svc_test = cv_score_test(clf_svc.best_estimator_).mean() print(acc_svc_train) print(acc_svc_test) Best estimator found by grid search: SVC(C=1000.0, cache_size=200, class_weight=’balanced’, coef0=0.0, decision_function_shape=None, degree=3, gamma=0.0001, kernel=’rbf’, max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) 0.826306967835 0.816196122718決策樹
#a simple treeclf_tree = DecisionTreeClassifier() clf_tree.fit(train_x,train_y) acc_tree_train = cv_score(clf_tree).mean() acc_tree_test = cv_score_test(clf_tree).mean() print(acc_tree_train) print(acc_tree_test) 0.808216271583 0.811631846414KNN
#test n_neighbors pred = [] for i in range(1,11):model = KNeighborsClassifier(n_neighbors=i)model.fit(train_x,train_y)pred.append(cv_score(model).mean()) n = list(range(1,11)) plt.plot(n,pred) plt.xticks(range(1,11)) plt.show() clf_knn = KNeighborsClassifier(n_neighbors=4) clf_knn.fit(train_x,train_y) acc_knn_train = cv_score(clf_knn).mean() acc_knn_test = cv_score_test(clf_knn).mean() print(acc_knn_train) print(acc_knn_test) 0.826239790353 0.829653679654邏輯回歸
#logistic regressionclf_LR = LogisticRegression() clf_LR.fit(train_x,train_y) acc_LR_train = cv_score(clf_LR).mean() acc_LR_test = cv_score_test(clf_LR).mean() print(acc_LR_train) print(acc_LR_test) 0.838226647511 0.811848296631高斯貝葉斯
clf_gb = GaussianNB() clf_gb.fit(train_x,train_y) acc_gb_train = cv_score(clf_gb).mean() acc_gb_test = cv_score_test(clf_gb).mean() print(acc_gb_train) print(acc_gb_test) 0.794959693511 0.789695087521隨機森林
n_estimators = range(100,1000,100) grid = {'n_estimators':n_estimators}clf_forest = GridSearchCV(RandomForestClassifier(random_state=0),param_grid=grid,verbose=True) clf_forest.fit(train_x,train_y) print(clf_forest.best_estimator_) print(clf_forest.best_score_) #print(cv_score(clf_forest).mean()) #print(cv_score_test(clf_forest).mean()) Fitting 3 folds for each of 9 candidates, totalling 27 fits [Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 32.2s finished RandomForestClassifier(bootstrap=True, class_weight=None, criterion=’gini’, max_depth=None, max_features=’auto’, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1, oob_score=False, random_state=0, verbose=0, warm_start=False) 0.817365269461 clf_forest = RandomForestClassifier(n_estimators=200) clf_forest.fit(train_x,train_y) acc_forest_train = cv_score(clf_forest).mean() acc_forest_test = cv_score_test(clf_forest).mean() print(acc_forest_train) print(acc_forest_test) 0.811178066885 0.811434217956 pd.Series(clf_forest.feature_importances_,train_x.columns).sort_values(ascending=True).plot.barh(width=0.8) plt.show() models = pd.DataFrame({'model':['SVM','Decision Tree','KNN','Logistic regression','Gaussion Bayes','Random Forest'],'score on train':[acc_svc_train,acc_tree_train,acc_knn_train,acc_LR_train,acc_gb_train,acc_forest_train],'score on test':[acc_svc_test,acc_tree_test,acc_knn_test,acc_LR_test,acc_gb_test,acc_forest_test] }) models.sort_values(by='score on test', ascending=False) ''' models = pd.DataFrame({'model':['SVM','Decision Tree','KNN','Logistic regression','Gaussion Bayes','Random Forest'],'score on train':[acc_svc_train,acc_tree_train,acc_knn_train,acc_LR_train,acc_gb_train,acc_forest_train] }) ''' models.sort_values(by='score on test', ascending=False) .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }| KNN | 0.829654 | 0.826240 |
| SVM | 0.816196 | 0.826307 |
| Logistic regression | 0.811848 | 0.838227 |
| Decision Tree | 0.811632 | 0.808216 |
| Random Forest | 0.811434 | 0.811178 |
| Gaussion Bayes | 0.789695 | 0.794960 |
Ensemble
from sklearn.ensemble import AdaBoostClassifier from sklearn.ensemble import VotingClassifier from sklearn.ensemble import GradientBoostingClassifier # bagging Decision tree from sklearn.ensemble import BaggingClassifier bag_tree = BaggingClassifier(base_estimator=clf_svc.best_estimator_,n_estimators=200,random_state=0) bag_tree.fit(train_x,train_y) acc_bagtree_train = cv_score(bag_tree).mean() acc_bagtree_test =cv_score_test(bag_tree).mean() print(acc_bagtree_train) print(acc_bagtree_test) 0.82782211935 0.816196122718Adaboosting
n_estimators = range(100,1000,100) a = [0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9] grid = {'n_estimators':n_estimators,'learning_rate':a} ada = GridSearchCV(AdaBoostClassifier(),param_grid=grid,verbose=True) ada.fit(train_x,train_y) print(ada.best_estimator_) print(ada.best_score_) #acc_ada_train = cv_score(ada).mean() #acc_ada_test = cv_score_test(ada).mean()#print(acc_ada_train) #print(acc_ada_test) Fitting 3 folds for each of 90 candidates, totalling 270 fits[Parallel(n_jobs=1)]: Done 270 out of 270 | elapsed: 5.4min finishedAdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,learning_rate=0.05, n_estimators=200, random_state=None) 0.835329341317 ada = AdaBoostClassifier(n_estimators=200,random_state=0,learning_rate=0.2) ada.fit(train_x,train_y)acc_ada_train = cv_score(ada).mean() acc_ada_test = cv_score_test(ada).mean()print(acc_ada_train) print(acc_ada_test) 0.829248144305 0.825719932242 #confusion matrix to see the presictiony_pred = cross_val_predict(ada,test_x,test_y,cv=10) sns.heatmap(confusion_matrix(test_y,y_pred),cmap='winter',annot=True,fmt='2.0f') plt.show()GradientBoosting
n_estimators = range(100,1000,100) a = [0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9] grid = {'n_estimators':n_estimators,'learning_rate':a} grad = GridSearchCV(GradientBoostingClassifier(),param_grid=grid,verbose=True) grad.fit(train_x,train_y) print(grad.best_estimator_) print(grad.best_score_) Fitting 3 folds for each of 90 candidates, totalling 270 fits[Parallel(n_jobs=1)]: Done 270 out of 270 | elapsed: 2.4min finishedGradientBoostingClassifier(criterion='friedman_mse', init=None,learning_rate=0.05, loss='deviance', max_depth=3,max_features=None, max_leaf_nodes=None,min_impurity_split=1e-07, min_samples_leaf=1,min_samples_split=2, min_weight_fraction_leaf=0.0,n_estimators=200, presort='auto', random_state=None,subsample=1.0, verbose=0, warm_start=False) 0.824850299401 #use best estimator in gradientclf_grad=GradientBoostingClassifier(n_estimators=200,random_state=0,learning_rate=0.05) clf_grad.fit(train_x,train_y) acc_grad_train = cv_score(clf_grad).mean() acc_grad_test = cv_score_test(clf_grad).mean()print(acc_grad_train) print(acc_grad_test) 0.818709926304 0.807500470544 from sklearn.metrics import precision_score class Ensemble(object):def __init__(self,estimators):self.estimator_names = []self.estimators = []for i in estimators:self.estimator_names.append(i[0])self.estimators.append(i[1])self.clf = LogisticRegression()def fit(self, train_x, train_y):for i in self.estimators:i.fit(train_x,train_y)x = np.array([i.predict(train_x) for i in self.estimators]).Ty = train_yself.clf.fit(x, y)def predict(self,x):x = np.array([i.predict(x) for i in self.estimators]).T#print(x)return self.clf.predict(x)def score(self,x,y):s = precision_score(y,self.predict(x))return s ensem = Ensemble([('Ada',ada),('Bag',bag_tree),('SVM',clf_svc.best_estimator_),('LR',clf_LR),('gbdt',clf_grad)]) score = 0 for i in range(0,10):ensem.fit(train_x, train_y)sco = round(ensem.score(test_x,test_y) * 100, 2)score+=sco print(score/10) 89.83提交
pre = ensem.predict(test_data) pd.DataFrame({'PassengerId':temp['PassengerId'],'Survived':pre}) submission = pd.DataFrame({'PassengerId':passID,'Survived':pre})提交結果看,ensemble模型和單個模型比并沒有明顯提升,分析可能是基模型相關性較強,訓練數據不夠多,或者是one-hot編碼會不會引入共線性。雖然測試集和訓練集結果相差不大,但提交結果降低明顯,分析可能是數據不夠,訓練不充分,特征較少且相關性強,可以考慮引入更多特征。
總結
以上是生活随笔為你收集整理的kaggle初探--泰坦尼克号生存预测的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Docker-1 Docker简介,Ce
- 下一篇: 使用80percent开发rails程序