當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

titanic数据集_TF2.0结构化数据建模流程范例

發布時間：2024/10/8 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 titanic数据集_TF2.0结构化数据建模流程范例小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

盡管TensorFlow設計上足夠靈活，可以用于進行各種復雜的數值計算。但通常人們使用TensorFlow來實現機器學習模型，尤其常用于實現神經網絡模型。

從原理上說可以使用張量構建計算圖來定義神經網絡，并通過自動微分機制訓練模型。但為簡潔起見，一般推薦使用TensorFlow的高層次keras接口來實現神經網絡網模型。

使用TensorFlow實現神經網絡模型的一般流程包括：

1，準備數據
2，定義模型
3，訓練模型
4，評估模型
5，使用模型
6，保存模型。

對新手來說，其中最困難的部分實際上是準備數據過程。

我們在實踐中通常會遇到的數據類型包括結構化數據，圖片數據，文本數據。

我們將分別以titanic生存預測問題，cifar2圖片分類問題，imdb電影評論分類問題為例，演示應用tensorflow對這三類數據的建模方法。

本篇以titanic生存預測問題為例，演示應用tensorflow對結構化數據進行建模的方法。

一，準備數據

titanic數據集的目標是根據乘客信息預測他們在Titanic號撞擊冰山沉沒后能否生存。

結構化數據一般會使用Pandas中的DataFrame進行預處理。

import?numpy?as?np?
import?pandas?as?pd?
import?matplotlib.pyplot?as?plt
import?tensorflow?as?tf?
from?tensorflow.keras?import?models,layers

dftrain_raw?=?pd.read_csv('./data/titanic/train.csv')
dftest_raw?=?pd.read_csv('./data/titanic/test.csv')
dftrain_raw.head(10)

字段說明：

Survived:0代表死亡，1代表存活【y標簽】
Pclass:乘客所持票類，有三種值(1,2,3) 【轉換成onehot編碼】
Name:乘客姓名【舍去】
Sex:乘客性別【轉換成bool特征】
Age:乘客年齡(有缺失) 【數值特征，添加“年齡是否缺失”作為輔助特征】
SibSp:乘客兄弟姐妹/配偶的個數(整數值) 【數值特征】
Parch:乘客父母/孩子的個數(整數值)【數值特征】
Ticket:票號(字符串)【舍去】
Fare:乘客所持票的價格(浮點數，0-500不等) 【數值特征】
Cabin:乘客所在船艙(有缺失) 【添加“所在船艙是否缺失”作為輔助特征】
Embarked:乘客登船港口:S、C、Q(有缺失)【轉換成onehot編碼，四維度 S,C,Q,nan】

利用Pandas的數據可視化功能我們可以簡單地進行探索性數據分析EDA(Exploratory Data Analysis)。

label分布情況

%matplotlib?inline
%config?InlineBackend.figure_format?=?'png'
ax?=?dftrain_raw['Survived'].value_counts().plot(kind?=?'bar',
?????figsize?=?(12,8),fontsize=15,rot?=?0)
ax.set_ylabel('Counts',fontsize?=?15)
ax.set_xlabel('Survived',fontsize?=?15)
plt.show()

年齡分布情況

%matplotlib?inline
%config?InlineBackend.figure_format?=?'png'
ax?=?dftrain_raw['Age'].plot(kind?=?'hist',bins?=?20,color=?'purple',
????????????????????figsize?=?(12,8),fontsize=15)

ax.set_ylabel('Frequency',fontsize?=?15)
ax.set_xlabel('Age',fontsize?=?15)
plt.show()

年齡和label的相關性

%matplotlib?inline
%config?InlineBackend.figure_format?=?'png'
ax?=?dftrain_raw.query('Survived?==?0')['Age'].plot(kind?=?'density',
??????????????????????figsize?=?(12,8),fontsize=15)
dftrain_raw.query('Survived?==?1')['Age'].plot(kind?=?'density',
??????????????????????figsize?=?(12,8),fontsize=15)
ax.legend(['Survived==0','Survived==1'],fontsize?=?12)
ax.set_ylabel('Density',fontsize?=?15)
ax.set_xlabel('Age',fontsize?=?15)
plt.show()

下面為正式的數據預處理

def?preprocessing(dfdata):

????dfresult=?pd.DataFrame()

????#Pclass
????dfPclass?=?pd.get_dummies(dfdata['Pclass'])
????dfPclass.columns?=?['Pclass_'?+str(x)?for?x?in?dfPclass.columns?]
????dfresult?=?pd.concat([dfresult,dfPclass],axis?=?1)

????#Sex
????dfSex?=?pd.get_dummies(dfdata['Sex'])
????dfresult?=?pd.concat([dfresult,dfSex],axis?=?1)

????#Age
????dfresult['Age']?=?dfdata['Age'].fillna(0)
????dfresult['Age_null']?=?pd.isna(dfdata['Age']).astype('int32')

????#SibSp,Parch,Fare
????dfresult['SibSp']?=?dfdata['SibSp']
????dfresult['Parch']?=?dfdata['Parch']
????dfresult['Fare']?=?dfdata['Fare']

????#Carbin
????dfresult['Cabin_null']?=??pd.isna(dfdata['Cabin']).astype('int32')

????#Embarked
????dfEmbarked?=?pd.get_dummies(dfdata['Embarked'],dummy_na=True)
????dfEmbarked.columns?=?['Embarked_'?+?str(x)?for?x?in?dfEmbarked.columns]
????dfresult?=?pd.concat([dfresult,dfEmbarked],axis?=?1)

????return(dfresult)

x_train?=?preprocessing(dftrain_raw)
y_train?=?dftrain_raw['Survived'].values

x_test?=?preprocessing(dftest_raw)
y_test?=?dftest_raw['Survived'].values

print("x_train.shape?=",?x_train.shape?)
print("x_test.shape?=",?x_test.shape?)

二，定義模型

使用Keras接口有以下3種方式構建模型：使用Sequential按層順序構建模型，使用函數式API構建任意結構模型，繼承Model基類構建自定義模型。

此處選擇使用最簡單的Sequential，按層順序模型。

tf.keras.backend.clear_session()

model?=?models.Sequential()
model.add(layers.Dense(20,activation?=?'relu',input_shape=(15,)))
model.add(layers.Dense(10,activation?=?'relu'?))
model.add(layers.Dense(1,activation?=?'sigmoid'?))

model.summary()

三，訓練模型

訓練模型通常有3種方法，內置fit方法，內置train_on_batch方法，以及自定義訓練循環。此處我們選擇最常用也最簡單的內置fit方法。

#?二分類問題選擇二元交叉熵損失函數
model.compile(optimizer='adam',
????????????loss='binary_crossentropy',
????????????metrics=['AUC'])

history?=?model.fit(x_train,y_train,
????????????????????batch_size=?64,
????????????????????epochs=?30,
????????????????????validation_split=0.2?#分割一部分訓練數據用于驗證
???????????????????)

四，評估模型

我們首先評估一下模型在訓練集和驗證集上的效果。

%matplotlib?inline
%config?InlineBackend.figure_format?=?'svg'

import?matplotlib.pyplot?as?plt

def?plot_metric(history,?metric):
????train_metrics?=?history.history[metric]
????val_metrics?=?history.history['val_'+metric]
????epochs?=?range(1,?len(train_metrics)?+?1)
????plt.plot(epochs,?train_metrics,?'bo--')
????plt.plot(epochs,?val_metrics,?'ro-')
????plt.title('Training?and?validation?'+?metric)
????plt.xlabel("Epochs")
????plt.ylabel(metric)
????plt.legend(["train_"+metric,?'val_'+metric])
????plt.show()

我們再看一下模型在測試集上的效果.

五，使用模型

六，保存模型

可以使用Keras方式保存模型，也可以使用TensorFlow原生方式保存。前者僅僅適合使用Python環境恢復模型，后者則可以跨平臺進行模型部署。

推薦使用后一種方式進行保存。

1，Keras方式保存

2，TensorFlow原生方式保存

總結

以上是生活随笔為你收集整理的titanic数据集_TF2.0结构化数据建模流程范例的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。