TensorFlow 2.0 - tf.data.Dataset 数据预处理 猫狗分类
生活随笔
收集整理的這篇文章主要介紹了
TensorFlow 2.0 - tf.data.Dataset 数据预处理 猫狗分类
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
文章目錄
- 1 tf.data.Dataset.from_tensor_slices() 數據集建立
- 2. Dataset.map(f) 數據集預處理
- 3. Dataset.prefetch() 并行處理
- 4. for 循環獲取數據
- 5. 例子: 貓狗分類
學習于:簡單粗暴 TensorFlow 2
1 tf.data.Dataset.from_tensor_slices() 數據集建立
tf.data.Dataset.from_tensor_slices()
import matplotlib.pyplot as plt (train_data, train_label), (_, _) = tf.keras.datasets.mnist.load_data() train_data = np.expand_dims(train_data.astype(np.float32)/255., axis=-1)mnistdata = tf.data.Dataset.from_tensor_slices((train_data, train_label))for img, label in mnistdata:plt.title(label.numpy())plt.imshow(img.numpy())plt.show()2. Dataset.map(f) 數據集預處理
- Dataset.map(f) 應用變換
- Dataset.batch(batch_size) 分批
- Dataset.shuffle(buffer_size) 隨機打亂
buffer_size = 1,沒有打亂的效果
數據集較隨機,buffer_size 可小一些,否則,設置大一些
我在做貓狗分類例子的時候,遇到內存不足的報錯,建議可以提前打亂數據
3. Dataset.prefetch() 并行處理
- Dataset.prefetch() 開啟預加載數據,使得在 GPU 訓練的同時 CPU 可以準備數據
- num_parallel_calls 多核心并行處理
4. for 循環獲取數據
# for 循環 dataset = tf.data.Dataset.from_tensor_slices((A, B, C, ...)) for a, b, c, ... in dataset:# 對張量a, b, c等進行操作,例如送入模型進行訓練# 或者 創建迭代器,使用 next() 獲取 dataset = tf.data.Dataset.from_tensor_slices((A, B, C, ...)) it = iter(dataset) a_0, b_0, c_0, ... = next(it) a_1, b_1, c_1, ... = next(it)5. 例子: 貓狗分類
項目及數據地址:https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition/overview
The train folder contains 25,000 images of dogs and cats. Each image in this folder has the label as part of the filename. The test folder contains 12,500 images, named according to a numeric id.
For each image in the test set, you should predict a probability that the image is a dog (1 = dog, 0 = cat).
# ---------cat vs dog------------- # https://michael.blog.csdn.net/ import tensorflow as tf import pandas as pd import numpy as np import random import osnum_epochs = 10 batch_size = 32 learning_rate = 1e-4 train_data_dir = "./dogs-vs-cats/train/" test_data_dir = "./dogs-vs-cats/test/"# 數據處理 def _decode_and_resize(filename, label=None):img_string = tf.io.read_file(filename)img_decoded = tf.image.decode_jpeg(img_string)img_resized = tf.image.resize(img_decoded, [256, 256]) / 255.if label == None:return img_resizedreturn img_resized, label# 使用 tf.data.Dataset 生成數據 def processData(train_filenames, train_labels):train_dataset = tf.data.Dataset.from_tensor_slices((train_filenames, train_labels))train_dataset = train_dataset.map(map_func=_decode_and_resize)# train_dataset = train_dataset.shuffle(buffer_size=25000) # 非常耗內存,不使用train_dataset = train_dataset.batch(batch_size)train_dataset = train_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)return train_datasetif __name__ == "__main__":# 訓練文件路徑file_dir = [train_data_dir + filename for filename in os.listdir(train_data_dir)]labels = [0 if filename[0] == 'c' else 1for filename in os.listdir(train_data_dir)]# 打包并打亂f_l = list(zip(file_dir, labels))random.shuffle(f_l)file_dir, labels = zip(*f_l)# 切分訓練集,驗證集valid_ratio = 0.1idx = int((1 - valid_ratio) * len(file_dir))train_files, valid_files = file_dir[:idx], file_dir[idx:]train_labels, valid_labels = labels[:idx], labels[idx:]# 使用 tf.data.Dataset 生成數據集train_filenames, valid_filenames = tf.constant(train_files), tf.constant(valid_files)train_labels, valid_labels = tf.constant(train_labels), tf.constant(valid_labels)train_dataset = processData(train_filenames, train_labels)valid_dataset = processData(valid_filenames, valid_labels)# 建模 model = tf.keras.Sequential([tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(256, 256, 3)),tf.keras.layers.MaxPooling2D(),tf.keras.layers.Dropout(0.2),tf.keras.layers.Conv2D(64, 5, activation='relu'),tf.keras.layers.MaxPooling2D(),tf.keras.layers.Dropout(0.2),tf.keras.layers.Conv2D(128, 5, activation='relu'),tf.keras.layers.MaxPooling2D(),tf.keras.layers.Dropout(0.2),tf.keras.layers.Flatten(),tf.keras.layers.Dense(64, activation='relu'),tf.keras.layers.Dense(2, activation='softmax')])# 模型配置model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),loss=tf.keras.losses.sparse_categorical_crossentropy,metrics=[tf.keras.metrics.sparse_categorical_accuracy])# 訓練model.fit(train_dataset, epochs=num_epochs, validation_data=valid_dataset)# 測試 testtest_filenames = tf.constant([test_data_dir + filename for filename in os.listdir(test_data_dir)])test_data = tf.data.Dataset.from_tensor_slices(test_filenames)test_data = test_data.map(map_func=_decode_and_resize)test_data = test_data.batch(batch_size)ans = model.predict(test_data) # ans [12500, 2]prob = ans[:, 1] # dog 的概率# 寫入提交文件id = list(range(1, 12501))output = pd.DataFrame({'id': id, 'label': prob})output.to_csv("submission.csv", index=False)提交成績:
榜首他人成績:
- 把模型改成 MobileNetV2 + FC,訓練 2 個 epochs
結果:
704/704 [==============================] - 179s 254ms/step - loss: 0.0741 - sparse_categorical_accuracy: 0.9737 - val_loss: 0.1609 - val_sparse_categorical_accuracy: 0.9744 704/704 [==============================] - 167s 237ms/step - loss: 0.0128 - sparse_categorical_accuracy: 0.9955 - val_loss: 0.0724 - val_sparse_categorical_accuracy: 0.9848準確率(99%, 98%)比上面第一種模型高(第一種模型大概是訓練集 92%, 驗證集80%)
測試時,損失值竟然比上面的大,怎么解釋?貌似第二種方案也沒有過擬合吧,訓練集和驗證集準確率差不多。
總結
以上是生活随笔為你收集整理的TensorFlow 2.0 - tf.data.Dataset 数据预处理 猫狗分类的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: LeetCode 802. 找到最终的安
- 下一篇: LeetCode 1146. 快照数组(