當前位置：首頁 > 人工智能 > pytorch >内容正文

pytorch

深度学习-Tensorflow2基础知识

發(fā)布時間：2025/4/5 pytorch 20 豆豆

生活随笔收集整理的這篇文章主要介紹了深度学习-Tensorflow2基础知识小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

文章目錄

1、使用tensorflow_datasets
- 1.1 加載數(shù)據(jù)集
- 1.2 查看數(shù)據(jù)集中某些樣本的信息
- 1.3 將樣本標準化
- 1.4 將樣本打亂、分批
- 1.5 查看最終的訓練樣本
2、將已有的csv文件作為數(shù)據(jù)集
- 2.2 數(shù)據(jù)標準化
- 2.3 劃分訓練集和測試集
- 2.4 劃分特征與標簽
- 2.5 切片處理
3、使用tf.keras.datasets
- 3.1 導(dǎo)入數(shù)據(jù)集
- 3.2 特征歸一化
- 3.3 切片
4、Dataset數(shù)據(jù)集
- 4.1 將Dataframe改為Dataset數(shù)據(jù)集
- 4.2 將array改為Dataset數(shù)據(jù)集
- 4.3 將csv文件中數(shù)據(jù)導(dǎo)入到Dataset數(shù)據(jù)集
- 4.4 創(chuàng)建Dataset數(shù)據(jù)集
5、圖片
- 5.1 導(dǎo)入
- 5.2 將圖片轉(zhuǎn)換為需要的類型
- 5.3 刪除dataset中的灰度圖
- 5.4 加入batch和shuffle
6、使用 wget.download 在官網(wǎng)下載數(shù)據(jù)集
- 6.1 去官網(wǎng)下載數(shù)據(jù)集
- 6.2 解壓數(shù)據(jù)集壓縮包
- 6.3 讀取圖像文件
7、文本
- 7.2 得到文本所在目錄
- - 7.2.1 下載數(shù)據(jù)集
- 7.2.2 查找目錄地址
- 7.3 生成 dataset
- - 7.3.1 為每個類別的樣本都單獨生成一個數(shù)據(jù)集
  - 7.3.2 將三個 dataset 合并成一個 dataset
  - 7.3.3 將 dataset 打亂
- 7.4 將文本編碼成數(shù)字形式
- - 7.4.1 建立詞匯表
  - 7.4.2 建立編碼器
  - 7.4.3 對所有樣本進行編碼
- 7.5 將數(shù)據(jù)集分割為測試集和訓練集
8、將標簽數(shù)字化

1、使用tensorflow_datasets

tensorflow_datasets是一個非常有用的庫，其中包含了很多數(shù)據(jù)集，通過運行：

tfds.list_builders()

可以查看其中包含的所有數(shù)據(jù)集。

1.1 加載數(shù)據(jù)集

import tensorflow_datasets as tfds(raw_train, raw_validation, raw_test), metadata = tfds.load('cats_vs_dogs',split=['train[:80%]', 'train[80%:90%]', 'train[90%:]'],shuffle_files=False,batch_size=None,with_info=True,as_supervised=True, )

參數(shù)說明：

輸入：

name：數(shù)據(jù)集的名稱，可以通過運行tfds.list_builders()獲得。
split：如何劃分數(shù)據(jù)集，如果不進行劃分，則只得到訓練集（即全部樣本）。
shuffle_files：是否打亂。
batch_size：是否每次分批取出。如果為None，則每次取出一個樣本。
with_info：是否輸出數(shù)據(jù)集信息。
as_supervised：為True時，函數(shù)會返回一個二元組 (input, label)，而不是返回 FeaturesDict。

輸出：

(raw_train, raw_validation, raw_test)：split之后的數(shù)據(jù)。
metadata：數(shù)據(jù)集信息。

1.2 查看數(shù)據(jù)集中某些樣本的信息

for image, label in raw_train.take(2):print(image.shape,label)print(label)""" 輸出： (262, 350, 3) tf.Tensor(0, shape=(), dtype=int64) (428, 500, 3) tf.Tensor(1, shape=(), dtype=int64) """

獲取標簽所代表的種類

get_label_name = metadata.features['label'].int2str for image, label in raw_train.take(2):print(image.shape)print(label)print(get_label_name(label)) ''' 輸出： (262, 350, 3) tf.Tensor(0, shape=(), dtype=int64) cat (428, 500, 3) tf.Tensor(1, shape=(), dtype=int64) dog'''

1.3 將樣本標準化

IMG_SIZE = 160 # All images will be resized to 160x160def format_example(image, label):image = tf.cast(image, tf.float32)image = (image/127.5) - 1image = tf.image.resize(image, (IMG_SIZE, IMG_SIZE))return image, label train = raw_train.map(format_example) validation = raw_validation.map(format_example) test = raw_test.map(format_example)

當然，這里也可以用下面的代碼代替：

for image, label in raw_train:image = tf.cast(image, tf.float32)image = (image/127.5) - 1image = tf.image.resize(image, (IMG_SIZE, IMG_SIZE))

但這將會非常花時間！！！

1.4 將樣本打亂、分批

如果在導(dǎo)入數(shù)據(jù)集的時候沒有shuffle和分批，那么可以在之后進行。

BATCH_SIZE = 32 SHUFFLE_BUFFER_SIZE = 1000 train_batches = train.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE) validation_batches = validation.batch(BATCH_SIZE) test_batches = test.batch(BATCH_SIZE)

1.5 查看最終的訓練樣本

for image_batch, label_batch in train_batches.take(1):print(image_batch.shape)print(label_batch.shape) ''' 輸出： (32, 160, 160, 3) (32,) '''

2、將已有的csv文件作為數(shù)據(jù)集

2.2 數(shù)據(jù)標準化

data_mean = dataset_.mean(axis=0) data_std = dataset_.std(axis=0)dataset_ = (dataset_-data_mean)/data_std

2.3 劃分訓練集和測試集

因為這個數(shù)據(jù)集本身不分訓練集和測試集，所以在這里要用sklearn庫進行劃分。

from sklearn.model_selection import train_test_split train, test = train_test_split(dataset_, test_size=0.1)

2.4 劃分特征與標簽

train_x = train[:, :-1].astype(np.float32) train_y = train[:, -1].astype(np.float32) test_x = test[:, :-1].astype(np.float32) test_y = test[:, -1].astype(np.float32)

2.5 切片處理

dataset_train = tf.data.Dataset.from_tensor_slices((train_x, train_y)).shuffle(train_y.shape[0]).batch(32) dataset_test = tf.data.Dataset.from_tensor_slices((test_x, test_y)).shuffle(test_y.shape[0]).batch(32)

將此輸入模型，即可進行訓練。

3、使用tf.keras.datasets

3.1 導(dǎo)入數(shù)據(jù)集

(x, y), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()

3.2 特征歸一化

因為這里特征是圖片，所以除以255即可。

def preprocess(x, y):x = tf.cast(x, dtype=tf.float32) / 255.0y = tf.cast(y, dtype=tf.int32)return x,y

3.3 切片

batchsz = 128db = tf.data.Dataset.from_tensor_slices((x,y)) db = db.map(preprocess).shuffle(10000).batch(batchsz)db_test = tf.data.Dataset.from_tensor_slices((x_test,y_test)) db_test = db_test.map(preprocess).batch(batchsz)

將此輸入模型，即可進行訓練。

4、Dataset數(shù)據(jù)集

4.1 將Dataframe改為Dataset數(shù)據(jù)集

#target為標簽列，將其從dataframe中刪除，并返回刪除內(nèi)容于labels中。 labels = dataframe.pop('target') ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))

4.2 將array改為Dataset數(shù)據(jù)集

# 從Numpy array構(gòu)建數(shù)據(jù)管道import tensorflow as tf import numpy as np from sklearn import datasets iris = datasets.load_iris()ds1 = tf.data.Dataset.from_tensor_slices((iris["data"],iris["target"])) for features,label in ds1.take(5):print(features,label) ''' 輸出： tf.Tensor([5.1 3.5 1.4 0.2], shape=(4,), dtype=float64) tf.Tensor(0, shape=(), dtype=int32) tf.Tensor([4.9 3. 1.4 0.2], shape=(4,), dtype=float64) tf.Tensor(0, shape=(), dtype=int32) tf.Tensor([4.7 3.2 1.3 0.2], shape=(4,), dtype=float64) tf.Tensor(0, shape=(), dtype=int32) tf.Tensor([4.6 3.1 1.5 0.2], shape=(4,), dtype=float64) tf.Tensor(0, shape=(), dtype=int32) tf.Tensor([5. 3.6 1.4 0.2], shape=(4,), dtype=float64) tf.Tensor(0, shape=(), dtype=int32) '''

4.3 將csv文件中數(shù)據(jù)導(dǎo)入到Dataset數(shù)據(jù)集

ds4 = tf.data.experimental.make_csv_dataset(file_pattern = ["../A.csv","../B.csv"],batch_size=3, label_name="Survived",na_value="",num_epochs=1,ignore_errors=True)for data,label in ds4.take(2):print(data,label)

4.4 創(chuàng)建Dataset數(shù)據(jù)集

batch_size = 5 # 小批量大小用于演示 train_ds = df_to_dataset(train, batch_size=batch_size) val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size) test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

此處返回的皆為字典形式。
可以通過以下方式查看數(shù)據(jù)集信息：

for feature_batch, label_batch in train_ds.take(1):print('Every feature:', list(feature_batch.keys()))print('A batch of ages:', feature_batch['age'])print('A batch of targets:', label_batch ) ''' 輸出： Every feature: ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal'] A batch of ages: tf.Tensor([61 59 58 42 40], shape=(5,), dtype=int32) A batch of targets: tf.Tensor([1 1 0 1 0], shape=(5,), dtype=int32) '''

5、圖片

我們用horse2zebra數(shù)據(jù)集舉例：此數(shù)據(jù)集中包含4個文件夾，分別是horse訓練集、zebra訓練集、horse測試集以及zebra測試集。每個訓練集中都包含1000多張 (256, 256, 3) 的彩色圖片（摻有一些灰度圖片，之后會在代碼中刪掉）。

5.1 導(dǎo)入

PATH = 'C:\\Users\\kzb' train_horses = tf.data.Dataset.list_files(PATH+'trainA/*.jpg') train_zebras = tf.data.Dataset.list_files(PATH+'trainB/*.jpg') test_horses = tf.data.Dataset.list_files(PATH+'testA/*.jpg') test_zebras = tf.data.Dataset.list_files(PATH+'testB/*.jpg')

此時導(dǎo)入的是字符串類型的dataset。

5.2 將圖片轉(zhuǎn)換為需要的類型

def load(image_file):image = tf.io.read_file(image_file)image = tf.image.decode_jpeg(image)image = tf.cast(image, tf.float32)return image

打印出一張圖片查看：

img = load(PATH+'trainB/n02391049_2.jpg') # casting to int for matplotlib to show the image plt.figure() plt.imshow(img/255.0)

5.3 刪除dataset中的灰度圖

for dirname, _, filenames in os.walk(PATH+'trainB'):for filename in filenames:img = load(os.path.join(dirname, filename))if img.shape != (256, 256, 3):print(filename)print(img.shape)os.remove(os.path.join(dirname, filename))

5.4 加入batch和shuffle

AUTOTUNE = tf.data.experimental.AUTOTUNE train_horses = train_horses.map(load, num_parallel_calls=AUTOTUNE).cache().shuffle(1000).batch(1)train_zebras = train_zebras.map(load, num_parallel_calls=AUTOTUNE).cache().shuffle(1000).batch(1)test_horses = test_horses.map(load, num_parallel_calls=AUTOTUNE).cache().shuffle(1000).batch(1)test_zebras = test_zebras.map(load, num_parallel_calls=AUTOTUNE).cache().shuffle(1000).batch(1)

將此輸入模型，即可進行訓練。

6、使用 wget.download 在官網(wǎng)下載數(shù)據(jù)集

以熱狗數(shù)據(jù)集舉例。

6.1 去官網(wǎng)下載數(shù)據(jù)集

import os import wget data = os.getcwd()+'/data' base_url = 'https://apache-mxnet.s3-accelerate.amazonaws.com/' wget.download(base_url + 'gluon/dataset/hotdog.zip',data)

其中，os.getcwd() 返回的是當前 .py 文件所在的文件夾。wget.download(data, dir) 是將 data 數(shù)據(jù)集（壓縮包）下載到 dir 文件夾中。

6.2 解壓數(shù)據(jù)集壓縮包

import zipfile with zipfile.ZipFile('data', 'r') as z:z.extractall(os.getcwd())

6.3 讀取圖像文件

創(chuàng)建兩個 tf.keras.preprocessing.image.ImageDataGenerator 實例來分別讀取訓練數(shù)據(jù)集和測試數(shù)據(jù)集中的所有圖像文件。這里我們將訓練集圖片全部處理為高和寬均為224像素的輸入。此外，我們對RGB（紅、綠、藍）三個顏色通道的數(shù)值做標準化。

import pathlib train_dir = 'hotdog/train' test_dir = 'hotdog/test' train_dir = pathlib.Path(train_dir) train_count = len(list(train_dir.glob('*/*.jpg'))) test_dir = pathlib.Path(test_dir) test_count = len(list(test_dir.glob('*/*.jpg')))CLASS_NAMES = np.array([item.name for item in train_dir.glob('*') if item.name != 'LICENSE.txt' and item.name[0] != '.']) CLASS_NAMESimage_generator = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255) # 標準化 BATCH_SIZE = 32 IMG_HEIGHT = 224 IMG_WIDTH = 224train_data_gen = image_generator.flow_from_directory(directory=str(train_dir),batch_size=BATCH_SIZE,target_size=(IMG_HEIGHT, IMG_WIDTH),shuffle=True,classes = list(CLASS_NAMES))test_data_gen = image_generator.flow_from_directory(directory=str(test_dir),batch_size=BATCH_SIZE,target_size=(IMG_HEIGHT, IMG_WIDTH),shuffle=True,classes = list(CLASS_NAMES))

7、文本

使用 tf.data.TextLineDataset 來加載文本文件。TextLineDataset 通常被用來以文本文件構(gòu)建數(shù)據(jù)集（文件中的一行為一個樣本) 。這適用于大多數(shù)的基于行的文本數(shù)據(jù)（例如，詩歌、小說或錯誤日志) 。

7.2 得到文本所在目錄

7.2.1 下載數(shù)據(jù)集

如果是自己的數(shù)據(jù)集，這一步可以跳過。

在這里，我們將使用相同作品（荷馬的伊利亞特）的三個不同版本的英文翻譯舉例，以文本的每一行作為樣本特征，以作者為標簽。

import tensorflow as tfDIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/' FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']for name in FILE_NAMES:text_dir = tf.keras.utils.get_file(name, origin=DIRECTORY_URL+name)

7.2.2 查找目錄地址

parent_dir = os.path.dirname(text_dir) parent_dir

7.3 生成 dataset

7.3.1 為每個類別的樣本都單獨生成一個數(shù)據(jù)集

def labeler(example, index):return example, tf.cast(index, tf.int64) labeled_data_sets = []for i, file_name in enumerate(FILE_NAMES):lines_dataset = tf.data.TextLineDataset(os.path.join(parent_dir, file_name))labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))labeled_data_sets.append(labeled_dataset)

tf.data.TextLineDataset()： 輸入一個文件地址，輸出是一個 dataset。dataset 中的每一個元素就對應(yīng)了文件中的一行。
比如：

a = tf.data.TextLineDataset(os.path.join(parent_dir, 'cowper.txt')) for each in a.take(5):print(each) ''' 輸出： tf.Tensor(b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;", shape=(), dtype=string) tf.Tensor(b'His wrath pernicious, who ten thousand woes', shape=(), dtype=string) tf.Tensor(b"Caused to Achaia's host, sent many a soul", shape=(), dtype=string) tf.Tensor(b'Illustrious into Ades premature,', shape=(), dtype=string) tf.Tensor(b'And Heroes gave (so stood the will of Jove)', shape=(), dtype=string) '''

然后我們將得到的 dataset 映射到 labeler 函數(shù)中，將標簽添加到 dataset 中：

b = a.map(lambda ex: labeler(ex, 0)) for each in b.take(5):print(each) ''' 輸出： (<tf.Tensor: id=95344, shape=(), dtype=string, numpy=b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;">, <tf.Tensor: id=95345, shape=(), dtype=int64, numpy=0>) (<tf.Tensor: id=95346, shape=(), dtype=string, numpy=b'His wrath pernicious, who ten thousand woes'>, <tf.Tensor: id=95347, shape=(), dtype=int64, numpy=0>) (<tf.Tensor: id=95348, shape=(), dtype=string, numpy=b"Caused to Achaia's host, sent many a soul">, <tf.Tensor: id=95349, shape=(), dtype=int64, numpy=0>) (<tf.Tensor: id=95350, shape=(), dtype=string, numpy=b'Illustrious into Ades premature,'>, <tf.Tensor: id=95351, shape=(), dtype=int64, numpy=0>) (<tf.Tensor: id=95352, shape=(), dtype=string, numpy=b'And Heroes gave (so stood the will of Jove)'>, <tf.Tensor: id=95353, shape=(), dtype=int64, numpy=0>) '''

7.3.2 將三個 dataset 合并成一個 dataset

all_labeled_data = labeled_data_sets[0] for labeled_dataset in labeled_data_sets[1:]:all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

7.3.3 將 dataset 打亂

BUFFER_SIZE = 50000 all_labeled_data = all_labeled_data.shuffle(BUFFER_SIZE, reshuffle_each_iteration=False)

我們可以打印 dataset 中前5個元素：

for ex in all_labeled_data.take(5):print(ex) ''' 輸出： (<tf.Tensor: id=95461, shape=(), dtype=string, numpy=b"Uprear'd, a wonder even in eyes divine.">, <tf.Tensor: id=95462, shape=(), dtype=int64, numpy=0>) (<tf.Tensor: id=95463, shape=(), dtype=string, numpy=b'hecatombs, but to the daughter of great Jove alone he had made no'>, <tf.Tensor: id=95464, shape=(), dtype=int64, numpy=2>) (<tf.Tensor: id=95465, shape=(), dtype=string, numpy=b'strode onward. The Argives were elated as they beheld him, but the'>, <tf.Tensor: id=95466, shape=(), dtype=int64, numpy=2>) (<tf.Tensor: id=95467, shape=(), dtype=string, numpy=b'"Friends, Grecian Heroes, Ministers of Mars,'>, <tf.Tensor: id=95468, shape=(), dtype=int64, numpy=1>) (<tf.Tensor: id=95469, shape=(), dtype=string, numpy=b'sin against their oaths--of them and their children--may be shed upon'>, <tf.Tensor: id=95470, shape=(), dtype=int64, numpy=2>) '''

可見此三個類別的樣本都已經(jīng)包含在 dataset 中了。

7.4 將文本編碼成數(shù)字形式

7.4.1 建立詞匯表

import tensorflow_datasets as tfds import ostokenizer = tfds.features.text.Tokenizer()vocabulary_set = set() for text in df["text"]:some_tokens = tokenizer.tokenize(text)vocabulary_set.update(some_tokens)vocab_size = len(vocabulary_set) vocab_size ''' 輸出： 10000 '''

其中 tokenizer = tfds.features.text.Tokenizer() 的目的是實例化一個分詞器，tokenizer.tokenize 可以將一句話分成多個單詞，例如：

for text_tensor, _ in all_labeled_data.take(1):print(text_tensor)print(text_tensor.numpy())print(tokenizer.tokenize(text_tensor.numpy())) tf.Tensor(b"Uprear'd, a wonder even in eyes divine.", shape=(), dtype=string) b"Uprear'd, a wonder even in eyes divine." ['Uprear', 'd', 'a', 'wonder', 'even', 'in', 'eyes', 'divine']

7.4.2 建立編碼器

encoder = tfds.features.text.TokenTextEncoder(vocabulary_set)

我們拿一個樣本實驗：

example_text = next(iter(all_labeled_data))[0].numpy() print(example_text)encoded_example = encoder.encode(example_text) print(encoded_example)''' 輸出： b'I mean to pound his flesh, and smash his bones.' [1677, 9644, 1762, 15465, 12945, 9225, 13806, 5555, 12945, 4829] '''

然后，我們將編碼器寫成函數(shù)供以后調(diào)用：

def encode(text_tensor, label):encoded_text = encoder.encode(text_tensor.numpy())return encoded_text, label

7.4.3 對所有樣本進行編碼

def encode_map_fn(text, label):# py_func doesn't set the shape of the returned tensors.encoded_text, label = tf.py_function(encode, inp=[text, label], Tout=(tf.int64, tf.int64))# `tf.data.Datasets` work best if all components have a shape set# so set the shapes manually: encoded_text.set_shape([None])label.set_shape([])return encoded_text, labelall_encoded_data = all_labeled_data.map(encode_map_fn)

其中，我們使用了 tf.py_function(func, inp, Tout, name=None) 函數(shù)：

作用：包裝 Python 函數(shù)，讓 Python 代碼可以與 tensorflow 進行交互。
參數(shù)：
- func ：自己定義的python函數(shù)名稱
- inp ：自己定義python函數(shù)的參數(shù)列表，寫成列表的形式，[tensor1,tensor2,tensor3] 列表的每一個元素是一個Tensor對象，
- Tout：它與自定義的python函數(shù)的返回值相對應(yīng)的，
  - 當Tout是一個列表的時候，如 [ tf.string,tf,int64,tf.float] 表示自定義函數(shù)有三個返回值，即返回三個tensor，每一個tensor的元素的類型與之對應(yīng)；
  - 當Tout只有一個值的時候，如tf.int64，表示自定義函數(shù)返回的是一個整型列表或整型tensor；
  - 當Tout沒有值的時候，表示自定義函數(shù)沒有返回值。

注意：如果這里不使用 tf.py_function 而是使用 dataset.map，程序會報錯：

AttributeError: 'Tensor' object has no attribute 'numpy'

這是因為 datastep.map(function) 給解析函數(shù) function 傳遞進去的參數(shù)，即上面的 encode(text_tensor, label) 中的 text_tensor 和 label 是 Tensor 而不是 EagerTensor 。可以這樣理解：

因為對一個數(shù)據(jù)集 dataset.map 的時候，并沒有預(yù)先對每一組樣本先進行 map 中映射的函數(shù)運算，而僅僅是告訴 dataset，你每一次拿出來的樣本時要先進行一遍 function 運算之后才使用的，所以 function 的調(diào)用是在每次迭代 dataset 的時候才調(diào)用的，但是預(yù)先的參數(shù) text_tensor 和 label 只是一個“容器”，迭代的時候采用數(shù)據(jù)將這個“容器”填起來，而在運算的時候，雖然將數(shù)據(jù)填進去了，但是 text_tensor 和 label 依然還是一個 Tensor 而不是 EagerTensor，所以才會出現(xiàn)上面的問題。

此時，我們得到的最終 dataset 中的樣本已經(jīng)從文本轉(zhuǎn)換成了數(shù)字向量：

for i in all_encoded_data.take(5):print(i) ''' 輸出： (<tf.Tensor: id=225383, shape=(8,), dtype=int64, numpy= array([ 1438, 14227, 5791, 16819, 11806, 13990, 10168, 11243],dtype=int64)>, <tf.Tensor: id=225384, shape=(), dtype=int64, numpy=0>) (<tf.Tensor: id=225388, shape=(13,), dtype=int64, numpy= array([ 3194, 4566, 1762, 15273, 9726, 377, 5972, 556, 11565,13400, 5594, 5132, 9271], dtype=int64)>, <tf.Tensor: id=225389, shape=(), dtype=int64, numpy=2>) (<tf.Tensor: id=225393, shape=(12,), dtype=int64, numpy= array([ 9549, 7697, 12367, 901, 7439, 4679, 3366, 11629, 5709,4866, 4566, 15273], dtype=int64)>, <tf.Tensor: id=225394, shape=(), dtype=int64, numpy=2>) (<tf.Tensor: id=225398, shape=(6,), dtype=int64, numpy=array([ 88, 12816, 14312, 7786, 377, 10566], dtype=int64)>, <tf.Tensor: id=225399, shape=(), dtype=int64, numpy=1>) (<tf.Tensor: id=225403, shape=(13,), dtype=int64, numpy= array([ 7888, 8908, 2313, 13645, 377, 12262, 13806, 2313, 7788,3289, 1718, 12822, 2595], dtype=int64)>, <tf.Tensor: id=225404, shape=(), dtype=int64, numpy=2>) '''

7.5 將數(shù)據(jù)集分割為測試集和訓練集

BATCH_SIZE = 64 TAKE_SIZE = 5000 train_data = all_encoded_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE) train_data = train_data.padded_batch(BATCH_SIZE, ((None, ), ()))test_data = all_encoded_data.take(TAKE_SIZE) test_data = test_data.padded_batch(BATCH_SIZE, ((None, ), ()))

使用 tf.data.Dataset.take 和 tf.data.Dataset.skip 來建立一個小一些的測試數(shù)據(jù)集和稍大一些的訓練數(shù)據(jù)集。tf.data.Dataset.take(TAKE_SIZE) 表示取 TAKE_SIZE 個樣本做測試集；tf.data.Dataset.skip(TAKE_SIZE) 表示取總樣本數(shù)-TAKE_SIZE 個樣本做訓練集。

在數(shù)據(jù)集被傳入模型之前，數(shù)據(jù)集需要進行分批處理。最典型的是，每個批次中的樣本大小與格式需要一致。但是數(shù)據(jù)集中樣本并不全是相同大小的（每行文本字數(shù)并不相同）。因此，我們使用 tf.data.Dataset.padded_batch（而不是 batch ）將樣本填充到相同的大小，這里把形狀設(shè)置成 (None, ) 之后，它會判斷在這個批次中的最長樣本的單詞個數(shù)，然后將該批次所有其他樣本用零填充到這個長度。

sample_text, sample_labels = next(iter(test_data)) sample_text[0], sample_labels[0] ''' 輸出： (<tf.Tensor: id=225755, shape=(15,), dtype=int64, numpy=array([ 1438, 14227, 5791, 16819, 11806, 13990, 10168, 11243, 0,0, 0, 0, 0, 0, 0], dtype=int64)>,<tf.Tensor: id=225759, shape=(), dtype=int64, numpy=0>) '''

由于我們引入了一個新的 token 來編碼（填充零），因此詞匯表大小增加了一個。

vocab_size += 1

之后在訓練的時候，直接將 train_data 輸入詞嵌入層即可。訓練的詳細信息請參照Tensorflow2.0之文本分類確定文章譯者。

8、將標簽數(shù)字化

原始數(shù)據(jù)

# categorical 實際上是計算一個列表型數(shù)據(jù)中的類別數(shù)，即不重復(fù)項， # 它返回的是一個CategoricalDtype 類型的對象，相當于在原來數(shù)據(jù)上附加上類別信息， # 具體的類別可以通過和對應(yīng)的序號可以通過 codes 和 categories df.airline_sentiment = pd.Categorical(df.airline_sentiment).codes df

數(shù)字化后的數(shù)據(jù)

總結(jié)

以上是生活随笔為你收集整理的深度学习-Tensorflow2基础知识的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：数据分析系列：完善统计图（matplot
下一篇：数据分析系列：绘制折线图（matplot