當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

tensorflow2.0 环境下的tfrecord读写及tf.io.parse_example和tf.io.parse_single_example的区别

發(fā)布時(shí)間：2023/12/20 编程问答 22 豆豆

生活随笔收集整理的這篇文章主要介紹了 tensorflow2.0 环境下的tfrecord读写及tf.io.parse_example和tf.io.parse_single_example的区别小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

在文章tfrecord格式的內(nèi)容解析及樣例?中我們已經(jīng)分析了tfrecord 的內(nèi)容是什么格式，接下來就要學(xué)習(xí)tfrecord怎么使用，及tfrecord的讀寫。

生成tfrecord

tfrecord文件的寫入非常簡單，仍然用tfrecord格式的內(nèi)容解析及樣例?中的例子，我們首先生成一個(gè)example

value_city = u"北京".encode('utf-8') # 城市 value_use_day = 7 #最近7天打開淘寶次數(shù) value_pay = 289.4 # 最近7 天消費(fèi)金額 value_poi = [b"123", b"456", b"789"] #最近7天瀏覽電鋪''' 下面生成ByteList，Int64List和FloatList ''' bl_city = tf.train.BytesList(value = [value_city]) ## tf.train.ByteList入?yún)⑹莑ist，所以要轉(zhuǎn)為list il_use_day = tf.train.Int64List(value = [value_use_day]) fl_pay = tf.train.FloatList(value = [value_pay]) bl_poi = tf.train.BytesList(value = value_poi)''' 下面生成tf.train.Feature ''' feature_city = tf.train.Feature(bytes_list = bl_city) feature_use_day = tf.train.Feature(int64_list = il_use_day) feature_pay = tf.train.Feature(float_list = fl_pay) feature_poi = tf.train.Feature(bytes_list = bl_poi) ''' 下面定義tf.train.Features ''' feature_dict = {"city":feature_city,"use_day":feature_use_day,"pay":feature_pay,"poi":feature_poi} features = tf.train.Features(feature = feature_dict) ''' 下面定義tf.train.example ''' example = tf.train.Example(features = features)

然后就是把這個(gè)example寫入文件中

path = "./tfrecord" with tf.io.TFRecordWriter(path) as file_writer:file_writer.write(example.SerializeToString())

至此，就完成了tfrecord文件的寫入。

當(dāng)然，到這里還沒完，用tf.io寫入example的字節(jié)和直接用Python的寫入example的字節(jié) 是一樣的嗎？為此我們做一個(gè)實(shí)驗(yàn)

path = "./tfrecord" path2 = "./tfrecord2" with tf.io.TFRecordWriter(path) as file_writer:file_writer.write(example.SerializeToString()) with open(path2,"wb") as f:f.write(example.SerializeToString())

通過上面的代碼，我們分別通過tf.io和Python的open方法把example的字節(jié)寫入2個(gè)文件。比較大小后發(fā)現(xiàn)一個(gè)是86字節(jié)，一個(gè)是99字節(jié)。看來內(nèi)容還是不一樣的，所以不能用Python自帶的open方法代替tf.io

tfrecord讀取

tfrecord的讀取也很簡單，但是tensorflow的官方document寫的真的非常糟糕，以下全部是我個(gè)人摸索出來的結(jié)果。接上代碼

path = "./tfrecord" data = tf.data.TFRecordDataset(pathtensor)

以上實(shí)際上就已經(jīng)完成了tfrecord的讀取過程。很多人會(huì)說，可是無論平時(shí)使用還是工程中，都會(huì)用一個(gè)map方法對data進(jìn)行變換呀。沒錯(cuò)，如果使用需要進(jìn)行變換，這是因?yàn)槲覀冊诒４鎡frecord的時(shí)候，先把一個(gè)example序列化成二進(jìn)制，然后再把二進(jìn)制字節(jié)變成一個(gè)string，這樣每個(gè)example就是一個(gè)string保存在了tfrecord 中。而讀取過程同樣，通過tf.data.TFRecordDataset，我們已經(jīng)把每個(gè)example變成的string以? tf.tensor(dtype=string) ?的方式讀取進(jìn)來了。所以我們完全可以用下面代碼看讀取結(jié)果

for batch in data:print(batch)result: tf.Tensor(b'\nQ\n\x18\n\x03poi\x12\x11\n\x0f\n\x03123\n\x03456\n\x03789\n\x12\n\x04city\x12\n\n\x08\n\x06\xe5\x8c\x97\xe4\xba\xac\n\x10\n\x07use_day\x12\x05\x1a\x03\n\x01\x07\n\x0f\n\x03pay\x12\x08\x12\x06\n\x043\xb3\x90C', shape=(), dtype=string)

這里還有另外一個(gè)大坑，data是一個(gè)TFRecordDatasetV2類，但同時(shí)，它也是個(gè)可迭代對象，所以就算找遍它的所有屬性和方法，都找不到它保存數(shù)據(jù)的tensor，但是可以通過迭代看到。

在Python中，可迭代對象是指有__iter__屬性的對象，這類對象可以用循環(huán)取迭代，所以可以放在for中迭代，其他對象例如整型，float等不是可迭代對象，放在循環(huán)中會(huì)報(bào)錯(cuò) “object is not iterable”。

當(dāng)然只是把example序列化的字節(jié)，讀取出來是不能用的，我們還是要把其中數(shù)據(jù)解析出來，這時(shí)候就要用到熟悉的map 方法了

def decode_fn(record_bytes):return tf.io.parse_single_example(record_bytes,{"city":tf.io.FixedLenFeature([],dtype = tf.string),"use_day":tf.io.FixedLenFeature([],dtype = tf.int64),"pay":tf.io.FixedLenFeature([],dtype = tf.float32),"poi":tf.io.VarLenFeature(dtype=tf.string)}) data2 = data.map(decode_fn)

tf.io.parse_single_example? 輸入是一個(gè)string的tensor 輸出是一個(gè) dict ，格式就是如入?yún)⒅械母袷?#xff0c;應(yīng)該注意的是，入?yún)⒅械膋ey應(yīng)該去全部在example中出現(xiàn)過，否則會(huì)報(bào)錯(cuò)。

在弄懂了data的內(nèi)容之后，我們就可以通過下面的方法調(diào)用decode_fn：

for batch in data:print(decode_fn(batch))result: {'poi': <tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x000002532FEF7860>, 'city': <tf.Tensor: id=55, shape=(), dtype=string, numpy=b'\xe5\x8c\x97\xe4\xba\xac'>, 'pay': <tf.Tensor: id=56, shape=(), dtype=float32, numpy=289.4>, 'use_day': <tf.Tensor: id=58, shape=(), dtype=int64, numpy=7>}

可以看到2種讀取方法內(nèi)容是一樣的。

在這里還有一個(gè)問題，在tf的官方教程中，io接口下有2個(gè)很類似的函數(shù)：tf.io.parse_single_example和tf.io.parse_example。這兩個(gè)有什么區(qū)別呢？

1. 解析的example規(guī)模不同。

我們先來看官方的文檔

tf.io.parse_example的官方文檔如下

Args

serialized	A vector (1-D Tensor) of strings, a batch of binary serialized?Example?protos.
features	A?dict?mapping feature keys to?FixedLenFeature,?VarLenFeature,?SparseFeature, and?RaggedFeature?values.
example_names	A vector (1-D Tensor) of strings (optional), the names of the serialized protos in the batch.
name	A name for this operation (optional).

Returns

A?dict?mapping feature keys to?Tensor,?SparseTensor, and?RaggedTensor?values.

tf.io.parse_single_example官方文檔如下

Args

serialized	A scalar string Tensor, a single serialized Example.
features	A?dict?mapping feature keys to?FixedLenFeature?or?VarLenFeature?values.
example_names	(Optional) A scalar string Tensor, the associated name.
name	A name for this operation (optional).

Returns

A?dict?mapping feature keys to?Tensor?and?SparseTensor?values.

通過官方給的定義和函數(shù)的名字就可以看出來，tf.io.parse_single_example只對單條example的二進(jìn)制序列進(jìn)行解析，得到的也就是一個(gè)example，所以他的第一個(gè)入?yún)⒁笫莝calar string Tensor，即標(biāo)量tensor，其實(shí)就是一個(gè)字符串。所以在上面的例子中

result看似是一個(gè)tensor，但它沒有形狀，所以說本質(zhì)上還是一個(gè)標(biāo)量（字符串），并非張量

tensorflow中有三個(gè)概念

標(biāo)量（scalar tensor），也可以認(rèn)為就是普通的變量，是0階張量，shape一般是空

向量（vector），就是一階張量

張量，不用解釋，用的最多

那如果把標(biāo)量變形成一個(gè)向量或者張量，這樣的入?yún)⒉环蟨arse_single_example的入?yún)⒍x，就會(huì)報(bào)錯(cuò)

而tf.io.parse_example正好相反，tf.io.parse_example可以解析一批example，所以他的入?yún)⑹且粋€(gè)向量，就算是只對一個(gè)example進(jìn)行解析，也必須把標(biāo)量變形成向量，也就是說應(yīng)該寫成

def decode_fn(record_bytes):return tf.io.parse_example(tf.reshape(record_bytes,[1]), #注意這一行發(fā)生了變化{"city":tf.io.FixedLenFeature([],dtype = tf.string),"use_day":tf.io.FixedLenFeature([],dtype = tf.int64),"pay":tf.io.FixedLenFeature([],dtype = tf.float32),"poi":tf.io.VarLenFeature(dtype=tf.string)}) data2 = data.map(decode_fn)

這里應(yīng)該注意，tf.io.parse_example的第一個(gè)入?yún)⒅荒苁窍蛄?#xff0c;絕對不能是二維以上的張量，否則同樣報(bào)錯(cuò)。

2.對可變長sparse特征的解析結(jié)果不同

這個(gè)區(qū)別是非常有趣的，我們來看上面的poi這個(gè)特征，他是一個(gè)sparse特征，無論是通過tf.io.parse_example 還是tf.io.parse_single_example，我們都是把字符串解析了出來，得到了?["123", "456", "789"]三個(gè)店鋪id，但實(shí)際上一般都要對這類特征進(jìn)行onehot，變成數(shù)值類型的輸入。

用tf.io.parse_example得到的onrhot編碼是一個(gè)向量例如，假設(shè)一共有5家店鋪[a,"123", b, "456", "789"]。那么用tf.io.parse_example，在經(jīng)過onehot會(huì)得到[0,1,0,1,1]，而parse_single_example會(huì)得到

[[0,1,0,0,0]

[0,0,0,1,0]

0,0,0,0,1]]

這個(gè)會(huì)在https://blog.csdn.net/kangshuangzhu/article/details/106851826中詳細(xì)介紹

結(jié)語

這里還有一個(gè)問題，在定義tf.io.parse_single_example的時(shí)候，我們需要給出返回的dict的形式。當(dāng)特征數(shù)量較少的時(shí)候這當(dāng)然沒問題，但是工程中一般特征非常多，動(dòng)輒上千維，用這種方法定義很明顯是非常低效的。這時(shí)候tf.feature_column就是一個(gè)非常有用的工具了。tf.feature_column的內(nèi)容下一篇文章再進(jìn)行講解

總結(jié)

以上是生活随笔為你收集整理的tensorflow2.0 环境下的tfrecord读写及tf.io.parse_example和tf.io.parse_single_example的区别的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：动画，3d变形
下一篇：怎样成为精力管理的高手————作者：张遇