TensorFlow2简单入门-单词嵌入向量
用數字表示文本
機器學習模型將向量(數字數組)作為輸入。在處理文本時,我們必須先想出一種策略,將字符串轉換為數字(或將文本“向量化”),然后再其饋入模型。在本部分中,我們將探究實現這一目標的三種策略。
獨熱編碼
作為第一個想法,我們可以對詞匯表中的每個單詞進行“獨熱”編碼??紤]這樣一句話:“The cat sat on the mat”。這句話中的詞匯(或唯一單詞)是(cat、mat、on、sat、the)。為了表示每個單詞,我們將創建一個長度等于詞匯量的零向量,然后在與該單詞對應的索引中放置一個 1。下圖顯示了這種方法。
為了創建一個包含句子編碼的向量,我們可以將每個單詞的獨熱向量連接起來。
要點:這種方法效率低下。一個獨熱編碼向量十分稀疏(這意味著大多數索引為零)。假設我們的詞匯表中有 10,000 個單詞。為了對每個單詞進行獨熱編碼,我們將創建一個其中 99.99% 的元素都為零的向量。
用一個唯一的數字編碼每個單詞
我們可以嘗試的第二種方法是使用唯一的數字來編碼每個單詞。繼續上面的示例,我們可以將 1 分配給“cat”,將 2 分配給“mat”,依此類推。然后,我們可以將句子“The cat sat on the mat”編碼為一個密集向量,例如 [5, 1, 4, 3, 5, 2]。這種方法是高效的。現在,我們有了一個密集向量(所有元素均已滿),而不是稀疏向量。
但是,這種方法有兩個缺點:
整數編碼是任意的(它不會捕獲單詞之間的任何關系)。
對于要解釋的模型而言,整數編碼頗具挑戰。例如,線性分類器針對每個特征學習一個權重。由于任何兩個單詞的相似性與其編碼的相似性之間都沒有關系,因此這種特征權重組合沒有意義。
單詞嵌入向量
單詞嵌入向量為我們提供了一種使用高效、密集表示的方法,其中相似的單詞具有相似的編碼。重要的是,我們不必手動指定此編碼。嵌入向量是浮點值的密集向量(向量的長度是您指定的參數)。它們是可以訓練的參數(模型在訓練過程中學習的權重,與模型學習密集層權重的方法相同),無需手動為嵌入向量指定值。8 維的單詞嵌入向量(對于小型數據集)比較常見,而在處理大型數據集時最多可達 1024 維。維度更高的嵌入向量可以捕獲單詞之間的細粒度關系,但需要更多的數據來學習。
上面是一個單詞嵌入向量的示意圖。每個單詞都表示為浮點值的 4 維向量。還可以將嵌入向量視為“查找表”。學習完這些權重后,我們可以通過在表中查找對應的密集向量來編碼每個單詞。
下面來看看代碼
import io import os import re import shutil import string import tensorflow as tffrom datetime import datetime from tensorflow.keras import Model, Sequential from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D from tensorflow.keras.layers.experimental.preprocessing import TextVectorizationprint(tf.__version__) """ 輸出:2.5.0-dev20201226 """下載IMDB數據集
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,untar=True, cache_dir='.',cache_subdir='')dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb') os.listdir(dataset_dir) """ 輸出: Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz 84131840/84125825 [==============================] - 138s 2us/step ['imdb.vocab', 'imdbEr.txt', 'README', 'test', 'train'] """train文件夾中有pos與neg兩個關于電影評論的文件夾,其中數據分別被標記為positive與negative,你可以使用這兩個文件夾中的數據去訓練一個二元分類模型
train_dir = os.path.join(dataset_dir, 'train') os.listdir(train_dir) """ 輸出: ['labeledBow.feat','neg','pos','unsup','unsupBow.feat','urls_neg.txt','urls_pos.txt','urls_unsup.txt'] """在創建數據集之前應該先移除多余的文件夾,例如unsup
remove_dir = os.path.join(train_dir, 'unsup') shutil.rmtree(remove_dir)下一步,用tf.keras.preprocessing.text_dataset_from_directory函數創建一個tf.data.Dataset。用train文件夾中數據創建train與validation數據集,validation所占比例為20%(即validation_split為0.2)
batch_size = 1024 seed = 123 train_ds = tf.keras.preprocessing.text_dataset_from_directory('aclImdb/train', batch_size=batch_size, validation_split=0.2, subset='training', seed=seed) val_ds = tf.keras.preprocessing.text_dataset_from_directory('aclImdb/train', batch_size=batch_size, validation_split=0.2, subset='validation', seed=seed) """ 輸出: Found 25000 files belonging to 2 classes. Using 20000 files for training. Found 25000 files belonging to 2 classes. Using 5000 files for validation. """檢查數據集中的評論數據以及對應的標簽
for text_batch, label_batch in train_ds.take(1):for i in range(5):print(label_batch[i].numpy(), text_batch.numpy()[i]) """ 輸出: 0 b"Oh My God! Please, for the love of all that is holy, Do Not Watch This Movie! It it 82 minutes of my life I will never get back. Sure, I could have stopped watching half way through. But I thought it might get better. It Didn't. Anyone who actually enjoyed this movie is one seriously sick and twisted individual. No wonder us Australians/New Zealanders have a terrible reputation when it comes to making movies. Everything about this movie is horrible, from the acting to the editing. I don't even normally write reviews on here, but in this case I'll make an exception. I only wish someone had of warned me before I hired this catastrophe" 1 b'This movie is SOOOO funny!!! The acting is WONDERFUL, the Ramones are sexy, the jokes are subtle, and the plot is just what every high schooler dreams of doing to his/her school. I absolutely loved the soundtrack as well as the carefully placed cynicism. If you like monty python, You will love this film. This movie is a tad bit "grease"esk (without all the annoying songs). The songs that are sung are likable; you might even find yourself singing these songs once the movie is through. This musical ranks number two in musicals to me (second next to the blues brothers). But please, do not think of it as a musical per say; seeing as how the songs are so likable, it is hard to tell a carefully choreographed scene is taking place. I think of this movie as more of a comedy with undertones of romance. You will be reminded of what it was like to be a rebellious teenager; needless to say, you will be reminiscing of your old high school days after seeing this film. Highly recommended for both the family (since it is a very youthful but also for adults since there are many jokes that are funnier with age and experience.' 0 b"Alex D. Linz replaces Macaulay Culkin as the central figure in the third movie in the Home Alone empire. Four industrial spies acquire a missile guidance system computer chip and smuggle it through an airport inside a remote controlled toy car. Because of baggage confusion, grouchy Mrs. Hess (Marian Seldes) gets the car. She gives it to her neighbor, Alex (Linz), just before the spies turn up. The spies rent a house in order to burglarize each house in the neighborhood until they locate the car. Home alone with the chicken pox, Alex calls 911 each time he spots a theft in progress, but the spies always manage to elude the police while Alex is accused of making prank calls. The spies finally turn their attentions toward Alex, unaware that he has rigged devices to cleverly booby-trap his entire house. Home Alone 3 wasn't horrible, but probably shouldn't have been made, you can't just replace Macauley Culkin, Joe Pesci, or Daniel Stern. Home Alone 3 had some funny parts, but I don't like when characters are changed in a movie series, view at own risk." 0 b"There's a good movie lurking here, but this isn't it. The basic idea is good: to explore the moral issues that would face a group of young survivors of the apocalypse. But the logic is so muddled that it's impossible to get involved.<br /><br />For example, our four heroes are (understandably) paranoid about catching the mysterious airborne contagion that's wiped out virtually all of mankind. Yet they wear surgical masks some times, not others. Some times they're fanatical about wiping down with bleach any area touched by an infected person. Other times, they seem completely unconcerned.<br /><br />Worse, after apparently surviving some weeks or months in this new kill-or-be-killed world, these people constantly behave like total newbs. They don't bother accumulating proper equipment, or food. They're forever running out of fuel in the middle of nowhere. They don't take elementary precautions when meeting strangers. And after wading through the rotting corpses of the entire human race, they're as squeamish as sheltered debutantes. You have to constantly wonder how they could have survived this long... and even if they did, why anyone would want to make a movie about them.<br /><br />So when these dweebs stop to agonize over the moral dimensions of their actions, it's impossible to take their soul-searching seriously. Their actions would first have to make some kind of minimal sense.<br /><br />On top of all this, we must contend with the dubious acting abilities of Chris Pine. His portrayal of an arrogant young James T Kirk might have seemed shrewd, when viewed in isolation. But in Carriers he plays on exactly that same note: arrogant and boneheaded. It's impossible not to suspect that this constitutes his entire dramatic range.<br /><br />On the positive side, the film *looks* excellent. It's got an over-sharp, saturated look that really suits the southwestern US locale. But that can't save the truly feeble writing nor the paper-thin (and annoying) characters. Even if you're a fan of the end-of-the-world genre, you should save yourself the agony of watching Carriers." 0 b'I saw this movie at an actual movie theater (probably the $2.00 one) with my cousin and uncle. We were around 11 and 12, I guess, and really into scary movies. I remember being so excited to see it because my cool uncle let us pick the movie (and we probably never got to do that again!) and sooo disappointed afterwards!! Just boring and not scary. The only redeeming thing I can remember was Corky Pigeon from Silver Spoons, and that wasn\'t all that great, just someone I recognized. I\'ve seen bad movies before and this one has always stuck out in my mind as the worst. This was from what I can recall, one of the most boring, non-scary, waste of our collective $6, and a waste of film. I have read some of the reviews that say it is worth a watch and I say, "Too each his own", but I wouldn\'t even bother. Not even so bad it\'s good.' """創建一個高性能的數據集(dataset)
這是加載數據時應該使用的兩種重要方法,以確保I/O不會阻塞
- .cache():將數據從磁盤加載后保留在內存中。這將確保數據集在訓練模型時不會成為瓶頸。如果數據集太大,無法放入內存,也可以使用此方法創建一個性能良好的磁盤緩存,它比許多小文件讀取效率更高。
- .prefetch():使數據預處理與模型的訓練交替進行
使用嵌入層(Embedding層)
Embedding層可以理解成一個從整數索引(代表特定詞匯)映射到密集向量(該單詞對應的embeddings)的一個查找表。你可以通過試驗確定最佳嵌入維度,就和你確定Dense層的最佳神經元個數那樣做。
# 輸入1000個單詞,每個單詞用5個維度的向量表示 embedding_layer = tf.keras.layers.Embedding(1000, 5)當你創建Embedding層時,Embedding層的權重(weights)將會和其他層(layer)一樣被隨機初始化。在訓練過程中,權重會逐漸通過反向傳播來進行調整。訓練過后,embeddings層將會粗略的編碼詞匯之間的相似性(這個是針對你所訓練模型的特定問題的)。
如果將整數傳遞給嵌入層,則結果將用嵌入表中的向量替換每個整數。
result = embedding_layer(tf.constant([1,2,3])) result.numpy() """ 輸出: array([[-0.01827962, 0.033703 , 0.02065292, 0.00335936, -0.00998179],[ 0.00618695, -0.02138543, -0.01288087, 0.03814398, -0.02176479],[-0.02900024, 0.03794893, -0.03229412, 0.04951945, 0.03212232]],dtype=float32) """對于文本或序列問題,嵌入向量層采用整數組成的 2D 張量,其形狀為 (samples, sequence_length),其中每個條目都是一個整數序列。它可以嵌入可變長度的序列。您可以在形狀為 (32, 10)(32 個長度為 10 的序列組成的批次)或 (64, 15)(64 個長度為 15 的序列組成的批次)的批次上方嵌入向量層。
返回的張量比輸入多一個軸,嵌入向量沿新的最后一個軸對齊。向其傳遞 (2, 3) 輸入批次,輸出為 (2, 3, N)
result = embedding_layer(tf.constant([[0,1,2],[3,4,5]])) result.shape """ 輸出:TensorShape([2, 3, 5]) """當給定一個序列批次作為輸入時,嵌入向量層將返回形狀為 (samples, sequence_length, embedding_dimensionality) 的 3D 浮點張量。
代碼:可在微信公眾號【明天依舊可好】中回復:05
注: 本文參考了官網并對其進行了刪減以及部分注釋與修改
《新程序員》:云原生和全面數字化實踐50位技術專家共同創作,文字、視頻、音頻交互閱讀總結
以上是生活随笔為你收集整理的TensorFlow2简单入门-单词嵌入向量的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: TensorFlow2简单入门-加载及预
- 下一篇: 常用镜像