变形金刚2_变形金刚(
變形金剛2
重點 (Top highlight)
This is a 3 part series where we will be going through Transformers, BERT, and a hands-on Kaggle challenge — Google QUEST Q&A Labeling to see Transformers in action (top 4.4% on the leaderboard).In this part (1/3) we will be looking at how Transformers became state-of-the-art in various modern natural language processing tasks and their working.
這是一個分為3部分的系列文章,我們將通過Transformers,BERT和動手Kaggle挑戰-Google QUEST問題與 解答 標簽 來查看Transformers的使用情況(在排行榜上排名前4.4%)。在這一部分(1/3)我們將研究《變形金剛》如何在各種現代自然語言處理任務及其工作中成為最先進的技術。
The Transformer is a deep learning model proposed in the paper Attention is All You Need by researchers at Google and the University of Toronto in 2017, used primarily in the field of natural language processing (NLP).
噸他變壓器是在提出一種深度學習模型關注的是你所需要的研究人員在谷歌和多倫多在2017年的大學,主要是在自然語言處理(NLP)的領域。
Like recurrent neural networks (RNNs), Transformers are designed to handle sequential data, such as natural language, for tasks such as translation and text summarization. However, unlike RNNs, Transformers do not require that the sequential data be processed in the order. For example, if the input data is a natural language sentence, the Transformer does not need to process the beginning of it before the end. Due to this feature, the Transformer allows for much more parallelization than RNNs and therefore reduced training times.
像遞歸神經網絡(RNN)一樣,變形金剛旨在處理順序數據(例如自然語言),以執行翻譯和文本摘要之類的任務。 但是,與RNN不同,Transformer不需要按順序處理順序數據。 例如,如果輸入數據是自然語言語句,則Transformer不需要在結束之前處理它的開頭。 由于此功能,與RNN相比,Transformer允許更多的并行化,因此減少了訓練時間。
Transformers were designed around the concept of attention mechanism which was designed to help memorize long source sentences in neural machine translation.
圍繞注意機制的概念設計了變壓器,該機制旨在幫助記憶神經機器翻譯中的長句。
Sounds cool right?Let’s take a look under the hood and see how things work.
聽起來不錯吧?讓我們看一下引擎蓋,看看它們是如何工作的。
source來源Transformers are based on an encoder-decoder architecture that comprises of encoders which consists of a set of encoding layers that processes the input iteratively one layer after another and decoders that consists of a set of decoding layers that does the same thing to the output of the encoder.
變壓器基于編碼器-解碼器體系結構,該體系結構由編碼器和解碼器組成,其中編碼器由一組編碼層組成,這些編碼層一層又一層地迭代處理輸入,而解碼器由一組解碼層組成,這些解碼層對輸出的內容進行相同的處理編碼器。
So, when we pass a sentence into a transformer, it is embedded and passed into a stack of encoders. The output from the final encoder is then passed into each decoder block in the decoder stack. The decoder stack then generates the output.
因此,當我們將句子傳遞給轉換器時,它將被嵌入并傳遞給編碼器堆棧。 最終編碼器的輸出然后傳遞到解碼器堆棧中的每個解碼器塊。 然后,解碼器堆棧生成輸出。
All the encoder blocks in the transformer are identical and similarly, all the decoder blocks in the transformer are identical.
變壓器中的所有編碼器塊是相同的,并且類似地,變壓器中的所有解碼器塊是相同的。
source: http://jalammar.github.io/illustrated-transformer/來源: http : //jalammar.github.io/illustrated-transformer/This was a very high-level representation of a transformer and it wouldn’t probably make much sense when understanding how transformers are so efficient in modern NLP tasks.Don’t worry, to make things clearer, we will go through the internals of an encoder and decoder cell now…
這是一個非常高級的變壓器表示形式,當理解變壓器如何在現代NLP任務中如此高效時,可能沒有多大意義。不用擔心,為了使事情更清楚,我們將仔細研究變壓器的內部結構。編碼器和解碼器單元現在…
Encoder
編碼器
The encoder has 2 parts, self-attention, and a feed-forward neural network.
編碼器由兩部分組成:自我注意和前饋神經網絡。
source: http://jalammar.github.io/illustrated-transformer/來源: http : //jalammar.github.io/illustrated-transformer/The encoder’s inputs first flow through a self-attention layer — a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. Basically for each input word ‘x’ the self-attention layer generates a vector Z such that it takes all the input words (x1, x2, x3, …, xn) into the picture before generating Z. I’ll come to why it takes all the input word’s embedding into the picture and how it generates Z later in this blog but for now, just remember these brief high-level summarizations of the subcomponents of an encoder.
編碼器的輸入首先流經自我注意層,該層可以幫助編碼器在對特定單詞進行編碼時查看輸入句子中的其他單詞。 基本上,對于每個輸入單詞'x',自我注意層都會生成一個向量Z ,以使其在生成Z之前將所有輸入單詞(x1,x2,x3,…,xn)放入圖片中。 在本博客的稍后部分,我將討論為什么要將所有輸入詞都嵌入圖片中以及如何生成Z ,但是現在,請記住編碼器子組件的這些簡要概述。
The outputs of the self-attention layer are fed to a feed-forward neural network. The feed-forward neural network generates an output for each input Z and the output from the feed-forward neural network is passed into the next encoder block’s self-attention layer and so on.
自我注意層的輸出被饋送到前饋神經網絡。 前饋神經網絡為每個輸入Z生成一個輸出,前饋神經網絡的輸出將傳遞到下一個編碼器塊的自注意層,依此類推。
Now that we have an idea of what all is inside an encoder, let’s understand the tensor operations happening inside each component.
現在我們已經了解了編碼器內部的所有內容,讓我們了解每個組件內部發生的張量操作。
First comes the input:
首先是輸入:
We know that transformers are used for NLP tasks so the data we deal with is usually a corpus of sentences, but since machine learning algorithms are all about matrix operations, we first need to convert the human-readable sentences into a machine-readable format (numbers). To convert the sentences into numbers, we use ‘word embeddings’. This step is simple, each word in a sentence is represented as an n-dimensional vector (n is usually 512) and for transformers, we typically use GloVe embedding representation of words. There is also something called positional encoding that is applied to these embedding but I’ll come to it later.Once we have the embedding for each input word, we pass these embedding simultaneously to the self-attention layer.
我們知道轉換器是用于NLP任務的,因此我們處理的數據通常是句子的主體,但是由于機器學習算法都是關于矩陣運算的,因此我們首先需要將人類可讀的句子轉換為機器可讀的格式(數字)。 要將句子轉換為數字,我們使用“單詞嵌入”。 此步驟很簡單,將句子中的每個單詞表示為n維向量(n通常為512),對于轉換器,我們通常使用GloVe嵌入單詞表示法。 還有一些叫做位置編碼的東西被應用到這些嵌入中,但是稍后我會介紹。一旦我們為每個輸入單詞都嵌入了嵌入,我們就將這些嵌入同時傳遞給自我注意層。
The training parameters of self-attention layer:
自我注意層的訓練參數:
Different layers have different learning parameters eg. a Dense layer has weights and bias, a Convolutional layer has kernels as the learning parameters similarly in the self-attention layer, we have 4 learning parameters:- Query matrix: Wq- Key matrix: Wk- Value matrix: Wv- Output matrix: Wo (this is not the output matrix but a trainable parameter that generates the final output of the self-attention layer Z).
不同的層具有不同的學習參數,例如。 一個密集層具有權重和偏差 ,一個卷積層也具有內核作為自注意力層的學習參數,我們有4個學習參數: -查詢矩陣: Wq- 關鍵矩陣: Wk- 值矩陣: Wv- 輸出矩陣: 禾 (這不是輸出矩陣,而是可訓練的參數,該參數生成自我注意層Z的最終輸出)。
The first 3 trainable parameters have a special purpose, they are used for generating 3 new parameters:- Query: Q- Key: K- Value: Vwhich are later used for generating output Z from input x, let’s see how-
前三個可訓練參數具有特殊用途,它們用于生成3個新參數: -查詢: Q- 鍵: K- 值: V ,稍后用于從輸入x生成輸出Z ,讓我們看看如何-
Some points to keep in mind are:- The input tensor x has n-rows and m-columns where n is the number of input words and m is the vector size of each word i.e. 512.- The output tensors Q, K, V, and Z have n-rows and dk-columns where n is the number of input words and dk is 64. The values of m and dk are no random values but were found to work the best by researchers who came up with this architecture.
請記住以下幾點:-輸入張量x具有n行和m列 ,其中n是輸入單詞的數量, m是每個單詞的向量大小,即512.-輸出張量Q,K,V和 Z n行 DK -columns其中n是輸入字的數量和DK為64 m的值和DK都沒有隨機值,但被發現誰用這種架構想出了研究人員的工作是最好的。
source: http://jalammar.github.io/illustrated-transformer/來源: http : //jalammar.github.io/illustrated-transformer/After calculating the 3 parameters Q, K, V as mentioned above, the self-attention layer then calculates scores, a vector for each of the input words.
如上所述,在計算了三個參數Q,K,V之后 ,自我注意層將計算分數,即每個輸入單詞的向量。
Dot-product attention:
點產品注意事項:
The next step in the self-attention layer is to calculate the value of the vector score corresponding to each input word. This score calculation is one of the most crucial steps that bring the attention mechanism to life (well… not literally). The vector score has a size of n where n is the number of input words and each element of this vector is a number that tells how much does the word that it corresponds to contributes to the current word.Let’s consider an example to get the intuition-“The animal didn’t cross the street because it was too tired”In the above sentence, the word it refers to the animal and not the road. For us, this is pretty simple to grasp but not for a machine with no attention, because we know how grammar works and we’ve developed a sense that it will be referring to animal more than words like cross or road. This sense of grammar comes to transformers after training but the fact that for a given word, it considers all the words in the input and then has the ability to select the one that it thinks contributes the most is what the attention mechanism is about.For the above sentence, the score vector generated for the word it will have 11 numbers, each corresponding to a word in the input sentence. For a well-trained model, this score vector will have larger numbers at positions 2 and 8 because the words at 2(animal) and 8(it) contribute the most to it. It may look something like: [2, 60, 4, 5, 3, 8, 5, 90, 7, 6, 3]Notice that the values at positions 2 and 8 are greater than the values at other positions.
自我注意層的下一步是計算與每個輸入單詞相對應的矢量分數的值。 分數計算是使注意力機制栩栩如生的最關鍵步驟之一(嗯……不是字面上的意思)。 向量分數的大小為n 哪里 ? 是輸入單詞的數量,該向量的每個元素是一個數字,表明該單詞對應的單詞對當前單詞有多少貢獻。讓我們考慮一個例子來獲得直覺: “動物沒有過馬路因為太累了” 在以上句子中, 它是指動物而不是道路。 對于我們來說,這很容易掌握,但對于沒有注意力的機器來說卻并非如此,因為我們知道語法是如何工作的,并且我們已經形成一種感覺,即它比動物之類的單詞“ cross”或“ road”更能指動物 。 這種語法意識是經過訓練的變形者,但是對于一個給定的單詞,它會考慮輸入中的所有單詞,然后能夠選擇其認為貢獻最大的單詞這一事實,這就是注意力機制的意義所在。 對于上述的句子,對于單詞生成的得分矢量它將具有11個數字,每一個對應于在輸入句子的單詞。 對于訓練有素的模型,此得分向量將在位置2和8處具有較大的數字,因為2(動物)和8(it)處的單詞對其貢獻最大。 它可能看起來像:[2,60,4,5,3,8,5,90,7,6,3]注意,在位置2和8的值比在其它位置的值越大。
source: http://jalammar.github.io/illustrated-transformer/來源: http : //jalammar.github.io/illustrated-transformer/Let’s see how these scores are generated in the self-attention layer.Till now, for each word, we have Q, K, V vectors. To generate the score vector, we use something called the dot-product attention where we take a dot product between the Q and the K vectors to generate the score. The value of Q is corresponding to the query of the word for which we are calculating the score, in the above example, the word was it whereas there are n values of K, each corresponding to the key vector of the input words.So, if we want to generate the scores for the word it:
讓我們看看這些分數是如何在自我注意層中生成的。到目前為止,對于每個單詞,我們都有Q,K,V向量。 為了生成得分向量,我們使用了一種稱為“ 點積注意”的方法 ,其中我們取Q和K向量之間的點積來生成得分。 Q的值對應于我們要為其計算分數的單詞的查詢,在上面的示例中,單詞是它,而存在n個K值,每個值對應于輸入單詞的鍵向量。如果我們想生成單詞的分數:
We take the query vector of it: Q
我們取它的查詢向量: Q
We take the key vectors of the input sentences: K1, K2, K3, …, Kn.
我們采用輸入句子的關鍵向量: K1,K2,K3,…,Kn。
We take a dot product between Q and K’s and obtain n scores.
我們在Q和K之間取一個點積,并得到n分。
After calculating the scores, we kind of normalize the scores by dividing them by squared root of (dk) which was the column-dimension of vectors Q, K, V.This step was mandatory because the creators of the transformer found that normalizing the scores by sqrt. of dk gives better results.
在計算出分數之后,我們通過將分數除以( dk )的平方根(即向量Q,K,V的 列維 )來對分數進行歸一化 。此步驟是必需的,因為轉換器的創建者發現對分數進行歸一化由sqrt。 的dk效果更好。
After normalizing the score vectors, we encode them using softmax function such that the output is proportional to the original scores but all the values sum up to 1.
對得分向量進行歸一化后,我們使用softmax函數對其進行編碼,以使輸出與原始得分成正比,但所有值的總和為1。
Once we have the ‘softmaxed’ scores ready, we simply multiply each score element with the value vector V corresponding to it, such that we get n value vectors V after this operation: [V1, V2, V3, …, Vn].Now to obtain the output Z of the self-attention, we simply add all the n value vectors.
一旦我們準備好“ softmaxed”分數,我們就簡單地將每個分數元素與對應的值向量V相乘,以便在此操作后獲得n個值向量V :[ V1,V2,V3,…,Vn ]。為了獲得自我注意的輸出Z ,我們只需將所有n個值向量相加即可。
source)來源 ) source: http://jalammar.github.io/illustrated-transformer/來源: http : //jalammar.github.io/illustrated-transformer/The above diagrams illustrate the steps of the self-attention layer.
上圖說明了自我注意層的步驟。
Multi-head Attention:
多頭注意:
Now that we know how an attention-head works, and how amazing it is there is a catch to it. A single attention-head can sometimes miss some of the words in input that contribute most to the spotlight word, like in the example before, sometimes the attention head may fail to pay attention to the word animal while predicting the word it and this may cause problems.To tackle this issue, instead of just a single attention-head, we use multiple attention-heads, each working in a similar manner. This idea helps us to reduce the error or miscalculation by any single attention head.This is also referred to as multi-head attention.
現在,我們知道了注意頭是如何工作的,以及它有多神奇,這有一個吸引點。 單注意頭有時會錯過一些在輸入之前的例子最有助于聚光燈字,之類的話語,有時同時預測字呢注意頭可能無法要注意單詞的動物 ,這可能會導致為了解決這個問題,我們使用多個關注頭,而不僅僅是一個單獨的關注頭,每個關注頭的工作方式都相似。 這個想法可以幫助我們減少任何單個關注頭的錯誤或計算錯誤,這也稱為多頭關注 。
The scores from 2 different attention-heads are represented in orange and green. We can see how one attention-head pays more attention to words like the, animal, cross whereas the other pays more attention to words like street, was, tired. (image source).2個不同的關注度得分以橙色和綠色表示。 我們可以看到一個注意力集中的人如何更加關注動物,十字架之類的單詞,而另一個注意力集中于街道,過去,疲倦之類的單詞。 ( 圖片來源 )。In the transformers, multi-head attention typically uses 8 attention heads.Now notice that the output of a single attention-head was of 64 dimensions, but if we use multi-head attention, we will get 8 such 64-dimensional vectors as output.
在變形金剛中,多頭注意力通常使用8個注意力頭,現在注意單個注意力頭的輸出為64維,但是如果我們使用多頭注意力,我們將獲得8個這樣的64維向量作為輸出。
source: http://jalammar.github.io/illustrated-transformer/來源: http : //jalammar.github.io/illustrated-transformer/Turns out there is a final trainable parameter Output matrix Wo that I mentioned before that comes into play here.In the final layer of the self-attention, all the output [Z0, Z1, Z2,…, Z7] are concatenated and multiplied with Wo such that the final output Z is of a dimension 64.
原來,我之前提到的是一個最終可訓練的參數輸出矩陣 Wo 。在自我注意的最后一層,所有輸出[Z0,Z1,Z2,…,Z7]被級聯并乘以Wo ,使得最終輸出Z的尺寸為64。
Below is the diagram to show all the steps discussed above:
下圖顯示了上面討論的所有步驟:
source: http://jalammar.github.io/illustrated-transformer/來源: http : //jalammar.github.io/illustrated-transformer/Positional encoding:
位置編碼:
Remember in first comes the input section I mentioned positional encoding, let’s see what are they and how they help. The problem with our current awesome transformer is that it does not take the position of the input words into account. Unlike RNN where we had timesteps to denote which word comes before and after, in transformers since the words are fed simultaneously, we need some kind of positional encoding that defines which word comes after which.Positional encoding comes to our rescue as it gives the input embedding a sense of position, we first generate the position embeddings for each of the input words and these position embeddings are then added to the word embeddings of the respective words to generate embeddings with a time signal.
請記住, 首先我提到位置編碼的輸入部分,讓我們看看它們是什么以及它們如何提供幫助。 我們當前出色的變壓器存在的問題是它沒有考慮輸入字的位置。 與RNN不同,在RNN中,我們有時間步長指示哪個單詞出現在前后,而在轉換器中,由于單詞是同時饋送的,我們需要某種位置編碼來定義哪個單詞出現在后面,因為位置編碼可以提供輸入為了嵌入位置感,我們首先為每個輸入單詞生成位置嵌入,然后將這些位置嵌入添加到各個單詞的單詞嵌入中,以生成帶有時間信號的嵌入。
There were many proposed method for generating the positional embeddings like one-hot encoded vectors or binary encoding but what the researchers found to work the best was using the equations below to generate the embeddings:
提出了許多生成位置嵌入的方法,例如單熱編碼矢量或二進制編碼,但是研究人員發現最有效的方法是使用以下公式生成嵌入:
Image by Author圖片作者When we plot the 128-dimensional positional encoding for a sentence with a maximum length of 50, it looks something like:
當我們繪制最大長度為50的句子的128維位置編碼時,它看起來像:
Each row represents the embedding vector (Image by Author每行代表嵌入向量(作者提供的圖像))Residual connections:
殘余連接:
Finally, there is one more improvisation added to the encoders known as residual connections or skip connections which allow the output from the previous layer to bypass layers in between.It helps in deep networks where there are many hidden layers and if any layer in between is not of much use or is not learning much, skip connections help in bypassing that layer.Another thing to note is that when the residual connections are added and the resultant is normalized.
最后,在編碼器中又增加了一種即席連接(殘余連接或跳過連接),可以使前一層的輸出繞過中間的層,這對于深度網絡中存在許多隱藏層并且中間有任何層的情況很有幫助。沒有太大用處或學習不多的地方,跳過連接有助于繞過該層。另一要注意的是,當添加剩余連接并將結果標準化后。
image source).圖片來源 )。Decoder
解碼器
A decoder is very similar to the encoder. Like encoder, it also has the self-attention and feed-forward network but it also has an additional block known as Encoder-Decoder Attention sandwiched between the two.The Encoder-Decoder Attention layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.The remaining 2 layers work exactly the same as those in the encoder cell.
解碼器與編碼器非常相似。 像編碼器,它也有自關注和前饋網絡,但它也有被稱為夾在two.The 編碼器-解碼器注意層之間的編碼器-解碼器注意的附加塊的工作原理一樣多頭自注意,除了它創建它的查詢矩陣來自其下一層,并從編碼器堆棧的輸出中獲取鍵和值矩陣,其余兩層的工作原理與編碼器單元中的相同。
image source).圖片來源 )。The input to the decoder stack is sequential unlike the simultaneous input in encoder stack, meaning the first output word is passed into the decoder as an input using which it generates the second output now this output is again passed as an input to the decoder and using that it generates the third output and so on…
解碼器堆棧的輸入是順序的,與編碼器堆棧中的同時輸入不同,這意味著第一個輸出字作為輸入傳遞到解碼器,通過它生成第二個輸出,現在此輸出再次作為輸入傳遞到解碼器,并且它會生成第三個輸出,依此類推...
image source).圖片來源 )。The output of the decoders is passed into a linear layer with softmax activation using which, the correct word is predicted.
解碼器的輸出通過softmax激活傳遞到線性層,通過該層預測正確的字。
source: http://jalammar.github.io/illustrated-transformer/來源: http : //jalammar.github.io/illustrated-transformer/Once the transformer predicts a word using forward propagation, the prediction is compared with the actual label using a loss function like cross-entropy and then all the trainable parameters are updated using back-propagation.Well, this is one simplified way of understanding how learning happens in transformers. There are more variations like taking the complete output sentence for calculating the loss. To know more you can check out this amazing blog on Transformer by Jay Alammar.
轉換器使用正向傳播預測單詞后,使用諸如交叉熵之類的損失函數將預測與實際標簽進行比較,然后使用反向傳播更新所有可訓練的參數。這是一種了解學習方式的簡化方式發生在變壓器中。 還有更多變化,例如采用完整的輸出語句來計算損失。 要了解更多信息,您可以查看Jay Alammar撰寫的有關Transformer的精彩博客 。
With this, we have come to the end of this blog. Hope the read was pleasant.I would like to thank all the creators for creating the awesome content I referred to for writing this blog.
至此,我們到了本博客的結尾。 希望閱讀愉快。我要感謝所有創作者創造了我寫此博客所提到的精彩內容。
Reference links:
參考鏈接:
Applied AI Course: https://www.appliedaicourse.com/
應用AI課程: https : //www.appliedaicourse.com/
https://arxiv.org/abs/1706.03762
https://arxiv.org/abs/1706.03762
http://jalammar.github.io/illustrated-transformer/
http://jalammar.github.io/illustrated-transformer/
http://primo.ai/index.php?title=Transformer
http://primo.ai/index.php?title=變形金剛
https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)
https://zh.wikipedia.org/wiki/變形金剛(machine_learning_model)
https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04
https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04
Final note
最后說明
Thank you for reading the blog. I hope it was useful for some of you aspiring to do projects or learn some new concepts in NLP.
感謝您閱讀博客。 我希望這對有志于在NLP中進行項目或學習一些新概念的人有用。
In part 2/3 we will go through BERT (Bidirectional Encoder Representations from Transformers).
在第2/3部分中,我們將介紹BERT(來自變壓器的雙向編碼器表示)。
In part 3/3 we will go through a hands-on Kaggle challenge — Google QUEST Q&A Labeling to see Transformers in action (top 4.4% on the leaderboard).
在第3/3部分中,我們將進行動手的Kaggle挑戰-Google QUEST問題與解答標簽,以查看《變形金剛》的使用情況(在排行榜上排名前4.4%)。
Find me on LinkedIn: www.linkedin.com/in/sarthak-vajpayee
在LinkedIn 上找到我: www.linkedin.com/in/sarthak-vajpayee
Peace! ?
和平! ?
翻譯自: https://towardsdatascience.com/transformers-state-of-the-art-natural-language-processing-1d84c4c7462b
變形金剛2
總結
以上是生活随笔為你收集整理的变形金刚2_变形金刚(的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: power bi可视化表_如何使用Pow
- 下一篇: 中消协点名!魔兽世界停服“荣登”十大消费