Transformer论文阅读(一): Attention is all you need
橙色--目的、結論、優點;洋紅--突破性重要內容或結論,對我來說特別急需緊要的知識點;紅色--特別重要的內容;黃色--重要內容;綠色--問題;藍色--解決方案;灰色--未經證實的個人懷疑或假設或過時不重要的標記內容
Transformer三大核心:
- attention mechanisms modeling dependencies without regard to their distance in the input and output sequences.
-->dependencies: position and steps, aligning, 限制了并行化parallelization, 修修補補的策略不管用 ==>?attention mechanism==>Transformer, 允許more parallelization
- Transformer rely entirely on self-attention,
- ?reduced to a constant number of operations to learn representations of all input and output positions
-->因為關聯輸入、輸出信號的操作,越離越遠,操作越多,所以用Transformer==>有效分辨率降低==>采用Multi-Head
目錄
Abstract
1. Introduction
1.1 background
1.2 Recurrent models
1.3 Attention mechanisms
1.4 Transformer
2. Background
2.1 self-attention
2.2 End-to-End memory networks
2.3 Transformer
3. Model Architecture
Encoder-Decoder Structure
3.1 Encoder and Decoder Stacks
3.2 Attention
3.2.1 Scaled Dot-Product Attention
3.2.2 Multi-Head Attention
3.2.3 Applications of Attention in our Model
3.3 Position-wise Feed-Forward Networks
3.4 Embeddings and Softmax
3.5 positional encoding
4. Why Self-Attention
reasons for choosing self-attention
5. Training
5.4 Regularization 正則化
6. Results
7. conclusions
7.2 future research of Transformer
Reference
Abstract
The dominant?sequence transduction models are based on complex recurrent or convolutional neural networks that include an eccoder and a decoder.
主流的序列轉碼模型基于包含編碼器-解碼器的復雜循環或卷機神經網絡。
The best performing models also connect the encoder and decoder through an attention mechanism.
表現最好的模型也通過attention機制將編碼器與解碼器連接起來
A new simple network architecture, Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
一種新的簡便網絡架構,Transformer,只基于attention機制,完全不使用循環和卷機。
1. Introduction
1.1 background
RNN, lstm, gated recurrent neural networks have been firmly established as state-of-the-art approaches in sequence modeling and transduction problems such as language modeling and machine translation.?
RNN, lstm, 門循環神經網絡已經被牢固地確立為序列建模和轉碼問題(語言模型和機器翻譯)的最新方法。
Numerous effects have since continued to push the boundaries of recurrent language models and encoder-decoder architectures.
大量的努力已經持續推動著循環語言模型和編碼器-解碼器架構的邊界。
1.2 Recurrent models
factor computation along with the symbol positions of the input and output sequences.
循環模型通常會考慮計算以及輸入輸出序列的符號位置
Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a funtion of the previous hidden state ht-1 and the input for position t.
通過將位置與計算時間中的步驟對齊,它們生成了一個隱含狀態ht的序列,作為一個先前隱含狀態ht-1和位置t的函數。
problem: This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.
問題:這種固有的對齊序列性質限制了訓練實例內的并行化,但并行化計算在更長的序列長度中非常重要,因為內存限制了實例間的batching批處理。
solve:Recent work has achieved significant improvements in computational efficiency through ?factorization tricks and conditional computation, while also improving model performance in case of latter.
最近的工作通過分解技巧和條件計算在計算效率方面實現了顯著提升,同時在后者的情況下提升了模型性能。
problem: the fundamental constriant of sequential computation remains.
但是,這種基礎的序列計算限制仍然存在
1.3 Attention mechanisms
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences.
attention機制已經成為不同任務中引人注目的序列建模和轉導模型的組成部分,允許無視依賴在輸入輸出序列中的距離而對其依賴建模。
almost cases, attention mechanisms are used in conjunction with a recurrent network.
1.4 Transformer
a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output.
allows for significantly more parallelization and can reach a new state of the art in translation quality.
Transformer是一種排除循環只完全依賴attention機制的架構來繪制輸入輸出之間的依賴關系。
允許更多的并行化,并且在翻譯質量方面達到了新的水平。
2. Background
The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU, ByteNet, and ConS2S, all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions.
減少序列計算的目標也構成了擴展神經GPU,ByteNet和ConS2S的基礎,所以這些都使用卷機神經網絡作為基礎構建模塊,并行計算所有輸入輸出位置的隱含表示。
problem: the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet.
問題:關聯來自兩個任意輸入輸出位置的信號需要的操作數量隨位置之間距離的增加而增加,對ConvS2S來說是線性增加的,對ByteNet來說是對數增加的
This makes it more difficult to learn dependencies between distant positions.
solve: Transformer this is reduced to a constant number of operations.
Transformer將這降低到常數量的操作數目。
problem: at the cost of reduced effective resolution due to averaging attention-weighted positions.
因為平均加權attention位置而使有效分辨率降低為代價
solve:an effect we counteract with Multi-Head Attention in 3.2
我們用Multi-Head來克服有效分辨率降低的問題
2.1 self-attention
called intra-attention, relating different positions of a single sequence in order to compute a representation of the sequence.
self-attention又稱intra-attention, 關聯單個序列的不同位置以計算序列的表示
2.2 End-to-End memory networks
are based on a recurrent attention mechanism instead of sequence-aligned recurrence.
端到端記憶網絡基于一個循環注意力機制,而不是序列對齊的循環
2.3 Transformer
first transduction model, relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution.
Transformer第一個完全只依賴self-attention計算自身輸入輸出表示的序列轉導模型,而無需考慮序列對齊的RNNs或卷積。
3. Model Architecture
Most competitive neural sequence transduction models have an encoder-decoder structure.
大多數有競爭力的神經序列轉導模型都有編碼器-解碼器結構。
Encoder-Decoder Structure
- encoder: an input sequence of symbol representations(x1,..., xn) -->mapped to a sequence of continuous representations z=(z1, z2,..., zn)
編碼器:符號表示為x1,x2,...,xn的輸入序列映射為一個連續表示序列z1,z2,...,zn
- decoder: Given z -->generate an output sequence(y1,y2,...,yn) of symbols, on element at a time.
解碼器:給定z-->符號的輸出序列 生成一個輸出序列y1,y2,...,yn,一次一個元素
- at each step, the model is auto-regressive --> consuming the previously generated symbols as additional input when generating the next.
在每一步,模型是自回歸的 -->用前面生成的符號作為生成下一個符號的的額外輸入。
? ? ? ? ? ? ? ? ? ?Transformer model architecture
擴展:
Transformer模型相關疑問以及解答,https://www.jianshu.com/p/4064217e1c19?
3.1 Encoder and Decoder Stacks
Encoder: The encoder is composed of a stack of?N = 6 identical layers.
編碼器由6個相同層的堆棧組成,==> 這6個層是并聯關系
參考:https://www.zhihu.com/question/344516091
Each layer has two?sub-layers.?The first?is a multi-head?self-attention mechanism, and?the second?is a simple,?positionwise?fully connected feed-forward network.
每一層由兩個子層組成。第一子層是multi-head self-attention,第二子層是一個簡單、逐點全連接的前向網絡。
We employ a?residual connection?[11] around each of?the two sub-layers, followed by?layer normalization?[1].
我們在每一個子層周圍都應用了殘差連接,然后是層歸一化。
That is, the?output of each sub-layer?is?LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer?itself.
To facilitate these residual connections, all sub-layers in the model, as well as the embedding?layers, produce outputs of?dimension dmodel = 512.
Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two?sub-layers in each encoder layer,
the decoder inserts?a third sub-layer, which performs?multi-head?attention?over the output of the encoder stack.
解碼器插入了第三層,它對編碼器堆棧輸出執行multi-head attention。
Similar to the encoder, we employ residual connections?around each of the sub-layers, followed by layer normalization.
We also?modify the self-attention?sub-layer?in the decoder stack?to prevent positions from attending to subsequent positions.
并且,我們修改了解碼器堆棧中的self-attention子層,用掩碼以防止位置參與后續的位置運算。
This?masking,?combined with fact that the output embeddings are offset by one position, ensures that the?predictions for position i can depend only on the known?outputs at positions less than i.
這種掩碼與偏移一個位置的輸入嵌入output embedings相結合,確保i位置的預測只依賴于小于i位置的已知輸出。
3.2 Attention
An attention function?can be described as mapping a?query?and a set of?key-value pairs?to an output,?where the query, keys, values, and output are all vectors.
attention函數可以被描述為將一個查詢和一組鍵值對映射為一個輸出,其中query,key,values,output都是向量。
The?output?is computed as a?weighted sum?of the values, where?the weight?assigned to each value is computed by a?compatibility function?of?the?query with the corresponding key.
輸出是值的加權和,其中分配給每個值的權重是由query與其對應的key的兼容性函數來計算的。
?
(left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several?attention layers running in parallel
3.2.1 Scaled Dot-Product Attention
We call our particular attention "Scaled Dot-Product Attention" (Figure 2).
The input consists of?queries and keys of dimension dk , and values of dimension dv . We compute the dot products?of the?query with all keys, divide each by??, and apply a softmax function to obtain the?weights on the?values.
縮放點積attention由queries、維度dk的keys、維度dv的values組成。我們用所有keys計算查詢的點積,每個key除以,并用一個softmax函數獲取values的權重。
-->點積(dot product)和哈達馬積(hadamard product)
點積:
哈達馬積:兩個維度相同的矩陣相乘,得到另一個維數相同的矩陣
In practice, we compute the attention function on a set of queries simultaneously, packed together?into a matrix Q . The keys and values are also packed together into matrices K? and V? . We compute?the matrix of outputs as:
The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative)?attention. Dot-product attention is identical to our algorithm, except for the scaling factor?of . Additive attention computes the compatibility function using a feed-forward network with?a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is?much faster and more space-efficient in practice, since it can be implemented using highly optimized?matrix multiplication code.
加性attention和點積(乘性)attention區別:
- 點積attention與我們的算法一致,除了縮放因子
- 加性attention使用帶一個單隱層的前饋網絡計算兼容函數。
盡管這兩種attention在原理復雜度上相似,但點積attention在實踐中更快、空間效率更高,因為它可以使用高度優化的矩陣乘法代碼。
3.2.2 Multi-Head Attention
problem: Instead of performing a single attention function with dmodel-dimensional keys, values and queries,?
相比較于用維的keys、values、queries執行一個單attention函數
solve:?we found it beneficial to linearly project the queries, keys and values h times with different, learned?linear projections to dk, dk and dv dimensions, respectively.
我們發現用更有益的方式,它用不同的學習到的線性映射器linear projections將queries、keys、values線性映射h次,分別映射為dk, dk, dv維。
On each of these projected versions of?queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional?output values. These are concatenated and once again projected, resulting in the final values, as?depicted in Figure 2.
對于每一個映射的quries、keys、values,我們并行執行attention函數,產生dv維的輸出值,這些值再被連接和映射后,生成最終值。
Multi-head attention allows the model to jointly attend to information from different representation?subspaces at different positions. With a single attention head, averaging inhibits this.
multi-head attention允許模型共同關注來自不同位置的不同表示子空間的信息,而使用single attention head,平均會抑制這種情況。
3.2.3 Applications of Attention in our Model
The Transformer uses multi-head attention in three different ways:
Transformer以三種方式使用multi-head attention:
- ? In "encoder-decoder attention" layers, the queries come from the previous decoder layer,?and the memory keys and values come from the output of the encoder. This allows every?position in the decoder to attend over all positions in the input sequence. This mimics the?typical encoder-decoder attention mechanisms in sequence-to-sequence models such as?[38, 2, 9].
在encoder-decoder attention層,來自decoder層的queries和來自encoder輸出的keys、values,這允許decoder中的每個位置可以處理input sequence中的所有位置。
- The encoder contains self-attention layers. In a self-attention layer all of the keys, values?and queries come from the same place, in this case, the output of the previous layer in the?encoder. Each position in the encoder can attend to all positions in the previous layer of the?encoder.?
在encoder的self-attention層,self-attention層中的所有keys、values、queries都來自encoder之前層的輸出。
- Similarly, self-attention layers in the decoder allow each position in the decoder to attend to?all positions in the decoder up to and including that position. We need to prevent leftward?information flow in the decoder to preserve the auto-regressive property. We implement this?inside of scaled dot-product attention by masking out (setting to ) all values in the input?of the softmax which correspond to illegal connections. See Figure 2.
在decoder中的self-attention層,允許解碼器中的每個位置參與處理解碼器中的所有位置,直到包括該位置。
我們需要防止解碼器中的左向信息流以保持自回歸特性。==>我們需要在放縮點積attention內部執行這些操作,通過對所有softmax輸入的值進行掩碼,以應對softmax中的非法連接。
3.3 Position-wise Feed-Forward Networks
In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully?connected feed-forward network, which is applied to each position separately and identically. This?consists of two linear transformations with a ReLU activation in between.
除了attention子層,我們encoder-decoder框架中每一層都包含一個全連接的前饋網絡,它分別相同地應用于每個位置。它由兩個線性變換和中間的一個ReLU激活函數組成
3.4 Embeddings and Softmax
learned embeddings to convert the input?tokens and output tokens to vectors of dimension dmodel.
embeddings將輸入和輸出tokens轉換為向量
the usual learned linear transformation?and softmax function to convert the decoder output to predicted next-token probabilities. ?
線性變換和softmax函數將decoder輸出轉換為預測的寫一個token概率
In?our model, we share the same weight matrix between the two embedding layers and the pre-softmax?linear?transformation, similar to [30]. In the embedding layers, we multiply those weights by?.
3.5 positional encoding
problem: Since our model contains no recurrence and no convolution,
因為Transformer模型不包含循環或卷積
in order for the model to make use of the?order of the sequence, 為了利用序列順序
solve:?we must inject some information about the relative or absolute position of the?tokens in the sequence.?To this end, we add "positional encodings" to the input embeddings at the?bottoms of the encoder and decoder stacks.
我們插入一些關于序列tokens絕對位置的信息,為此,我們在encoder和decoder堆棧地步的輸入嵌入中加入了positional encodings。
The positional encodings have the same dimension dmodel?as the embeddings, so that the two can be summed. There are many choices of positional?encodings,?learned and fixed [9].
4. Why Self-Attention
Motivating our use of self-attention we?consider three desiderata:
- One is the total computational complexity per layer. 每層總的計算復雜度
- Another is the amount of computation that can?be parallelized, as measured by the minimum number of sequential operations required. 可以并行化計算的計算量
- The third is the path length between long-range dependencies in the network. 網絡中長程依賴間的路徑長度
problem: Learning long-range?dependencies is a key challenge in many sequence transduction tasks. 在許多序列轉到任務中,學習長范圍依賴是一項關鍵挑戰?
analysis:?One key factor affecting the?ability to learn such dependencies is the length of the paths forward and backward signals have to?traverse in the network.學習這種依賴的關鍵影響因素是網絡中前饋和后饋信號必須穿越的路徑
The shorter these paths between any combination of positions in the input?and output sequences, the easier it is to learn long-range dependencies [12].
輸入輸出序列中任意位置組合之間的路徑越短,學習長范圍依賴就越容易。
Hence we also compare?the maximum path length between any two input and output positions in networks composed of the?different layer types.
因此,我們還比較了由不同層類型組成的網絡中任意兩個輸入和輸出位置之間的最大路徑長度。
reasons for choosing self-attention
- In terms of?computational complexity,?self-attention layers are faster than recurrent layers when the sequence?length n? is smaller than the representation dimensionality d.
在計算復雜度方面,當序列長度n小于表示維度d時,self-attention層比循環層更快
a self-attention layer connects all positions with a constant number of sequentially?executed operations, whereas a recurrent layer requires O(n)? sequential operations.
因為self-attention層用常數目的序列執行操作連接所有位置,而循環層需要O(n)序列操作
- To improve computational performance for tasks involving?very long sequences, self-attention could be restricted to considering only a neighborhood of size r? in?the input sequence centered around the respective output position。
為了提高包含長序列任務的計算性能,self-attention可以被限制為僅考慮以輸入序列為中心大小為z的鄰域及相應的輸出位置。
This would increase the maximum?path length to O(n=r) . We plan to investigate this approach further in future work.
-
the complexity of a separable?convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer,?the approach we take in our model.
獨立卷積的復雜度與self-attention和逐點前饋層相等
-
self-attention could yield more interpretable models.
self-attention能夠產生解釋性更強的模型
5. Training
5.4 Regularization 正則化
- Residual Dropout?殘差dropout
We apply dropout [33] to the output of each sub-layer, before it is added to the?sub-layer input and normalized.
我們在每個子層輸出加入子層輸入和歸一化之前,將dropout正則化用于每一個子層輸出。
In addition, we apply dropout to the sums of the embeddings and the?positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of?Pdrop = 0:1.
我們還將dropout用于編碼器解碼器堆棧中embeddings和positional encodings的和。
- Label Smoothing 標簽平滑
During training, we employed label smoothing of value ls = 0:1 [36]. This?hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.
標簽平滑雖然損傷了困惑度,但模型更加不確定,并提升了準確度和BLEU分值
6. Results
bigger models are better, and dropout is very helpful in avoiding over-fitting.?
模型越大,性能越好,而且dropout在避免過擬合方面很有幫助
7. conclusions
7.2 future research of Transformer
Reference
our models is available at https://github.com/tensorflow/tensor2tensor.
總結
以上是生活随笔為你收集整理的Transformer论文阅读(一): Attention is all you need的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: python爬虫基础(二)~工具包: 下
- 下一篇: BERT论文阅读(一): Pre-tra