Batch Normalization原文详细解读
這篇博客分為兩部分,
一部分是[3]中對于BN(Batch Normalization的縮寫)的大致描述
一部分是原文[1]中的完整描述
、####################先說下書籍[3]############################################
Batch Normalization首次在[1]中提出,原文中沒有給出具體的圖示。
我們先來歸納下[3]中提到的Batch Normalization的大意:
上圖出自[3].
[1][2]討論了應(yīng)該把Batch Normalization插入到激活函數(shù)的前面還是后面.
[2]中有一段話是用來吐槽[1]沒有把BN的插入位置寫清楚,摘錄如下:
5.2.1 WHERE TO PUT BN – BEFORE OR AFTER NON-LINEARITY?
It is not clear from the paper Ioffe & Szegedy (2015) where to put the batch-normalization layer before input of each layer as stated in Section 3.1, or before non-linearity, as stated in section 3.2,so we have conducted an experiment with FitNet4 on CIFAR-10 to clarify this.Results are shown in Table 5.
Exact numbers vary from run to run, but in the most cases, batch normalization put after non-linearity performs better.
In the next experiment we compare BN-FitNet4, initialized with Xavier and LSUV-initialized FitNet4. Batch-normalization reduces training time in terms of needed number of iterations, but each iteration becomes slower because of extra computations. The accuracy versus wall-clock-time graphs are shown in Figure 3.
注意:
[3]的附帶代碼沒有實(shí)現(xiàn)[1]中提到的對卷積層進(jìn)行Batch Normalization轉(zhuǎn)化
Batch Normalization算法之一(因?yàn)檫@個(gè)是[1]提出中的Algorithm 1):
講人話,什么意思呢?
上面圖中的Batch Norm層,輸入mini-batch中的m條數(shù)據(jù),輸出yiy_iyi?,
mini-batch和yiy_iyi?之間的映射關(guān)系通過上面的Algorithm 1來實(shí)現(xiàn)
注意,書籍[3]沒有對反向傳播的BN進(jìn)行論文,代碼也沒有相關(guān)實(shí)現(xiàn),對于卷積層的BN轉(zhuǎn)化,該書籍也沒有實(shí)現(xiàn)。
Batch Normalization完整的代碼還是需要閱讀tensorflow的源碼
#######################再說下論文原文[1]結(jié)構(gòu)#############################
論文結(jié)構(gòu):
BatchNormalization={Abstract1.Introduction2.Towards?Reducint?Internal?Covariate?Shift3.Normalization?via?Mini-Batch?Statistics4.ExperimentsBatch\ Normalization=\left\{ \begin{aligned} Abstract \\ 1.Introduction \\ \text{2.Towards Reducint Internal Covariate Shift}\\ \text{3.Normalization via Mini-Batch Statistics}\\ \text{4.Experiments} \end{aligned} \right.Batch?Normalization=????????????????Abstract1.Introduction2.Towards?Reducint?Internal?Covariate?Shift3.Normalization?via?Mini-Batch?Statistics4.Experiments?
∴論文的重點(diǎn)是2和3兩個(gè)section
#########先看下Introduction都講了些啥####################
Deep learning has dramatically advanced the state of the
art in vision, speech, and many other areas. Stochastic gradient descent (SGD) has proved to be an effec-
tive way of training deep networks, and SGD variants such as momentum (Sutskever et al., 2013) and Adagrad
(Duchi et al., 2011) have been used to achieve state of the art performance. SGD optimizes the parameters Θ of the
network, so as to minimize the loss
Θ=arg?min?Θ1N∑i=1N?(xi,Θ)\Theta=\arg \min _{\Theta} \frac{1}{N} \sum_{i=1}^{N} \ell\left(\mathrm{x}_{i}, \Theta\right)Θ=argΘmin?N1?i=1∑N??(xi?,Θ)
where x1...Nx_{1...N}x1...N? is the training data set. With SGD, the training proceeds in steps, and at each step we consider a mini-
batch x1...mx_{1...m}x1...m? of size mmm. The mini-batch is used to approximate the gradient of the loss function with respect to the
parameters, by computing
1m??(xi,Θ)?Θ\frac{1}{m} \frac{\partial \ell\left(\mathrm{x}_{i}, \Theta\right)}{\partial \Theta}m1??Θ??(xi?,Θ)?
(這里的Θ\ThetaΘ就是神經(jīng)網(wǎng)絡(luò)里面的權(quán)重。上面內(nèi)容絕大多數(shù)都是套話)
Using mini-batches of examples, as opposed to one example at a time, is helpful in several ways. First, the gradient
of the loss over a mini-batch is an estimate of the gradient over the training set, whose quality improves as the batch
size increases. Second, computation over a batch can be much more ef?cient than m computations for individual
examples, due to the parallelism afforded by the modern computing platforms.(這段話的意思是更新權(quán)重的時(shí)候使用全部數(shù)據(jù)效果會比較好,但是用batch的速度會更快。這段主要就是回顧了一些基礎(chǔ)知識)
While stochastic gradient is simple and effective, it requires careful tuning of the model hyper-parameters,speci?cally the learning rate used in optimization, as well as the initial values for the model parameters. The training is complicated by the fact that the inputs to each layer are affected by the parameters of all preceding layers – so that small changes to the network parameters amplify as the network becomes deeper.
(最后一句是重點(diǎn),淺層的小變化會引起深層的大變化)
The change in the distributions of layers’ inputs presents a problem because the layers need to continuously adapt to the new distribution. When the input distribution to a learning system changes, it is said to experience covariate shif t (Shimodaira, 2000). This is typically handled via domain adaptation (Jiang, 2008). However, the notion of covariate shift(這是這篇文章首次提到covariate shift) can be extended beyond the
learning system as a whole, to apply to its parts, such as a sub-network or a layer. Consider a network computing
?=F2(F1(u,Θ1),Θ2)\ell=F_{2}\left(F_{1}\left(\mathrm{u}, \Theta_{1}\right), \Theta_{2}\right)?=F2?(F1?(u,Θ1?),Θ2?)
here F1F1F1 and F2F2F2 are arbitrary transformations(這里指的是激活函數(shù)), and the parameters Θ1,Θ2Θ1,Θ2Θ1,Θ2 are to be learned so as to minimize the loss ?. Learning Θ2Θ2Θ2 can be viewed as if the inputs x=F1(u,Θ1)x = F1(u,Θ1)x=F1(u,Θ1) are fed into the sub-network.
?=F2(x,Θ2)\ell=F_{2}\left(\mathrm{x}, \Theta_{2}\right)?=F2?(x,Θ2?)
For example, a gradient descent step
Θ2←Θ2?αm∑i=1m?F2(xi,Θ2)?Θ2\Theta_{2} \leftarrow \Theta_{2}-\frac{\alpha}{m} \sum_{i=1}^{m} \frac{\partial F_{2}\left(\mathrm{x}_{i}, \Theta_{2}\right)}{\partial \Theta_{2}}Θ2?←Θ2??mα?i=1∑m??Θ2??F2?(xi?,Θ2?)?
(for batch sizem and learning rateα) is exactly equivalent
to that for a stand-alone network F2F2F2 with input xxx. Therefore, the input distribution properties that make training
more ef?cient – such as having the same distribution between the training and test data – apply to training the
sub-network as well. As such(嚴(yán)格來說,真正意義上) it is advantageous for the distribution ofx to remain ?xed over time. Then,Θ2Θ2Θ2 does not have to readjust to compensate for the change in the distribution of x.
(以上一部分依然是在回顧基礎(chǔ)知識:如何更新權(quán)重,其中的F1F1F1和F2F2F2都是指激活函數(shù))
Fixed distribution of inputs to a sub-network would have positive consequences for the layers outside the sub-network, as well. Consider a layer with a sigmoid activation function z = g(Wu +b) where u is the layer input,the weight matrix W and bias vector b are the layer parameters to be learned, and g(x)=11+exp(?x)g(x)=\frac{1}{1+exp(-x)}g(x)=1+exp(?x)1? . As |x| increases,g′(x) tends to zero. This means that for all dimensions ofx = Wu+b except those with small absolute values, the gradient ?owing down tou will vanish and the model will train slowly. However, since x is affected by W,b and the parameters of all the layers below, changes to those parameters during training will likely move many dimensions of x into the saturated regime of the nonlinearity and slow down the convergence. This effect is ampli?ed as the network depth increases. In practice, the saturation problem and the resulting vanishing gradients are usually addressed by using Recti?ed Linear Units (Nair & Hinton, 2010) ReLU(x) = max(x,0), careful initialization (Bengio & Glorot, 2010; Saxe et al., 2013), and small learning rates. If, however, we could ensure that the distribution of nonlinearity inputs remains more stable as the network trains, then the optimizer would be less likely to get stuck in the saturated regime, and the training would accelerate.
上面這段話的意思就是訓(xùn)練的時(shí)候別進(jìn)激活函數(shù)的飽和區(qū),進(jìn)入飽和區(qū)的話要再調(diào)整就很難,從而會導(dǎo)致增加訓(xùn)練時(shí)間。
也就是說,權(quán)值穩(wěn)定可能有兩種情況,1.收斂 2.梯度消失
關(guān)于梯度爆炸和梯度消失的概念可以參考[7]講得非常好。
We refer to the change in the distributions of internal nodes of a deep network, in the course of training, as Internal Covariate Shift. (這里首次提到interl covariate shift)Eliminating it offers a promise of faster training. We propose a new mechanism, which we call Batch Normalization, that takes a step towards reducing internal covariate shift, and in doing so dramatically accelerates the training of deep neural nets. It accomplishes this via a normalization step that ?xes the means and variances of layer inputs. Batch Normalization also has a bene?cial effect on the gradient ?ow through the network, by reducing the dependence of gradients on the scale of the parameters or of their initial values.This allows us to use much higher learning rates without the risk of divergence. Furthermore, batch normalization regularizes the model and reduces the need for Dropout (Srivastava et al., 2014). Finally, Batch Normalization makes it possible to use saturating nonlinearities by preventing the network from getting stuck in the saturated modes.
注意,covariate shift和interl covariate shift的區(qū)別是前者是應(yīng)用于整個(gè)系統(tǒng),后者是應(yīng)用于系統(tǒng)的各個(gè)部分(也就是各個(gè)層,但是每種層的處理方法都不一樣)。
這段文字講了一堆好處,但是其實(shí)這些好處都是作者意外發(fā)現(xiàn)的,并不是一開始就想到的。
In Sec. 4.2, we apply Batch Normalization to the bestperforming ImageNet classi?cation network, and show that we can match its performance using only 7% of the training steps, and can further exceed its accuracy by a substantial margin. Using an ensemble of such networks trained with Batch Normalization, we achieve the top-5 error rate that improves upon the best known results on
ImageNet classi?cation.
這段的重點(diǎn)是,采用BN措施以后,需要的訓(xùn)練steps數(shù)量僅僅是原來的7%
###################introduction 結(jié)束######################################
#################Section 2開始####################
這個(gè)部分講了作者的探索,就是BN如果加在反向傳播以后好不好,作者認(rèn)為不好,因?yàn)闀窒谩>唧w分析如下:
For example, consider a layer with the input u that adds the learned bias b, and normalizes the result by subtracting the mean of the activation computed over the training data:
x^=x?E[x]\widehat{x}=x-E[x]x=x?E[x]where x=u+b,X={x1…N}x=u+b, \mathcal{X}=\left\{x_{1 \ldots N}\right\}x=u+b,X={x1…N?}is the set of values of xxx over the training set, and
E[x]=1N∑i=1Nxi\mathrm{E}[x]=\frac{1}{N} \sum_{i=1}^{N} x_{i}E[x]=N1?∑i=1N?xi?.
If a gradient descent step ignores the dependence of E[x]E[x]E[x] on bbb,then it will update b←b+Δbb \leftarrow b+\Delta bb←b+Δb,where Δb∝???/?x^\Delta b \propto-\partial \ell / \partial \widehat{x}Δb∝???/?x.Then u+(b+Δb)?E[u+(b+Δb)]=u+b?E[u+b]u+(b+\Delta b)-\mathrm{E}[u+(b+\Delta b)]=u+b-\mathrm{E}[u+b]u+(b+Δb)?E[u+(b+Δb)]=u+b?E[u+b]
Thus, the combination of the update to b and subsequent change in normalization led to no change in the output of the layer nor, consequently, the loss.
上面的話什么意思呢?如果說BN加在BP后面,那么BP后每個(gè)b顯然都會增加一個(gè)Δb\Delta bΔb,每個(gè)E[u+b]也會增加一個(gè)Δb\Delta bΔb,一旦進(jìn)行BN處理,那么這兩個(gè)東西一減,就把Δb\Delta bΔb抵消了,那BP的工作就白費(fèi)了。
另外,稍微提下,這篇論文中:
inference step:前向傳播的意思,因?yàn)橥茢嗟臅r(shí)候(infer)的時(shí)候需要從頭到尾計(jì)算一遍才能計(jì)算輸出值。
gradient step:反向傳播階段
這篇文章的毛病就是一些概念用詞沒有前后統(tǒng)一。
####################Section 2結(jié)束#################
###################section3################################
3.提出了上面的Algorithm 1以及BN在反向傳播時(shí)如何處理
3.1.提出算法2,實(shí)質(zhì)是對調(diào)用了algorithm1 以后的整個(gè)神經(jīng)網(wǎng)絡(luò)的工作流程的描述
3.2.BN在卷積層中的處理(一筆帶過,純文字描述)
3.3.和3.4說下BN的好處
3.1已經(jīng)在下面提過了,重點(diǎn)來看下3中反向傳播的處理
???x^i=???yi?γ\frac{\partial \ell}{\partial \widehat{x}_{i}}=\frac{\partial \ell}{\partial y_{i}} \cdot \gamma?xi????=?yi?????γ
???σB2=∑i=1m???x^i?(xi?μB)??12(σB2+?)?3/2\frac{\partial \ell}{\partial \sigma_{\mathcal{B}}^{2}}=\sum_{i=1}^{m} \frac{\partial \ell}{\partial \widehat{x}_{i}} \cdot\left(x_{i}-\mu_{\mathcal{B}}\right) \cdot \frac{-1}{2}\left(\sigma_{\mathcal{B}}^{2}+\epsilon\right)^{-3 / 2}?σB2????=∑i=1m??xi?????(xi??μB?)?2?1?(σB2?+?)?3/2
???μB=(∑i=1m???x^i??1σB2+?)+???σB2?∑i=1m?2(xi?μB)m\frac{\partial \ell}{\partial \mu_{\mathcal{B}}}=\left(\sum_{i=1}^{m} \frac{\partial \ell}{\partial \widehat{x}_{i}} \cdot \frac{-1}{\sqrt{\sigma_{\mathcal{B}}^{2}+\epsilon}}\right)+\frac{\partial \ell}{\partial \sigma_{\mathcal{B}}^{2}} \cdot \frac{\sum_{i=1}^{m}-2\left(x_{i}-\mu_{\mathcal{B}}\right)}{m}?μB????=(∑i=1m??xi?????σB2?+???1?)+?σB2?????m∑i=1m??2(xi??μB?)?
???xi=???x^i?1σB2+?+???σB2?2(xi?μB)m+???μB?1m\frac{\partial \ell}{\partial x_{i}}=\frac{\partial \ell}{\partial \widehat{x}_{i}} \cdot \frac{1}{\sqrt{\sigma_{\mathcal{B}}^{2}+\epsilon}}+\frac{\partial \ell}{\partial \sigma_{\mathcal{B}}^{2}} \cdot \frac{2\left(x_{i}-\mu_{\mathcal{B}}\right)}{m}+\frac{\partial \ell}{\partial \mu_{\mathcal{B}}} \cdot \frac{1}{m}?xi????=?xi?????σB2?+??1?+?σB2?????m2(xi??μB?)?+?μB?????m1?
???γ=∑i=1m???yi?x^i\frac{\partial \ell}{\partial \gamma}=\sum_{i=1}^{m} \frac{\partial \ell}{\partial y_{i}} \cdot \widehat{x}_{i}?γ???=∑i=1m??yi?????xi?
???β=∑i=1m???yi\frac{\partial \ell}{\partial \beta}=\sum_{i=1}^{m} \frac{\partial \ell}{\partial y_{i}}?β???=∑i=1m??yi????
雖然提到了訓(xùn)練誤差lll關(guān)于γ\gammaγ,β\betaβ的求導(dǎo)方式,但是并沒有提到γ\gammaγ,β\betaβ的更新方式,因?yàn)檫@個(gè)論文是谷歌的人寫的,tensorflow是谷歌開發(fā)的,所以具體細(xì)節(jié)還是要去看tensorflow的代碼。
上述求導(dǎo)的轉(zhuǎn)化目標(biāo)是有規(guī)律的,每一行都視上一行已經(jīng)得到的偏導(dǎo)值為常數(shù),并在這一行的求導(dǎo)過程中加以利用。
###################section3結(jié)束################################
再往后就是實(shí)驗(yàn)了(略)
############################################################
論文的思路是把[5][6]中提到的白化操作應(yīng)用到了神經(jīng)網(wǎng)絡(luò)層的內(nèi)部
論文提出了兩個(gè)算法,Algorithm 1就是上面的那個(gè),Algorithm 2是下面這個(gè):
先解釋下上面的mm?1\frac{m}{m-1}m?1m?咋回事。
使用的是無偏方差估計(jì)。
根據(jù)我們的概率論與數(shù)理統(tǒng)計(jì)知識可以得知:
∵E(S)=D(X)∵E(S)=D(X)∵E(S)=D(X)
∴EB(σB2)=DB(B)=1m∑i=1m[x?x ̄]2①∴E_B(\sigma_B^2)=D_B(B)=\frac{1}{m}\sum_{i=1}^m[x-\overline{x}]^2①∴EB?(σB2?)=DB?(B)=m1?i=1∑m?[x?x]2①
這里的m是一個(gè)batch中的數(shù)據(jù)數(shù)量,也就是m條數(shù)據(jù)
顯然我們需要的是這批batch的方差,而不是總體方差
∴σB2=1m?1∑i=1m[x?x ̄]2=mm?1EB(σB2)\sigma_B^2=\frac{1}{m-1}\sum_{i=1}^m[x-\overline{x}]^2=\frac{m}{m-1}E_B(\sigma_B^2)σB2?=m?11?i=1∑m?[x?x]2=m?1m?EB?(σB2?)
注意:
這篇論文沒有講清楚如何更新算法2中的γ\gammaγ和β\betaβ
Algorithm2調(diào)用了Algorithm1
Algorithm2的思路與上面的神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)圖對應(yīng)
######################################################################
最后是應(yīng)付面試用的,BN啥好處?摘自[8]:
①不僅僅極大提升了訓(xùn)練速度,收斂過程大大加快;
②還能增加分類效果,一種解釋是這是類似于Dropout的一種防止過擬合的正則化表達(dá)方式,所以不用Dropout也能達(dá)到相當(dāng)?shù)男Ч?#xff1b;
③另外調(diào)參過程也簡單多了,對于初始化要求沒那么高,而且可以使用大的學(xué)習(xí)率等。
######################################################################
Reference:
[1]Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift-Sergey Ioffe and Christian Szegedy
[2]All you need is a good init-Dmytro Mishkin and Jiri Matas
[3]深度學(xué)習(xí)入門-基于Python的理論與實(shí)現(xiàn)
[4]Understanding the backward pass through Batch Normalization Layer
[5]Efficient bckprop-Yann A. LeCun,Leon Bottou,Genevieve B. Orr…
[6]A convergence analysis of log-linear training-Wiesler,Simon and Ney,Hermann
[7]深度學(xué)習(xí)中 Batch Normalization為什么效果好?
[8]【深度學(xué)習(xí)】深入理解Batch Normalization批標(biāo)準(zhǔn)化
總結(jié)
以上是生活随笔為你收集整理的Batch Normalization原文详细解读的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 第六章勘误以及Normalization
- 下一篇: 37 Reasons why your