Paper:论文解读《Adaptive Gradient Methods With Dynamic Bound Of Learning Rate》中国本科生提出AdaBound的神经网络优化算法
Paper:論文解讀—《Adaptive Gradient Methods With Dynamic Bound Of Learning Rate》中國本科生(學霸)提出AdaBound的神經(jīng)網(wǎng)絡(luò)優(yōu)化算法
?
?
目錄
亮點總結(jié)
論文解讀
實驗結(jié)果
1、FEEDFORWARD NEURAL NETWORK
2、CONVOLUTIONAL NEURAL NETWORK
3、RECURRENT NEURAL NETWORK ?
實驗結(jié)果分析
?
?
?
《Adaptive Gradient Methods With Dynamic Bound Of Learning Rate》
論文頁面:https://openreview.net/pdf?id=Bkg3g2R9FX
評審頁面:https://openreview.net/forum?id=Bkg3g2R9FX
GitHub地址:https://github.com/Luolc/AdaBound
?
亮點總結(jié)
1、AdaBound算法的初始化速度快。
2、AdaBound算法對超參數(shù)不是很敏感,省去了大量調(diào)參的時間。
3、適合應用在CV、NLP領(lǐng)域,可以用來開發(fā)解決各種流行任務(wù)的深度學習模型。
? ? ? We investigate existing adaptive algorithms and find that extremely large or small learning rates can result in the poor convergence behavior. A rigorous proof of non-convergence for ADAM is provided to demonstrate the above problem.
? ? ? Motivated by the strong generalization ability of SGD, we design a strategy to constrain the learn- ing rates of ADAM and AMSGRAD to avoid a violent oscillation. Our proposed algorithms, AD- ABOUND and AMSBOUND, which employ dynamic bounds on their learning rates, achieve a smooth transition to SGD. They show the great efficacy on several standard benchmarks while maintaining advantageous properties of adaptive methods such as rapid initial progress and hyper- parameter insensitivity.
? ? ?我們研究了現(xiàn)有的自適應算法,發(fā)現(xiàn)極大或極小的學習率都會導致較差的收斂行為。為證明上述問題,ADAM給出了非收斂性的嚴格證明。
?? ??基于SGD較強的泛化能力,我們設(shè)計了一種策略來約束ADAM和AMSGRAD的學習速率,以避免劇烈的振蕩。我們提出的算法,ADABOUND和AMSBOUND,采用了動態(tài)的學習速率邊界,實現(xiàn)了向SGD的平穩(wěn)過渡。它們在保持自適應方法初始化速度快、超參數(shù)不敏感等優(yōu)點的同時,在多個標準基準上顯示了良好的效果。
?
論文解讀
? ? ? ? 自適應優(yōu)化方法,如ADAGRAD, RMSPROP和ADAM已經(jīng)被提出,以實現(xiàn)一個基于學習速率的元素級縮放項的快速訓練過程。雖然它們普遍存在,但與SGD相比,它們的泛化能力較差,甚至由于不穩(wěn)定和極端的學習速率而無法收斂。最近的研究提出了AMSGRAD等算法來解決這一問題,但相對于現(xiàn)有的方法沒有取得很大的改進。在我們的論文中,我們證明了極端的學習率會導致糟糕的表現(xiàn)。我們提供了ADAM和AMSGRAD的新變體,分別稱為ADABOUND和AMSBOUND,它們利用學習速率的動態(tài)邊界來實現(xiàn)從自適應方法到SGD的漸進平穩(wěn)過渡,并給出收斂性的理論證明。我們進一步對各種流行的任務(wù)和模型進行實驗,這在以往的工作中往往是不夠的。實驗結(jié)果表明,新的變異可以消除自適應方法與SGD的泛化差距,同時在訓練早期保持較高的學習速度。此外,它們可以對原型帶來顯著的改進,特別是在復雜的深度網(wǎng)絡(luò)上。該算法的實現(xiàn)可以在https://github.com/Luolc/AdaBound找到。
?
實驗結(jié)果
? ? ? In this section, we turn to an empirical study of different models to compare new variants with ?popular optimization methods including SGD(M), ADAGRAD, ADAM, and AMSGRAD. We focus ?on three tasks: the MNIST image classification task (Lecun et al., 1998), the CIFAR-10 image ?classification task (Krizhevsky & Hinton, 2009), and the language modeling task on Penn Treebank ?(Marcus et al., 1993). We choose them due to their broad importance and availability of their architectures ?for reproducibility. The setup for each task is detailed in Table 2. We run each experiment ?three times with the specified initialization method from random starting points. A fixed budget on ?the number of epochs is assigned for training and the decay strategy is introduced in following parts. ?We choose the settings that achieve the lowest training loss at the end.
? ? ? 在這一節(jié)中,我們將對不同的模型進行實證研究,將新變量與常用的優(yōu)化方法(包括SGD(M)、ADAGRAD、ADAM和AMSGRAD))進行比較。我們主要關(guān)注三個任務(wù):MNIST圖像分類任務(wù)(Lecun et al.,1998)、CIFAR-10圖像分類任務(wù)(Krizhevsky & Hinton, 2009)和Penn Treebank上的語言建模任務(wù)(Marcus et al.1993)。我們之所以選擇它們,是因為它們的架構(gòu)具有廣泛的重要性和可再現(xiàn)性。表2詳細列出了每個任務(wù)的設(shè)置。我們使用指定的初始化方法從隨機的起點運行每個實驗三次。為訓練指定了固定的時域數(shù)預算,下面將介紹衰減策略。我們選擇的設(shè)置,實現(xiàn)最低的訓練損失在最后。
?
1、FEEDFORWARD NEURAL NETWORK
? ? ?We train a simple fully connected neural network with one hidden layer for the multiclass classification ?problem on MNIST dataset. We run 100 epochs and omit the decay scheme for this experiment. ?
? ? ?Figure 2 shows the learning curve for each optimization method on both the training and test set. ?We find that for training, all algorithms can achieve the accuracy approaching 100%. For the test ?part, SGD performs slightly better than adaptive methods ADAM and AMSGRAD. Our two proposed ?methods, ADABOUND and AMSBOUND, display slight improvement, but compared with ?their prototypes there are still visible increases in test accuracy.
? ? ? ? ?針對MNIST數(shù)據(jù)集上的多類分類問題,我們訓練了一個具有隱層的簡單全連通神經(jīng)網(wǎng)絡(luò)。我們運行了100個epochs,省略了這個實驗的衰變方案。
?? ? ??圖2顯示了訓練和測試集上每種優(yōu)化方法的學習曲線。我們發(fā)現(xiàn)在訓練中,所有算法都能達到接近100%的準確率。在測試部分,SGD的性能略優(yōu)于ADAM和AMSGRAD的自適應方法。我們提出的 ADABOUND和AMSBOUND兩種方法顯示出輕微的改進,但與它們的原型相比,測試精度仍然有明顯的提高。
2、CONVOLUTIONAL NEURAL NETWORK
? ? ? Using DenseNet-121 (Huang et al., 2017) and ResNet-34 (He et al., 2016), we then consider the task ?of image classification on the standard CIFAR-10 dataset. In this experiment, we employ the fixed ?budget of 200 epochs and reduce the learning rates by 10 after 150 epochs. ?
? ? ? DenseNet :We first run a DenseNet-121 model on CIFAR-10 and our results are shown in Figure 3. ?We can see that adaptive methods such as ADAGRAD, ADAM and AMSGRAD appear to perform ?better than the non-adaptive ones early in training. But by epoch 150 when the learning rates are ?decayed, SGDM begins to outperform those adaptive methods. As for our methods, ADABOUND ?and AMSBOUND, they converge as fast as adaptive ones and achieve a bit higher accuracy than ?SGDM on the test set at the end of training. In addition, compared with their prototypes, their ?performances are enhanced evidently with approximately 2% improvement in the test accuracy. ?
? ? ? ResNet :Results for this experiment are reported in Figure 3. As is expected, the overall performance ?of each algorithm on ResNet-34 is similar to that on DenseNet-121. ADABOUND and ?AMSBOUND even surpass SGDM by 1%. Despite the relative bad generalization ability of adaptive ?methods, our proposed methods overcome this drawback by allocating bounds for their learning ?rates and obtain almost the best accuracy on the test set for both DenseNet and ResNet on CIFAR-10.
? ? ? 然后利用DenseNet-121 (Huang et al.2017)和ResNet-34 (He et al.2016)對CIFAR-10標準數(shù)據(jù)集進行圖像分類。在這個實驗中,我們使用200個epoch的固定預算,在150個epoch后將學習率降低10個。
? ? ? DenseNet:我們首先在CIFAR-10上運行DenseNet-121模型,結(jié)果如圖3所示。我們可以看到,ADAGRAD、ADAM和AMSGRAD等自適應方法在早期訓練中表現(xiàn)得比非自適應方法更好。但是到了歷元150,當學習速率衰減時,SGDM開始優(yōu)于那些自適應方法。對于我們的方法ADABOUND和AMSBOUND,它們收斂速度和自適應方法一樣快,并且在訓練結(jié)束時的測試集上達到比SGDM稍高的精度。此外,與原型機相比,其性能得到了顯著提高,測試精度提高了約2%。
? ? ? ResNet:實驗結(jié)果如圖3所示。正如預期的那樣,ResNet-34上的每個算法的總體性能與DenseNet-121上的相似。ADABOUND和AMSBOUND甚至超過SGDM 1%。盡管自適應方法的泛化能力相對較差,但我們提出的方法克服了這一缺點,為其學習速率分配了界限,在CIFAR-10上對DenseNet和ResNet的測試集都獲得了幾乎最佳的準確率。
?
3、RECURRENT NEURAL NETWORK ?
? ? ?Finally, we conduct an experiment on the language modeling task with Long Short-Term Memory ?(LSTM) network (Hochreiter & Schmidhuber, 1997). From two experiments above, we observe that our methods show much more improvement in deep convolutional neural networks than in perceptrons. ?Therefore, we suppose that the enhancement is related to the complexity of the architecture ?and run three models with (L1) 1-layer, (L2) 2-layer and (L3) 3-layer LSTM respectively. We train ?them on Penn Treebank, running for a fixed budget of 200 epochs. We use perplexity as the metric ?to evaluate the performance and report results in Figure 4.
? ? ? ?We find that in all models, ADAM has the fastest initial progress but stagnates in worse performance ?than SGD and our methods. Different from phenomena in previous experiments on the image classification ?tasks, ADABOUND and AMSBOUND does not display rapid speed at the early training ?stage but the curves are smoother than that of SGD.
? ? ? ?我們發(fā)現(xiàn),在所有模型中,ADAM的初始進展最快,但在性能上停滯不前,不如SGD和我們的方法。與以往在圖像分類任務(wù)實驗中出現(xiàn)的現(xiàn)象不同,ADABOUND和AMSBOUND在訓練初期的速度并不快,但曲線比SGD平滑。
? ? ? ?Comparing L1, L2 and L3, we can easily notice a distinct difference of the improvement degree. ?In L1, the simplest model, our methods perform slightly 1.1% better than ADAM while in L3, the ?most complex model, they show evident improvement over 2.8% in terms of perplexity. It serves as ?evidence for the relationship between the model’s complexity and the improvement degree.
? ? ? ?對比L1、L2和L3,我們可以很容易地發(fā)現(xiàn)改善程度的顯著差異。在最簡單的模型L1中,我們的方法比ADAM的方法略好1.1%,而在最復雜的模型L3中,我們的方法在復雜的方面明顯優(yōu)于2.8%。為模型的復雜性與改進程度之間的關(guān)系提供了依據(jù)。
實驗結(jié)果分析
? ? ? ?To investigate the efficacy of our proposed algorithms, we select popular tasks from computer vision and natural language processing. Based on results shown above, it is easy to find that ADAM and AMSGRAD usually perform similarly and the latter does not show much improvement for most cases. Their variants, ADABOUND and AMSBOUND, on the other hand, demonstrate a fast speed of convergence compared with SGD while they also exceed two original methods greatly with respect to test accuracy at the end of training. This phenomenon exactly confirms our view mentioned in Section 3 that both large and small learning rates can influence the convergence.
? ? ? ?Besides, we implement our experiments on models with different complexities, consisting of a per- ceptron, two deep convolutional neural networks and a recurrent neural network. The perceptron used on the MNIST is the simplest and our methods perform slightly better than others. As for DenseNet and ResNet, obvious increases in test accuracy can be observed. We attribute this differ- ence to the complexity of the model. Specifically, for deep CNN models, convolutional and fully connected layers play different parts in the task. Also, different convolutional layers are likely to be responsible for different roles (Lee et al., 2009), which may lead to a distinct variation of gradients of parameters. In other words, extreme learning rates (huge or tiny) may appear more frequently in complex models such as ResNet. As our algorithms are proposed to avoid them, the greater enhance- ment of performance in complex architectures can be explained intuitively. The higher improvement degree on LSTM with more layers on language modeling task also consists with the above analysis.
? ? ? ?為了研究我們提出的算法的有效性,我們從計算機視覺和自然語言處理中選擇流行的任務(wù)。根據(jù)上面顯示的結(jié)果,不難發(fā)現(xiàn)ADAM和AMSGRAD的表現(xiàn)通常是相似的,而AMSGRAD在大多數(shù)情況下并沒有太大的改善。另一方面,它們的變體ADABOUND和AMSBOUND與SGD相比具有較快的收斂速度,同時在訓練結(jié)束時的測試精度也大大超過了兩種原始方法。這一現(xiàn)象正好印證了我們在第3節(jié)中提到的觀點,學習速率的大小都會影響收斂。
? ? ? ?此外,我們還對不同復雜度的模型進行了實驗,包括一個per- ceptron模型、兩個深度卷積神經(jīng)網(wǎng)絡(luò)模型和一個遞歸神經(jīng)網(wǎng)絡(luò)模型。MNIST上使用的感知器是最簡單的,我們的方法比其他方法稍好一些。DenseNet和ResNet的測試精度明顯提高。我們把這種不同歸因于模型的復雜性。具體來說,對于深度CNN模型,卷積層和全連通層在任務(wù)中扮演不同的角色。此外,不同的卷積層可能負責不同的角色(Lee et al.2009),這可能導致參數(shù)梯度的明顯變化。換句話說,極端的學習速率(巨大或微小)可能在ResNet等復雜模型中出現(xiàn)得更頻繁。由于我們的算法是為了避免這些問題而提出的,因此可以直觀地解釋在復雜體系結(jié)構(gòu)中性能的提高。LSTM在語言建模任務(wù)上的層次越多,改進程度越高,也與上述分析一致。
?
PS:因為時間比較緊,博主翻譯的不是特別盡善盡美,如有錯誤,請指出,謝謝!
?
總結(jié)
以上是生活随笔為你收集整理的Paper:论文解读《Adaptive Gradient Methods With Dynamic Bound Of Learning Rate》中国本科生提出AdaBound的神经网络优化算法的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: ML之K-means:基于DIY数据集利
- 下一篇: ML之DR之PCA:利用PCA对手写数字