梯度下降优化方法'原理_优化梯度下降的新方法
梯度下降優(yōu)化方法'原理
The new era of machine learning and artificial intelligence is the Deep learning era. It not only has immeasurable accuracy but also a huge hunger for data. Employing neural nets, functions with more exceeding complexity can be mapped on given data points.
機(jī)器學(xué)習(xí)和人工智能的新時(shí)代是深度學(xué)習(xí)時(shí)代。 它不僅具有無法估量的準(zhǔn)確性,而且對數(shù)據(jù)的渴望也很大。 利用神經(jīng)網(wǎng)絡(luò),可以將復(fù)雜程度更高的功能映射到給定的數(shù)據(jù)點(diǎn)上。
But there are a few very precise things which make the experience with neural networks more incredible and perceiving.
但是,有一些非常精確的東西使神經(jīng)網(wǎng)絡(luò)的體驗(yàn)更加令人難以置信和感知。
Xavier初始化 (Xavier Initialization)
Let us assume that we have trained a huge neural network. For simplicity, the constant term is zero and the activation function is identity.
讓我們假設(shè)我們已經(jīng)訓(xùn)練了一個(gè)巨大的神經(jīng)網(wǎng)絡(luò)。 為簡單起見,常數(shù)項(xiàng)為零,激活函數(shù)為恒等式。
For the given condition, we can have the following equations of gradient descent and expression of the target variable in terms of weights of all layer and input a[0].
對于給定的條件,我們可以使用以下梯度下降方程和目標(biāo)變量的表達(dá)式來表示所有層和輸入a [0]的權(quán)重。
For ease of understanding, let us consider all weights to be equal, i.e.
為了便于理解,讓我們考慮所有權(quán)重相等,即
Here we have considered the last weight different as it will give the output value and in case of binary classification it may be a sigmoid function or ReLu function.
在這里,我們考慮了最后一個(gè)權(quán)重,因?yàn)樗鼤?huì)給出輸出值,并且在二進(jìn)制分類的情況下,它可能是S型函數(shù)或ReLu函數(shù)。
When we replace all the weights in the expression of the target variable, we obtain a new expression for y, the expression of prediction of the target variable.
當(dāng)替換目標(biāo)變量表達(dá)式中的所有權(quán)重時(shí),我們將獲得y的新表達(dá)式,即目標(biāo)變量預(yù)測的表達(dá)式。
Let us consider two different situations for the weights.
讓我們考慮兩種不同的權(quán)重情況。
In case 1, when we advance the weight to the power of L-1, assuming a very large neural network, the value of y becomes very large. Likewise, in case 2, the value of y becomes exponentially small. These are called vanishing and exploding gradients. These provisions affect the accuracy of gradient descent and demand more time for training the data.
在情況1中,當(dāng)我們將權(quán)重提高到L-1的冪時(shí),假設(shè)神經(jīng)網(wǎng)絡(luò)非常大,則y的值將變得非常大。 同樣,在情況2中,y的值呈指數(shù)減小。 這些稱為消失和爆炸梯度 。 這些規(guī)定會(huì)影響梯度下降的準(zhǔn)確性,并需要更多時(shí)間來訓(xùn)練數(shù)據(jù)。
To avoid these circumstances we need to initialize our weights more carefully and more systematically. One way of doing this is by Xavier Initialization.
為了避免這些情況,我們需要更仔細(xì),更系統(tǒng)地初始化權(quán)重。 一種方法是通過Xavier Initialization 。
If we consider a single neuron as in logistic regression, the dimension of the weight matrix is defined by the dimension of a single example. Hence we can set the variance of weights as 1/n. As we increase the dimension of input example, the former ‘s dimensions must be increased to train the model.
如果我們在邏輯回歸中考慮單個(gè)神經(jīng)元,則權(quán)重矩陣的維數(shù)由單個(gè)示例的維數(shù)定義。 因此,我們可以將權(quán)重的方差設(shè)置為1 / n。 隨著我們增加輸入示例的尺寸,必須增加前者的尺寸以訓(xùn)練模型。
Once we have applied this technique to deeper neural networks, the weight initialization for each layer can be expressed as
一旦我們將此技術(shù)應(yīng)用于更深的神經(jīng)網(wǎng)絡(luò),就可以將每一層的權(quán)重初始化表示為
similarly, there can be various ways to define the variance and multiplying with the randomly initialized weights.
類似地,可以有多種方法來定義方差并與隨機(jī)初始化的權(quán)重相乘。
改進(jìn)梯度計(jì)算 (Improvising Gradient Computation)
Let us consider a function f(x) = x3 and compute its gradient at x = 1 using calculus. Using this simple function has a reason to understand and admire this concept. By differentiation, we know that the slope of the function at x=1 is 3.
讓我們考慮一個(gè)函數(shù)f(x)=x3,并使用微積分計(jì)算在x = 1處的梯度。 使用這個(gè)簡單的功能有一個(gè)理解和欣賞這個(gè)概念的理由。 通過微分,我們知道函數(shù)在x = 1處的斜率為3。
Now, let us calculate the slope at x=1 by calculus. We find the value of the function at x = 1+delta, where delta is a very small quantity (say = 0.001). The slope of the function becomes the slope of the hypotenuse of the yellow triangle.
現(xiàn)在,讓我們通過微積分計(jì)算x = 1處的斜率。 我們在x = 1 + delta處找到函數(shù)的值,其中delta是非常小的量(例如= 0.001)。 函數(shù)的斜率變?yōu)辄S色三角形斜邊的斜率。
Hence the slope is 3.003 with an error of 0.003. Now, let us define the error differently and again calculate the slope.
因此,斜率為3.003,誤差為0.003。 現(xiàn)在,讓我們以不同的方式定義誤差,然后再次計(jì)算斜率。
Now we are calculating the slope of a bigger triangle formed by boundaries of 1-delta and 1+delta. Calculating the slope in this manner has reduced the error significantly to 0.000001.
現(xiàn)在,我們正在計(jì)算由1-delta和1 + delta的邊界形成的較大三角形的斜率。 以這種方式計(jì)算斜率已將誤差顯著降低至0.000001。
Hence, we can infer that defining the slope in this manner will help us to better calculate the slope of a function. This demonstration helps us to optimize gradient calculation hence optimizing the Gradient descent.
因此,我們可以推斷出以這種方式定義斜率將有助于我們更好地計(jì)算函數(shù)的斜率。 該演示幫助我們優(yōu)化了梯度計(jì)算,從而優(yōu)化了梯度下降。
One thing to note is implementing this function to calculate gradients more efficiently will increase the time required to calculate the gradients.
要注意的一件事是,實(shí)現(xiàn)此功能以更有效地計(jì)算梯度將增加計(jì)算梯度所需的時(shí)間。
翻譯自: https://towardsdatascience.com/new-ways-for-optimizing-gradient-descent-42ce313fccae
梯度下降優(yōu)化方法'原理
總結(jié)
以上是生活随笔為你收集整理的梯度下降优化方法'原理_优化梯度下降的新方法的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 阿里云-云监控插件安装指南
- 下一篇: DengAI —数据预处理