02.改善深层神经网络:超参数调试、正则化以及优化 W1.深度学习的实践层面(作业:初始化+正则化+梯度检验)
文章目錄
- 作業1:初始化
- 1. 神經網絡模型
- 2. 使用 0 初始化
- 3. 隨機初始化
- 4. He 初始化
- 作業2:正則化
- 1. 無正則化模型
- 2. L2 正則化
- 3. DropOut 正則化
- 3.1 帶dropout的前向傳播
- 3.2 帶dropout的后向傳播
- 3.3 運行模型
- 作業3:梯度檢驗
- 1. 1維梯度檢驗
- 2. 多維梯度檢驗
測試題:參考博文
筆記:02.改善深層神經網絡:超參數調試、正則化以及優化 W1.深度學習的實踐層面
作業1:初始化
好的初始化:
- 加快梯度下降的收斂速度
- 增加梯度下降收斂到較低的訓練(和泛化)誤差的幾率
導入數據
import numpy as np import matplotlib.pyplot as plt import sklearn import sklearn.datasets from init_utils import sigmoid, relu, compute_loss, forward_propagation, backward_propagation from init_utils import update_parameters, predict, load_dataset, plot_decision_boundary, predict_dec%matplotlib inline plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots plt.rcParams['image.interpolation'] = 'nearest' plt.rcParams['image.cmap'] = 'gray'# load image dataset: blue/red dots in circles train_X, train_Y, test_X, test_Y = load_dataset()
我們的任務是:將兩類點分類
1. 神經網絡模型
用一個已經實現好了的 3層神經網絡
def model(X, Y, learning_rate = 0.01, num_iterations = 15000, print_cost = True, initialization = "he"):"""Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.Arguments:X -- input data, of shape (2, number of examples)Y -- true "label" vector (containing 0 for red dots; 1 for blue dots), of shape (1, number of examples)learning_rate -- learning rate for gradient descent num_iterations -- number of iterations to run gradient descentprint_cost -- if True, print the cost every 1000 iterationsinitialization -- flag to choose which initialization to use ("zeros","random" or "he")Returns:parameters -- parameters learnt by the model"""grads = {}costs = [] # to keep track of the lossm = X.shape[1] # number of exampleslayers_dims = [X.shape[0], 10, 5, 1]# Initialize parameters dictionary.if initialization == "zeros":parameters = initialize_parameters_zeros(layers_dims)elif initialization == "random":parameters = initialize_parameters_random(layers_dims)elif initialization == "he":parameters = initialize_parameters_he(layers_dims)# Loop (gradient descent)for i in range(0, num_iterations):# Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.a3, cache = forward_propagation(X, parameters)# Losscost = compute_loss(a3, Y)# Backward propagation.grads = backward_propagation(X, Y, cache)# Update parameters.parameters = update_parameters(parameters, grads, learning_rate)# Print the loss every 1000 iterationsif print_cost and i % 1000 == 0:print("Cost after iteration {}: {}".format(i, cost))costs.append(cost)# plot the lossplt.plot(costs)plt.ylabel('cost')plt.xlabel('iterations (per hundreds)')plt.title("Learning rate =" + str(learning_rate))plt.show()return parameters2. 使用 0 初始化
# GRADED FUNCTION: initialize_parameters_zeros def initialize_parameters_zeros(layers_dims):"""Arguments:layer_dims -- python array (list) containing the size of each layer.Returns:parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])b1 -- bias vector of shape (layers_dims[1], 1)...WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])bL -- bias vector of shape (layers_dims[L], 1)"""parameters = {}L = len(layers_dims) # number of layers in the networkfor l in range(1, L):### START CODE HERE ### (≈ 2 lines of code)parameters['W' + str(l)] = np.zeros((layers_dims[l], layers_dims[l-1]))parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))### END CODE HERE ###return parameters運行以下代碼訓練:
parameters = model(train_X, train_Y, initialization = "zeros") print ("On the train set:") predictions_train = predict(train_X, train_Y, parameters) print ("On the test set:") predictions_test = predict(test_X, test_Y, parameters)結果:
Cost after iteration 0: 0.6931471805599453 Cost after iteration 1000: 0.6931471805599453 Cost after iteration 2000: 0.6931471805599453 Cost after iteration 3000: 0.6931471805599453 Cost after iteration 4000: 0.6931471805599453 Cost after iteration 5000: 0.6931471805599453 Cost after iteration 6000: 0.6931471805599453 Cost after iteration 7000: 0.6931471805599453 Cost after iteration 8000: 0.6931471805599453 Cost after iteration 9000: 0.6931471805599453 Cost after iteration 10000: 0.6931471805599455 Cost after iteration 11000: 0.6931471805599453 Cost after iteration 12000: 0.6931471805599453 Cost after iteration 13000: 0.6931471805599453 Cost after iteration 14000: 0.6931471805599453 On the train set: Accuracy: 0.5 On the test set: Accuracy: 0.5- 效果很差,代價函數幾乎沒有下降
預測全部都是 0
predictions_train = [[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0]] predictions_test = [[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]] plt.title("Model with Zeros initialization") axes = plt.gca() axes.set_xlim([-1.5,1.5]) axes.set_ylim([-1.5,1.5]) plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)
結論:
- 神經網絡中,不要把參數初始化為0,否則模型不能打破這種狀態,一直學習同樣的東西。
- 可以將權重隨機初始化,偏置初始化為0
3. 隨機初始化
- np.random.randn(layers_dims[l], layers_dims[l-1])*10,* 10 使用很大的隨機數初始化權重
運行以下代碼訓練:
parameters = model(train_X, train_Y, initialization = "random") print ("On the train set:") predictions_train = predict(train_X, train_Y, parameters) print ("On the test set:") predictions_test = predict(test_X, test_Y, parameters)結果:
Cost after iteration 0: inf Cost after iteration 1000: 0.6239567039908781 Cost after iteration 2000: 0.5978043872838292 Cost after iteration 3000: 0.563595830364618 Cost after iteration 4000: 0.5500816882570866 Cost after iteration 5000: 0.5443417928662615 Cost after iteration 6000: 0.5373553777823036 Cost after iteration 7000: 0.4700141958024487 Cost after iteration 8000: 0.3976617665785177 Cost after iteration 9000: 0.39344405717719166 Cost after iteration 10000: 0.39201765232720626 Cost after iteration 11000: 0.38910685278803786 Cost after iteration 12000: 0.38612995897697244 Cost after iteration 13000: 0.3849735792031832 Cost after iteration 14000: 0.38275100578285265 On the train set: Accuracy: 0.83 On the test set: Accuracy: 0.86決策邊界
plt.title("Model with large random initialization") axes = plt.gca() axes.set_xlim([-1.5,1.5]) axes.set_ylim([-1.5,1.5]) plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)
將* 10 改為 * 1:
將* 10 改為 * 0.1:
- 使用合適的初始化權重非常重要!!!
- 不好的初始化會造成梯度消失/爆炸,降低了學習速度
4. He 初始化
是以一個人的名字命名的。
- 如果使用ReLu激活函數(最常用),?np.sqrt(2n[l?1])*np.sqrt(\frac{2}{n^{[l-1]}})?np.sqrt(n[l?1]2?)
- 如果使用tanh激活函數,1n[l?1]\sqrt \frac{1}{n^{[l-1]}}n[l?1]1??,或者 2n[l?1]+n[l]\sqrt \frac{2}{n^{[l-1]}+n^{[l]}}n[l?1]+n[l]2??
| 3-layer NN with zeros initialization | 50% | fails to break symmetry |
| 3-layer NN with large random initialization | 83% | too large weights |
| 3-layer NN with He initialization | 99% | recommended method |
作業2:正則化
過擬合是個嚴重的問題,它表現為在訓練集上表現的很好,但是泛化性能較差
# import packages import numpy as np import matplotlib.pyplot as plt from reg_utils import sigmoid, relu, plot_decision_boundary, initialize_parameters, load_2D_dataset, predict_dec from reg_utils import compute_cost, predict, forward_propagation, backward_propagation, update_parameters import sklearn import sklearn.datasets import scipy.io from testCases import *%matplotlib inline plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots plt.rcParams['image.interpolation'] = 'nearest' plt.rcParams['image.cmap'] = 'gray'問題引入:
法國足球守門員發球,把球踢到什么位置,他的隊友可以用頭頂球。
法國守門員從左側發球,藍色是自己隊友頂球位置,紅色是對方頂球位置
肉眼看,好像可以用一條45°左右的斜線分開
1. 無正則化模型
def model(X, Y, learning_rate = 0.3, num_iterations = 30000, print_cost = True, lambd = 0, keep_prob = 1):"""Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.Arguments:X -- input data, of shape (input size, number of examples)Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (output size, number of examples)learning_rate -- learning rate of the optimizationnum_iterations -- number of iterations of the optimization loopprint_cost -- If True, print the cost every 10000 iterationslambd -- regularization hyperparameter, scalarkeep_prob - probability of keeping a neuron active during drop-out, scalar.Returns:parameters -- parameters learned by the model. They can then be used to predict."""grads = {}costs = [] # to keep track of the costm = X.shape[1] # number of exampleslayers_dims = [X.shape[0], 20, 3, 1]# Initialize parameters dictionary.parameters = initialize_parameters(layers_dims)# Loop (gradient descent)for i in range(0, num_iterations):# Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.if keep_prob == 1:a3, cache = forward_propagation(X, parameters)elif keep_prob < 1:a3, cache = forward_propagation_with_dropout(X, parameters, keep_prob)# Cost functionif lambd == 0:cost = compute_cost(a3, Y)else:cost = compute_cost_with_regularization(a3, Y, parameters, lambd)# Backward propagation.assert(lambd==0 or keep_prob==1) # it is possible to use both L2 regularization and dropout, # but this assignment will only explore one at a timeif lambd == 0 and keep_prob == 1:grads = backward_propagation(X, Y, cache)elif lambd != 0:grads = backward_propagation_with_regularization(X, Y, cache, lambd)elif keep_prob < 1:grads = backward_propagation_with_dropout(X, Y, cache, keep_prob)# Update parameters.parameters = update_parameters(parameters, grads, learning_rate)# Print the loss every 10000 iterationsif print_cost and i % 10000 == 0:print("Cost after iteration {}: {}".format(i, cost))if print_cost and i % 1000 == 0:costs.append(cost)# plot the costplt.plot(costs)plt.ylabel('cost')plt.xlabel('iterations (x1,000)')plt.title("Learning rate =" + str(learning_rate))plt.show()return parameters parameters = model(train_X, train_Y) print ("On the training set:") predictions_train = predict(train_X, train_Y, parameters) print ("On the test set:") predictions_test = predict(test_X, test_Y, parameters)- 無正則化 訓練過程
- 沒有正則化的模型過擬合了,它擬合了一些噪聲點
2. L2 正則化
- 注意在損失函數里加入正則化項
無正則項的損失函數:
J=?1m∑i=1m(y(i)log?(a[L](i))+(1?y(i))log?(1?a[L](i)))J = -\frac{1}{m} \sum\limits_{i = 1}^{m} \bigg( \small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \bigg)J=?m1?i=1∑m?(y(i)log(a[L](i))+(1?y(i))log(1?a[L](i)))
加入正則化項的損失函數:
Jregularized=?1m∑i=1m(y(i)log?(a[L](i))+(1?y(i))log?(1?a[L](i)))?cross-entropy?cost+1mλ2∑l∑k∑jWk,j[l]2?L2?regularization?costJ_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \bigg(\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \bigg) }_\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W_{k,j}^{[l]2} }_\text{L2 regularization cost} Jregularized?=cross-entropy?cost?m1?i=1∑m?(y(i)log(a[L](i))+(1?y(i))log(1?a[L](i)))??+L2?regularization?costm1?2λ?l∑?k∑?j∑?Wk,j[l]2???
- 反向傳播,計算梯度時也要根據 新的損失函數
dw 需要加入 ddW(12λmW2)=λmW\fracze8trgl8bvbq{dW} ( \frac{1}{2}\frac{\lambda}{m} W^2) = \frac{\lambda}{m} WdWd?(21?mλ?W2)=mλ?W 項
- 運行帶 L2 正則化( λ\lambdaλ = 0.7)的模型(使用上面兩個函數計算損失、梯度)
模型沒有過擬合
L2 正則化使得 權重衰減,其基于假設: 小的權重 W 的模型更簡單,所以模型會懲罰 大的 W,小的權重 使得輸出變化比較平和,不會劇烈變化 形成復雜的邊界(造成過擬合)
調整 λ\lambdaλ 做點對比:
λ=0.3\lambda = 0.3λ=0.3
λ=0.1\lambda = 0.1λ=0.1
On the train set: Accuracy: 0.9383886255924171 On the test set: Accuracy: 0.95λ=0.01\lambda = 0.01λ=0.01(正則化作用很弱)
On the train set: Accuracy: 0.9289099526066351 On the test set: Accuracy: 0.915(有點過擬合)
λ=1\lambda = 1λ=1
On the train set: Accuracy: 0.9241706161137441 On the test set: Accuracy: 0.93
λ=5\lambda = 5λ=5
- λ\lambdaλ 太大,正則化太強,W 被壓縮的很小,決策邊界過度平滑(都直線了),造成高的偏差
3. DropOut 正則化
DropOut 正則化 在每次迭代的時候 隨機關閉一些神經元
被關閉的神經元在當次迭代時,對前向和后向傳播都沒有貢獻
drop-out背后的思想是,每次迭代時,使用部分神經元子集的不同模型,神經元對另一個特定神經元的激活變得不那么敏感,因為另一個神經元隨時可能被關閉
3.1 帶dropout的前向傳播
對一個3層神經網絡實施 dropout,只對第1,2層進行,不包括輸入和輸出層
- 用 np.random.rand() 建立與 A[1]A^{[1]}A[1] 一樣維度的 D[1]=[d[1](1)d[1](2)...d[1](m)]D^{[1]} = [d^{[1](1)} d^{[1](2)} ... d^{[1](m)}]D[1]=[d[1](1)d[1](2)...d[1](m)]
- 以一定的概率,設置 D[1]D^{[1]}D[1] 元素為0(概率 1-keep_prob), 1(概率 keep_prob)X = (X < keep_prob)
- 關閉某些神經元,A[1]=A[1]?D[1]A^{[1]}= A^{[1]} * D^{[1]}A[1]=A[1]?D[1]
- A[1]/=keep-prob?A^{[1]} /= \text { keep-prob }A[1]/=?keep-prob?,此步確保損失函數的期望值與 沒有dropout 時一樣(inverted dropout)
3.2 帶dropout的后向傳播
上面我們用 D[1],D[2]D^{[1]},D^{[2]}D[1],D[2] 把神經元關閉了
- 使用相同的 D[1]D^{[1]}D[1] 關閉 dA1dA1dA1
- dA1/=keep-probdA1 /= \text{keep-prob}dA1/=keep-prob,導數跟上面保持一致的系數
3.3 運行模型
參數:keep_prob = 0.86,前后向傳播 使用上面的兩個函數
On the train set: Accuracy: 0.9289099526066351 On the test set: Accuracy: 0.95模型沒有過擬合,且 test 集上的準確率達到了 95%
注意:
- 只能在訓練的時候,使用dropout,測試的時候不要使用
- 前向、后向均應該使用
| 3-layer NN without regularization | 95% | 91.5% |
| 3-layer NN with L2-regularization | 94% | 93% |
| 3-layer NN with dropout | 93% | 95% |
正則化限制了在訓練集上的過擬合,訓練準確率下降了,但是測試集準確率上升了,這是個好現象
作業3:梯度檢驗
梯度檢驗 確保 反向傳播 是正確的,沒有 bug
1. 1維梯度檢驗
?J?θ=lim?ε→0J(θ+ε)?J(θ?ε)2ε\frac{\partial J}{\partial \theta} = \lim_{\varepsilon \to 0} \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon}?θ?J?=ε→0lim?2εJ(θ+ε)?J(θ?ε)?
- 計算理論梯度
- 計算近似梯度
- 反向傳播,計算理論梯度 grad
- 比較兩者誤差
difference=∣∣grad?gradapprox∣∣2∣∣grad∣∣2+∣∣gradapprox∣∣2difference = \frac {\mid\mid grad - gradapprox \mid\mid_2}{\mid\mid grad \mid\mid_2 + \mid\mid gradapprox \mid\mid_2}difference=∣∣grad∣∣2?+∣∣gradapprox∣∣2?∣∣grad?gradapprox∣∣2??
- 檢查上式是否足夠小(10-7)
2. 多維梯度檢驗
def forward_propagation_n(X, Y, parameters):"""Implements the forward propagation (and computes the cost) presented in Figure 3.Arguments:X -- training set for m examplesY -- labels for m examples parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":W1 -- weight matrix of shape (5, 4)b1 -- bias vector of shape (5, 1)W2 -- weight matrix of shape (3, 5)b2 -- bias vector of shape (3, 1)W3 -- weight matrix of shape (1, 3)b3 -- bias vector of shape (1, 1)Returns:cost -- the cost function (logistic cost for one example)"""# retrieve parametersm = X.shape[1]W1 = parameters["W1"]b1 = parameters["b1"]W2 = parameters["W2"]b2 = parameters["b2"]W3 = parameters["W3"]b3 = parameters["b3"]# LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOIDZ1 = np.dot(W1, X) + b1A1 = relu(Z1)Z2 = np.dot(W2, A1) + b2A2 = relu(Z2)Z3 = np.dot(W3, A2) + b3A3 = sigmoid(Z3)# Costlogprobs = np.multiply(-np.log(A3),Y) + np.multiply(-np.log(1 - A3), 1 - Y)cost = 1./m * np.sum(logprobs)cache = (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3)return cost, cache def backward_propagation_n(X, Y, cache):"""Implement the backward propagation presented in figure 2.Arguments:X -- input datapoint, of shape (input size, 1)Y -- true "label"cache -- cache output from forward_propagation_n()Returns:gradients -- A dictionary with the gradients of the cost with respect to each parameter, activation and pre-activation variables."""m = X.shape[1](Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cachedZ3 = A3 - YdW3 = 1./m * np.dot(dZ3, A2.T)db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)dA2 = np.dot(W3.T, dZ3)dZ2 = np.multiply(dA2, np.int64(A2 > 0))dW2 = 1./m * np.dot(dZ2, A1.T) * 2db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)dA1 = np.dot(W2.T, dZ2)dZ1 = np.multiply(dA1, np.int64(A1 > 0))dW1 = 1./m * np.dot(dZ1, X.T)db1 = 4./m * np.sum(dZ1, axis=1, keepdims = True)gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2,"dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1}return gradients # GRADED FUNCTION: gradient_check_ndef gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7):"""Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_nArguments:parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters. x -- input datapoint, of shape (input size, 1)y -- true "label"epsilon -- tiny shift to the input to compute approximated gradient with formula(1)Returns:difference -- difference (2) between the approximated gradient and the backward propagation gradient"""# Set-up variablesparameters_values, _ = dictionary_to_vector(parameters)grad = gradients_to_vector(gradients)num_parameters = parameters_values.shape[0]J_plus = np.zeros((num_parameters, 1))J_minus = np.zeros((num_parameters, 1))gradapprox = np.zeros((num_parameters, 1))# Compute gradapproxfor i in range(num_parameters):# Compute J_plus[i]. Inputs: "parameters_values, epsilon". Output = "J_plus[i]".# "_" is used because the function you have to outputs two parameters but we only care about the first one### START CODE HERE ### (approx. 3 lines)thetaplus = np.copy(parameters_values) # Step 1thetaplus[i][0] = thetaplus[i][0] + epsilon # Step 2J_plus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaplus)) # Step 3### END CODE HERE #### Compute J_minus[i]. Inputs: "parameters_values, epsilon". Output = "J_minus[i]".### START CODE HERE ### (approx. 3 lines)thetaminus = np.copy(parameters_values) # Step 1thetaminus[i][0] = thetaminus[i][0] - epsilon # Step 2 J_minus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaminus)) # Step 3### END CODE HERE #### Compute gradapprox[i]### START CODE HERE ### (approx. 1 line)gradapprox[i] = (J_plus[i] - J_minus[i])/(2*epsilon)### END CODE HERE #### Compare gradapprox to backward propagation gradients by computing difference.### START CODE HERE ### (approx. 1 line)numerator = np.linalg.norm(gradapprox - grad) # Step 1'denominator = np.linalg.norm(gradapprox)+np.linalg.norm(grad) # Step 2'difference = numerator/denominator # Step 3'### END CODE HERE ###if difference > 1e-7:print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")else:print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")return difference- 老師給的 backward_propagation_n 函數里面有錯誤,嘗試去找到它。
- 尋找錯誤
db1 改成:db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
dW2 改成:dW2 = 1./m * np.dot(dZ2, A1.T)
誤差下來了,但略微超過 10-7, 所以顯示錯誤,應該問題不大
There is a mistake in the backward propagation! difference = 1.1890913023330276e-07注意:
- 梯度檢驗非常慢,計算很耗時,所以我們訓練時,不運行梯度檢驗,只運行幾次檢查梯度是否正確
- 梯度檢驗時,需要關掉 dropout
我的CSDN博客地址 https://michael.blog.csdn.net/
長按或掃碼關注我的公眾號(Michael阿明),一起加油、一起學習進步!
總結
以上是生活随笔為你收集整理的02.改善深层神经网络:超参数调试、正则化以及优化 W1.深度学习的实践层面(作业:初始化+正则化+梯度检验)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: LeetCode 375. 猜数字大小
- 下一篇: LeetCode 900. RLE 迭代