实现一个反向传播人工神经网络
為何實現一個BP神經網絡?
“What I cannot create, I do not understand”
— Richard Feynman, February 1988
實現一個BP神經網絡的7個步驟
\Theta_{jk}^{(i)}}J(\Theta)$
數據集
我們選取著名的鳶尾花數據集作為神經網絡作用的對象,首先先觀察下數據。
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | In[2]: import numpy as np from sklearn import datasets iris = datasets.load_iris() # random it perm = np.random.permutation(iris.target.size) iris.data = iris.data[perm] iris.target = iris.target[perm] print iris.data.shape print iris.target.shape np.unique(iris.target) #print iris Out[2]: (150, 4) (150,) array([0, 1, 2]) |
可見,鳶尾花數據集有四個特征,分為三種類別,總共150個條數據。
選擇人工神經網絡結構
隨機初始化權重
網絡權重隨機初始化
初始化 $\Theta_{ij}^{(l)}$ 為 $[-\epsilon, \epsilon]$ 之間的隨機值。
初始化函數可定義如下:
| 1 2 3 4 5 | In[3]: def random_init(input_layer_size, hidden_layer_size, classes, init_epsilon=0.12): Theta_1 = np.random.rand(hidden_layer_size, input_layer_size + 1) * 2 * init_epsilon -init_epsilon Theta_2 = np.random.rand(classes, hidden_layer_size + 1) * 2 * init_epsilon -init_epsilon return (Theta_1, Theta_2) |
實現前向傳播
每一層都以此為公式計算下一層:
$$
G_{\theta^{(l)}(X)} = \frac{1}{1 + e^{-X(\Theta^{(l)})^T}}
$$
注意添加偏差項。
前向傳播函數定義如下
| 1 2 3 4 5 6 7 8 9 | In[4]: # 先定義一個sigmoid函數 def g(theta, X): return 1 / (1 + np.e ** (-np.dot(X, np.transpose(theta)))) # 前向傳播函數 def predict(Theta_1, Theta_2, X): h = g(Theta_1, np.hstack((np.ones((X.shape[0], 1)), X))) o = g(Theta_2, np.hstack((np.ones((h.shape[0], 1)), h))) return o |
實現成本函數
成本函數的數學表示
$$
J(\Theta) = -\frac{1}{m}[\sum{i=1}^m
\sum{k=1}^Kyk^{(i)}\log(h\Theta(x^{(i)}))_k + (1 -
yk^{(i)})\log(1-h\Theta(x^{(i)}))k)] + \frac{\lambda}{2m}\sum{l=1}^{L-1}\sum
_{i=1}^{sl}\sum{j=1}^{s{l+1}}(\Theta{ji}^{(l)})^2
$$
$K$是輸出層單元個數, $L$是層數,我們這里是三, $s_l$ 是層 $l$ 中單元數。
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | In[5]: def costfunction(nn_param, input_layer_size, hidden_layer_size, classes, X, y, lamb_da): Theta_1 = nn_param[:hidden_layer_size * (input_layer_size + 1)].reshape((hidden_layer_size, (input_layer_size + 1))) Theta_2 = nn_param[hidden_layer_size * (input_layer_size + 1):].reshape((classes, (hidden_layer_size + 1))) m = X.shape[0] # recode y y_recoded = np.eye(classes)[y,:] # print y_recoded # calculate regularator regularator = (np.sum(Theta_1**2) + np.sum(Theta_2 ** 2)) * (lamb_da/(2*m)) a3 = predict(Theta_1, Theta_2, X) J = 1.0 / m * np.sum(-1 * y_recoded * np.log(a3)-(1-y_recoded) * np.log(1-a3)) + regularator # print J return J |
因為scipy中優化函數要求輸入參數是向量而不是矩陣,因此我們的函數實現也必須能夠隨時展開和復原矩陣。
實現反向傳播
- 正向傳播計算網絡輸出
- 反向計算每一層的誤差$\sigma^{(l)}$
- 由誤差累加計算得到成本函數的偏微分$\frac{\partial}{\partial\Theta_{ij}^{(l)}}J(\Theta)$
同樣,在實現函數時因為要傳遞給scipy.optimize模塊中的優化函數,必須能展開矩陣參數為向量并可隨時復原。
詳細實現過程參見Andrew NG在Coursera上的講座。
數學推導過程請 自行 上coursera ml課程論壇搜索。
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | In[6]: def Gradient(nn_param, input_layer_size, hidden_layer_size, classes, X, y, lamb_da): Theta_1 = nn_param[:hidden_layer_size * (input_layer_size + 1)].reshape((hidden_layer_size, (input_layer_size + 1))) Theta_2 = nn_param[hidden_layer_size * (input_layer_size + 1):].reshape((classes, (hidden_layer_size + 1))) m = X.shape[0] Theta1_grad = np.zeros(Theta_1.shape); Theta2_grad = np.zeros(Theta_2.shape); for t in range(m): # For the input layer, where l=1: # add a bias unit and forward propagate a1 = np.hstack((np.ones((X[t:t+1,:].shape[0], 1)), X[t:t+1,:])) # For the hidden layers, where l=2: a2 = g(Theta_1, a1) a2 = np.hstack((np.ones((a2.shape[0], 1)), a2)) a3 = g(Theta_2, a2) yy = np.zeros((1, classes)) yy[0][y[t]] = 1 # For the delta values: delta_3 = a3 - yy; |
續上頁
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | delta_2 = delta_3.dot(Theta_2) * (a2 * (1 - a2)) delta_2 = delta_2[0][1:] # Taking of the bias row delta_2 = np.transpose(delta_2) delta_2 = delta_2.reshape(1, (np.size(delta_2))) # delta_1 is not calculated because we do not associate error with the input # Big delta update Theta1_grad = Theta1_grad + np.transpose(delta_2).dot(a1); Theta2_grad = Theta2_grad + np.transpose(delta_3).dot(a2); Theta1_grad = 1.0 / m * Theta1_grad + lamb_da / m * np.hstack((np.zeros((Theta_1.shape[0], 1)), Theta_1[:, 1:])) Theta2_grad = 1.0 / m * Theta2_grad + lamb_da / m * np.hstack((np.zeros((Theta_2.shape[0], 1)), Theta_2[:, 1:])) Grad = np.concatenate((Theta1_grad.ravel(), Theta2_grad.ravel()), axis=0) return Grad |
梯度檢測
“But back prop as an algorithm has a lot of details and,you know, can be a
little bit tricky to implement.”
– Andrew NG如是說
- 即使看上去成本函數在一直減小,可能神經網絡還是有問題。
- 梯度檢測能讓你確認,哦,我的實現還正常。
梯度檢測原理
簡單的微分近似。但相比反向傳播神經網絡算法計算量更大。
$$
\frac{\partial}{\partial\theta{i}}J(\theta) \approx
\frac{J(\theta{i}+\epsilon) - J(\theta_{i}+\epsilon) }{2\epsilon}
$$
這種梯度計算實現如下:
| 1 2 3 4 5 6 7 8 9 10 11 12 | In[7]: # define compute_grad def compute_grad(theta, input_layer_size, hidden_layer_size, classes, X, y, lamb_da, epsilon=1e-4): n = np.size(theta) gradaproxy = np.zeros(n) for i in range(n): theta_plus = np.copy(theta) # Important for in numpy assign is shalow copy theta_plus[i] = theta_plus[i] + epsilon theta_minus = np.copy(theta) theta_minus[i] = theta_minus[i] - epsilon gradaproxy[i] = (costfunction(theta_plus, input_layer_size, hidden_layer_size, classes, X, y, lamb_da) - costfunction(theta_minus, input_layer_size, hidden_layer_size, classes, X, y, lamb_da)) / (2.0 * epsilon) return gradaproxy |
把一切放到一起
先定義訓練函數如下:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | In[8]: from scipy import optimize def train(input_layer_size, hidden_layer_size, classes, X, y, lamb_da): Theta_1, Theta_2 = random_init(input_layer_size, hidden_layer_size, classes, init_epsilon=0.12) nn_param = np.concatenate((Theta_1.ravel(), Theta_2.ravel())) final_nn = optimize.fmin_cg(costfunction, np.concatenate((Theta_1.ravel(), Theta_2.ravel()), axis=0), fprime=Gradient, args=(input_layer_size, hidden_layer_size, classes, X, y, lamb_da)) return final_nn |
使用訓練集進行訓練
我選取鳶尾花數據集的100條作為訓練集,剩下50條作為測試集。
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | In[9]: X = iris.data[:100] y = iris.target[:100] lamb_da = 1.0 # must be float input_layer_size = 4 hidden_layer_size = 6 classes = 3 final_nn = train(input_layer_size, hidden_layer_size, classes, X, y, lamb_da) Warning: Desired error not necessarily achieved due to precision loss. Current function value: 0.878999 Iterations: 48 Function evaluations: 152 Gradient evaluations: 140 |
檢測兩種方式計算的梯度是否近似
| 1 2 3 4 5 6 7 8 9 10 11 | In[10]: # gradient checking grad_aprox = compute_grad(final_nn, input_layer_size, hidden_layer_size, classes, X, y, lamb_da) grad_bp = Gradient(final_nn, input_layer_size, hidden_layer_size, classes, X, y, lamb_da) print (grad_aprox - grad_bp) < 1e-1 [ True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True] |
對測試集使用訓練得來的參數
可以看到,測試集中50個樣本有_個判定錯誤,其它_個分類正確。
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | In[11]: def test(final_nn, input_layer_size, hidden_layer_size, classes, X, y, lamb_da): Theta_1 = final_nn[:hidden_layer_size * (input_layer_size + 1)].reshape((hidden_layer_size, input_layer_size + 1)) Theta_2 = final_nn[hidden_layer_size * (input_layer_size + 1):].reshape((classes, hidden_layer_size + 1)) nsuccess = np.sum(np.argmax(predict(Theta_1, Theta_2, X), axis=1) == y) return nsuccess n = test(final_nn, input_layer_size, hidden_layer_size, classes, iris.data[100:], iris.target[100:], lamb_da) print n n = test(final_nn, input_layer_size, hidden_layer_size, classes, iris.data[:100], iris.target[:100], lamb_da) print n Out[11]: 47 99 |
另一個例子:手寫數字數據集
最后是對同樣著名的手寫數字數據集的實驗。
載入并觀察數據集:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | In[12]: digits = datasets.load_digits() # random it perm = np.random.permutation(digits.target.size) digits.data = digits.data[perm] digits.target = digits.target[perm] print digits.data.shape print digits.target.shape print np.unique(digits.target) Out[12]: (1797, 64) (1797,) [0 1 2 3 4 5 6 7 8 9] |
選擇神經網絡參數
取1000個樣本作為訓練集,剩下作為測試集。
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | In[13]: X = digits.data[:1000] y = digits.target[:1000] lamb_da = 1.0 # must be float input_layer_size = 64 hidden_layer_size = 10 classes = 10 final_nn = train(input_layer_size, hidden_layer_size, classes, X, y, lamb_da) Out[13]: Warning: Desired error not necessarily achieved due to precision loss. Current function value: 0.594474 Iterations: 965 Function evaluations: 2210 Gradient evaluations: 2189 |
進行梯度檢測
| 1 2 3 4 5 6 7 8 | In[14]: # gradient checking grad_aprox = compute_grad(final_nn, input_layer_size, hidden_layer_size, classes, X, y, lamb_da) grad_bp = Gradient(final_nn, input_layer_size, hidden_layer_size, classes, X, y, lamb_da) print np.all((grad_aprox - grad_bp) < 1e-1) Out[14]: True |
對測試集使用訓練得來的參數
| 1 2 3 4 5 6 7 8 9 10 11 12 | In[15]: n = test(final_nn, input_layer_size, hidden_layer_size, classes, digits.data[1000:], digits.target[1000:], lamb_da) print n n = test(final_nn, input_layer_size, hidden_layer_size, classes, digits.data[:1000], digits.target[:1000], lamb_da) print n Out[15]: 722 991 |
BY REVERLAND
from:?http://reverland.org/machine-learning/2014/06/02/總結
以上是生活随笔為你收集整理的实现一个反向传播人工神经网络的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 用 python 实现一个多线程网页下载
- 下一篇: 推荐几个好玩又有难度的编程网站