當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

吴恩达《机器学习》学习笔记十四——应用机器学习的建议实现一个机器学习模型的改进

發布時間：2024/7/23 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了吴恩达《机器学习》学习笔记十四——应用机器学习的建议实现一个机器学习模型的改进小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

吳恩達《機器學習》學習筆記十四——應用機器學習的建議實現一個機器學習模型的改進

一、任務介紹
二、代碼實現
- 1.準備數據
- 2.代價函數
- 3.梯度計算
- 4.帶有正則化的代價函數和梯度計算
- 5.擬合數據
- 6.創建多項式特征
- 7.準備多項式回歸數據
- 8.繪制學習曲線
- - 𝜆=0
  - 𝜆=1
  - 𝜆=100
- 9.找到最佳的 𝜆

前幾次筆記介紹了具體實現一個機器學習模型的時候應該如何操作，首先是快速實現一個較為簡單的模型，然后通過繪制學習曲線去判斷目前的模型存在什么問題，分析應該如何改進，這次筆記就將實現這一整個過程。

數據集：https://pan.baidu.com/s/1Er82YunOOmyTJIW-0H40mQ
提取碼：41rj

一、任務介紹

這次筆記中我們將要實現正則化線性回歸，并用它來研究帶有方差-偏差性質的模型。首先使用正則化線性回歸，通過水位（water_level）的改變來預測一個大壩的出水量（flow），然后通過繪制學習曲線來診斷學習算法存在的問題，分析是具有偏差還是具有方差，以及如何調節。

二、代碼實現

1.準備數據

import numpy as np import scipy.io as sio import scipy.optimize as opt import pandas as pd import matplotlib.pyplot as plt import seaborn as sns def load_data():"""for ex5d['X'] shape = (12, 1)pandas has trouble taking this 2d ndarray to construct a dataframe, so I ravelthe results"""d = sio.loadmat('ex5data1.mat')return map(np.ravel, [d['X'], d['y'], d['Xval'], d['yval'], d['Xtest'], d['ytest']])

把訓練集、驗證集和測試集都讀入了，然后觀察一下他們的維度：

X, y, Xval, yval, Xtest, ytest = load_data() print('X:', X.shape) print('Xval:', Xval.shape) print('Xtest:', Xtest.shape)

可以看到訓練集、驗證集和測試集分別有12、21、21個數據。

下面對訓練數據可視化，對它有個直觀的理解：

df = pd.DataFrame({'water_level':X, 'flow':y})sns.lmplot('water_level', 'flow', data=df, fit_reg=False, size=7) plt.show()

最后是需要為三個數據集的X都插入偏置，這是線性回歸假設函數的常數項，也是與參數theta0相乘的項，接著再觀察一下維度：

X, Xval, Xtest = [np.insert(x.reshape(x.shape[0], 1), 0, np.ones(x.shape[0]), axis=1) for x in (X, Xval, Xtest)] print('X:', X.shape) print('Xval:', Xval.shape) print('Xtest:', Xtest.shape)

2.代價函數

線性回歸的代價函數如下圖所示：

相應的代碼實現如下所示：

def cost(theta, X, y): # INPUT：參數值theta，數據X,標簽y # OUTPUT：當前參數值下代價函數 # TODO：根據參數和輸入的數據計算代價函數# STEP1：獲取樣本個數m = X.shape[0]# STEP2：計算代價函數inner = X @ theta - ysquare_sum = inner.T @ innercost = square_sum / (2 * m)return cost

給一個初始參數計算的玩玩，注意參數theta的維度：

theta = np.ones(X.shape[1]) cost(theta, X, y)

3.梯度計算

線性回歸的梯度計算公式如下所示：

代碼實現如下所示：

def gradient(theta, X, y): # INPUT：參數值theta，數據X,標簽y # OUTPUT：當前參數值下梯度 # TODO：根據參數和輸入的數據計算梯度 # STEP1：獲取樣本個數m = X.shape[0]# STEP2：計算代價函數grad= (X.T @ (X @ theta - y))/mreturn grad gradient(theta, X, y)

4.帶有正則化的代價函數和梯度計算

帶有正則化的梯度計算公式如下圖所示：

def regularized_gradient(theta, X, y, l=1): # INPUT：參數值theta，數據X,標簽y # OUTPUT：當前參數值下梯度 # TODO：根據參數和輸入的數據計算梯度 # STEP1：獲取樣本個數m = X.shape[0]# STEP2：計算正則化梯度regularized_term = theta.copy() # same shape as thetaregularized_term[0] = 0 # don't regularize intercept thetaregularized_term = (l / m) * regularized_termreturn gradient(theta, X, y) + regularized_term regularized_gradient(theta, X, y)

帶有正則化的代價函數公式如下所示：

def regularized_cost(theta, X, y, l=1):m = X.shape[0]regularized_term = (l / (2 * m)) * np.power(theta[1:], 2).sum()return cost(theta, X, y) + regularized_term

5.擬合數據

def linear_regression_np(X, y, l=1): # INPUT：數據X,標簽y，正則化參數l # OUTPUT：當前參數值下梯度 # TODO：根據參數和輸入的數據計算梯度 # STEP1：初始化參數theta = np.ones(X.shape[1])# STEP2：調用優化算法擬合參數res = opt.minimize(fun=regularized_cost,x0=theta,args=(X, y, l),method='TNC',jac=regularized_gradient,options={'disp': True})return res

調用擬合數據的函數來優化參數：

theta = np.ones(X.shape[0])final_theta = linear_regression_np(X, y, l=0).get('x')print(final_theta)

這就是用訓練數據擬合后的theta0和theta1，下面我們對這個參數組合得到的模型來進行可視化。

b = final_theta[0] # intercept m = final_theta[1] # slopeplt.scatter(X[:,1], y, label="Training data") plt.plot(X[:, 1], X[:, 1]*m + b, label="Prediction") plt.legend(loc=2) plt.show()

對訓練數據都不能很好的擬合，顯然是欠擬合的。但這個例子太明顯了，有些情況可能觀察這個圖不能直接判斷，所以最好還是繪制學習曲線，也就是用訓練數據的子集去擬合模型，得到的參數同樣用來計算驗證集上的誤差，隨著訓練數據逐漸增多，計算訓練誤差和驗證集誤差并繪制出相應的曲線：

1.使用訓練集的子集來擬合模型

2.在計算訓練代價和交叉驗證代價時，沒有用正則化

3.記住使用相同的訓練集子集來計算訓練代價

training_cost, cv_cost = [], [] # TODO：計算訓練代價和交叉驗證集代價 # STEP1：獲取樣本個數，遍歷每個樣本 m = X.shape[0] for i in range(1, m+1):# STEP2：計算當前樣本的代價res = linear_regression_np(X[:i, :], y[:i], l=0)tc = regularized_cost(res.x, X[:i, :], y[:i], l=0)cv = regularized_cost(res.x, Xval, yval, l=0)# STEP3：把計算結果存儲至預先定義的數組training_cost, cv_cost中training_cost.append(tc)cv_cost.append(cv) plt.plot(np.arange(1, m+1), training_cost, label='training cost') plt.plot(np.arange(1, m+1), cv_cost, label='cv cost') plt.legend(loc=1) plt.show()

從這個圖可以看出，訓練集誤差和驗證集誤差都是比較大的，這是一種欠擬合的情況。

6.創建多項式特征

因為模型欠擬合，所以不能使用簡單的線性函數來擬合了，應該添加一些多項式特征來增加模型的復雜性。

def prepare_poly_data(*args, power):"""args: keep feeding in X, Xval, or Xtestwill return in the same order"""def prepare(x):# 特征映射df = poly_features(x, power=power)# 歸一化處理ndarr = normalize_feature(df).as_matrix()# 添加偏置項return np.insert(ndarr, 0, np.ones(ndarr.shape[0]), axis=1)return [prepare(x) for x in args]

特征映射之前寫過，這里不再贅述，直接上代碼：

def poly_features(x, power, as_ndarray=False): #特征映射data = {'f{}'.format(i): np.power(x, i) for i in range(1, power + 1)}df = pd.DataFrame(data)return df.as_matrix() if as_ndarray else df

嘗試一下上面的代碼，構造一下次數最高為3的多項式特征：

X, y, Xval, yval, Xtest, ytest = load_data() poly_features(X, power=3)

7.準備多項式回歸數據

擴展特征到 8階,或者你需要的階數

使用歸一化來合并x^n

不要忘記添加偏置項

def normalize_feature(df):"""Applies function along input axis(default 0) of DataFrame."""return df.apply(lambda column: (column - column.mean()) / column.std()) X_poly, Xval_poly, Xtest_poly= prepare_poly_data(X, Xval, Xtest, power=8) X_poly[:3, :]

8.繪制學習曲線

𝜆=0

首先，沒有使用正則化，所以 𝜆=0

def plot_learning_curve(X, y, Xval, yval, l=0): # INPUT：訓練數據集X,y，交叉驗證集Xval，yval，正則化參數l # OUTPUT：當前參數值下梯度 # TODO：根據參數和輸入的數據計算梯度 # STEP1：初始化參數，獲取樣本個數，開始遍歷training_cost, cv_cost = [], []m = X.shape[0]for i in range(1, m + 1):# STEP2：調用之前寫好的擬合數據函數進行數據擬合res = linear_regression_np(X[:i, :], y[:i], l=l)# STEP3：計算樣本代價tc = cost(res.x, X[:i, :], y[:i])cv = cost(res.x, Xval, yval)# STEP3：把計算結果存儲至預先定義的數組training_cost, cv_cost中training_cost.append(tc)cv_cost.append(cv)plt.plot(np.arange(1, m + 1), training_cost, label='training cost')plt.plot(np.arange(1, m + 1), cv_cost, label='cv cost')plt.legend(loc=1) plot_learning_curve(X_poly, y, Xval_poly, yval, l=0) plt.show()

從這個學習曲線看，訓練誤差太低了，而驗證集誤差不算低，這是過擬合的情況（lamda=0，所以也沒有任何抑制過擬合的作用，而多項式次數又比較高，過擬合很正常）。

𝜆=1

plot_learning_curve(X_poly, y, Xval_poly, yval, l=1) plt.show()

訓練誤差稍有增加，驗證誤差也降得較低，算是比較好的情況。

𝜆=100

plot_learning_curve(X_poly, y, Xval_poly, yval, l=100) plt.show()

正則化過多，變成了欠擬合情況。

9.找到最佳的 𝜆

l_candidate = [0, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10] training_cost, cv_cost = [], [] for l in l_candidate:res = linear_regression_np(X_poly, y, l)tc = cost(res.x, X_poly, y)cv = cost(res.x, Xval_poly, yval)training_cost.append(tc)cv_cost.append(cv) plt.plot(l_candidate, training_cost, label='training') plt.plot(l_candidate, cv_cost, label='cross validation') plt.legend(loc=2)plt.xlabel('lambda')plt.ylabel('cost') plt.show()

找出最佳的 𝜆，即找出驗證誤差最小時對應的 𝜆：

# best cv I got from all those candidates l_candidate[np.argmin(cv_cost)]

用測試集去計算這些𝜆情況下的測試誤差，最終的目的是要測試誤差小：

# use test data to compute the cost for l in l_candidate:theta = linear_regression_np(X_poly, y, l).xprint('test cost(l={}) = {}'.format(l, cost(theta, Xtest_poly, ytest)))

調參后， 𝜆=0.3 是最優選擇，這個時候測試代價最小，我們上述選擇的𝜆=1時的測試誤差很接近最優選擇下的誤差，所以上述的操作就是建立并改進一個模型的大致流程。

總結

以上是生活随笔為你收集整理的吴恩达《机器学习》学习笔记十四——应用机器学习的建议实现一个机器学习模型的改进的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。