當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器学习算法拟合曲线_制定学习曲线以检测机器学习算法中的错误

發布時間：2023/12/15 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习算法拟合曲线_制定学习曲线以检测机器学习算法中的错误小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

機器學習算法擬合曲線

機器學習 (Machine Learning)

The learning curve is very useful to determine how to improve the performance of an algorithm. It is useful to determine if an algorithm is suffering from bias or underfitting, a variance or overfishing, or a bit of both.

學習曲線對于確定如何提高算法性能非常有用。確定算法是否遭受偏差或擬合不足，方差或過度捕撈或兩者兼而有之很有用。

If your machine learning algorithm is not working as expected, what to do next? There are several options:

如果您的機器學習算法無法正常工作，下一步該怎么做？有幾種選擇：

Getting more training data which is very time-consuming. It may even take months to obtain more research data.

獲取更多的培訓數據非常耗時。甚至可能需要數月的時間才能獲得更多的研究數據。

Getting more training features. It may also take a lot of time. But if adding some polynomial features works, that is cool.

獲得更多的培訓功能。也可能需要很多時間。但是，如果添加一些多項式特征可以工作，那就太酷了。

Selecting a smaller set of training features.

選擇較小的一組訓練功能。

Increasing regularization term

增加正則項

Decreasing the regularization term.

減少正則項。

So, which one should you try next? This is not a good idea to start trying just anything. Because you may end up spending too much time on something that is not helpful. You need to detect the problem first and then take action accordingly. A learning curve helps to detect the problem easily which saves a lot of time.

那么，接下來您應該嘗試哪一個呢？開始嘗試任何操作都不是一個好主意。因為您可能最終會花太多時間在無用的事情上。您需要先檢測到問題，然后采取相應措施。學習曲線有助于輕松檢測問題，從而節省大量時間。

學習曲線的工作原理 (How Learning Curve Works)

The learning curve is the plot of the cost function. The cost function for the training data and the cost function for the cross-validation data in the same plot gives important insights about the algorithm. As a reminder, here is the formula for the cost function:

學習曲線是成本函數的圖。在同一圖中，訓練數據的成本函數和交叉驗證數據的成本函數為算法提供了重要的見解。提醒一下，這是成本函數的公式：

In other words, it is squared of the predicted output minus the original output divided by twice the number of training data. To make the learning curve, we need to plot these cost functions as a function of the number of training data (m). Instead of using all the training data, we will use only a smaller subset of training data to train the data.

換句話說，它是預測輸出減去原始輸出的平方除以訓練數據數量的兩倍。為了繪制學習曲線，我們需要將這些成本函數繪制為訓練數據數量(m)的函數。代替使用所有訓練數據，我們將僅使用訓練數據的較小子集來訓練數據。

Have a look at the picture below:

看看下面的圖片：

Here is the concept. If we train the data with a too-small number of data, the algorithm will fit perfectly on the training data and the cost function will return 0. In the picture above it is showing clearly that when we train the data with only one, two, or three data algorithms can learn that few data very well and training cost comes out to be zero or close to zero. But this type of algorithm cannot perform well on other data. When you will try to fit the cross-validation data on this algorithm, the probability is very high that it will perform poorly on cross-validation data. So, the cost function for cross-validation data will return a very high value. On the other hand, when we will take more and more data to train the algorithm, it will not fit in the training data perfectly anymore. So, the training cost will become higher. At the same time, as this algorithm is trained on a lot of data, it will perform better on the cross-validation data and the cost function for cross-validation data will return a lower value. Here is how to develop a learning curve.

這是概念。如果我們使用太少的數據來訓練數據，則該算法將非常適合訓練數據，并且成本函數將返回0。在上圖中，清楚地表明，當我們僅訓練一個，兩個時，或三種數據算法可以很好地了解到很少的數據，并且訓練成本為零或接近零。但是，這種類型的算法無法在其他數據上很好地執行。當您嘗試使交叉驗證數據適合此算法時，在交叉驗證數據上執行效果很差的可能性很高。因此，交叉驗證數據的成本函數將返回非常高的值。另一方面，當我們將需要越來越多的數據來訓練算法時，它將不再完全適合訓練數據。因此，培訓成本將變得更高。同時，由于該算法針對大量數據進行訓練，因此在交叉驗證數據上的性能會更好，并且交叉驗證數據的成本函數將返回較低的值。這是如何建立學習曲線的方法。

開發學習算法 (Develop A Learning Algorithm)

I will demonstrate how to draw a learning curve step by step. For drawing a learning curve, we need a machine learning algorithm first. For simplicity, I will work with a linear regression algorithm. I will move a bit faster here and not explain every step because I am assuming you know the machine learning algorithm development. If you need a refresher on how to develop a linear regression algorithm, please check this article first:

我將演示如何逐步繪制學習曲線。為了繪制學習曲線，我們首先需要機器學習算法。為簡單起見，我將使用線性回歸算法。在這里，我會移動得更快一些，并且不會解釋每個步驟，因為我假設您知道機器學習算法的開發。如果您需要重新學習如何開發線性回歸算法，請首先查看本文：

First, import the packages and the dataset. The dataset I am using here is taken from Andrew Ng’s machine learning course in Coursera. In this dataset, X-value, and y-value are organized in separate sheets in an Excel file. X and y values of cross-validation data are also organized in two other sheets in the same Excel file. I provided the link to the dataset at the end of this article. Please feel free to download the dataset and practice yourself.

首先，導入包和數據集。我在這里使用的數據集取材于Coursera的Andrew Ng的機器學習課程。在此數據集中，X值和y值在Excel文件中的單獨工作表中進行組織。交叉驗證數據的X和y值也被組織在同一Excel文件中的其他兩個工作表中。我在本文結尾處提供了到數據集的鏈接。請隨時下載數據集并進行練習。

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
file = pd.ExcelFile('dataset.xlsx')
df = pd.read_excel(file, 'Xval', header=None)
df.head()

In the same way, import the y-values for the training set:

以相同的方式，導入訓練集的y值：

y = pd.read_excel(file, 'yval', header=None)
y.head()

Let’s develop the linear regression algorithm quickly. Define hypothesis and cost function.

讓我們快速開發線性回歸算法。定義假設和成本函數。

m = len(df)def hypothesis(theta, X):
return theta[0] + theta[1]*Xdef cost_calc(theta, X, y):
return (1/2*m) * np.sum((hypothesis(theta, X) - y)**2)

Now, we will define gradient descent to optimize the parameters

現在，我們將定義梯度下降以優化參數

def gradient_descent(theta, X, y, epoch, alpha):
cost = []
theta_hist = []
i = 0
while i < epoch:
hx = hypothesis(theta, X)
theta[0] -= alpha*(sum(hx-y)/m)
theta[1] -= (alpha * np.sum((hx - y) * X))/m
cost.append(cost_calc(theta, X, y))
i += 1
return theta, cost

A linear regression algorithm is done. We need a method to predict the output:

完成了線性回歸算法。我們需要一種預測輸出的方法：

def predict(theta, X, y, epoch, alpha):
theta, cost = gradient_descent(theta, X, y, epoch, alpha)
return hypothesis(theta, X), cost, theta

Now, initiate the parameters as zeros and use the predict function to predict the output variable.

現在，將參數初始化為零，并使用預測函數預測輸出變量。

theta = [0,0]
y_predict, cost, theta = predict(theta, df[0], y[0], 1400, 0.001)

The updated theta values are: [10.724868115832654, 0.3294833798797125]

更新的theta值是：[10.724868115832654，0.3294833798797125]

Now, plot the predicted output and the original output against the df in the same plot:

現在，在同一圖中針對df繪制預測輸出和原始輸出：

plt.figure()
plt.scatter(df, y)
plt.scatter(df, y_predict)

Looks like the algorithm is working well.

看起來算法運行良好。

畫出學習曲線 (Draw A Learning Curve)

Now, we can draw a learning curve. First, let’s import the X and y values for our cross-validation dataset. As I mentioned earlier, We have then organized in separate Excel sheets.

現在，我們可以畫出學習曲線。首先，讓我們為交叉驗證數據集導入X和y值。正如我之前提到的，我們然后將其組織在單獨的Excel工作表中。

file = pd.ExcelFile('dataset.xlsx')
cross_val = pd.read_excel(file, 'X', header=None)
cross_val.head()cross_y = pd.read_excel(file, 'y', header=None)
cross_y.head()

For this purpose, I want to modify the gradient_descent function a little bit. In our previous gradient_descent function, we calculated the cost in each iteration. I did that because that’s a good practice in traditional machine learning algorithm development. But for the learning curve, we do not need the cost in each iteration. So, to save running time there, I will exclude calculating cost function in each epoch. We will return only the updated parameters.

為此，我想稍微修改一下gradient_descent函數。在之前的gradient_descent函數中，我們計算了每次迭代的成本。我這樣做是因為這是傳統機器學習算法開發中的一個好習慣。但是對于學習曲線，我們不需要每次迭代的成本。因此，為了節省運行時間，我將排除每個時期的成本函數計算。我們將僅返回更新的參數。

def grad_descent(theta, X, y, epoch, alpha):
i = 0
while i < epoch:
hx = hypothesis(theta, X)
theta[0] -= alpha*(sum(hx-y)/m)
theta[1] -= (alpha * np.sum((hx - y) * X))/m
i += 1
return theta

As I discussed earlier, to develop a learning curve, we need to train the learning algorithm with the different subsets of training data. In our training dataset, we have 21 data. I will train the algorithm using just one data, then with two data, then with three data all the way up to 21 data. So, we will train the algorithm 21 times on 21 subsets of the training data. We will also keep track of the cost function for each subset of training data. Please have a close look at the code, it will be clearer.

如前所述，要開發學習曲線，我們需要使用不同的訓練數據子集來訓練學習算法。在我們的訓練數據集中，我們有21個數據。我將僅使用一個數據，然后使用兩個數據，然后使用三個數據一直到21個數據來訓練算法。因此，我們將在21個訓練數據子集上對算法進行21次訓練。我們還將跟蹤每個訓練數據子集的成本函數。請仔細看一下代碼，它將更加清晰。

j_tr = []
theta_list = []
#theta = [0,0]
for i in range(0, len(df)):
theta = [0,0]
theta_list.append(grad_descent(theta, df[0][:i], y[0][:i], 1400, 0.001))
#print(theta)
j_tr.append(cost_calc(theta, df[0][:i], y[0][:i]))
theta_list

Here are the training parameters for each subset of training data:

以下是每個訓練數據子集的訓練參數：

Here is the cost for each training subset:

這是每個訓練子集的費用：

Look at the cost for each subset. When the training data was only 1 or 2, the cost was zero or almost zero. As we kept increasing the training data, the cost also went up which was expected. Now, use the parameters above for all the subsets of training data to calculate the cost on cross-validation data:

查看每個子集的成本。當訓練數據僅為1或2時，成本為零或幾乎為零。隨著我們不斷增加培訓數據，成本也上升了，這是預期的。現在，對訓練數據的所有子集使用上面的參數來計算交叉驗證數據的成本：

j_val = []
for i in theta_list:
j_val.append(cost_calc(i, cross_val[0], cross_y[0]))
j_val

In the beginning, the cost was really high because the training parameters are coming from too few training data. But as the parameters improved with more training data, cross-validation error kept going down. Let’s plot the training error and cross-validation error in the same plot:

剛開始時，成本確實很高，因為訓練參數來自太少的訓練數據。但是隨著參數的增加和更多訓練數據的改進，交叉驗證錯誤不斷下降。讓我們在同一圖中繪制訓練誤差和交叉驗證誤差：

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure()
plt.scatter(range(0, 21), j_tr)
plt.scatter(range(0, 21), j_val)

This is our learning curve.

這是我們的學習曲線。

從學習曲線中得出決策 (Drawing Decision From Learning Curve)

The learning curve above looks nice. It is flowing the way we expected. In the beginning, training error was too small and validation error was too high. Slowly they totally overlapped on each other. So that is perfect! But in the real-life, it does not happen very often. Most machine learning algorithms do not work perfectly for the first time. Almost all the time it suffers from some problems that we need to fix. Here I will discuss some issues.

上面的學習曲線看起來不錯。它以我們預期的方式流動。最初，訓練錯誤太小，驗證錯誤太高。慢慢地，它們彼此完全重疊。太完美了！但是在現實生活中，這種情況并不經常發生。大多數機器學習算法并不是第一次都能完美運行。它幾乎始終都遭受一些我們需要解決的問題的困擾。在這里，我將討論一些問題。

We may find our learning curve looks like this:

我們可能會發現我們的學習曲線如下所示：

If there is a significant gap between training error and validation-error, that indicates a high variance problem. It also can be called an overfitting problem. Getting more training data or selecting a smaller set of features or both may fix this problem.

如果訓練誤差與驗證誤差之間存在顯著差距，則表明存在高方差問題。也可以稱為過度擬合問題。獲取更多的訓練數據或選擇較小的功能集或同時使用這兩種功能都可以解決此問題。

If a leaning curve looks like this that means in the beginning training error was too small and validation error was too high. Slowly, training error goes higher and validation-error goes lower. But at a point they become parallel. You can see from the picture, after a point, even with more training data cross-validation error is not going down anymore. In this case, getting more training data will not improve the machine learning algorithm. This indicates that the learning algorithm is suffering from a high bias problem. In this case, getting more training features may help.

如果傾斜曲線看起來像這樣，則意味著開始時訓練誤差太小而驗證誤差太高。緩慢地，訓練誤差變高而驗證誤差變低。但在某種程度上，它們變得平行。您可以從圖片中看到一點，即使有了更多訓練數據，交叉驗證錯誤也不再減少。 在這種情況下，獲取更多訓練數據將不會改善機器學習算法。 這表明學習算法正在遭受高偏差問題。 在這種情況下，獲得更多訓練功能可能會有所幫助。

修正學習算法 (Fixing A Learning Algorithm)

Assume, we are implementing linear regression. But the algorithm is not working as expected. What to do now?

假設我們正在執行線性回歸。但是該算法無法正常工作。現在要做什么？

First, draw a learning curve as I demonstrated here. If you detect a high variance problem, select a smaller set of features based on the importance of the features. If that helps, that will save some time. If not, try getting more training data.

首先，畫出一條學習曲線，如我在此處演示的那樣。 如果檢測到高方差問題 ，請根據要素的重要性選擇一組較小的要素。如果有幫助，可以節省一些時間。如果不是，請嘗試獲取更多訓練數據。

If you detect high bias problem from the learning curve, you know already that getting additional features is a possible solution. You may even try adding some polynomial features. Lots of time that helps and saves a lot of time.

如果您從學習曲線中發現高偏差問題 ，那么您已經知道獲得附加功能是一種可能的解決方案。您甚至可以嘗試添加一些多項式特征。很多時間可以幫助您節省很多時間。

If you are implementing an algorithm with the regularization term lambda, try decreasing the lambda if the algorithm is suffering from a high bias and try increasing the lambda, if the algorithm is suffering from a high variance problem. Here is an article that explains the relationship of regularization term with bias and variance in details:

如果您要使用正則化項lambda實現算法，則在算法存在高偏差的情況下嘗試減小 lambda ，而在算法存在高方差問題的情況下嘗試增大lambda 。這是一篇詳細解釋正則項與偏差和方差的關系的文章：

In the case of a neural network also we may come across this bias or variance problem. For the high bias or underfitting problem, we need to increase the number of neurons or the number of hidden layers. To address the high variance or overfitting problem, we should decrease the number of neurons or the number of hidden layers. We can even draw a learning curve using a different number of neurons.

在神經網絡的情況下，我們也可能遇到這種偏差或方差問題。對于高偏差或欠擬合問題，我們需要增加神經元的數量或隱藏層的數量。為了解決高方差或過度擬合的問題，我們應該減少神經元的數量或隱藏層的數量。我們甚至可以使用不同數量的神經元繪制學習曲線。

Thank you so much for reading this article. I hope this was helpful.

非常感謝您閱讀本文。我希望這可以幫到你。

Here is the dataset used in this article:

這是本文中使用的數據集：

翻譯自: https://medium.com/towards-artificial-intelligence/learning-curve-to-detect-the-bug-in-a-machine-learning-algorithm-8cc6ade95ac8