评估模型如何建立_建立和评估分类ML模型
評估模型如何建立
There are different types of problems in machine learning. Some might fall under regression (having continuous targets) while others might fall under classification (having discrete targets). Some might not have a target at all where you are just trying to learn the characteristics of data by distinguishing the data points based on their inherent features by creating clusters.
機器學習中存在不同類型的問題。 一些可能屬于回歸(具有連續的目標),而另一些可能屬于分類(具有離散的目標)。 有些人可能根本沒有目標,而您只是試圖通過創建聚類來根據數據點的固有特征來區分數據點,從而學習數據的特征。
However, this article is not about different areas of machine learning but about a very little yet important thing, which if not tended to carefully can wreak “who knows what” on your operationalized classification models and eventually, the business. Therefore, the next time when someone at work tells you that her/his model is giving ~93.23% accuracy, do not fall for it before asking the right questions.
但是,本文不是關于機器學習的不同領域,而是關于一件非常重要的事情,如果不小心的話,它可能會在您的可操作分類模型以及最終業務上造成“誰知道什么”。 因此,下一次當有人在工作時告訴您他/他的模型給出了?93.23%的準確性時,在提出正確的問題之前,請不要掉以輕心。
So, how do we know what are the right questions?
那么,我們怎么知道正確的問題呢?
That’s a good question. Let us try to answer that by studying how to build and evaluate a classification model the right way. Everyone who has been studying machine learning is aware of all frequently used classification metrics but only a few of them know the right one to use to evaluate the performance of their classification model.
這是個好問題。 讓我們嘗試通過研究如何正確構建和評估分類模型來回答這一問題。 一直在研究機器學習的每個人都知道所有常用的分類指標,但是只有少數幾個知道正確的方法來評估其分類模型的性能。
So, to enable you to ask the right questions, we will go through the following concepts in detail (for a classification model):
因此,為了使您能夠提出正確的問題,我們將詳細介紹以下概念(針對分類模型):
Data Distribution (Training, Validation, Test)
數據分配(培訓,驗證,測試)
Handling Class Imbalance.
處理班級失衡。
The Right Choice of Metric For Model Evaluation
模型評估的正確選擇指標
資料分配 (Data Distribution)
While splitting data for training, validation, and test set, one should always keep in mind that all three of them must be representative of the same population. For example, in the MNIST digit classification dataset where all the digit images are grey (black and white), you train it and achieve a validation accuracy of almost 90%, but your test data has digit images of various colors (not just black and white). Now, you have a problem there. No matter what you do, there will always be a data bias. You cannot get rid of it totally but what you can do is maintain a uniformity in your validation and test data set. This is an example of a difference in the distribution of validation and test sets.
在為訓練,驗證和測試集拆分數據時,應始終牢記,這三個數據必須代表相同的總體。 例如,在所有數字圖像均為灰色(黑白)的MNIST數字分類數據集中,您對它進行了訓練并獲得了將近90%的驗證準確度,但是您的測試數據具有各種顏色的數字圖像(不只是黑色和白色)。白色)。 現在,您在那里遇到了問題。 無論您做什么,都會有數據偏差。 您不能完全擺脫它,但是您可以做的是在驗證和測試數據集中保持一致。 這是驗證和測試集分布差異的一個示例。
拆分數據的正確策略 (The Right Strategy to Split Data)
Your test data set MUST always represent real-world data distribution. For example, in a binary classification problem, where you are supposed to detect positive patients for a rare disease (class 1) where 6% of the entire data set contains positive cases, then your test data should also have almost the same proportion. Make sure you follow the same distribution. This is not just the case with classification models. It holds for every type of ML modeling problem.
您的測試數據集必須始終代表真實的數據分布。 例如,在二元分類問題中,您應該檢測一種罕見病的陽性患者(1類),其中整個數據集的6%包含陽性病例,那么您的測試數據也應具有幾乎相同的比例。 確保遵循相同的分布。 這不僅僅是分類模型的情況。 它適用于每種類型的ML建模問題。
正確的培訓,驗證和測試劃分順序 (The Right Sequence for Training, Validation & Test Split)
The test dataset should be extracted first without any data leakage into the leftover data and then, validation data must follow the distribution in the test data. And, what remains after these two splits goes into training. Hence, the right sequence to divide the entire dataset into training, validation, and test sets is to get the test, validation, and training set specifically in that order from the entire dataset.
首先應提取測試數據集,而不會將任何數據泄漏到剩余數據中,然后,驗證數據必須遵循測試數據中的分布。 而且,在這兩次分裂之后剩下的仍然要接受訓練。 因此,將整個數據集劃分為訓練集,驗證集和測試集的正確順序是從整個數據集中按照該順序專門獲得測試,驗證集和訓練集。
正確的比例 (The Right Proportion)
There is a convenience of 70–20–10 split in the machine learning community but that is only when you have an average amount of data. If you are working on, for instance, an image classification problem and you have ~10 million of images, then doing a 70–20–10 split would be a bad idea because the amount of data is so huge that to validate your model, even 1 to 2% of it is enough. Hence, I would rather go with 96–2–2 split because you do not want to increase the unnecessary overhead on validation and test by increasing the size as the same representation of distribution can be achieved using 2% of the total data in the validation and test. Also, make sure you do not sample with a replacement while making splits.
機器學習社區提供70–20–10的便利,但這僅在您擁有平均數據量的情況下。 例如,如果您正在處理圖像分類問題,并且有大約1000萬張圖像,那么進行70–20–10的拆分將是一個壞主意,因為數據量非常龐大,無法驗證模型,甚至只有1到2%就足夠了。 因此,我寧愿進行96–2–2拆分,因為您不希望通過增加大小來增加驗證和測試的不必要開銷,因為使用驗證中總數據的2%可以實現相同的分布表示并測試。 另外,請確保在進行分割時不要使用替代品進行采樣。
處理班級失衡 (Handling Class Imbalance)
In case of any classification problem, which affects the performance of a model most is the amount of loss contributed by every class to the total cost. The higher the number of examples of a certain class, the total loss contribution of that class is higher. The loss contribution to the total cost by a class is directly proportional to the number of examples belonging to that class. In this way, the classifier concentrates more on classifying those instances correctly which are contributing more to the total cost of the loss function (i.e. the instances from the majority class).
如果發生任何分類問題,而這會影響模型的性能,則最主要的是每個類對總成本造成的損失。 某個類別的示例數量越多,該類別的總損失貢獻就越高。 一個類別對總成本的損失貢獻與該類別的示例數成正比。 這樣,分類器將更多的精力放在正確分類那些對損失函數的總成本貢獻更大的實例上(即,來自多數類的實例)。
Following are the ways using which we can tackle class imbalance:
以下是我們解決班級失衡的方法:
Weighted Loss
加權損失
Resampling
重采樣
加權損失 (Weighted Loss)
In binary cross-entropy loss, we have the following loss function:
在二進制交叉熵損失中,我們具有以下損失函數:
Binary Cross Entropy Loss二元交叉熵損失The model outputs the probability that the given example belongs to a positive (y=1) class. And, based on the above binary cross-entropy loss function, loss value is computed per example, and finally, the total cost is computed as the average loss across all examples. Let us conduct a simple simulation to understand it better by writing a simple python script. Let’s generate 100 ground truth labels, 25 out of which belong to the positive (y=1) class, and the rest are negative (y=0) to account for the class imbalance in our tiny experiment. Also, we will generate a random probability value of it belonging to the positive class for every example.
模型輸出給定示例屬于正(y = 1)類的概率。 并且,基于以上的二進制交叉熵損失函數,每個示例計算損失值,最后,將總成本計算為所有示例的平均損失。 讓我們進行一個簡單的模擬,以通過編寫一個簡單的python腳本更好地理解它。 讓我們生成100個地面真相標簽,其中25個屬于正(y = 1)類,其余的為負(y = 0),以解決我們的小型實驗中的類不平衡問題。 同樣,我們將為每個示例生成一個屬于正類的隨機概率值。
import numpy as npimport random# Generating Ground truth labels and Predicted probabilities
truth, probs = [], []
for i in range(100):
# To maintain class imbalance
if i < 25:
truth.append(1)
else:
truth.append(0)
probs.append(round(random.random(),2))
print("Total Positive Example Count: ",sum(truth))
print("Total Negative Example Count: ",len(truth) - sum(truth))
print("Predicted Probability Values: ",probs)Output:
Total Positive Example Count: 25
Total Negative Example Count: 75
Predicted Probability Values: [0.84, 0.65, 0.11, 0.21, 0.31, 0.05, 0.44, 0.83, 0.19, 0.61, 0.28, 0.36, 0.46, 0.79, 0.74, 0.58, 0.65, 0.8, 0.05, 0.39, 0.08, 0.45, 0.4, 0.03, 0.41, 0.75, 0.46, 0.49, 0.94, 0.57, 0.38, 0.7, 0.07, 0.91, 0.85, 0.91, 0.72, 0.28, 0.0, 0.55, 0.61, 0.55, 0.81, 0.98, 0.9, 0.36, 0.65, 0.91, 0.26, 0.1, 0.99, 0.48, 0.34, 0.96, 0.68, 0.21, 0.28, 0.37, 0.8, 0.27, 0.87, 0.93, 0.03, 0.95, 0.25, 0.63, 0.2, 0.45, 0.05, 0.7, 0.91, 0.85, 0.56, 0.61, 0.4, 0.35, 0.6, 0.27, 0.08, 0.85, 0.14, 0.82, 0.22, 0.41, 0.85, 0.72, 0.91, 0.5, 0.55, 0.89, 0.39, 0.92, 0.24, 0.07, 0.52, 0.88, 0.01, 0.01, 0.01, 0.31]
Now, that we have ground truth labels and predicted probabilities, using the above loss function, we can compute the total loss contribution by both the classes. A really small number was added to the predicted probabilities before calculating the log value to avoid error due to undefined value. [log(0) = undefined]
現在,我們有了基本的事實標簽和預測的概率,使用上面的損失函數,我們可以計算兩個類的總損失貢獻。 在計算對數值之前,將非常小的數字添加到預測的概率中,以避免由于不確定的值而導致錯誤。 [log(0)=未定義]
# Calculating Plain Binary Cross-Entropy Losspos_loss, neg_loss = 0, 0
for i in range(len(truth)):
# Applying the binary cross-entropy loss function
if truth[i] == 1:
pos_loss += -1 * np.log(probs[i] + 1e-7)
else:
neg_loss += -1 * np.log(1 - probs[i] + 1e-7)
print("Positive Class Loss: ",round(pos_loss,2))
print("Negative Class Loss: ",round(neg_loss,2))Output:
Positive Class Loss: 29.08
Negative Class Loss: 83.96
As we can see that the total loss over both the classes has a huge difference and the negative class is leading the race of loss contribution, the algorithm is technically going to focus more on the negative class to decrease loss radically while minimizing it. That is when we fool the model into believing what is not real by assigning a weight to total loss calculation by using the following weighted loss function:
正如我們所看到的,兩種類別的總損失有很大的差異,負類別主導著損失貢獻的競爭,該算法在技術上將更多地集中于負類別,以從根本上減少損失,同時將損失最小化。 那就是當我們通過使用以下加權損失函數為總損失計算分配權重來使模型相信不真實的情況時:
Weighted Binary Cross Entropy Loss加權二進制交叉熵損失Here, ‘Wp’ & ‘Wn’ are weights assigned to positive and negative class loss respectively and can be calculated as follows:
在這里,“ Wp”和“ Wn”分別是分配給正類損失和負類損失的權重,可以如下計算:
Wp = total number of negative (y=0) examples / total examples
Wp =負(y = 0)個示例/示例總數
Wn = total number of positive (y=1) examples / total examples
Wn =正樣本總數(y = 1)/樣本總數
Now, let us calculate weighted loss by adding the weights to the calculation:
現在,讓我們通過將權重添加到計算中來計算加權損失:
# Calculating Weighted Binary Cross-Entropy Losspos_loss, neg_loss = 0, 0# Wp (Weight for positive class)
wp = (len(truth) - sum(truth))/len(truth)# Wn (Weight for negative class)
wn = sum(truth) / len(truth)for i in range(len(truth)):
# Applying the same function with class weights.
if truth[i] == 1:
pos_loss += -wp * np.log(probs[i] + 1e-7)
else:
neg_loss += -wn * np.log(1 - probs[i] + 1e-7)
print("Positive Class Loss: ",round(pos_loss,2))
print("Negative Class Loss: ",round(neg_loss,2))Output:
Positive Class Loss: 21.81
Negative Class Loss: 20.99
Amazing! Isn’t it? We managed to reduce the difference of loss contribution between both classes significantly by assigning the right weights.
驚人! 是不是 通過分配正確的權重,我們設法顯著減少了兩個類別之間的損失貢獻差異。
重采樣 (Resampling)
This is yet another technique using which you can counter class imbalance but this should not be the first technique you use. Resampling can be done in three ways:
這是另一種可以用來抵消類不平衡的技術,但這不應該是您使用的第一項技術。 重采樣可以通過三種方式完成:
Either by oversampling the minority class
通過對少數群體的過度采樣
Or by undersampling the majority class
或對多數階層進行低估
Or both by the right amount
或兩者都適當
Oversampling can be achieved either by random sampling the minority class with the replacement or by synthetically generating more examples by using techniques such as SMOTE. Oversampling can help up to a limit because, after a certain amount, you are duplicating the information contained in the data. It might give you ideal loss contributions from both the classes but will fail at validation and test time. But, if you have a massive amount of data along with imbalance, you should go for undersampling without replacement of majority class.
可以通過用替代品對少數族裔進行隨機抽樣或通過使用SMOTE等技術綜合生成更多示例來實現超采樣。 過度采樣最多可以提供一個限制,因為經過一定數量后,您將復制數據中包含的信息。 這可能會給您兩種類別的理想損耗貢獻,但在驗證和測試時將失敗。 但是,如果您有大量的數據且不平衡,則應進行欠采樣,而不必替換多數類。
Sometimes, people do use both the techniques at once when there is an average amount of data, and class imbalance is not huge. So, they oversample the minority class and undersample the majority class by a certain calculated amount to achieve balance.
有時,人們在平均數據量時確實會同時使用這兩種技術,并且班級不平衡并不大。 因此,他們對少數群體進行了過度采樣,而對少數群體進行了過低采樣以達到一定的計算量,從而達到平衡。
Now, you understand when somebody comes to you and says I have got ~93.23% accuracy, you should think and ask about the class proportions in the data and the type of loss function used. Then, you should wonder whether measuring just accuracy is the right way to go. Or there is something more!
現在,您了解了有人向您說我的準確度達到93.23%時,您應該考慮并詢問數據中的類比例以及所使用的損失函數的類型。 然后,您應該懷疑僅測量精度是否是正確的方法。 或者還有更多!
公制的正確選擇 (The Right Choice of Metric)
There is always something more at least when you are working on a machine learning model but to know when you want more is only possible when you have something to compare to. A Benchmark! Once you have a benchmark, you know how much improvement you want.
至少在使用機器學習模型時,總會有更多的東西,但是只有當您有比較的東西時才知道何時需要更多的東西。 基準! 一旦有了基準,就知道需要多少改進。
But to improve the performance, you need to know which metric is the right indicator of performance in the business problem you are trying to solve. For example, if you are trying to solve a tumor detection problem where the objective is to detect whether the tumor is malignant (y=1) or benign (y=0). In this case, you need to understand that in real word benign cases are more than the malignant cases. Hence, when you get the data, you will have a good amount of class imbalance (unless of course, you are really lucky). So, accuracy as a metric is out of the question. Now, the question is, what is more important? To detect whether a patient has a malignant tumor or a benign one. This is a business decision and you should always consult a domain expert(s) (in this case, expert doctors) to understand the business problem by asking such questions. If we are more concerned with detecting malignant tumors effectively even if we have a few false positives (Ground Truth: Benign, Model Prediction: Malignant) but we need as minimum false negatives (Ground Truth: Malignant, Model Prediction: Benign) as possible, then Recall should be our metric of choice but if it is vice-versa (which can never be in this particular case), Precision should be our choice.
但是,要提高性能,您需要知道哪個指標是您要解決的業務問題中正確的績效指標。 例如,如果您要解決腫瘤檢測問題,而目標是檢測腫瘤是惡性(y = 1)還是良性(y = 0)。 在這種情況下,您需要了解,實際上,良性情況比惡性情況還多。 因此,當您獲取數據時,您將有大量的班級失衡(當然,除非您真的很幸運)。 因此,將準確性作為度量標準是不可能的。 現在的問題是,更重要的是什么? 檢測患者是否患有惡性腫瘤或良性腫瘤。 這是一項業務決策,您應始終咨詢域專家(在這種情況下為專家醫生),通過詢問此類問題來了解業務問題。 如果即使我們有一些假陽性(基本事實:良性,模型預測:惡性),我們更關注有效地檢測惡性腫瘤,但我們需要盡可能少的假陰性(基本事實:惡性,模型預測:良性),那么“召回率”應該是我們的選擇指標,但是反之亦然(在這種情況下絕對不能如此),“精度”應該是我們的選擇。
Sometimes, in a business setting, there are problems where you need effective classifications on both the classes and hence you would want to optimize the F1 score. To tackle this trade-off, we should work on maximizing the area under the precision-recall curve as much as possible.
有時,在業務環境中,存在一些問題,您需要在兩個類別上都進行有效分類,因此您需要優化F1分數。 為了解決這種折衷,我們應該努力使最大精度調用曲線下的面積最大化。
Also, the results must be conveyed in terms of confidence intervals with upper and lower bounds of the metric to get a fair idea of the behavior of the model on the population using all the experiments conducted over various samples.
而且,必須使用度量的上限和下限在置信區間上傳達結果,以便使用對各種樣本進行的所有實驗來更好地了解模型在總體上的行為。
To summarize it all, the following are the major takeaways from this article:
總結一下,以下是本文的主要內容:
- Data distribution is crucial when building a classification model and one should always start by getting their test distribution right first and then validation and train in that order. 在建立分類模型時,數據分發至關重要,應該始終首先正確地進行測試分發,然后再按照該順序進行驗證和訓練。
- Class imbalance should be handled properly to avoid really bad results on live data. 應該正確處理類的不平衡,以避免對實時數據產生非常糟糕的結果。
- Only when you select the right metric for evaluation of your model, you can assess its performance correctly. There are a lot of factors ranging from business expertise to technicalities of the model itself that help us decide the right metric. 僅當選擇正確的度量標準來評估模型時,您才能正確評估其性能。 從業務專業知識到模型本身的技術性,都有很多因素可以幫助我們確定正確的指標。
Thank you for reading the article.
感謝您閱讀這篇文章。
翻譯自: https://towardsdatascience.com/building-and-evaluating-classification-ml-models-9c3f45038ef4
評估模型如何建立
總結
以上是生活随笔為你收集整理的评估模型如何建立_建立和评估分类ML模型的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: :)xception_Xception:
- 下一篇: 阴阳师速度最快的式神是谁