机器学习 异常值检测_异常值是否会破坏您的机器学习预测? 寻找最佳解决方案
機(jī)器學(xué)習(xí) 異常值檢測(cè)
內(nèi)部AI (Inside AI)
In the world of data, we all love Gaussian distribution (also known as a normal distribution). In real-life, seldom we have normal distribution data. It is skewed, missing data points or has outliers.
在數(shù)據(jù)世界中,我們都喜歡高斯分布(也稱(chēng)為正態(tài)分布)。 在現(xiàn)實(shí)生活中,很少有正態(tài)分布數(shù)據(jù)。 它歪斜,缺少數(shù)據(jù)點(diǎn)或有異常值。
As I mentioned in my earlier article, the strength of Scikit-learn inadvertently works to its disadvantage. Machine learning developers esp. with relatively lesser experience implements an inappropriate algorithm for prediction without grasping particular algorithms salient feature and limitations. We have seen earlier the reason we should not use the decision tree regression algorithm in making a prediction involving extrapolating the data.
正如我在前一篇文章中提到的那樣 ,Scikit-learn的優(yōu)勢(shì)在無(wú)意中起到了不利的作用。 機(jī)器學(xué)習(xí)開(kāi)發(fā)人員,尤其是。 經(jīng)驗(yàn)相對(duì)較少的人在不掌握特定算法的顯著特征和局限性的情況下,實(shí)施了不合適的預(yù)測(cè)算法。 前面我們已經(jīng)看到了在進(jìn)行涉及外推數(shù)據(jù)的預(yù)測(cè)時(shí)不應(yīng)使用決策樹(shù)回歸算法的原因。
The success of any machine learning modelling always starts with understanding the existing dataset on which model will be trained. It is imperative to understand the data well before starting any modelling. I will even go to an extent to say that the prediction accuracy of the model is directly proportional to the extent we know the data.
任何機(jī)器學(xué)習(xí)建模的成功總是始于了解將在其上訓(xùn)練模型的現(xiàn)有數(shù)據(jù)集。 必須在開(kāi)始任何建模之前充分了解數(shù)據(jù)。 我什至?xí)谀撤N程度上說(shuō)模型的預(yù)測(cè)準(zhǔn)確性與我們知道數(shù)據(jù)的程度成正比。
Objective
目的
In this article, we will see the effect of outliers on various regression algorithms available in Scikit-learn, and learn about the most appropriate regression algorithm to apply in such a situation. We will start with a few techniques to understand the data and then train a few of the Sklearn algorithms with the data. Finally, we will compare the training results of the algorithms and learn the potential best algorithms to apply in the case of outliers.
在本文中,我們將看到異常值對(duì)Scikit-learn中可用的各種回歸算法的影響,并了解適用于這種情況的最合適的回歸算法。 我們將從幾種了解數(shù)據(jù)的技術(shù)入手,然后根據(jù)數(shù)據(jù)訓(xùn)練一些Sklearn算法。 最后,我們將比較算法的訓(xùn)練結(jié)果,并學(xué)習(xí)適用于異常值的潛在最佳算法。
Training Dataset
訓(xùn)練數(shù)據(jù)集
The training data consists of 200,000 records of 3 features (independent variable) and 1 target value (dependent variable). The true coefficient of the features 1, feature 2 and feature 3 is 77.74, 23.34, and 7.63 respectively.
訓(xùn)練數(shù)據(jù)包含200,000條具有3個(gè)特征(獨(dú)立變量)和1個(gè)目標(biāo)值(獨(dú)立變量)的記錄。 特征1,特征2和特征3的真實(shí)系數(shù)分別為77.74、23.34和7.63。
Training Data — 3 Independent and 1 Dependent Variable訓(xùn)練數(shù)據(jù)-3個(gè)獨(dú)立變量和1個(gè)因變量Step 1- First, we will import the packages required for data analysis and regressions.
步驟1- 首先,w e將導(dǎo)入數(shù)據(jù)分析和回歸所需的軟件包。
We will be comparing HuberRegressor, LinearRegression, Ridge, SGDRegressor, ElasticNet, PassiveAggressiveRegressor and Linear Support Vector Regression (SVR), hence we will import the respective packages.
我們將比較HuberRegressor,LinearRegression,Ridge,SGDRegressor,ElasticNet,PassiveAggressiveRegressor和Linear Support Vector Regression(SVR),因此將分別導(dǎo)入軟件包。
Most of the time, few data points are missing in the training data. In that case, if any particular features have a high proportion of null values then it may be better not consider that feature. Else, if a few data points are missing for a feature then either can drop those particular records from training data, or we can replace those missing values with mean, median or constant values. We will import SimpleImputer to fill the missing values.
大多數(shù)時(shí)候,訓(xùn)練數(shù)據(jù)中很少有數(shù)據(jù)點(diǎn)缺失。 在那種情況下,如果任何特定功能具有高比例的空值,則最好不要考慮該功能。 否則,如果某個(gè)功能缺少一些數(shù)據(jù)點(diǎn),則可以從訓(xùn)練數(shù)據(jù)中刪除那些特定的記錄,或者我們可以將這些丟失的值替換為均值,中值或常數(shù)。 我們將導(dǎo)入SimpleImputer來(lái)填充缺少的值。
We will import the Variance Inflation Factor to find the severity of multicollinearity among the features. We will need Matplotlib and seaborn to draw various plots for analysis.
我們將導(dǎo)入方差通貨膨脹因子以找到特征之間多重共線(xiàn)性的嚴(yán)重性。 我們將需要Matplotlib和seaborn繪制各種圖進(jìn)行分析。
from sklearn.linear_model import HuberRegressor,LinearRegression ,Ridge,SGDRegressor,ElasticNet,PassiveAggressiveRegressorfrom sklearn.svm import LinearSVRimport pandas as pdfrom sklearn.impute import SimpleImputer
from statsmodels.stats.outliers_influence import variance_inflation_factor
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Step 2- In the code below, training data containing 200.000 records are read from excel file into the PandasDataframe called “RawData”. Independent variables are saved into a new DataFrame.
步驟2-在下面的代碼中,將包含200.000條記錄的訓(xùn)練數(shù)據(jù)從excel文件中讀取到名為“ RawData”的PandasDataframe中。 自變量保存到新的DataFrame中。
RawData=pd.read_excel("Outlier Regression.xlsx")Data=RawData.drop(["Target"], axis=1)
Step 3-Now we will start by getting a sense of the training data and understanding it. In my opinion, a heatmap is a good option to understand the relationship between different features.
步驟3-現(xiàn)在,我們將首先了解并理解訓(xùn)練數(shù)據(jù)。 我認(rèn)為,熱圖是了解不同功能之間關(guān)系的一個(gè)不錯(cuò)的選擇。
sns.heatmap(Data.corr(), cmap="YlGnBu", annot=True)plt.show()
It shows that none of the independent variables (features) is closely related to each other. In case you would like to learn more on the approach and selection criteria of independent variables for regression algorithms, then please read my earlier article on it.
它表明沒(méi)有一個(gè)自變量(特征)彼此密切相關(guān)。 如果您想了解更多有關(guān)回歸算法自變量的方法和選擇標(biāo)準(zhǔn)的信息,請(qǐng)閱讀我以前的文章。
How to identify the right independent variables for Machine Learning Supervised Algorithms?
如何為機(jī)器學(xué)習(xí)監(jiān)督算法識(shí)別正確的自變量?
Step 4- After getting a sense of the correlation among the features in the training data next we will look into the minimum, maximum, median etc. of each feature value range. This will help us to ascertain whether there are any outliers in the training data and the extent of it. Below code instructs to draw boxplots for all the features.
步驟4-在了解了訓(xùn)練數(shù)據(jù)中各特征之間的相關(guān)性之后,我們將研究每個(gè)特征值范圍的最小值,最大值,中位數(shù)等。 這將有助于我們確定訓(xùn)練數(shù)據(jù)中是否存在異常值及其范圍。 以下代碼指示繪制所有功能的箱線(xiàn)圖。
sns.boxplot(data=Data, orient="h",palette="Set2")plt.show()
In case you don’t know to read the box plot then please refer the Wikipedia to learn more on it. Feature values are spread across a wide range with a big difference from the median value. This confirms the presence of outlier values in the training dataset.
如果您不知道閱讀箱形圖,請(qǐng)參考Wikipedia以了解更多信息。 特征值分布在很大范圍內(nèi),與中值有很大差異。 這確認(rèn)了訓(xùn)練數(shù)據(jù)集中存在異常值。
Step 5- We will check if there are any null values in the training data and take any action required before going anywhere near modelling.
第5步-我們將檢查訓(xùn)練數(shù)據(jù)中是否有任何空值,并采取任何必要的措施,然后再進(jìn)行建模。
print (Data.info())Here we can see that there are total 200,000 records in the training data and all three features have few values missing. For example, feature 1 has 60 values (200000 –199940) missing.
在這里我們可以看到訓(xùn)練數(shù)據(jù)中總共有200,000條記錄,并且所有三個(gè)功能都缺少幾個(gè)值。 例如,特征1缺少60個(gè)值(200000 –199940)。
Step 6- We use SimpleImputer to fill the missing values with the mean values of the other records for a feature. In the below code, we use the strategy= “mean” for the same. Scikit-learn provides different strategies viz. mean, median, most frequent and constant value to replace the missing value. I will suggest you please self explore the effect of each strategy on the training model as a learning exercise.
第6步-我們使用SimpleImputer用功能的其他記錄的平均值填充缺失值。 在下面的代碼中,我們同樣使用strategy =“ mean”。 Scikit-learn提供了不同的策略。 平均,中位數(shù),最頻繁和恒定的值來(lái)代替缺失值。 我建議您作為學(xué)習(xí)練習(xí),自我探索每種策略對(duì)訓(xùn)練模型的影響。
In the code below, we have created an instance of SimpleImputer with strategy “Mean” and then fit the training data into it to calculate the mean of each feature. Transform method is used to fill the missing values with the mean value.
在下面的代碼中,我們創(chuàng)建了一個(gè)策略為“ Mean”的SimpleImputer實(shí)例,然后將訓(xùn)練數(shù)據(jù)擬合到其中以計(jì)算每個(gè)特征的均值。 變換方法用于用平均值填充缺失值。
imputer = SimpleImputer(strategy="mean")imputer.fit(Data)
TransformData = imputer.transform(Data)
X=pd.DataFrame(TransformData, columns=Data.columns)
Step 7- It is good practice to check the features once more after replacing the missing values to ensure we do not have any null (blank) values remaining in our training dataset.
第7步-好的做法是在替換缺失值之后再次檢查特征,以確保我們的訓(xùn)練數(shù)據(jù)集中沒(méi)有剩余的任何空(空白)值。
print (X.info())We can see that now all the features have non-null i.e non-blank values for 200,000 records.
我們可以看到,現(xiàn)在所有功能都具有非空值,即200,000條記錄的非空白值。
Step 8- Before we start training the algorithms, let us check the Variance inflation factor (VIF) among the independent variables. VIF quantifies the severity of multicollinearity in an ordinary least squares regression analysis. It provides an index that measures how much the variance (the square of the estimate’s standard deviation) of an estimated regression coefficient is increased because of collinearity. I will encourage you all to read the Wikipedia page on Variance inflation factor to gain a good understanding of it.
步驟8-在開(kāi)始訓(xùn)練算法之前,讓我們檢查自變量之間的方差膨脹因子 ( VIF )。 VIF在普通最小二乘回歸分析中量化多重共線(xiàn)性的嚴(yán)重性。 它提供了一個(gè)指標(biāo),用于衡量由于共線(xiàn)性而導(dǎo)致估計(jì)的回歸系數(shù)的方差(估計(jì)的標(biāo)準(zhǔn)偏差的平方)增加了多少。 我鼓勵(lì)大家閱讀Wikipedia頁(yè)面上關(guān)于方差膨脹因子的知識(shí) ,以更好地理解它。
vif = pd.DataFrame()vif["features"] = X.columns
vif["vif_Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)
In the above code, we calculate the VIF of each independent variables and print it. In general, we should aim for the VIF of less than 10 for the independent variables. We have seen earlier in the heatmap that none of the variables is highly correlated, and the same is reflecting in the VIF index among the features.
在上面的代碼中,我們計(jì)算每個(gè)獨(dú)立變量的VIF并將其打印出來(lái)。 通常,我們應(yīng)將自變量的VIF設(shè)置為小于10。 我們?cè)缦仍跓釄D中看到,變量沒(méi)有高度相關(guān),并且在功能之間的VIF索引中也反映出同樣的情況。
Step 9- We will extract the target i.e. dependent variable values from the RawData dataframe and save it in a data series.
第9步-我們將從RawData數(shù)據(jù)幀中提取目標(biāo)(即因變量值)并將其保存在數(shù)據(jù)系列中。
y=RawData["Target"].copy()Step 10- We will be evaluating the performance of various regressors viz. HuberRegressor, LinearRegression, Ridge and others on outlier dataset. In the below code, we created instances of the various regressors.
步驟10-我們將評(píng)估各種回歸器的性能。 離群數(shù)據(jù)集上的HuberRegressor,LinearRegression,Ridge等。 在下面的代碼中,我們創(chuàng)建了各種回歸變量的實(shí)例。
Huber = HuberRegressor()Linear = LinearRegression()
SGD= SGDRegressor()
Ridge=Ridge()
SVR=LinearSVR()
Elastic=ElasticNet(random_state=0)
PassiveAggressiveRegressor= PassiveAggressiveRegressor()
Step 11- We declared a list with instances of the regressions to pass it in sequence in a for a loop later.
第11步-我們聲明了一個(gè)帶有回歸實(shí)例的列表,以便稍后在循環(huán)中依次傳遞它。
estimators = [Linear,SGD,SVR,Huber, Ridge, Elastic,PassiveAggressiveRegressor]Step 12- Finally, we will train the models in sequence with the training data set and print the coefficients of the features calculated by the model.
第12步-最后,我們將使用訓(xùn)練數(shù)據(jù)集按順序訓(xùn)練模型,并打印由模型計(jì)算出的特征的系數(shù)。
for i in estimators:reg= i.fit(X,y)
print(str(i)+" Coefficients:", np.round(i.coef_,2))
print("**************************")
We can observe a wide range of coefficients calculated by different models based on their optimisation and regularisation factors. Feature 1 coefficient calculated coefficient varies from 29.31 to 76.88.
我們可以觀察到基于不同模型的優(yōu)化和正則化因子計(jì)算出的各種系數(shù)。 特征1系數(shù)計(jì)算的系數(shù)從29.31到76.88。
Due to a few outliers in the training dataset a few models, like linear and ridge regression predicted coefficients nowhere near the true coefficients. Huber regressor is quite robust to the outliers ensuring loss function is not heavily influenced by the outliers while not completely ignoring their effects like TheilSenRegressor and RANSAC Regressor. Linear SVR also more options in the selection of penalties and loss functions and performed better than other models.
由于訓(xùn)練數(shù)據(jù)集中存在一些離群值,因此一些模型(例如線(xiàn)性和嶺回歸)預(yù)測(cè)的系數(shù)遠(yuǎn)不及真實(shí)系數(shù)。 Huber回歸器對(duì)異常值非常強(qiáng)大,可以確保損失函數(shù)不受異常值的嚴(yán)重影響,同時(shí)又不完全忽略其影響,例如TheilSenRegressor和RANSAC回歸器。 線(xiàn)性SVR在罰分和損失函數(shù)的選擇上也有更多選擇,并且比其他模型表現(xiàn)更好。
Learning Action for you- We trained different models with a training data set containing outliers and then compared the predicted coefficients with actual coefficients. I will encourage you all to follow the same approach and compare the prediction metrics viz. R2 score, mean squared error (MSE), RMSE of different models trained with outlier dataset.
為您學(xué)習(xí)的行動(dòng)-我們使用包含異常值的訓(xùn)練數(shù)據(jù)集訓(xùn)練了不同的模型,然后將預(yù)測(cè)系數(shù)與實(shí)際系數(shù)進(jìn)行了比較。 我將鼓勵(lì)大家采用相同的方法,并比較預(yù)測(cè)指標(biāo)。 使用離群數(shù)據(jù)集訓(xùn)練的不同模型的R2得分,均方誤差(MSE),RMSE。
Hint — You may be surprised to see the R2 (coefficient of determination) regression score function for the models in comparison to the coefficient prediction accuracy we have seen in this article. In case, you stumble upon on any point then, feel free to reach out to me.
提示 —與我們?cè)诒疚闹锌吹降南禂?shù)預(yù)測(cè)精度相比,您可能會(huì)驚訝地看到模型的R2(確定系數(shù))回歸得分函數(shù)。 萬(wàn)一您偶然發(fā)現(xiàn)了任何東西,請(qǐng)隨時(shí)與我聯(lián)系。
Key Takeaway
重點(diǎn)介紹
As mentioned in my earlier article and keep stressing that main focus for us machine learning practitioners are to consider the data, prediction objective, algorithms strengths and limitations before starting the modelling. Every additional minute we spend in understanding the training data directly translates into prediction accuracy with the right algorithms. We don’t want to use a hammer to unscrew and screwdriver to nail in the wall.
正如我在前一篇文章中提到的,并繼續(xù)強(qiáng)調(diào),機(jī)器學(xué)習(xí)從業(yè)者的主要關(guān)注點(diǎn)是在開(kāi)始建模之前要考慮數(shù)據(jù),預(yù)測(cè)目標(biāo),算法優(yōu)勢(shì)和局限性。 我們花費(fèi)在理解訓(xùn)練數(shù)據(jù)上的每一分鐘都可以通過(guò)正確的算法直接轉(zhuǎn)化為預(yù)測(cè)準(zhǔn)確性。 我們不想用錘子擰開(kāi)而用螺絲刀釘在墻上。
If you want to learn more on a structured approach to identifying the right independent variables for Machine Learning Supervised Algorithms then please refer my article on this topic.
如果您想了解更多有關(guān)識(shí)別機(jī)器學(xué)習(xí)監(jiān)督算法的正確自變量的結(jié)構(gòu)化方法的信息,請(qǐng)參閱我關(guān)于此主題的文章 。
"""Full Code"""from sklearn.linear_model import HuberRegressor, LinearRegression ,Ridge ,SGDRegressor, ElasticNet, PassiveAggressiveRegressorfrom sklearn.svm import LinearSVR
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor
import numpy as npRawData=pd.read_excel("Outlier Regression.xlsx")
Data=RawData.drop(["Target"], axis=1)sns.heatmap(Data.corr(), cmap="YlGnBu", annot=True)
plt.show()sns.boxplot(data=Data, orient="h",palette="Set2")
plt.show()print (Data.info())print(Data.describe())imputer = SimpleImputer(strategy="mean")
imputer.fit(Data)
TransformData = imputer.transform(Data)
X=pd.DataFrame(TransformData, columns=Data.columns)
print (X.info())vif = pd.DataFrame()
vif["features"] = X.columns
vif["vif_Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)
y=RawData["Target"].copy()Huber = HuberRegressor()
Linear = LinearRegression()
SGD= SGDRegressor()
Ridge=Ridge()
SVR=LinearSVR()
Elastic=ElasticNet(random_state=0)
PassiveAggressiveRegressor= PassiveAggressiveRegressor()estimators = [Linear,SGD,SVR,Huber, Ridge, Elastic,PassiveAggressiveRegressor]for i in estimators:
reg= i.fit(X,y)
print(str(i)+" Coefficients:", np.round(i.coef_,2))
print("**************************")
翻譯自: https://towardsdatascience.com/are-outliers-ruining-your-machine-learning-predictions-search-for-an-optimal-solution-c81313e994ca
機(jī)器學(xué)習(xí) 異常值檢測(cè)
總結(jié)
以上是生活随笔為你收集整理的机器学习 异常值检测_异常值是否会破坏您的机器学习预测? 寻找最佳解决方案的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 吉利汽车CEO称行业乱象丛生 吉利做的事
- 下一篇: yolov3算法优点缺点_优点缺点