Python中的线性回归:Sklearn与Excel
內部AI (Inside AI)
Around 13 years ago, Scikit-learn development started as a part of Google Summer of Code project by David Cournapeau. As time passed Scikit-learn became one of the most famous machine learning library in Python. It offers several classifications, regression and clustering algorithms and its key strength, in my opinion, is seamless integration with Numpy, Pandas and Scipy.
大約13年前,Scikit學習發展始于華氏度大衛Cournapeau代碼項目的谷歌暑期的一部分。 隨著時間的流逝,Scikit-learn成為Python中最著名的機器學習庫之一。 我認為它提供了幾種分類,回歸和聚類算法,并且它的主要優勢是與Numpy,Pandas和Scipy無縫集成。
In this article, I will compare the prediction accuracy of multiple linear regression of Scikit-learn with excel. Scikit-learn offers many parameters (known as hyper-parameters of an estimator) to fine-tune the training of the model and increase the accuracy of prediction. In the excel, we do not have much to tune the regression algorithm. For a fair comparison, I will train the sklearn regression model with default parameters.
在本文中,我將比較Scikit-learn與excel的多元線性回歸的預測準確性。 Scikit學習 提供許多參數(稱為估算器的超參數)來微調模型的訓練并提高預測的準確性。 在Excel中,我們沒有太多的調整回歸算法。 為了公平地比較,我將使用默認參數訓練sklearn回歸模型。
Objective
目的
This comparison aims to learn the prediction accuracy of the linear regression in excel and Scikit-learn. Also, I will touch briefly on the process to perform linear regression in excel.
該比較旨在了解excel和Scikit-learn中線性回歸的預測準確性。 另外,我將簡要介紹在Excel中執行線性回歸的過程。
Sample Data File
樣本數據文件
For the comparison, we will use historical 100,000 readings of precipitation, minimum temperature, maximum temperature and wind speed, measured several times in a day for 8 years.
為了進行比較,我們將使用歷史上100,000次的降水,最低溫度,最高溫度和風速的讀數,在8年中每天進行幾次測量。
We will use the precipitation, minimum temperature and maximum temperature to predict the wind speed. Hence, wind speed is the dependent variable, and other data is the independent variable.
我們將使用降水量,最低溫度和最高溫度來預測風速。 因此,風速是因變量,其他數據是自變量。
Training Data — Sample培訓數據-樣本We will first build and predict the wind speed with a linear regression model on excel. Then we will do the same exercise with Scikit-learn, and finally, we will compare the predicted results.
我們將首先在excel上使用線性回歸模型構建和預測風速。 然后,我們將使用Scikit學習進行相同的練習,最后,我們將比較預測結果。
Excel Ribbon ScreenshotExcel功能區屏幕截圖To perform the linear regression in excel, we will open the sample data file and click the “Data” tab in excel ribbon. In the “Data” tab, select the Data Analysis option.
要在excel中執行線性回歸,我們將打開示例數據文件,然后單擊excel功能區中的“數據”標簽。 在“數據”選項卡中,選擇“數據分析”選項。
Tip: In case you do not see the “Data Analysis” option then, click File > Options> Add-ins. Select the “Analysis Toolpak” and click the “Go” button as shown below
提示: 如果您沒有看到“數據分析”選項,請單擊文件>選項>加載項。 選擇“ Analysis Toolpak”,然后單擊“ Go”按鈕,如下所示
Excel options screenshotExcel選項屏幕截圖On clicking the “Data Analysis” option, a pop-window will open up showing different analysis tools available in the excel. We will select the Regression and then click “OK”.
單擊“數據分析”選項后,將打開一個彈出窗口,顯示excel中可用的不同分析工具。 我們將選擇回歸,然后單擊“確定”。
Another pop-up window to provide the independent and dependent values series will be shown. Excel cell reference of wind speed (dependent variable) is mentioned in the “Input Y range” field. In “Input X Range” we will provide the cell reference for independent variables i.e. precipitation, minimum temperature and maximum temperature.
將顯示另一個彈出窗口,提供獨立和從屬值系列。 “輸入Y范圍”字段中提到了風速(因變量)的Excel單元格參考。 在“輸入X范圍”中,我們將為獨立變量(例如降水,最低溫度和最高溫度)提供像元參考。
We need to select the checkbox “Label” as the first row in our sample data has variable names.
我們需要選中復選框“ Label”,因為示例數據的第一行具有變量名。
On clicking the “Ok” button after specifying the data, excel will build a linear regression model. You can consider it like training (fit option) in Scikit-learn coding.
指定數據后,單擊“確定”按鈕,excel將建立線性回歸模型。 您可以將其視為Scikit-learn編碼中的訓練(擬合選項)。
Excel does the calculations and shows the information in a nice format. In our example, excel could fit the linear regression model with R Square of 0.953. Considering 100,000 records in the training dataset, excel performed the linear regression in less than 7 seconds. Along with other statistical information, it also shows the intercepts and coefficients of different independent variables.
Excel進行計算,并以一種很好的格式顯示信息。 在我們的示例中,excel可以擬合線性回歸模型,R Square為0.953。 考慮到訓練數據集中的100,000條記錄,excel在不到7秒的時間內執行了線性回歸。 除其他統計信息外,它還顯示了不同自變量的截距和系數。
Based on the excel linear regression output, we can put together the below mathematical relationship.
根據excel線性回歸輸出,我們可以將以下數學關系匯總起來。
Wind Speed = 2.438 + (Precipitation* 0.026) + (MinTemp*0.393)+ (MaxTemp*0.395)
風速= 2.438 +(降水* 0.026)+(最小溫度* 0.393)+(最大溫度* 0.395)
We will use this formula to predict the wind speed of the test data set, which excel regression model has not seen before.
我們將使用此公式來預測測試數據集的風速,這是excel回歸模型之前從未見過的。
For example for the first test data set, Wind Speed= 2.438 + (0.51* 0.026) + (17.78*0.393)+ (25.56*0.395) = 19.55
例如對于第一個測試數據集,風速= 2.438 +(0.51 * 0.026)+(17.78 * 0.393)+(25.56 * 0.395)= 19.55
Further, we have calculated the residual of the prediction and plotted it to understand the trend of it. We can see that in nearly all cases the wind speed predicted is lower than the actual value and faster the wind speed higher is the error in the prediction.
此外,我們已經計算了預測的殘差并將其繪制以了解其趨勢。 我們可以看到,幾乎在所有情況下,預測的風速都低于實際值,而更快的風速是預測中的誤差。
Windspeed Actual Vs Excel Linear Regression Residual Scatterplot風速實際Vs Excel線性回歸殘差散點圖Let us not delve into linear regression in Scikit-learn.
讓我們不要研究Scikit學習中的線性回歸。
Step 1- We will import the packages which we are going to use for our analysis. Individual independent variables values are spread across different value ranges and not standard normally distributed, hence we need StandardScaler for standardization of independent variables.
第1步-我們將導入將用于分析的軟件包。 各個自變量的值分布在不同的值范圍內,而不是標準的正態分布,因此我們需要StandardScaler對自變量進行標準化。
from sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Step 2- Read the training and test data from excel file into the PandasDataframe Training_data and Test_data respectively.
步驟2-將訓練和測試數據從excel文件分別讀取到PandasDataframe Training_data和Test_data中。
Training_data=pd.read_excel(“Weather.xlsx”, sheet_name=”Sheet1") Test_data=pd.read_excel(“Weather Test.xlsx”, sheet_name=”Sheet1")I will not focus on preliminary data quality checks like blank values, outliers, etc. and respective correction approach in this article, and assuming that there are no data series related to the discrepancy.
在本文中,我將不著重于初步的數據質量檢查,例如空白值,離群值等,以及相應的校正方法,并假設沒有與差異有關的數據系列。
Please refer “How to identify the right independent variables for Machine Learning Supervised Algorithms?” for independent variable selection criteria and correlation analysis.
請參閱“ 如何為機器學習監督算法識別正確的自變量? ”用于獨立變量選擇標準和相關性分析。
Step 3- In the below code, we have declared all the columns data except “WindSpeed” as the independent variable and only “WindSpeed” as the dependent variable for training and test data. Please note that we will not use “SourceData_test_dependent” for linear regression but to compare the predicted value with it.
步驟3-在下面的代碼中,我們已聲明所有列數據(“ WindSpeed”作為自變量,僅“ WindSpeed”作為因變量用于訓練和測試數據)。 請注意,我們不會將“ SourceData_test_dependent”用于線性回歸,而是將其與預測值進行比較。
SourceData_train_independent= Training_data.drop(["WindSpeed"], axis=1) # Drop depedent variable from training datasetSourceData_train_dependent=Training_data["WindSpeed"].copy() # New dataframe with only independent variable value for training datasetSourceData_test_independent=Test_data.drop(["WindSpeed"], axis=1)SourceData_test_dependent=Test_data["WindSpeed"].copy()
Step 4- As the independent variable ranges are quite disparate, hence we need to scale it to avoid the unintended influence of one variable. In the code below the independent train and test variable is scaled, and saved to X-train and X_test respectively. We neither need to scale training or testing dependent variable values. In y_train, the dependent trained variable is saved without scaling.
步驟4-由于自變量范圍非常不同,因此我們需要對其進行縮放,以避免一個變量的意外影響。 在下面的代碼中,獨立訓練和測試變量被縮放,并分別保存到X-train和X_test。 我們既不需要擴展訓練規模,也不需要測試因變量值。 在y_train中,將保存因變量而不縮放。
sc_X = StandardScaler()X_train=sc_X.fit_transform(SourceData_train_independent.values) #scale the independent variablesy_train=SourceData_train_dependent # scaling is not required for dependent variableX_test=sc_X.transform(SourceData_test_independent)y_test=SourceData_test_dependent
Step 5- Now we will feed the independent and dependent train data i.e. X_train and y_train respectively to train the linear regression model. We will perform the model fit with default parameters for the reasons mentioned at the start of the article.
步驟5-現在,我們將分別輸入獨立和相關的訓練數據,即X_train和y_train,以訓練線性回歸模型。 由于本文開頭提到的原因,我們將使用默認參數執行模型擬合。
reg = LinearRegression().fit(X_train, y_train)print("The Linear regression score on training data is ", round(reg.score(X_train, y_train),2))
The Linear regression score on the training data is the same as we observed with excel.
訓練數據上的線性回歸得分與我們在excel中觀察到的相同。
Step 6- Finally, we will predict the wind speed based on test independent value data sets.
步驟6-最后,我們將基于獨立于測試的值數據集來預測風速。
predict=reg.predict(X_test)Based on the predicted wind speed value and residual scatter plot we can see that Sklean predictions are more close to actual values.
根據預測的風速值和殘留散點圖,我們可以看到Sklean的預測更接近實際值。
Windspeed Actual Vs Sklearn Linear Regression Residual Scatterplot風速實際Vs Sklearn線性回歸殘差散點圖On comparing the Sklearn and Excel residuals side by side, we can see that both the model deviated more from actual values as the wind speed increases but sklearn did better than excel.
通過并排比較Sklearn和Excel殘差,我們可以看到,隨著風速的增加,兩個模型與實際值的偏差都更大,但是sklearn的表現要優于excel。
On a different note, excel did predict the wind speed similar value range like sklearn. If you an approximate linear regression model is good enough for your business case then to quickly predict the values excel comes across a very good option.
另一方面,Excel確實預測了風速類似sklearn的值范圍。 如果近似線性回歸模型足以滿足您的業務需求,則可以快速預測excel的值是一個很好的選擇。
Actual Windspeed vs Residual Scatter Plot實際風速與殘留散點圖Excel can perform linear regression prediction at the same accuracy level as sklearn is not the takeaway of this exercise. We can improve the sklearn linear regression prediction accuracy massively with fine-tuning of the parameters and it is more equipped to handle complex models. For quick and approximate prediction use cases excel is a very good alternative with acceptable accuracy.
Excel可以與sklearn相同的精度級別執行線性回歸預測,而不是本練習的重點。 通過參數的微調,我們可以大大提高sklearn線性回歸預測的準確性,并且它更有能力處理復雜的模型。 對于快速和近似的預測用例,excel是可以接受的準確度非常好的選擇。
翻譯自: https://towardsdatascience.com/linear-regression-in-python-sklearn-vs-excel-6790187dc9ca
總結
以上是生活随笔為你收集整理的Python中的线性回归:Sklearn与Excel的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 三星 Galaxy S23 系列即将发布
- 下一篇: 机器学习中倒三角符号_机器学习的三角误差