使用线性回归的预测建模
[R egression分析: (Regression Analysis:)
Regression analysis is a predictive modeling technique that estimates the relationship between two or more variables. Recall that a correlation analysis makes no assumption about the causal relationship between two variables. Regression analysis focuses on the relationship between a dependent (target) variable and an independent variable(s) (predictors). Here, the dependent variable is assumed to be the effect of the independent variable(s). The value of predictors is used to estimate or predict the likely-value of the target variable.
回歸分析是一種預測建模技術,可估計兩個或多個變量之間的關系。 回想一下,相關分析沒有假設兩個變量之間的因果關系。 回歸分析著重于因變量(目標變量)和自變量(預測變量)之間的關系。 在此,因變量被認為是自變量的影響。 預測變量的值用于估計或預測目標變量的可能值。
For example to describe the relationship between diesel consumption and industrial production, if it is assumed that “diesel consumption” is the effect of “industrial production”, we can do a regression analysis to predict value of “diesel consumption” for some specific value of “industrial production”
例如,為了描述柴油消耗與工業生產之間的關系,如果假設“柴油消耗”是“工業生產”的影響,我們可以通過回歸分析來預測“柴油消耗”的特定值。 “工業生產”
STEPS TO PERFORM LINEAR REGRESSION
進行線性回歸的步驟
STEP 1: Assume a mathematical relationship between the target and the predictor(s). “The relationship can be a straight line (linear regression) or a polynomial curve (polynomial regression) or a non-linear relationship (non-linear regression)”
步驟1: 假設目標與預測變量之間存在數學關系。 “關系可以是直線(線性回歸)或多項式曲線(多項式回歸)或非線性關系(非線性回歸)”
STEP 2 : Create a scatter plot of the target variable and predictor variable(simplest and most popular way).
步驟2: 創建目標變量和預測變量的散點圖 (最簡單,最流行的方式)。
STEP 3 : Find the most-likely values of the coefficients in the mathematical formula.
步驟3: 在數學公式中找到最可能的系數值。
Regression analysis comprises of the entire process of identifying the target and predictors,finding the relationship, estimating the coefficients, finding the predicted values of target, and finally evaluating the accuracy of the fitted relationship
回歸分析包括確定目標和預測變量,找到關系,估計系數,找到目標的預測值以及最終評估擬合關系的準確性的全過程。
我們為什么要使用回歸分析? (Why do we use Regression Analysis?)
Regression analysis estimates the relationship between two or more variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed.
[R egression分析估計兩個或多個變量之間的關系。 更具體地說,回歸分析可幫助人們理解,當任何一個自變量發生變化而其他自變量保持固定時,因變量的典型值將如何變化。
For example, we want to estimate the credit card spend of the customers in the next quarter. For each customer, we have their demographic and transaction related data which indicate that the credit card spend is a factor of age, credit limit and total outstanding balance on their loans. Using this insight, we can predict future sales of the company based on current and past information.
例如,我們要估計下一個季度客戶的信用卡支出。 對于每個客戶,我們都有其與人口統計和交易相關的數據,這些數據表明信用卡支出是年齡,信貸額度和貸款總未償余額的一個因素。 利用這種見解,我們可以根據當前和過去的信息預測公司的未來銷售 。
使用回歸分析的好處? (Benefits of using Regression Analysis?)
1. Regression explores significant relationships between dependent variable and independent variable
1.回歸探索因變量和自變量之間的重要關系
2. Indicates the strength of impact of multiple independent variables on a dependent variable
2.指示多個自變量對因變量的影響強度
3. Allows us to compare the effect of variable measures on different scales and can consider nominal, interval, or categorical variables for analysis.
3.使我們能夠比較變量度量在不同規模上的影響,并可以考慮名義變量,區間變量或分類變量進行分析。
具有一個因變量和一個自變量的方程式由以下公式定義: (Equation with one dependent and one independent variable is defined by the formula:)
y = c + b * x (y = c + b * x)
其中y =估計的相關分數 (where y = estimated dependent score)
c =常數 (c = constant)
b =回歸系數, (b = regression coefficient,)
x =自變量。 (x = independent variable.)
回歸技術的類型 (Types of Regression Techniques)
For predictions, there are many regression techniques available. The type of regression technique to be used is mostly driven by three metrics:
對于預測,有許多可用的回歸技術。 所使用的回歸技術的類型主要由三個指標驅動:
1. Number of independent variables
1.自變量數量
2. Type of dependent variables
2.因變量類型
3. Shape of regression line
3.回歸線的形狀
線性回歸 (Linear Regression)
Linear regression is one of the most commonly used predictive modelling techniques.It is represented by an equation 𝑌 = 𝑎 + 𝑏𝑋 + 𝑒, where a is the intercept, b is the slope of the line and e is the error term. This equation can be used to predict the value of a target variable based on given predictor variable(s).
線性回歸是最常用的預測建模技術之一,它由方程𝑌=𝑎+𝑏𝑋+𝑒表示,其中a是截距,b是直線的斜率,e是誤差項。 該方程式可用于基于給定的預測變量來預測目標變量的值。
邏輯回歸 (Logistic Regression)
Logistic regression is used to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.
Logistic回歸用于解釋一個因變量和一個或多個名義,有序,區間或比率級別的自變量之間的關系。
多項式回歸 (Polynomial Regression)
A regression equation is a polynomial regression equation if the power of independent variable is more than 1. The equation below represents a polynomial equation. 𝑌 = 𝑎 + 𝑏𝑋 + 𝑐𝑋2. In this regression technique, the best fit line is not a straight line. It is rather a curve that fits into the data points.
如果自變量的冪大于1,則回歸方程式是多項式回歸方程式。下面的方程式表示多項式方程式。 𝑌=𝑎+𝑏𝑋+𝑐𝑋2。 在這種回歸技術中,最佳擬合線不是直線。 而是一條適合數據點的曲線。
嶺回歸 (Ridge Regression)
Ridge regression is suitable for analyzing multiple regression data that suffers from multicollinearity. When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value.By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors. It is hoped that the net effect will be to give estimates that are more reliable.
Ridge回歸適用于分析遭受多重共線性的多個回歸數據。 當發生多重共線性時,最小二乘估計是無偏的,但它們的方差很大,因此可能與真實值相差甚遠。通過在回歸估計中增加一定程度的偏倚,嶺回歸可減少標準誤差。 希望最終結果將是給出更可靠的估計。
確定最佳擬合線 (Determining the Best Fitting Line)
we have a random sample of 20 students with their height (x) and weight (y) and we need to establish a relationship between the two. One of the first and basic approach to fit a line through the data points is to create a scatter plot of (x,y) and draw a straight line that fits the experimental data.
我們隨機抽取了20名學生,分別是他們的身高(x)和體重(y),我們需要在兩者之間建立關系。 擬合通過數據點的線的第一種也是基本的方法之一是創建(x,y)的散點圖,并繪制一條適合實驗數據的直線。
Since there can be multiple lines that fit the data, the challenge arises in choosing the one that best fits. As we already know, the best fit line can be represented as
由于可能存在多條適合數據的行,因此在選擇最適合的行時會遇到挑戰。 眾所周知,最佳擬合線可以表示為
- 𝑦 denotes the observed response for experimental unit i 𝑦表示實驗單位i的觀測響應
- 𝑥𝑖 denotes the predictor value for experimental unit i 𝑥𝑖表示實驗單位i的預測值
- 𝑦?𝑖 is the predicted response (or fitted value) for experimental unit i 𝑦?𝑖是實驗單位i的預測響應(或擬合值)
When we predict height using the above equation, the predicted value of the prediction wouldn’t be perfectly accurate. It has some “prediction error” (or “residual error”). This can be represented as
當我們使用上述方程式預測高度時,該預測的預測值將不是十分準確。 它具有一些“預測誤差”(或“殘留誤差”)。 這可以表示為
A line that fits the data best will be one for which the n (i = 1 to n) prediction errors, one for each observed data point, are as small as possible in some overall sense.
最適合數據的一條線將是n(i = 1到n)個預測誤差(對于每個觀察到的數據點一個)在某種意義上應盡可能小。
One way to achieve this goal is to invoke the “least squares criterion,” which says to “minimize the sum of the squared prediction errors.
實現此目標的一種方法是調用“最小二乘準則”,該準則表示“最小化預測誤差平方和。
The equation of the best fitting line is:
最佳擬合線的方程為:
We need to find the values of b0 and b1 that make the sum of the squared prediction errors the smallest i.e.
我們需要找到b0和b1的值,它們使預測誤差平方的總和最小,即
等式是什么意思? (What Does the Equation Mean?)
The equation above is a physical interpretation of each of the coefficients and hence it is very important to understand what the regression equation means.
上面的方程式是每個系數的物理解釋,因此了解回歸方程式的意義非常重要。
The coefficient 𝑏0, or the intercept, is the expected value of Y when X =0
當X = 0時,系數𝑏0(即截距)是Y的期望值
The coefficient 𝑏1, or the slope, is the expected change in Y when X is increased by one unit.
系數𝑏1或斜率是當X增加一個單位時Y的預期變化。
The following figure explains the interpretations clearly.
下圖清楚地解釋了這些解釋。
Linear Regression: Factors affecting Credit Card Sales線性回歸:影響信用卡銷售的因素An analyst wants to understand what factors (or independent variables) affect credit card sales. Here, the dependent variable is credit card sales for each customer, and the independent variables are income, age, current balance, socio-economic status, current spend, last month’s spend, loan outstanding balance, revolving credit balance, number of existing credit cards and credit limit. In order to understand what factors affect credit card sales, the analyst needs to build a linear regression model.
分析師想了解哪些因素(或自變量)會影響信用卡銷售。 在這里, 因變量是每個客戶的信用卡銷售量,而自變量是收入,年齡,當前余額,社會經濟狀況,當前支出,上個月的支出,未償還貸款余額,循環信用余額,現有信用卡數量和信用額度 。 為了了解哪些因素會影響信用卡銷售,分析師需要建立線性回歸模型。
學習并應用簡單的線性回歸模型 (Learn & Apply a Simple Linear Regression Model)
Trainee is exposed to a sample dateset comprising of telecom customer accounts and their annual income, age along with their average monthly revenue (dependent variable). The trainee is expected to apply the linear regression model using annual income as the single predictor variable.
受訓者將接受一個樣本日期集,樣本集包括電信客戶帳戶及其年收入,年齡以及平均每月收入(因變量)。 預計學員將使用年收入作為單個預測變量來應用線性回歸模型。
評估線性回歸模型 (Evaluating a Linear Regression Model)
Once we fit a linear regression model, we need to evaluate the accuracy of the model. In the following sections, we will discuss the various methods used to evaluate the accuracy of the model with respect to its predictive power.
擬合線性回歸模型后,我們需要評估模型的準確性。 在以下各節中,我們將討論用于評估模型相對于其預測能力的準確性的各種方法。
F統計量和p值 (F-Statistics and p-value)
The F-Test indicates whether a linear regression model provides a better fit to the data than a model that contains no independent variables. It consists of the null and alternate hypothesis and the test statistic helps to prove or disprove the null hypothesis.
F檢驗表明,與不包含自變量的模型相比,線性回歸模型是否可以更好地擬合數據。 它由原假設和替代假設組成,檢驗統計量有助于證明或反證原假設。
測定系數 (Coefficient of Determination)
The R-squared value of the model, which is also called the “Coefficient of Determination”. This statistic calculates the percentage of variation in target variable explained by the model.
模型的R平方值,也稱為“測定系數”。 此統計信息計算模型解釋的目標變量的變化百分比。
R-squared is calculated using the following formula:
R平方使用以下公式計算:
R-squared is always between 0 and 100%. As a guideline, the more the R-squared, the better is the model. The objective is not to maximize the R-squared, since the stability and applicability of the model are equally important
R平方始終在0到100%之間。 原則上, R平方越大,模型越好 。 目標不是最大化R平方,因為模型的穩定性和適用性同等重要
Next, check the Adjusted R-squared value. Ideally, the R-squared and adjusted R-squared values need to be in close proximity of each other. If this is not the case, then the analyst may have over fitted the model and may need to remove the insignificant variables from the model.
接下來, 檢查調整后的R平方值。 理想情況下,R平方值和調整后的R平方值必須彼此接近。 如果不是這種情況,那么分析人員可能對模型進行了過度擬合,可能需要從模型中刪除無關緊要的變量。
學習并應用R平方的概念 (Learn & Apply the concept of R-Square)
The trainee is exposed to a sample dateset capturing telecom customer accounts and their annual income, age, along with their average monthly revenue (dependent variable). The dateset also contains predicted values of “average monthly revenue” from a regression model. The trainee is expected to apply the concept of calculation of coefficient of determination.
學員將獲得一個樣本日期集,該樣本集將捕獲電信客戶帳戶及其年收入,年齡以及其平均每月收入(因變量)。 該日期集還包含來自回歸模型的“平均每月收入”的預測值。 預計學員將應用確定系數的計算概念。
檢查參數估計值的p值 (Check the p-value of the Parameter Estimates)
The p-value for each variable tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (<0.05) indicates that we can reject the null hypothesis. In other words, a predictor that has a low p-value can be included in the model because changes in the predictor’s value are related to changes in the response variable.
每個變量的p值檢驗零假設,即系數等于零(無影響)。 低p值(<0.05)表示我們可以拒絕原假設。 換句話說,具有低p值的預測變量可以包含在模型中,因為預測變量值的變化與響應變量的變化有關。
建立多元線性回歸模型并評估參數意義 (Build a Multivariate Linear Regression Model and Evaluate Parameter Significance)
the traineeis exposed to a sample dataset capturing the flight status of flights with their delay in arrival, along with various possible predictor variables like departure delay, distance, air time, etc. The learner is expected to build a multiple regression model where all the variables are significant.
受訓者將獲得一個樣本數據集,該數據集將記錄航班的飛行狀態及其到達的延遲,以及各種可能的預測變量,如出發延遲,距離,飛行時間等。學習者應建立一個多元回歸模型,其中所有變量非常重要。
殘差分析 (Residual Analysis)
We can also evaluate a regression model based on various summary statistics on error or residuals.
我們還可以基于各種關于誤差或殘差的匯總統計信息來評估回歸模型。
Some of them are:
他們之中有一些是:
Root Mean Square Error (RMSE): Where we find average of squared residuals as per the given formula:
均方根誤差(RMSE):根據給定的公式求殘差平方的平均值:
Mean Absolute Percentage Error (MAPE): We find the average percentage deviation as per the given formula:
平均絕對百分比誤差(MAPE):我們根據給定的公式求出平均百分比偏差:
等級排序 (Rank Ordering)
Observations are grouped based on predicted values of the target variable. The average of the actual vs. predicted values of the target variable, across the groups, is observed to see if they move in the same direction across the groups (increase or decrease). This is called the rank ordering check.
根據目標變量的預測值將觀察分組。 觀察各組目標變量的實際值與預測值的平均值,以查看它們在組之間是否沿相同方向移動(增加或減少)。 這稱為等級排序檢查。
線性回歸的假設 (Assumptions of Linear Regression)
There are some basic but strong underlying assumptions behind the linear regression model estimation. After fitting a regression model, we should also test the validation of each of these assumptions.
線性回歸模型估算背后有一些基本但強大的基礎假設。 擬合回歸模型后,我們還應該測試這些假設的有效性。
- There must be a causal relationship between the dependent and the independent variable(s) which can be expressed as a linear function. A scatter plot of target variable vs. predictor variable can help us validate this. 因變量和自變量之間必須存在因果關系,可以將其表達為線性函數。 目標變量與預測變量的散點圖可以幫助我們驗證這一點。
- Error term of one observation is independent of that of the other. Otherwise we say the data has auto-correlation problem. 一個觀測值的誤差項與另一個觀測值的誤差項無關。 否則,我們說數據具有自相關問題。
- The mean (or expected value) of errors is zero. 錯誤的平均值(或期望值)為零。
- The variance of errors does not depend on the value of any predictor variable. This means, errors have a constant variance along the regression line. 誤差的方差不取決于任何預測變量的值。 這意味著誤差沿回歸線具有恒定的方差。
- Errors follow normal distribution. We can use normality test on the errors here 錯誤遵循正態分布。 我們可以在這里對錯誤使用正態檢驗
學習和應用變量選擇和過度擬合的概念。 (Learn & Apply concepts of Variable Selection & Over-fitting.)
The trainee is expected to select the significant variable for the model first and then check if there is any problem of over fitting. If found, trainee should remove the requisite variable(s) and iterate through the variable selection process.
預計學員將首先為模型選擇有效變量,然后檢查是否存在過度擬合的問題。 如果找到了,受訓人員應刪除必要的變量,并在變量選擇過程中進行迭代。
翻譯自: https://medium.com/swlh/predictive-modelling-using-linear-regression-e0e399dc4745
總結
以上是生活随笔為你收集整理的使用线性回归的预测建模的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 苹果xs防不防水(苹果官网报价)
- 下一篇: 王者荣耀没有人物声音怎么办