adf检验代码 python_第22期:向量自回归(VAR)模型预测——Python实现
一、向量自回歸模型簡介
經典回歸模型都存在一個強加單向關系的局限性,即被解釋變量受到解釋變量的影響,但反之不成立。然而,在許多情況下所有變量都相互影響。向量自回歸(VAR)模型允許這類雙向反饋關系,所有變量都被平等對待,即所有變量都是內生的,變量之間平等地相互影響。VAR模型將單變量自回歸的思想擴展到多元時間序列回歸,是單變量自回歸模型的一般化。它由系統中每個變量對應一個方程組成。每個方程的等式右邊都包含一個常數項和系統中所有變量的滯后項。兩變量P階VAR模型的一般表達式如下:? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ?? ? ? ? ? ? ? ? ??
如果序列是平穩的,我們直接根據數據擬合 VAR 模型(稱為“水平 VAR”);如果序列非平穩,則將數據進行差分使其變得平穩,在此基礎上擬合 VAR 模型(稱為“差分 VAR”)。在這兩種情況下,模型都是利用最小二乘法對方程逐一進行估計的。對于每個方程,通過最小化平方和的值來估計參數。
二、一個實例:使用VAR模型進行短期預測
(一)數據來源
本期應用案例為:Yash P. Mehra(1994)的文章《工資增長和通貨膨脹過程:一種經驗方法》所使用的時間序列數據集。
原文如下:
Mehra,Yash P. "Wage growth and the inflation process: an empiricalapproach." Cointegration. Palgrave Macmillan, London, 1994. 147-159.
CSV格式的數據集可以在下面網址下載。https://raw.githubusercontent.com/selva86/datasets/master/Raotbl6.csv(二)變量該數據集是8個變量的季度時間序列,變量如下:
1. rgnp :Real GNP.
2. pgnp :Potential real GNP.
3. ulc :Unit labor cost.
4. gdfco: Fixed weight deflator for personal consumption expenditure excluding food andenergy.
5. gdf :Fixed weight GNP deflator.
6. gdfim: Fixed weight import deflator.
7. gdfcf: Fixed weight deflator for food in personal consumption expenditure.
8. gdfce: Fixed weight deflator for energy in personal consumption expenditure.
三、計算過程與Python代碼
#(1)載入包
importpandas as pd
importnumpy as np
importmatplotlib.pyplot as plt
%matplotlibinline
# ImportStatsmodels
fromstatsmodels.tsa.api import VAR
fromstatsmodels.tsa.stattools import adfuller
fromstatsmodels.tools.eval_measures import rmse, aic
#(2)導入數據
filepath= 'https://raw.githubusercontent.com/selva86/datasets/master/Raotbl6.csv'
df =pd.read_csv(filepath, parse_dates=['date'], index_col='date')
print(df.shape)? # (123, 8)
df.tail()
#(3)時間序列可視化
# Plot
fig,axes = plt.subplots(nrows=4, ncols=2, dpi=120, figsize=(10,6))
for i,ax in enumerate(axes.flatten()):
??? data = df[df.columns[i]]
??? ax.plot(data, color='red', linewidth=1)
??? # Decorations
??? ax.set_title(df.columns[i])
??? ax.xaxis.set_ticks_position('none')
??? ax.yaxis.set_ticks_position('none')
??? ax.spines["top"].set_alpha(0)
??? ax.tick_params(labelsize=6)
plt.tight_layout()
#(4)格蘭杰因果檢驗
fromstatsmodels.tsa.stattools import grangercausalitytests
maxlag=12
test ='ssr_chi2test'
def grangers_causation_matrix(data,variables, test='ssr_chi2test', verbose=False):???
??? """Check Granger Causalityof all possible combinations of the Time series.
??? The rows are the response variable, columnsare predictors. The values in the table
??? are the P-Values. P-Values lesser than thesignificance level (0.05), implies
??? the Null Hypothesis that the coefficientsof the corresponding past values is
??? zero, that is, the X does not cause Y canbe rejected.
??? data?????: pandas dataframe containing the time series variables
??? variables : list containing names of thetime series variables.
??? """
??? df = pd.DataFrame(np.zeros((len(variables),len(variables))), columns=variables, index=variables)
??? for c in df.columns:
??????? for r in df.index:
??????????? test_result =grangercausalitytests(data[[r, c]], maxlag=maxlag, verbose=False)
??????????? p_values =[round(test_result[i+1][0][test][1],4) for i in range(maxlag)]
??????????? if verbose: print(f'Y = {r}, X ={c}, P Values = {p_values}')
??????????? min_p_value = np.min(p_values)
??????????? df.loc[r, c] = min_p_value
??? df.columns = [var + '_x' for var invariables]
??? df.index = [var + '_y' for var invariables]
??? return df
grangers_causation_matrix(df,variables = df.columns)??
上述結果中,行是響應變量(Y),列是預測變量(X)。
例如,(行1,列2)的P值0.0003<顯著性水平(0.05),pgnp_x是導致rgnp_y的格蘭杰原因。其它相同的解釋。
接下來進行變量間的協整檢驗。
#(5)協整檢驗
fromstatsmodels.tsa.vector_ar.vecm import coint_johansen
defcointegration_test(df, alpha=0.05):
??? """Perform Johanson'sCointegration Test and Report Summary"""
??? out = coint_johansen(df,-1,5)
??? d = {'0.90':0, '0.95':1, '0.99':2}
??? traces = out.lr1
??? cvts = out.cvt[:, d[str(1-alpha)]]
??? def adjust(val, length= 6): returnstr(val).ljust(length)
??? # Summary
??? print('Name?? ::?Test Stat > C(95%)???=>?? Signif? \n', '--'*20)
??? for col, trace, cvt in zip(df.columns,traces, cvts):
??????? print(adjust(col), ':: ',adjust(round(trace,2), 9), ">", adjust(cvt, 8), ' =>? ' , trace > cvt)
cointegration_test(df)
結果如下:
接下來,將數據集分為訓練和測試數據。
VAR模型將被擬合df_train,然后用于預測接下來的4個觀測值。這些預測將與測試數據中的實際值進行比較。
為了進行比較,我們將使用多個預測準確性指標。
#(6)數據集分為訓練和測試數據
nobs = 4
df_train,df_test = df[0:-nobs], df[-nobs:]
# Checksize
print(df_train.shape)? # (119, 8)
print(df_test.shape)? # (4, 8)
#(7)序列平穩性檢驗
三種最常用的檢驗方法:ADF、KPSS和PP檢驗法。
defadfuller_test(series, signif=0.05, name='', verbose=False):
??? """Perform ADFuller to testfor Stationarity of given series and print report"""
??? r = adfuller(series, autolag='AIC')
??? output = {'test_statistic':round(r[0], 4),'pvalue':round(r[1], 4), 'n_lags':round(r[2], 4), 'n_obs':r[3]}
??? p_value = output['pvalue']
??? def adjust(val, length= 6): returnstr(val).ljust(length)
??? # Print Summary
??? print(f'???Augmented Dickey-Fuller Test on "{name}"', "\n?? ", '-'*47)
??? print(f' Null Hypothesis: Data has unitroot. Non-Stationary.')
??? print(f' Significance Level??? = {signif}')
??? print(f' Test Statistic??????? ={output["test_statistic"]}')
??? print(f' No. Lags Chosen?????? = {output["n_lags"]}')
??? for key,val in r[4].items():
??????? print(f' Critical value {adjust(key)} ={round(val, 3)}')
??? if p_value <= signif:
??????? print(f" => P-Value ={p_value}. Rejecting Null Hypothesis.")
??????? print(f" => Series isStationary.")
??? else:
??????? print(f" => P-Value ={p_value}. Weak evidence to reject the Null Hypothesis.")
??????? print(f" => Series isNon-Stationary.")
# ADFTest on each column
forname, column in df_train.iteritems():
??? adfuller_test(column, name=column.name)
print('\n')
結果如下:
上述結果發現,ADF檢驗確認沒有時間序列是平穩的。因此各序列一次差分后再檢驗。差分后的序列都平穩(只列出了兩個差分序列的檢驗結果)。
# 1st difference
df_differenced = df_train.diff().dropna()
# ADF Test on each column of 1st Differences Dataframe
for name, column in df_differenced.iteritems():
??? adfuller_test(column,name=column.name)
??? print('\n')
#(8)選擇VAR模型的階數(P)
四個最常用的信息準則:AIC、BIC、FPE和HQIC。
model = VAR(df_differenced)
for i in [1,2,3,4,5,6,7,8,9]:
??? result = model.fit(i)
??? print('Lag Order =', i)
??? print('AIC : ',result.aic)
??? print('BIC : ',result.bic)
??? print('FPE : ',result.fpe)
??? print('HQIC: ',result.hqic, '\n')
結果如下:
在上面的輸出中,AIC在滯后4處降至最低,然后在滯后5處增加,然后連續進一步下降,所以使用滯后4模型。
選擇VAR模型的階數(p)的另一種方法是使用該方法:model.select_order(maxlags)
所選順序(p)是給出最低“ AIC”,“ BIC”,“ FPE”和“ HQIC”值的順序。
x =model.select_order(maxlags=12)
x.summary()
結果如下:
根據FPE和HQIC,最佳滯后階數為3。
#(9)訓練VAR模型
model_fitted= model.fit(4)
model_fitted.summary()
…………(省略)
#(10)Durbin Watson統計量檢查殘差(誤差)的序列相關性
使用Durbin Watson統計量檢查殘差項的序列相關性的公式如下:
DW值可以在0到4之間變化。它越接近值2,則沒有明顯的序列相關性。接近0時,存在正序列相關,而接近4時,則具有負序列相關。
from statsmodels.stats.stattools import durbin_watson
out = durbin_watson(model_fitted.resid)
for col, val in zip(df.columns, out):
??? print(adjust(col), ':',round(val, 2))
結果如下:
rgnp : 2.09
pgnp : 2.02
ulc : 2.17
gdfco : 2.05
gdf : 2.25
gdfim : 1.99
gdfcf : 2.2
gdfce : 2.17
#(11)VAR模型預測
# Get the lag order
lag_order = model_fitted.k_ar
print(lag_order)? #> 4
# Input data for forecasting
forecast_input = df_differenced.values[-lag_order:]
forecast_input
#預測
# Forecast
fc = model_fitted.forecast(y=forecast_input, steps=nobs)
df_forecast = pd.DataFrame(fc, index=df.index[-nobs:],columns=df.columns + '_2d')
df_forecast
生成了預測,但是預測值是模型使用的訓練數據得到的結果。因此,要將其恢復到原始比例,需要對原始輸入數據進行多次差異化處理。
在這種情況下,它是兩次。
#(12)變換以獲得真實的預測值
def invert_transformation(df_train, df_forecast, second_diff=False):
??? """Revertback the differencing to get the forecast to original scale."""
??? df_fc = df_forecast.copy()
??? columns = df_train.columns
??? for col in columns:???????
??????? # Roll back 2nd Diff
??????? if second_diff:
??????????? df_fc[str(col)+'_1d']= (df_train[col].iloc[-1]-df_train[col].iloc[-2]) +df_fc[str(col)+'_2d'].cumsum()
??????? # Roll back 1st Diff
???????df_fc[str(col)+'_forecast'] = df_train[col].iloc[-1] +df_fc[str(col)+'_1d'].cumsum()
??? return df_fc
f_results = invert_transformation(train, df_forecast,second_diff=True)???????
df_results.loc[:, ['rgnp_forecast', 'pgnp_forecast', 'ulc_forecast','gdfco_forecast',
??????????????????'gdf_forecast', 'gdfim_forecast', 'gdfcf_forecast', 'gdfce_forecast']]
結果:
#(12)預測值與實際值可視化
fig, axes = plt.subplots(nrows=int(len(df.columns)/2), ncols=2,dpi=150, figsize=(10,10))
for i, (col,ax) in enumerate(zip(df.columns, axes.flatten())):
???df_results[col+'_forecast'].plot(legend=True,ax=ax).autoscale(axis='x',tight=True)
???df_test[col][-nobs:].plot(legend=True, ax=ax);
??? ax.set_title(col + ":Forecast vs Actuals")
???ax.xaxis.set_ticks_position('none')
???ax.yaxis.set_ticks_position('none')
???ax.spines["top"].set_alpha(0)
???ax.tick_params(labelsize=6)
plt.tight_layout();
最后,本期特感謝:
Selva Prabhakaran博士對VAR模型Python代碼的提供,可參閱網站:
https://www.machinelearningplus.com/time-series/vector-autoregression-examples-python/
總結
以上是生活随笔為你收集整理的adf检验代码 python_第22期:向量自回归(VAR)模型预测——Python实现的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 这是 Google 工程师 Amit S
- 下一篇: 用VC写Assembly代码(5) --