當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

SARIMA时间序列模型预测城市房价数据

發布時間：2023/12/20 编程问答 37 豆豆

生活随笔收集整理的這篇文章主要介紹了 SARIMA时间序列模型预测城市房价数据小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

SARIMA時間序列模型預測城市房價數據

數據清洗

文件中含有大量城市的房價數據，考慮到此次為學習性質的練習，為了節省數據處理的繁瑣步驟。我截取了北京的2010-2021房價數據作為樣例，并將價格的數據格式改為數值，去除多余的逗號

數據導入

#導入數據 df = pd.read_csv('Beijingpricedata.csv', encoding='utf-8', index_col='Timing') #將日期轉化為標準時間格式 i=0; for str in df.index:df.index.values[i]=datetime.datetime.strptime(str, '%y-%b').strftime('%Y-%m')i+=1; df.index = pd.to_datetime(df.index) # 將字符串索引轉換成時間索引TravelDatets = df['北京'] # 生成pd.Series對象

數據平穩性檢測

因為采用ARIMA模型訓練數據，要求數據是穩定的，所以需要對數據進行平穩性檢測

畫出原始數據圖表

可以看到價格數據具有逐步上升的趨勢

#原始數據圖 def draw_ts(timeseries):f = plt.figure(facecolor='white')timeseries.plot(color='blue')plt.title('housing price changes')plt.show() draw_ts(ts)

移動平均圖

我們可以發現數據的移動平均值有越來越大的趨勢，是不穩定的。

def draw_trend(timeseries, size):plt.rcParams['font.sans-serif'] = [u'SimHei']plt.rcParams['axes.unicode_minus'] = Falsef = plt.figure(facecolor='white')# 對size個數據進行移動平均rol_mean = timeseries.rolling(window=size).mean()# 對size個數據移動平均的方差rol_std = timeseries.rolling(window=size).std()timeseries.plot(color='blue', label='Original')rol_mean.plot(color='red', label='Rolling Mean')rol_std.plot(color='black', label='Rolling standard deviation')plt.legend(loc='best')plt.title('Rolling Mean & Standard Deviation')plt.show() draw_trend(ts,12)

平穩性檢測

此時p值為0.623007，不能拒絕原假設，即數據存在單位根,數據是非平穩序列。

def TestStationaryAdfuller(ts, cutoff = 0.01):ts_test = adfuller(ts, autolag = 'AIC')ts_test_output = pd.Series(ts_test[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])for key,value in ts_test[4].items():ts_test_output['Critical Value (%s)'%key] = valueprint(ts_test_output)if ts_test[1] <= cutoff:print(u"拒絕原假設，即數據沒有單位根,序列是平穩的。")else:print(u"不能拒絕原假設，即數據存在單位根,數據是非平穩序列。") # 平穩性檢測: def teststationarity(ts):dftest = adfuller(ts)# 對上述函數求得的值進行語義描述dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', '#Lags Used', 'Number of Observations Used'])for key, value in dftest[4].items():dfoutput['Critical Value (%s)' % key] = valuereturn dfoutput print(teststationarity(ts))

數據分解

將時序數據分離成不同的成分：趨勢部分，季節性部分，殘留部分。

可以看到，數據具有上升的趨勢，且具有季節性的特點。周期=12（/月）

所以考慮采用季節ARIMA模型（SRIMA）方法模型擬合數據

def decompose(timeseries):# 返回包含三個部分 trend（趨勢部分）， seasonal（季節性部分）和residual (殘留部分)decomposition = seasonal_decompose(timeseries)trend = decomposition.trendseasonal = decomposition.seasonalresidual = decomposition.residplt.subplot(411)plt.plot(ts, label='Original')plt.legend(loc='best')plt.subplot(412)plt.plot(trend, label='Trend')plt.legend(loc='best')plt.subplot(413)plt.plot(seasonal, label='Seasonality')plt.legend(loc='best')plt.subplot(414)plt.plot(residual, label='Residuals')plt.legend(loc='best')plt.tight_layout()plt.show()return trend, seasonal, residualtrend, seasonal, residual = decompose(ts) residual.dropna(inplace=True) draw_trend(residual, 12) print(teststationarity(residual))

SARIMA模型

模型定階

用圖解法求解ARIMA模型的最優參數并非易事，主觀性很大，而且耗費時間。所以本文進一步考慮利用網格搜索的方法系統地選擇最優的參數值

#SARIMA模型參數的選擇 p = d = q = range(0, 2) pdq = list(itertools.product(p, d, q)) pdq_x_PDQs = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))] a=[] b=[] c=[] wf=pd.DataFrame() for param in pdq:for seasonal_param in pdq_x_PDQs:try:mod = sm.tsa.statespace.SARIMAX(ts,order=param,seasonal_order=seasonal_param,enforce_stationarity=False,enforce_invertibility=False)results = mod.fit()print('ARIMA{}x{} - AIC:{}'.format(param, seasonal_param, results.aic))a.append(param)b.append(seasonal_param)c.append(results.aic)except:continue wf['pdq']=a wf['pdq_x_PDQs']=b wf['aic']=c print(wf[wf['aic']==wf['aic'].min()])mod = sm.tsa.statespace.SARIMAX(ts,order=(1,1,1),seasonal_order=(1,1,1,12),enforce_stationarity=False,enforce_invertibility=False) results = mod.fit() print(results.summary())

可以看到，模型參數的最佳組合為(1,1,1)(1,1,1,12),aic最小值為1475.5377

模型檢驗

# 模型檢驗 mod = sm.tsa.statespace.SARIMAX(ts,order=(1,1,1),seasonal_order=(1,1,1,12),enforce_stationarity=False,enforce_invertibility=False) results = mod.fit() print(results.summary()) # 模型診斷 results.plot_diagnostics(figsize=(15, 12)) plt.show() # LB檢驗 r, q, p = sm.tsa.acf(results.resid.values.squeeze(), qstat=True) data = np.c_[range(1, 22), r[1:], q, p] table = pd.DataFrame(data, columns=['lag', "AC", "Q", "Prob(>Q)"]) print(table.set_index('lag'))

coef列顯示每個變量的權重(即重要性)，以及每個變量如何影響時間序列。P>|z|列是對每個變量系數的檢驗。每個變量的P值均小于0.05，所以在0.05的顯著性水平下，拒絕加假設，模型中每個變量的系數通過顯著性檢驗，可以認為擬合的模型中包含這些變量是合理的。

由右上角的正態分布圖可知，模型的殘差是正態分布的。紅色的KDE線緊隨著N(0，1)線。其中，N(0,1)是均值為0，標準差為1的標準正態分布。這很好地說明了殘差是正態分布的。而圖中左下角的正太Q—Q圖也說明了殘差服從正態分布。

圖中右下角殘差的自相關圖顯示殘差在初始位置偏差較大，其他位置不存在自相關，說明殘差序列是白噪聲序列。

由此，可以得出結論，SARIMAX(1，1，1)x(0，1，1，12)模型的擬合效果還不錯，可以幫助我們理解原始的時間序列數據并對未來的數值做出預測。

模型預測

1）靜態預測

#靜態預測 predict_ts = results.predict() predict_ts.plot(color='blue', label='Predict') ts.plot(color='red', label='Original') plt.legend(loc='best') plt.title('RMSE: %.4f'% np.sqrt(sum((predict_ts-ts)**2)/ts.size)) plt.show()

可以看到，模型在2011左右預測效果不太理想，不過之后預測效果很不錯

動態預測

#動態預測 pred_dynamic = results.get_prediction(start=pd.to_datetime('2014-01-1'), dynamic=True, full_results=True) pred_dynamic_ci = pred_dynamic.conf_int() ts_forecast = pred_dynamic.predicted_mean ts_orginal =ts['2014-01-01':] ts_forecast.plot(color='blue', label='Predict') ts_orginal.plot(color='red', label='Original') plt.legend(loc='best') plt.title('RMSE: %.4f'% np.sqrt(sum((ts_forecast-ts_orginal)**2)/ts.size)) plt.show()

可以看到，魔性的動態預測效果很差，不適合使用該模型

預測未來5年的房價

#預測未來5年的數據 forecast = results.get_forecast(steps= 60) # 得到預測的置信區間 forecast_ci = forecast.conf_int() ts_forecast = forecast.predicted_mean ts_pred_concat = pd.concat([ts_forecast,forecast_ci],axis=1) ts_pred_concat.columns = [u'預測值',u'下限',u'上限'] print(ts_pred_concat.head(60))#繪制時間序列圖 ax = ts.plot(label='origin', figsize=(20, 15)) forecast.predicted_mean.plot(ax=ax, label='predict') ax.fill_between(forecast_ci.index,forecast_ci.iloc[:, 0],forecast_ci.iloc[:, 1], color='g', alpha=.4) plt.xticks(fontsize = 36) plt.yticks(fontsize = 36) ax.set_xlabel('時間(年)',fontsize=36) ax.set_ylabel('房價\元',fontsize=36) plt.legend(loc = 'upper left',fontsize=20) plt.show()

問題

模型在初始采樣時間（2011年左右）擬合效果不佳

參考

https://zhuanlan.zhihu.com/p/127032260

https://blog.csdn.net/u012735708/article/details/82460962

總結

以上是生活随笔為你收集整理的SARIMA时间序列模型预测城市房价数据的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【干货】同步与互斥的失败例子
下一篇： 10个高效Linux技巧及Vim命令对比