當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

时间序列预测（一）—— 数据预处理

發布時間：2024/5/15 编程问答 45 豆豆

生活随笔收集整理的這篇文章主要介紹了时间序列预测（一）—— 数据预处理小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

時間序列預測（一）—— 數據預處理

歡迎大家來我的個人博客網站觀看原文：https://xkw168.github.io/2019/05/20/時間序列預測-一-數據預處理.html

??最近在做時間序列的預測問題，這里就稍微總結回顧一下，便于以后查閱，也希望能給大家提供到幫助，有什么問題歡迎多多交流。
??這是一個系列的文章，主要從代碼的角度分析問題（爭取做到代碼片段的隨用隨取），不涉及太多的模型原理（我會盡可能講一下自己的理解），本系列文章包含了數據預處理和基本時間序列分析預測模型：

（一）數據預處理

（二）AR模型（自回歸模型）

（三）Xgboost模型

（四）LSTM模型

（五）Prophet模型（自回歸模型）

數據預處理（pre-processing）

數據預處理在數據分析中占據了重要的地位，這里主要介紹幾種常見的預處理方法

1. 歸一化（反歸一化）

??歸一化可以說是數據預處理里面最常用的方法之一了，在模型訓練中不同的數據取值范圍假如相差過大很容易造成模型錯誤的分配權重，所以很多時候歸一化必不可少。

def normalize(data, method="MinMax", feature_range=(0, 1)):"""normalize the data:param data: list of data:param method: support MinMax scaler or Z-Score scaler:param feature_range: use in MinMax scaler:return: normalized data(list), scaler"""data = np.array(data)if len(data.shape) == 1 or data.shape[1] != 1:# reshape(-1, 1) --> reshape to a one column n rows matrix(-1 means not sure how many row)data = data.reshape(-1, 1)if method == "MinMax":scaler = MinMaxScaler(feature_range=feature_range)elif method == "Z-Score":scaler = StandardScaler()else:raise ValueError("only support MinMax scaler and Z-Score scaler")scaler.fit(data)# scaler transform apply to each column respectively# (which means that if we want to transform a 1-D data, we must reshape it to n x 1 matrix)return scaler.transform(data).reshape(-1), scalerdef denormalize(data, scaler):"""denormalize data by scaler:param data::param scaler::return: denormalized data"""data = np.array(data)if len(data.shape) == 1 or data.shape[1] != 1:data = data.reshape(-1, 1)# max, min, mean, variance are all store in scaler, so we need it to perform inverse transformreturn scaler.inverse_transform(data).reshape(-1)

2.重采樣

??有時候原始數據的采樣頻率可能太高，導致噪聲比較大，重采樣可以在一定程度上降低噪聲，同時數據量較大的時候還可以起到減小數據量提高模型的迭代速度。

def resample(data, period="W"):"""resample the original data to reduce noise:param data::param period: the period of data e.g. B - business day, D - calendar day, W - weekly, Y - yearly etc.(reference: pandas DateOffset Objects'http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html'):return:"""data = data.set_index(pd.DatetimeIndex(data['ds']))return data.resample(period, label="right").mean().reset_index()

3.數據劃分

??用于將原始數據劃分為訓練機，測試集和驗證集，可以自行調整分配權重。

def split_data(observed_data, split_ratio=(8, 2, 1)):"""split the observed data into train-evaluation-test three part:param observed_data::param split_ratio: relative proportion among train,evaluation,test:return: train, evaluation, test data"""total = split_ratio[0] + split_ratio[1] + split_ratio[2]length = len(observed_data)train_cnt = int((split_ratio[0] / total) * length)test_cnt = int((split_ratio[2] / total) * length)return observed_data[:train_cnt], observed_data[train_cnt:-test_cnt], observed_data[-test_cnt:]

4.各種濾波（中值，均值，巴特沃斯）

??這個主要用于信號處理，一般的數據分析可能用的不多。

def median_filter(datas, length=3):"""median filter, length must be odd number:param datas::param length::return:"""return signal.medfilt(datas, length).tolist()def average_filter(datas, length=3):"""average filter, length should not greater than 5:param datas::param length::return:"""if isinstance(datas, np.ndarray):datas = datas.tolist()updated = []if length == 2:for d1, d2 in zip(datas[:-1], datas[1:]):updated.append(np.average([d1, d2]))elif length == 3:for d1, d2, d3 in zip(datas[:-2], datas[1:-1], datas[2:]):updated.append(np.average([d1, d2, d3]))elif length == 4:for d1, d2, d3, d4 in zip(datas[:-3], datas[1:-2], datas[2:-1], datas[3:]):updated.append(np.average([d1, d2, d3, d4]))else:for d1, d2, d3, d4, d5 in zip(datas[:-4], datas[1:-3], datas[2:-2], datas[3:-1], datas[4:]):updated.append(np.average([d1, d2, d3, d4, d5]))return updated# noinspection PyTupleAssignmentBalance def butter_filter(datas, hc=0.1):b, a = signal.butter(5, hc, btype="low")return signal.filtfilt(b, a, datas)

5.數據可視化

??數據可視化也是預處理中很重要的一個手段，可以用于分析數據的各項屬性。

折線圖：用于分析數據的變化趨勢
箱型圖：用于分析數據異常值
正態分布圖：用于分析數據是否符合正態分布
散點圖：用于分析數據的分布情況（線性度，聚合度等）
熱力圖：用于分析多組數據間的相關系數（注意，一定是多組數據）
柱狀圖：用于分析數據的相對大小/占比
餅狀圖：用于分析數據的占比

def plot_data(y_val, x_val=None, legend="", x_label="", y_label="", title="", file_name=""):if x_val is None:x_val = range(len(y_val))plt.figure()plt.plot(x_val, y_val, label=legend)if legend:# show legend(line label)plt.legend()# show x labelplt.xlabel(x_label)# show y labelplt.ylabel(y_label)# show titleplt.title(title)if file_name:plt.savefig('./result/%s.png' % file_name, bbox_inches='tight')plt.tight_layout()plt.show()def box_plot(data, file_name=""):"""box plot —— use to see the distribution of the datas:param data: list of data:param file_name: file name of the plot figure:return: None"""plt.boxplot(np.array(data), sym="o")if file_name:plt.savefig('./result/%s.png' % file_name, bbox_inches='tight')plt.show()def distribution_plot(data, file_name=""):"""distribution plot —— use to see the distribution of the data:param data: list of data:param file_name: file name of the plot figure:return: None"""plt.figure(figsize=(8, 5))sns.set_style('whitegrid')sns.distplot(np.array(data), rug=True, color='b')plt.title("distribution")if file_name:plt.savefig("./result/%s.png" % file_name, bbox_inches='tight')plt.show()def scatter_plot(y_val, x_val=None, x_label="", y_label="", title="", file_name=""):"""scatter plot —— use to see the distribution of the data:param y_val: y value:param x_val: x value, if None will use 0 ~ range(len(y_val)):param x_label::param y_label::param title::param file_name: file name of the plot figure:return: None"""if x_val is None:x_val = range(len(y_val))plt.scatter(x_val, y_val, marker='o', color='black', s=10)plt.xlabel(x_label)plt.ylabel(y_label)plt.title(title)if file_name:plt.savefig('./result/%s.png' % file_name, bbox_inches='tight')plt.show()def heatmap(data, file_name="", method="pearson"):"""draw the heat map of sets of data:param data: DataFrame format:param file_name::param method: method used to calculate the correlationpearson - range -1 ~ 1(only two variable in perfect linear relation, it will be ±1)spearman - range -1 ~ 1(it will be ±1 whenthe relation between two variable can be described by a monotonic function)reference< 0.1 : no relation0.10 ~ 0.29: weak relation0.30 ~ 0.49: medium relation> 0.5: strong relation:return: None"""sns.heatmap(data.corr(method=method),xticklabels=data.corr(method=method).columns,yticklabels=data.corr(method=method).columns,annot=True, annot_kws={'weight': 'bold'},vmin=-0.5, vmax=1, cmap="YlGnBu")plt.tight_layout()if file_name:file_name = file_name.split(".")[0]plt.savefig("./result/heatmap/%s.png" % file_name)plt.show()plt.close()def bar_plot(y_val, x_val=None, x_label=None, horizontal=False, file_name=""):if x_val is None:x_val = range(len(y_val))if x_label is None:x_label = x_valif horizontal:plt.barh(y=x_val, width=y_val, tick_label=x_label)for a, b in zip(y_val, x_val):plt.text(a + 0.01, b, '%.0f' % a, ha='left', va='center', fontsize=11)else:plt.bar(x=x_val, height=y_val, tick_label=x_label)for a, b in zip(x_val, y_val):plt.text(a, b + 0.01, '%.0f' % b, ha='center', va='bottom', fontsize=11)plt.tight_layout()plt.savefig('./result/%s.png' % file_name, bbox_inches='tight')# if file_name:# plt.savefig('./result/%s.png' % file_name, bbox_inches='tight')plt.show()def pie_plot(data, labels=None):"""draw data in pie figure:param data:list(pure data) or dict(use keys as labels and values as data):param labels: None if data is a dict:return:"""X = []if isinstance(data, dict):labels = []for k in data.keys():labels.append(k)X.append(data[k])if labels is None:raise ValueError("labels should be specify")plt.pie(X, labels=labels, autopct='%1.2f%%') # 畫餅圖（數據，數據對應的標簽，百分數保留兩位小數點）plt.title("Pie chart")plt.show()

模型結果衡量

??當有多個模型的時候，我們需要有一個指標衡量模型建表現的好壞，針對時間序列（連續），可以選取均方誤差（mse）和均方根誤差（rmse）。

def rmse(predictions, targets):"""root-mean-square error:param predictions::param targets::return:"""predictions = np.array(predictions)targets = np.array(targets)return np.sqrt(((predictions - targets) ** 2).mean())def mse(predictions, targets):"""mean-square error:param predictions::param targets::return:"""predictions = np.array(predictions)targets = np.array(targets)return ((predictions - targets) ** 2).mean()

總結

以上是生活随笔為你收集整理的时间序列预测（一）—— 数据预处理的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Planning and Learnin
下一篇：计算机网络孙家启,孙家华