當前位置：首頁 > 编程资源 > 综合教程 >内容正文

综合教程

6-Pandas时序数据处理之重采样与频率转换（升降采样、resample()、OHLC、groupby()重采样）

發布時間：2023/12/15 综合教程 20 生活家

生活随笔收集整理的這篇文章主要介紹了 6-Pandas时序数据处理之重采样与频率转换（升降采样、resample()、OHLC、groupby()重采样）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

重采樣（resampling）指的是將時間序列從一個頻率轉換到另一個頻率的過程，其中：

高頻轉為低頻成為降采樣（下采樣）
低頻轉為高頻成為升采樣（上采樣）

1、使用resample()方法進行重采樣

例：現有一個以年月日為索引的時間序列ts，將其重采樣為年月的頻率，并計算均值

>>> ts = pd.Series(np.random.randint(0,10,5))
>>> ts.index = pd.date_range('2020-7-30',periods=5)
>>> ts
2020-07-30    4
2020-07-31    4
2020-08-01    8
2020-08-02    6
2020-08-03    7
Freq: D, dtype: int32

>>> ts.resample('M')
DatetimeIndexResampler [freq=<MonthEnd>, axis=0, closed=right, label=right, convention=start, base=0]
>>> ts.resample('M').mean()
2020-07-31    4
2020-08-31    7
Freq: M, dtype: int32

2、降采樣

使用resample()對數據進行降采樣時，需要考慮兩個問題：

各區間那邊是閉合的（close參數的值--right即又邊界閉合，left即左邊界閉合）
如何標記各個聚合面元，用區間的開頭還是末尾

例：通過求和的方式將上述數據聚合到2分鐘的集合里，傳入close='right'會讓右邊界閉合，傳入close='left'會讓左邊界閉合

>>> ts.resample('2min',closed='right').sum()
2020-07-31 23:58:00    0
2020-08-01 00:00:00    3
2020-08-01 00:02:00    7
Freq: 2T, dtype: int32

>>> ts.resample('2min',closed='left').sum()
2020-08-01 00:00:00    1
2020-08-01 00:02:00    5
2020-08-01 00:04:00    4
Freq: 2T, dtype: int32

　　可以使用loffset設置索引位移，傳入參數loffset一個字符串或者偏移量，即可實現對結果索引的一些位移

>>> ts.resample('T',loffset='-1s').sum()
2020-07-31 23:59:59    0
2020-08-01 00:00:59    1
2020-08-01 00:01:59    2
2020-08-01 00:02:59    3
2020-08-01 00:03:59    4
Freq: T, dtype: int32

>>> ts.resample('2T',loffset='-1s').sum()
2020-07-31 23:59:59    1
2020-08-01 00:01:59    5
2020-08-01 00:03:59    4

3、OHLC重采樣

金融領域中有一種時間序列聚合方式（OHLC），計算各面元的四個值：

O：open，開盤
H：high，最大值
L：low，最小值
C：close，收盤

其后也不單單用于金融領域，O可以用于表達初始值，H表示最大值，L表示最小值，C表示末尾值。

傳入how = ‘ohlc’可得到一個含有這四種聚合值的DateFrame，但是格式已經改變，如下：

>>> ts.resample('2T',closed = 'right',how = 'ohlc')
__main__:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).ohlc()
                     open  high  low  close
2020-07-31 23:58:00     0     0    0      0
2020-08-01 00:00:00     1     2    1      2
2020-08-01 00:02:00     3     4    3      4

>>> ts.resample('2T',closed = 'right').ohlc()
                     open  high  low  close
2020-07-31 23:58:00     0     0    0      0
2020-08-01 00:00:00     1     2    1      2
2020-08-01 00:02:00     3     4    3      4

3、groupby重采樣

同另一篇博文【Pandas時序數據處理（日期范圍pd.date_range()、頻率(基礎頻率表)及移動(shift()、rollforward()、rollback())）的第四部分的例子】

例：若有一時間序列數據，如何在每月月末顯示該月數據的均值

無需用到 rollback 滾動，只需傳入一個能夠訪問 ts 索引上的月份字段的函數即可

>>> rng = pd.date_range('2020-1-14',periods=100,freq='D')
>>> ts = pd.Series(np.random.randint(0,10,100),index=rng)
>>> ts.groupby(lambda x:x.month).mean()
1    3.722222
2    5.068966
3    4.290323
4    4.863636
dtype: float64

　根據星期幾對上述時間序列進行分組并求出分組后的均值，只需傳一個能夠訪問ts索引上的星期字段函數即可

>>> ts.groupby(lambda x:x.weekday).mean()
0    4.714286
1    5.333333
2    4.800000
3    3.214286
4    4.142857
5    4.785714
6    4.714286
dtype: float64

4、升采樣　　

在將數據從低頻轉換到高頻時，不需要聚合。

>>> data = pd.DataFrame(np.random.randint(0,10,size=(2,4)))
>>> data.index = pd.date_range('2020-1-14',periods = 2,freq='W-WED')
>>> data.columns = ['one','two','three','four']
>>> data
            one  two  three  four
2020-01-15    6    9      8     6
2020-01-22    5    6      7     6

　將data重采樣到日頻率，默認會引入缺失值

>>> data.resample('D').mean()
            one  two  three  four
2020-01-15  6.0  9.0    8.0   6.0
2020-01-16  NaN  NaN    NaN   NaN
2020-01-17  NaN  NaN    NaN   NaN
2020-01-18  NaN  NaN    NaN   NaN
2020-01-19  NaN  NaN    NaN   NaN
2020-01-20  NaN  NaN    NaN   NaN
2020-01-21  NaN  NaN    NaN   NaN
2020-01-22  5.0  6.0    7.0   6.0

　　假設用前面的值填充缺失值，使用ffill()實現，具體填充方式可以參考另一篇博文【Pandas數據初探索之缺失值處理與丟棄數據（填充fillna()、刪除drop()、drop_duplicates()、dropna()）的第二部分】

>>> data.resample('D').ffill()
            one  two  three  four
2020-01-15    6    9      8     6
2020-01-16    6    9      8     6
2020-01-17    6    9      8     6
2020-01-18    6    9      8     6
2020-01-19    6    9      8     6
2020-01-20    6    9      8     6
2020-01-21    6    9      8     6
2020-01-22    5    6      7     6

>>> data.resample('D').bfill()
            one  two  three  four
2020-01-15    6    9      8     6
2020-01-16    5    6      7     6
2020-01-17    5    6      7     6
2020-01-18    5    6      7     6
2020-01-19    5    6      7     6
2020-01-20    5    6      7     6
2020-01-21    5    6      7     6
2020-01-22    5    6      7     6

　也可以僅填充指定的時期數（目的是限制前面觀測值的持續使用距離，limit = 2表示前面的觀測值只能填充往后的兩行數據）　

>>> data.resample('D').pad(limit=2)
            one  two  three  four
2020-01-15  6.0  9.0    8.0   6.0
2020-01-16  6.0  9.0    8.0   6.0
2020-01-17  6.0  9.0    8.0   6.0
2020-01-18  NaN  NaN    NaN   NaN
2020-01-19  NaN  NaN    NaN   NaN
2020-01-20  NaN  NaN    NaN   NaN
2020-01-21  NaN  NaN    NaN   NaN
2020-01-22  5.0  6.0    7.0   6.0

5、通過日期進行重采樣　

　對于使用時期索引的數據進行重采樣較為簡單，先創建一個對象：

>>> df = pd.DataFrame(np.random.randn(24,4))
>>> df.index = pd.period_range('2020-1',periods=24,freq='M')
>>> df.columns = ['one','two','three','four']
>>> df
              one       two     three      four
2020-01 -0.773347  0.121962  0.688172 -0.128935
2020-02  1.260893  0.949058  0.617078 -1.444115
2020-03  0.470896  2.678574 -0.789855 -0.788634
2020-04 -1.011997 -0.743128  1.118954 -0.643499
2020-05  0.139304  0.119937  0.386177 -0.395788
2020-06 -1.264226 -0.647303  0.484827  0.986434
2020-07  0.430877 -0.007752  0.484699 -0.494257
2020-08  2.734575  0.850000  1.020758  0.078646
2020-09 -0.038556  0.168716 -1.301591  0.874963
2020-10 -1.061978  0.329240  0.372740 -0.474351
2020-11 -1.744309  0.050698 -1.261978  1.312718
2020-12  0.518119 -0.062940  0.765845  1.788449
2021-01 -0.876448  0.449906  0.927772 -0.044937
2021-02 -0.515143  1.594102  0.470797  0.377561
2021-03  0.857145  0.488788  0.346126  0.588185
2021-04 -0.467256  0.338766  0.307865 -0.713797
2021-05  1.674114 -0.730812  0.486691  0.059144
2021-06  0.746407 -0.542054  0.047589 -0.616221
2021-07  0.205364 -0.865091 -0.450592  0.736776
2021-08  1.123738  0.091906  1.039720  0.776065
2021-09  1.869627  1.688411 -2.790112 -0.116390
2021-10 -1.315471 -0.085058  0.729701  0.848654
2021-11  2.065949  0.297769 -0.398484 -1.197251
2021-12 -0.466184 -0.084250  0.700341 -1.764270

　傳入'A-DEC'進行降采樣（使用年度財政的方式）

>>> df.resample('A-DEC').mean()
           one       two     three      four
2020 -0.028312  0.317255  0.215486  0.055969
2021  0.408487  0.220199  0.118118 -0.088873

　傳入'A-JUN'進行降采樣（使用6月作為財政年度的分割單位）

>>> df.resample('A-JUN').mean()
           one       two     three      four
2020 -0.196413  0.413183  0.417559 -0.402423
2021  0.188129  0.243888  0.222276  0.228009
2022  0.580504  0.173948 -0.194904 -0.119403

6、通過日期進行升采樣　

　需決定在新頻率中，各區間的哪端用于放置原來的值，convention參數默認為start，可設置為end

>>> annu_df = df.resample('A-DEC').mean()
>>> annu_df
           one       two     three      four
2020 -0.028312  0.317255  0.215486  0.055969
2021  0.408487  0.220199  0.118118 -0.088873

>>> annu_df.resample('Q-DEC').ffill()
             one       two     three      four
2020Q1 -0.028312  0.317255  0.215486  0.055969
2020Q2 -0.028312  0.317255  0.215486  0.055969
2020Q3 -0.028312  0.317255  0.215486  0.055969
2020Q4 -0.028312  0.317255  0.215486  0.055969
2021Q1  0.408487  0.220199  0.118118 -0.088873
2021Q2  0.408487  0.220199  0.118118 -0.088873
2021Q3  0.408487  0.220199  0.118118 -0.088873
2021Q4  0.408487  0.220199  0.118118 -0.088873
>>> annu_df.resample('Q-DEC',convention = 'end').ffill()
             one       two     three      four
2020Q4 -0.028312  0.317255  0.215486  0.055969
2021Q1 -0.028312  0.317255  0.215486  0.055969
2021Q2 -0.028312  0.317255  0.215486  0.055969
2021Q3 -0.028312  0.317255  0.215486  0.055969
2021Q4  0.408487  0.220199  0.118118 -0.088873

總結

以上是生活随笔為你收集整理的6-Pandas时序数据处理之重采样与频率转换（升降采样、resample()、OHLC、groupby()重采样）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： python fine函数_python
下一篇： Linux不正常关机怎么办