我有不同测量频率的时间序列数据。我想把数据转换成一个频率,大概是几天的频率。生成的时间序列可能是不规则的。
例如,我有这个时间序列:
<表类>
日期
价值
tbody><<tr>2017-02-16 26.17000 2017-02-27 26.28000 2017-03-13 26.30000 2017-03-29 26.23000 2017-04-14 26.19000 2017-04-26 26.06000 2017-05-13 26.06000 2017-05-27 25.65000 2017-06-16 25.29000 2017-07-05 25.25000 2017-07-14 25.48000 2017-07-26 25.57000 2017-08-17 25.16000 2017-08-28 25.33000 2017-09-12 25.68235 2017-09-13 25.83799 2017-09-14 25.76669 2017-09-15 25.85253 2017-09-16 25.82017 2017-09-17 25.78362 2017-09-18 25.88422 2017-09-19 25.89594 2017-09-20 25.85522 2017-09-21 25.83583 2017-09-22 25.80082 2017-09-23 25.80076 2017-09-24 25.79209 2017-09-25 25.80632 2017-09-26 25.77773 2017-09-27 25.76311 表类>
我们可以建立一个中间的工作数据框架,其中包含重新索引的行和原始行,以方便将旧索引中的日期复制到新索引中的日期。然后,筛选所选索引的行并复制日期。
步骤1:构建一个包含索引行和原始行的数据框架:
我们可以使用Index.union
得到索引索引和原始索引的并集,如下所示:
idx_new = serie.asfreq('14d').index
idx_old = serie.index
idx_all = idx_new.union(idx_old)
tolerance = 3
serie_all = serie.reindex(index=idx_all, method='nearest', tolerance=datetime.timedelta(tolerance))
步骤2:筛选所选索引的行和复制日期:
让我们使用numpy.select()
对多个条件进行过滤。然后,只保留索引不是NaN
/NaT
的行,使用.loc
:
过滤条件:
- 对于不在新的索引索引中的日期,掩码为
NaT
以丢弃 - 对于前一个日期条目在原始索引中,且
Value
列值相同的日期,两个日期相差小于或等于公差(3天)==>将索引日期更改为上一个日期条目 - 对紧跟其后的日期条目进行类似检查==>将索引日期更改为紧跟其后的日期条目
- 否则,保留新的索引日期索引
condlist = [~ serie_all.index.isin(idx_new),
serie_all.index.to_series().shift().isin(idx_old) & serie_all['Value'].eq(serie_all['Value'].shift()) & serie_all.index.to_series().diff().dt.days.le(tolerance),
serie_all.index.to_series().shift(-1).isin(idx_old) & serie_all['Value'].eq(serie_all['Value'].shift(-1)) & serie_all.index.to_series().diff(-1).dt.days.abs().le(tolerance),
True
]
choicelist = [pd.NaT,
serie_all.index.to_series().shift(),
serie_all.index.to_series().shift(-1),
serie_all.index,
]
# Change date index values based on conditions
serie_all.index = pd.to_datetime(np.select(condlist, choicelist))
# Keep only non-NaT rows
serie_final = serie_all.loc[serie_all.index.notna()].rename_axis(index='Date')
结果:
print(serie_final)
Value
Date
2017-02-16 26.17000
2017-02-27 26.28000
2017-03-13 26.30000
2017-03-29 26.23000
2017-04-14 26.19000
2017-04-26 26.06000
2017-05-13 26.06000
2017-05-27 25.65000
2017-06-08 NaN
2017-06-22 NaN
2017-07-05 25.25000
2017-07-20 NaN
2017-08-03 NaN
2017-08-17 25.16000
2017-08-28 25.33000
2017-09-14 25.76669
数据设置
data = {'Value': {pd.Timestamp('2017-02-16 00:00:00'): 26.17,
pd.Timestamp('2017-02-27 00:00:00'): 26.28,
pd.Timestamp('2017-03-13 00:00:00'): 26.3,
pd.Timestamp('2017-03-29 00:00:00'): 26.23,
pd.Timestamp('2017-04-14 00:00:00'): 26.19,
pd.Timestamp('2017-04-26 00:00:00'): 26.06,
pd.Timestamp('2017-05-13 00:00:00'): 26.06,
pd.Timestamp('2017-05-27 00:00:00'): 25.65,
pd.Timestamp('2017-06-16 00:00:00'): 25.29,
pd.Timestamp('2017-07-05 00:00:00'): 25.25,
pd.Timestamp('2017-07-14 00:00:00'): 25.48,
pd.Timestamp('2017-07-26 00:00:00'): 25.57,
pd.Timestamp('2017-08-17 00:00:00'): 25.16,
pd.Timestamp('2017-08-28 00:00:00'): 25.33,
pd.Timestamp('2017-09-12 00:00:00'): 25.68235,
pd.Timestamp('2017-09-13 00:00:00'): 25.83799,
pd.Timestamp('2017-09-14 00:00:00'): 25.76669,
pd.Timestamp('2017-09-15 00:00:00'): 25.85253,
pd.Timestamp('2017-09-16 00:00:00'): 25.82017,
pd.Timestamp('2017-09-17 00:00:00'): 25.78362,
pd.Timestamp('2017-09-18 00:00:00'): 25.88422,
pd.Timestamp('2017-09-19 00:00:00'): 25.89594,
pd.Timestamp('2017-09-20 00:00:00'): 25.85522,
pd.Timestamp('2017-09-21 00:00:00'): 25.83583,
pd.Timestamp('2017-09-22 00:00:00'): 25.80082,
pd.Timestamp('2017-09-23 00:00:00'): 25.80076,
pd.Timestamp('2017-09-24 00:00:00'): 25.79209,
pd.Timestamp('2017-09-25 00:00:00'): 25.80632,
pd.Timestamp('2017-09-26 00:00:00'): 25.77773,
pd.Timestamp('2017-09-27 00:00:00'): 25.76311}}
serie = pd.DataFrame(data).rename_axis(index='Date')