>我有数据:
index = pd.DatetimeIndex(['2000-01-01 08:00:00', '2000-01-01 08:00:01',
'2000-01-01 08:00:03', '2000-01-01 08:00:04',
'2000-01-01 08:00:05', '2000-01-01 08:00:06',
'2000-01-01 08:00:09', '2000-01-01 08:00:10',
'2000-01-01 08:00:11', '2000-01-01 08:00:12',
'2000-01-01 08:00:15', '2000-01-01 08:00:16',
'2000-01-01 08:00:18', '2000-01-01 08:00:19',
'2000-01-01 08:00:20', '2000-01-01 08:00:22',
'2000-01-01 08:00:23', '2000-01-01 08:00:24',
'2000-01-01 08:00:25', '2000-01-01 08:00:26',
'2000-01-01 08:00:27', '2000-01-01 08:00:28'])
df = pd.DataFrame({"A": [4,3,8,15,2,3,14,17,15,20,13,2,5,15,10,10,13,20,1,3,10,11]},
index=index)
df
我发现我可以通过以下方式获取特定时间间隔的max
:
df2 = df.groupby(pd.Grouper(freq='5s', base=2)).rolling('5s', min_periods=1).max()
df2
我想获得每条记录的最后 2 组 5s 时间间隔相对于未完成的固定实际时间间隔的滚动最大值。
我发现两个固定 5s 间隔的滚动最大值可以通过以下方式获得:
df3 = df.resample('5s', base=2).max().rolling(2).max()
但例如在记录中2000-01-01 08:00:22
应该是结果15
而不是20
。我找不到如何将这两种方法结合起来。
结果应为:
A last-2x-5s-intervals-max
2000-01-01 07:59:57 2000-01-01 08:00:00 4 NaN (not enough data)
2000-01-01 08:00:01 3 NaN (not enough data)
2000-01-01 08:00:02 2000-01-01 08:00:03 8 8.0 (max from previous 5s to actual record/to actual unfinished 5s time interval)
2000-01-01 08:00:04 15 15.0 (max from 2000-01-01 07:59:57 to 2000-01-01 08:00:04)
2000-01-01 08:00:05 2 15.0 (max from 2000-01-01 07:59:57 to 2000-01-01 08:00:05)
2000-01-01 08:00:06 3 15.0 (max from 2000-01-01 07:59:57 to 2000-01-01 08:00:06)
2000-01-01 08:00:07 2000-01-01 08:00:09 14 15.0 (max from 2000-01-01 08:00:02 to 2000-01-01 08:00:09)
2000-01-01 08:00:10 17 17.0 (max from 2000-01-01 08:00:02 to 2000-01-01 08:00:10)
2000-01-01 08:00:11 15 17.0 (max from 2000-01-01 08:00:02 to 2000-01-01 08:00:11)
2000-01-01 08:00:12 2000-01-01 08:00:12 20 20.0 (max from 2000-01-01 08:00:07 to 2000-01-01 08:00:12)
2000-01-01 08:00:15 13 20.0 (etc...)
2000-01-01 08:00:16 2 20.0
2000-01-01 08:00:17 2000-01-01 08:00:18 5 20.0
2000-01-01 08:00:19 15 20.0
2000-01-01 08:00:20 10 20.0
2000-01-01 08:00:22 2000-01-01 08:00:22 10 15.0
2000-01-01 08:00:23 13 15.0
2000-01-01 08:00:24 20 20.0
2000-01-01 08:00:25 1 20.0
2000-01-01 08:00:26 3 20.0
2000-01-01 08:00:27 2000-01-01 08:00:27 10 20.0
2000-01-01 08:00:28 11 20.0
一种方法是获取df2
中第一级索引上每组的cummax
,以获得组中的当前最大值,assign
第一级索引的每组max
一次shift
以获得前一组的最大值,然后使用具有 df2 第一级索引的loc
重新索引,以便在右行获得正确的值, 以及列上的max
:
df2gr = df2.groupby(level=0)
df2['last_2x'] = df2gr.cummax()
.assign(l=df2gr.max().shift()
.loc[df2.index.get_level_values(0)]
.to_numpy())
.max(axis=1, skipna=False)
print (df2)
A last_2x
2000-01-01 07:59:57 2000-01-01 08:00:00 4.0 NaN
2000-01-01 08:00:01 4.0 NaN
2000-01-01 08:00:02 2000-01-01 08:00:03 8.0 8.0
2000-01-01 08:00:04 15.0 15.0
2000-01-01 08:00:05 15.0 15.0
2000-01-01 08:00:06 15.0 15.0
2000-01-01 08:00:07 2000-01-01 08:00:09 14.0 15.0
2000-01-01 08:00:10 17.0 17.0
2000-01-01 08:00:11 17.0 17.0
2000-01-01 08:00:12 2000-01-01 08:00:12 20.0 20.0
2000-01-01 08:00:15 20.0 20.0
2000-01-01 08:00:16 20.0 20.0
2000-01-01 08:00:17 2000-01-01 08:00:18 5.0 20.0
2000-01-01 08:00:19 15.0 20.0
2000-01-01 08:00:20 15.0 20.0
2000-01-01 08:00:22 2000-01-01 08:00:22 10.0 15.0
2000-01-01 08:00:23 13.0 15.0
2000-01-01 08:00:24 20.0 20.0
2000-01-01 08:00:25 20.0 20.0
2000-01-01 08:00:26 20.0 20.0
2000-01-01 08:00:27 2000-01-01 08:00:27 10.0 20.0
2000-01-01 08:00:28 11.0 20.0
经过一些重新搜索和 @Ben.T 的帮助,这就是我想要达到的状态。可选择更高的时间帧频率和周期:
HIGHER_TF_FREQ = '5s'
HIGHER_TF_PERIOD = 4
df2gr = df.groupby(pd.Grouper(freq=HIGHER_TF_FREQ, base=2))
df2 = df2gr.rolling(HIGHER_TF_FREQ, min_periods=1).max().dropna()
df2['last_2x'] = df2.assign(l=df2gr.max().dropna().rolling(HIGHER_TF_PERIOD-1).max().shift().loc[df2.index.get_level_values(0)].to_numpy()).max(axis=1, skipna=False)
df2