时间间隔到均匀间隔的时间序列



我需要为机器学习准备具有时间间隔的数据,我需要在时间戳之间获得相等的间隔。例如,对于3小时间隔,我希望有以下时间戳:00:00,03:00,6:00,9:00,12:00,15:00…例如:

df = pd.DataFrame({'Start': ['2022-07-01 11:30', '2022-07-01 22:30'], 'End': ['2022-07-01 18:30', '2022-07-02 3:30'], 'Val': ['a', 'b']})
for col in ['Start', 'End']:
df[col] = df[col].apply(pd.to_datetime)
print(df)

输出:

Start                 End Val
0 2022-07-01 11:30:00 2022-07-01 18:30:00   a
1 2022-07-01 22:30:00 2022-07-02 03:30:00   b

我尝试获取时间戳:

df['Datetime'] = df.apply(lambda x: pd.date_range(x['Start'], x['End'], freq='3H'), axis=1)
df = df.explode('Datetime').drop(['Start', 'End'], axis=1)
df['Datetime'] = df['Datetime'].dt.round('H')
print(df[['Datetime', 'Val']])

输出:

Datetime Val
0 2022-07-01 12:00:00   a
0 2022-07-01 14:00:00   a
0 2022-07-01 18:00:00   a
1 2022-07-01 22:00:00   b
1 2022-07-02 02:00:00   b

可以看到,这些时间戳的间隔不是相等的。我的预期结果:

Datetime  Val
4 2022-07-01 12:00:00    a
5 2022-07-01 15:00:00    a
6 2022-07-01 18:00:00    a
7 2022-07-01 21:00:00  NaN
8 2022-07-02 00:00:00    b
9 2022-07-02 03:00:00    b

我们可以使用函数merge_asof:

df['Datetime'] = df.apply(lambda x: pd.date_range(x['Start'], x['End'], freq='3H'), axis=1)
df = df.explode('Datetime').drop(['Start', 'End'], axis=1)
date_min, date_max = df['Datetime'].dt.date.min(), df['Datetime'].dt.date.max() + pd.Timedelta('1D')
time_range = pd.date_range(date_min, date_max, freq='3H').to_series(name='Datetime')
df = pd.merge_asof(time_range, df, tolerance=pd.Timedelta('3H'))
df.truncate(df['Val'].first_valid_index(), df['Val'].last_valid_index())

输出:

Datetime  Val
4 2022-07-01 12:00:00    a
5 2022-07-01 15:00:00    a
6 2022-07-01 18:00:00    a
7 2022-07-01 21:00:00  NaN
8 2022-07-02 00:00:00    b
9 2022-07-02 03:00:00    b

带注释的代码

# Find min and max date of interval
s, e = df['Start'].min(), df['End'].max()
# Create a date range with freq=3H
# Create a output dataframe by assigning daterange to datetime column
df_out = pd.DataFrame({'datetime': pd.date_range(s.ceil('H'), e, freq='3H')}) 
# Create interval index from start and end date
idx = pd.IntervalIndex.from_arrays(df['Start'], df['End'], closed='both')
# Set the index of df to interval index and select Val column to create mapping series
# Then use this mapping series to substitute values in output dataframe
df_out['Val'] = df_out['datetime'].map(df.set_index(idx)['Val'])
结果

datetime  Val
0 2022-07-01 12:00:00    a
1 2022-07-01 15:00:00    a
2 2022-07-01 18:00:00    a
3 2022-07-01 21:00:00  NaN
4 2022-07-02 00:00:00    b
5 2022-07-02 03:00:00    b

对于这个问题,我真的很喜欢使用pd.DataFrame.reindex。特别是,您可以指定method='nearesttolerance='90m',以确保在需要的地方留下空白。

您可以使用.floor('3H').ceil('3H')方法分别使用开始和结束参数使用pd.date_range创建您的规则间隔时间序列。

import pandas as pd
df = pd.DataFrame({'Start': ['2022-07-01 11:30', '2022-07-01 22:30'], 'End': ['2022-07-01 18:30', '2022-07-02 3:30'], 'Val': ['a', 'b']})
for col in ['Start', 'End']:
df[col] = df[col].apply(pd.to_datetime)
df['Datetime'] = df.apply(lambda x: pd.date_range(x['Start'], x['End'], freq='3H'), axis=1)
df = df.explode('Datetime').drop(['Start', 'End'], axis=1)
result = pd.DataFrame()
for name, group in df.groupby('Val'):
group = group.set_index('Datetime')
group.index = group.index.ceil('1H')
idx = pd.date_range(group.index.min().floor('3H'), group.index.max().ceil('3H'), freq='3H')
group = group.reindex(idx, tolerance = '90m', method='nearest')
result = pd.concat([result, group])
result = result.sort_index()

返回:

Val
2022-07-01 12:00:00 a
2022-07-01 15:00:00 a
2022-07-01 18:00:00 a
2022-07-01 21:00:00 
2022-07-02 00:00:00 b
2022-07-02 03:00:00 b

另一种方法是简单地将timehours添加到循环的开始时间。

from datetime import datetime,timedelta
#taking start time as current time just for example
start = datetime.now()
#taking end time as current time + 15 hours just for example
end = datetime.now() + timedelta(hours = 15)

times = []
while end>start:
start = start+timedelta(hours = 3)
print(start)
times.append(start)
df = pd.Dataframe(columns = ['Times'])
df['Times'] = times

输出
Times
0   2022-07-15 01:28:56.912013
1   2022-07-15 04:28:56.912013
2   2022-07-15 07:28:56.912013
3   2022-07-15 10:28:56.912013
4   2022-07-15 13:28:56.912013

最新更新