我有一个数据集,它包含在42小时内模糊更新的预测数据。这是一个示例:
df_old = pd.DataFrame({'IssueDatetime': ['2010-01-01 09:00:00', '2010-01-01 09:00:00', '2010-01-01 09:00:00','2010-01-01 09:00:00','2010-01-01 09:00:00'],
'endtime':['2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00'],
'Regions': ['EAST COAST-CAPE ST FRANCIS AND SOUTH', 'EAST COAST-CAPE ST FRANCIS AND SOUTH', 'EAST COAST-CAPE ST FRANCIS AND SOUTH','NORTHEAST COAST','NORTHEAST COAST'],
'forecastTime': ['2010-01-01 09:00:00','2010-01-01 15:00:00','2010-01-01 19:00:00','2010-01-01 09:00:00','2010-01-01 12:00:00'],
'forecast_Dir':[150,180,45,45,45],
'windSpeed':[20,90,35,45,15]})
问题是df[预测时间]和df[结束时间]之间的时间间隔。我试着利用我有限的熊猫知识对数据进行分组和重新采样,但由于日期重复,我无法获得日期时间索引。
最终,我的目标是扩展数据帧,使数据帧中原始小时之间的小时有自己的行,直到结束。。。
所需输出示例:
df_new = pd.DataFrame({'IssueDatetime': [ '2010-01-01 09:00:00', '2010-01-01 09:00:00', '2010-01-01 09:00:00', '2010-01-01 09:00:00', '2010-01-01 09:00:00', '2010-01-01 09:00:00','2010-01-01 09:00:00'],
'endtime':['2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00'],
'Regions': ['EAST COAST-CAPE ST FRANCIS AND SOUTH', 'EAST COAST-CAPE ST FRANCIS AND SOUTH','EAST COAST-CAPE ST FRANCIS AND SOUTH','EAST COAST-CAPE ST FRANCIS AND SOUTH','EAST COAST-CAPE ST FRANCIS AND SOUTH','EAST COAST-CAPE ST FRANCIS AND SOUTH','EAST COAST-CAPE ST FRANCIS AND SOUTH'],
'forecastTime': ['2010-01-01 09:00:00','2010-01-01 10:00:00','2010-01-01 11:00:00','2010-01-01 12:00:00','2010-01-01 13:00:00','2010-01-01 14:00:00','2010-01-01 15:00:00'],
'forecast_Dir':[150,150,150,150,150,150,180],
'windSpeed':[20,20,20,20,20,20,90]})
请注意,对于第一个区域,df['refreshTime']='2010-01-01 09:00:00'和df['prefreshTime']='2010-01-01 15:00:00'之间的小时应该是各自的行。从本质上讲,我希望增加采样来填补缺失的时间。
编辑:-原始数据帧
IssueDatetime endtime
0 2013-01-01 09:00:00 2013-01-03 03:00:00
1 2013-01-01 09:00:00 2013-01-03 03:00:00
2 2013-01-01 09:00:00 2013-01-03 03:00:00
3 2013-01-01 09:00:00 2013-01-03 03:00:00
4 2013-01-01 09:00:00 2013-01-03 03:00:00
... ... ...
53585 2016-12-30 09:00:00 2017-01-01 03:00:00
53586 2016-12-30 09:00:00 2017-01-01 03:00:00
53587 2016-12-30 09:00:00 2017-01-01 03:00:00
53588 2016-12-30 09:00:00 2017-01-01 03:00:00
53589 2016-12-30 09:00:00 2017-01-01 03:00:00
Regions forecastTime
0 SOUTH COAST 2013-01-01 09:00:00
1 SOUTH COAST 2013-01-01 18:00:00
2 SOUTH COAST 2013-01-02 06:00:00
3 SOUTH COAST 2013-01-02 13:00:00
4 EAST COAST-CAPE ST FRANCIS AND SOUTH 2013-01-01 09:00:00
... ... ...
53585 SOUTHWESTERN GRAND BANKS 2016-12-30 18:00:00
53586 SOUTHWESTERN GRAND BANKS 2016-12-31 09:00:00
53587 SOUTHWESTERN GRAND BANKS 2016-12-31 15:00:00
53588 SOUTHWESTERN GRAND BANKS 2016-12-31 18:00:00
53589 SOUTHWESTERN GRAND BANKS 2017-01-01 00:00:00
forecastHour forecast_Dir forecast_WindSpeed_low
0 0.0 270 35
1 9.0 270 25
2 21.0 225 15
3 28.0 270 35
4 0.0 270 35
... ... ... ...
53585 9.0 135 40
53586 24.0 135 40
53587 30.0 135 40
53588 33.0 315 25
53589 39.0 315 25
forecast_WindSpeed_gust forecast_WindSpeed_high
0 None None
1 None None
2 None None
3 None None
4 None None
... ... ...
53585 None 50
53586 None 50
53587 None 50
53588 None 35
53589 None None
forecast_WindSpeed_exception_1_type forecast_Dir_exception_1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ... ...
53585 NaN NaN
53586 OVER NORTHWESTERN SECTIONS 315
53587 NaN NaN
53588 NaN NaN
53589 NaN NaN
forecast_WindSpeed_low_exception_1 forecast_WindSpeed_high_exception_1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ... ...
53585 NaN NaN
53586 25 None
53587 NaN NaN
53588 NaN NaN
53589 NaN NaN
IIUC,首先需要将'forecastTime'
列转换为datetime,然后将'forecastTime'
列设置为索引以进行重采样,然后按'Regions'
分组,按小时进行resample
,并用ffill
:填充NaN值
df_old['forecastTime'] = pd.to_datetime(df_old['forecastTime'])
df_new = df_old.set_index('forecastTime')
.groupby('Regions',as_index=False)
.resample('H').ffill().droplevel(0).reset_index()
print(df_new.head())
为了避免由于日期重复而可能出现的错误(非唯一索引(,您可以尝试以下操作:
df_new = df_old.groupby('Regions',as_index=False)
.apply(lambda x: x.set_index('forecastTime')
.resample('H').ffill()).droplevel(0).reset_index()
输出:
forecastTime IssueDatetime endtime Regions forecast_Dir windSpeed
0 2010-01-01 09:00:00 2010-01-01 09:00:00 2010-01-03 03:00:00 EAST COAST-CAPE ST FRANCIS AND SOUTH 150 20
1 2010-01-01 10:00:00 2010-01-01 09:00:00 2010-01-03 03:00:00 EAST COAST-CAPE ST FRANCIS AND SOUTH 150 20
2 2010-01-01 11:00:00 2010-01-01 09:00:00 2010-01-03 03:00:00 EAST COAST-CAPE ST FRANCIS AND SOUTH 150 20
3 2010-01-01 12:00:00 2010-01-01 09:00:00 2010-01-03 03:00:00 EAST COAST-CAPE ST FRANCIS AND SOUTH 150 20
4 2010-01-01 13:00:00 2010-01-01 09:00:00 2010-01-03 03:00:00 EAST COAST-CAPE ST FRANCIS AND SOUTH 150 20
我已经部分回答了自己的问题。。。我仍然得到一个多索引错误,所以我在每个预测中添加了一毫秒每行的时间,使它们独一无二:
df_old['forecastTime'] = df_old['forecastTime'] + pd.to_timedelta(df_old.groupby('forecastTime').cumcount(), unit='ms')
df_old['forecastTime'] = pd.to_datetime(df_old['forecastTime'])
df_new = df_old.set_index('forecastTime')
.groupby('Regions',as_index=False)
.resample('H').ffill().droplevel(0).reset_index()
现在,我需要添加一个结束时间。有没有建议在每次预测的统计时间后42小时结束.fill((?