我得到了这样一个数据框架:
gpi_data[['sig','hourtime']]
Out[28]:
sig hourtime
datetime_doy
2007-01-02 -8.963545 2007-01-02 09:20:11.249998
2007-01-03 -8.671357 2007-01-03 10:39:31.874991
2007-01-03 -8.996480 2007-01-03 20:22:59.999006
2007-01-04 -8.835958 2007-01-04 10:18:56.249024
2007-01-05 -8.785034 2007-01-05 21:21:39.374002
... ...
2019-12-30 -8.529724 2019-12-30 20:23:01.874996
2019-12-30 -8.563781 2019-12-30 20:48:28.125016
2019-12-30 -8.504211 2019-12-30 21:23:44.999996
2019-12-31 -8.460620 2019-12-31 09:39:31.873999
2019-12-31 -8.230092 2019-12-31 10:18:58.125014
[7983 rows x 2 columns]
,我想根据hour time
计算每天上午和下午的平均值。上午是指10:00:00
附近的数据,下午是22:00:00
附近的数据。若当天早晚无值,则用np.nan
填充。例如,在2007-01-01
上,我们没有sig
的任何早晚值。然后我们用两个np.nan
值填充它。然后在2007-01-02
上,我们只有早晨值,所以我们用np.nan填充2007-01-02
的晚上值。具体来说,对于2019-12-30
,我们有3个傍晚值,即2019-12-30 20:23:01.874996
,2019-12-30 20:48:28.125016
和2019-12-30 21:23:44.999996
。所以我们需要计算-8.529724
、-8.563781
和-8.504211
的平均值。2019-12-31
上午的最后两个数据点也是一样,我们需要取平均值,将np.nan
填充到2019-12-31
晚上。
所以理想的最终结果是:
gpi_data[['sig','hourtime']]
Out[28]:
sig hourtime
datetime_doy
2007-01-01 nan 2007-01-01 10:00:00
2007-01-01 nan 2007-01-01 22:00:00
2007-01-02 -8.963545 2007-01-02 09:20:11.249998
2007-01-02 nan 2007-01-02 22:00:00
2007-01-03 -8.671357 2007-01-03 10:39:31.874991
2007-01-03 -8.996480 2007-01-03 20:22:59.999006
2007-01-04 -8.835958 2007-01-04 10:18:56.249024
2007-01-04 nan 2007-01-04 22:00:00
2007-01-05 nan 2007-01-05 10:00:00
2007-01-05 -8.785034 2007-01-05 21:21:39.374002
... ...
2019-12-30 -8.532572 2019-12-30 22:00:00
2019-12-31 -8.345356 2019-12-31 09:39:31.873999
2019-12-31 nan 2019-12-31 22:00:00
如果我们将所有hourtime
四舍五入到10:00:00或22:00:00,就像下面这样:
gpi_data[['sig','hourtime']]
Out[28]:
sig hourtime
datetime_doy
2007-01-01 nan 2007-01-01 10:00:00
2007-01-01 nan 2007-01-01 22:00:00
2007-01-02 -8.963545 2007-01-02 10:00:00
2007-01-02 nan 2007-01-02 22:00:00
2007-01-03 -8.671357 2007-01-03 10:00:00
2007-01-03 -8.996480 2007-01-03 22:00:00
2007-01-04 -8.835958 2007-01-04 10:00:00
2007-01-04 nan 2007-01-04 22:00:00
2007-01-05 nan 2007-01-05 10:00:00
2007-01-05 -8.785034 2007-01-05 22:00:00
... ...
2019-12-30 -8.532572 2019-12-30 22:00:00
2019-12-31 -8.460620 2019-12-31 10:00:00
2019-12-31 nan 2019-12-31 22:00:00
我该怎么做?有人能帮我吗?谢谢!
按某些阈值使用cut
来定义10
和22
列,这里使用12
和23
小时。
根据MultiIndex.from_product
的最小年和最大年创建MultiIndex
,汇总mean
并添加Series.reindex
的缺失组合,最后创建hourtime
列:
df['hourtime'] = pd.cut(df['hourtime'].dt.hour, bins=[0,12,23], labels=[10,22])
start = pd.Timestamp(year=df.index.year.min(), month=1, day=1)
end = pd.Timestamp(year=df.index.year.max(), month=12, day=31)
mux = pd.MultiIndex.from_product([pd.date_range(start, end), [10,22]],
names=['datetime_doy','h'])
df = df.groupby([df.index, 'hourtime'])['sig'].mean().reindex(mux).reset_index(level=1)
df['hourtime'] = df.index + pd.to_timedelta(df.pop('h'), unit='H')
print (df)
sig hourtime
datetime_doy
2007-01-01 NaN 2007-01-01 10:00:00
2007-01-01 NaN 2007-01-01 22:00:00
2007-01-02 -8.963545 2007-01-02 10:00:00
2007-01-02 NaN 2007-01-02 22:00:00
2007-01-03 -8.671357 2007-01-03 10:00:00
... ...
2019-12-29 NaN 2019-12-29 22:00:00
2019-12-30 NaN 2019-12-30 10:00:00
2019-12-30 -8.532572 2019-12-30 22:00:00
2019-12-31 -8.345356 2019-12-31 10:00:00
2019-12-31 NaN 2019-12-31 22:00:00
[9496 rows x 2 columns]