Pandas:计算上午平均值或下午平均值



我得到了这样一个数据框架:

gpi_data[['sig','hourtime']]
Out[28]: 
sig                   hourtime
datetime_doy                                     
2007-01-02   -8.963545 2007-01-02 09:20:11.249998
2007-01-03   -8.671357 2007-01-03 10:39:31.874991
2007-01-03   -8.996480 2007-01-03 20:22:59.999006
2007-01-04   -8.835958 2007-01-04 10:18:56.249024
2007-01-05   -8.785034 2007-01-05 21:21:39.374002
...                        ...
2019-12-30   -8.529724 2019-12-30 20:23:01.874996
2019-12-30   -8.563781 2019-12-30 20:48:28.125016
2019-12-30   -8.504211 2019-12-30 21:23:44.999996
2019-12-31   -8.460620 2019-12-31 09:39:31.873999
2019-12-31   -8.230092 2019-12-31 10:18:58.125014
[7983 rows x 2 columns]

,我想根据hour time计算每天上午和下午的平均值。上午是指10:00:00附近的数据,下午是22:00:00附近的数据。若当天早晚无值,则用np.nan填充。例如,在2007-01-01上,我们没有sig的任何早晚值。然后我们用两个np.nan值填充它。然后在2007-01-02上,我们只有早晨值,所以我们用np.nan填充2007-01-02的晚上值。具体来说,对于2019-12-30,我们有3个傍晚值,即2019-12-30 20:23:01.874996,2019-12-30 20:48:28.1250162019-12-30 21:23:44.999996。所以我们需要计算-8.529724-8.563781-8.504211的平均值。2019-12-31上午的最后两个数据点也是一样,我们需要取平均值,将np.nan填充到2019-12-31晚上。

所以理想的最终结果是:

gpi_data[['sig','hourtime']]
Out[28]: 
sig                   hourtime
datetime_doy                                     
2007-01-01   nan       2007-01-01 10:00:00
2007-01-01   nan       2007-01-01 22:00:00
2007-01-02   -8.963545 2007-01-02 09:20:11.249998
2007-01-02   nan       2007-01-02 22:00:00
2007-01-03   -8.671357 2007-01-03 10:39:31.874991
2007-01-03   -8.996480 2007-01-03 20:22:59.999006
2007-01-04   -8.835958 2007-01-04 10:18:56.249024
2007-01-04   nan       2007-01-04 22:00:00
2007-01-05   nan       2007-01-05 10:00:00
2007-01-05   -8.785034 2007-01-05 21:21:39.374002
...                        ...
2019-12-30   -8.532572 2019-12-30 22:00:00
2019-12-31   -8.345356 2019-12-31 09:39:31.873999
2019-12-31   nan       2019-12-31 22:00:00

如果我们将所有hourtime四舍五入到10:00:00或22:00:00,就像下面这样:

gpi_data[['sig','hourtime']]
Out[28]: 
sig                   hourtime
datetime_doy                                     
2007-01-01   nan       2007-01-01 10:00:00
2007-01-01   nan       2007-01-01 22:00:00
2007-01-02   -8.963545 2007-01-02 10:00:00
2007-01-02   nan       2007-01-02 22:00:00
2007-01-03   -8.671357 2007-01-03 10:00:00
2007-01-03   -8.996480 2007-01-03 22:00:00
2007-01-04   -8.835958 2007-01-04 10:00:00
2007-01-04   nan       2007-01-04 22:00:00
2007-01-05   nan       2007-01-05 10:00:00
2007-01-05   -8.785034 2007-01-05 22:00:00
...                        ...
2019-12-30   -8.532572 2019-12-30 22:00:00
2019-12-31   -8.460620 2019-12-31 10:00:00
2019-12-31   nan       2019-12-31 22:00:00

我该怎么做?有人能帮我吗?谢谢!

按某些阈值使用cut来定义1022列,这里使用1223小时。

根据MultiIndex.from_product的最小年和最大年创建MultiIndex,汇总mean并添加Series.reindex的缺失组合,最后创建hourtime列:

df['hourtime'] = pd.cut(df['hourtime'].dt.hour, bins=[0,12,23], labels=[10,22])
start = pd.Timestamp(year=df.index.year.min(), month=1, day=1)
end = pd.Timestamp(year=df.index.year.max(), month=12, day=31)
mux = pd.MultiIndex.from_product([pd.date_range(start, end), [10,22]],
names=['datetime_doy','h'])
df = df.groupby([df.index, 'hourtime'])['sig'].mean().reindex(mux).reset_index(level=1)
df['hourtime'] = df.index + pd.to_timedelta(df.pop('h'), unit='H')
print (df)
sig            hourtime
datetime_doy                              
2007-01-01         NaN 2007-01-01 10:00:00
2007-01-01         NaN 2007-01-01 22:00:00
2007-01-02   -8.963545 2007-01-02 10:00:00
2007-01-02         NaN 2007-01-02 22:00:00
2007-01-03   -8.671357 2007-01-03 10:00:00
...                 ...
2019-12-29         NaN 2019-12-29 22:00:00
2019-12-30         NaN 2019-12-30 10:00:00
2019-12-30   -8.532572 2019-12-30 22:00:00
2019-12-31   -8.345356 2019-12-31 10:00:00
2019-12-31         NaN 2019-12-31 22:00:00
[9496 rows x 2 columns]

最新更新