具有时间戳记列的DataFrame,由于groupby
,pd.Grouper
和for
循环,我能够按周期分组行并跟踪原始DataFrame中的组标签。
例如,考虑以下DataFrame,周期为2小时:
import pandas as pd
df1 = pd.DataFrame({'humidity': [0.3, 0.8, 0.9],
'pressure': [1e5, 1.1e5, 0.95e5],
'location': ['Paris', 'Paris', 'Milan']},
index = [pd.Timestamp('2020/01/02 01:59:00'),
pd.Timestamp('2020/01/02 03:59:00'),
pd.Timestamp('2020/01/02 02:59:00')])
grps = df1.groupby(pd.Grouper(freq='2H', origin='start_day'))
for gr in grps:
df1.loc[gr[1].index,'grp'] = gr[0]
结果是:
df1
Out[23]:
humidity pressure location grp
2020-01-02 01:59:00 0.3 100000.0 Paris 2020-01-02 00:00:00
2020-01-02 03:59:00 0.8 110000.0 Paris 2020-01-02 02:00:00
2020-01-02 02:59:00 0.9 95000.0 Milan 2020-01-02 02:00:00
打算管理大型数据集,我想知道是否没有办法摆脱这个for
循环?在groupby
中是否有一个函数或参数来检索原始DataFrame,仅使用嵌入标签名称的新列?
谢谢你的帮助。最好,
使用GroupBy.transform
作为任何列名:
grps = df1.groupby(pd.Grouper(freq='2H', origin='start_day'))
for gr in grps:
print (gr)
df1.loc[gr[1].index,'grp'] = gr[0]
df1['new'] = grps['humidity'].transform(lambda x: x.name)
print (df1)
humidity pressure location grp
2020-01-02 01:59:00 0.3 100000.0 Paris 2020-01-02 00:00:00
2020-01-02 03:59:00 0.8 110000.0 Paris 2020-01-02 02:00:00
2020-01-02 02:59:00 0.9 95000.0 Milan 2020-01-02 02:00:00
new
2020-01-02 01:59:00 2020-01-02 00:00:00
2020-01-02 03:59:00 2020-01-02 02:00:00
2020-01-02 02:59:00 2020-01-02 02:00:00