Pandas groupby sum不包括timedelta类型的列,同时也有数字类型的列



准备数据帧

import pandas as pd
data = {'Desc': ['abc', 'abc', 'abcd'], 'Duration': ['01:30:00', '02:00:00', '00:30:00']}
df = pd.DataFrame(data)
df['Duration'] = pd.to_timedelta(df['Duration'])
print(df)
#    Desc Duration
# 0   abc 01:30:00
# 1   abc 02:00:00
# 2  abcd 00:30:00

一列按和分组效果良好:

print(df.groupby('Desc').sum())
#      Duration
# Desc         
# abc  03:30:00
# abcd 00:30:00

添加第三列:

df['Duration (%)'] = df['Duration'] * 100 / df['Duration'].sum()
print(df)
#    Desc Duration  Duration (%)
# 0   abc 01:30:00          37.5
# 1   abc 02:00:00          50.0
# 2  abcd 00:30:00          12.5

尝试相同的分组方式:

print(df.groupby('Desc').sum())
# Only 'Duration (%)' column was included:
#       Duration (%)
# Desc              
# abc           87.5
# abcd          12.5

我所期望的(一个有2列的数据帧(:

#   Desc Duration  Duration (%)
#    abc 03:30:00          87.5
#   abcd 00:30:00          12.5

附加测试:将time_delta替换为数值

df['Duration'] = [1.5, 2, 0.5]
print(df.groupby('Desc').sum())
#       Duration  Duration (%)
# Desc                        
# abc        3.5          87.5
# abcd       0.5          12.5

(现在它按预期显示了这两列(。

当没有其他数字类型的列时,dtypetimedelta64[ns]的列只会包含在groupby中,这是有原因的吗?

您可以这样做:

sum_df = df.groupby(['Desc']).agg({'Duration': 'sum', 'Duration (%)': 'sum'})

并获得

Duration  Duration (%)
Desc                       
abc  03:30:00          87.5
abcd 00:30:00          12.5

最新更新