准备数据帧
import pandas as pd
data = {'Desc': ['abc', 'abc', 'abcd'], 'Duration': ['01:30:00', '02:00:00', '00:30:00']}
df = pd.DataFrame(data)
df['Duration'] = pd.to_timedelta(df['Duration'])
print(df)
# Desc Duration
# 0 abc 01:30:00
# 1 abc 02:00:00
# 2 abcd 00:30:00
一列按和分组效果良好:
print(df.groupby('Desc').sum())
# Duration
# Desc
# abc 03:30:00
# abcd 00:30:00
添加第三列:
df['Duration (%)'] = df['Duration'] * 100 / df['Duration'].sum()
print(df)
# Desc Duration Duration (%)
# 0 abc 01:30:00 37.5
# 1 abc 02:00:00 50.0
# 2 abcd 00:30:00 12.5
尝试相同的分组方式:
print(df.groupby('Desc').sum())
# Only 'Duration (%)' column was included:
# Duration (%)
# Desc
# abc 87.5
# abcd 12.5
我所期望的(一个有2列的数据帧(:
# Desc Duration Duration (%)
# abc 03:30:00 87.5
# abcd 00:30:00 12.5
附加测试:将time_delta替换为数值
df['Duration'] = [1.5, 2, 0.5]
print(df.groupby('Desc').sum())
# Duration Duration (%)
# Desc
# abc 3.5 87.5
# abcd 0.5 12.5
(现在它按预期显示了这两列(。
当没有其他数字类型的列时,dtype
为timedelta64[ns]
的列只会包含在groupby中,这是有原因的吗?
您可以这样做:
sum_df = df.groupby(['Desc']).agg({'Duration': 'sum', 'Duration (%)': 'sum'})
并获得
Duration Duration (%)
Desc
abc 03:30:00 87.5
abcd 00:30:00 12.5