在pandas中,用group by占总数的百分比对每行进行注释



我有一个数据框架,看起来像这样:

Company       Speciality      Payment
AcmeCorp      Roofing         50.00
AcmeCorp      Grounding       50.00
LolCorp       Roofing         106.00
LolCorp       Grounding       94.00

我想添加一个百分比列,像这样:

Company       Speciality      Payment     Percent of Total Payment
AcmeCorp      Roofing         50.00       50
AcmeCorp      Grounding       50.00       50
LolCorp       Roofing         106.00      53
LolCorp       Grounding       94.00       47

最好的方法是什么?

我可以这样做:

df_m = df.groupby('Company').sum()
final_df = pd.merge(df, df_m, on='Company', suffixes=('Raw', 'Total))
final_df['Percent of Total Payment] = final_df['Payment Raw'] / final_df['Payment_Total']

但是我想知道是否有更有效的方法。

使用groupby/transform生成与原始DataFrame相同长度的列。这样可以避免调用pd.merge

import numpy as np
import pandas as pd
df = pd.DataFrame({'Company': ['AcmeCorp', 'AcmeCorp', 'LolCorp', 'LolCorp'],
 'Payment': [50.0, 50.0, 106, 94.00],
 'Speciality': ['Roofing', 'Grounding', 'Roofing', 'Grounding']})
total = df.groupby('Company')['Payment'].transform('sum')
df['percent'] = df['Payment']/total
print(df)
收益率

    Company  Payment Speciality  percent
0  AcmeCorp     50.0    Roofing     0.50
1  AcmeCorp     50.0  Grounding     0.50
2   LolCorp    106.0    Roofing     0.53
3   LolCorp     94.0  Grounding     0.47

虽然
total = df.groupby('Company')['Payment'].transform('sum')
df['percent'] = df['Payment']/total

可以简化为一行,

df['percent'] = df.groupby('Company')['Payment'].transform(lambda x: x/x.sum())

因为像.transform('sum')这样的内置操作比那些自定义函数(例如.transform(lambda x: x/x.sum()))要快,所以两行版本更快(特别是对于大的dataframe)

当然,两行的版本也可以写成

df['percent'] = df['Payment'] / df.groupby('Company')['Payment'].transform('sum')

在没有速度损失的情况下,少了一个命名变量,但可能更难读。


下面是100k行DataFrame的基准测试:

In [53]: %timeit using_transform(df)
100 loops, best of 3: 8.5 ms per loop
In [54]: %timeit using_one_liner(df)
10 loops, best of 3: 20.2 ms per loop
In [55]: %timeit orig(df)
10 loops, best of 3: 30.2 ms per loop

这是执行基准测试的设置。

import numpy as np
import pandas as pd
N = 10**5
df = pd.DataFrame({'Company': np.random.choice(list('ABCD'), size=N),
    'Payment': np.random.randint(10, size=N),
    'Speciality': np.random.choice(list('XYZ'), size=N)})
def using_transform(df):
    total = df.groupby('Company')['Payment'].transform('sum')
    df['percent'] = df['Payment']/total
    return df
def using_one_liner(df):
    df['percent'] = df.groupby('Company')['Payment'].transform(lambda x: x/x.sum())
    return df
def orig(df):
    df_m = df.groupby('Company').sum()
    final_df = pd.merge(df, df_m, left_on='Company', right_index=True, suffixes=('_Raw', '_Total'))
    final_df['Percent of Total Payment'] = final_df['Payment_Raw'] / final_df['Payment_Total']
    return final_df

最新更新