使用Pandas在一组中计数时计算一个案例



我是使用python的初学者,我试图在一行代码中计算开放率比率(两个不同计数之间的比率(。我的数据帧是这样的:

df = pd.DataFrame([
(142, 1, 'open' , 'Mobile'),
(144, 2, 'open' , 'Mobile'),
(144, 1, 'delivered', 'Web'),
(142, 1, 'delivered', 'Mobile'),
(142, 2, 'delivered', 'Web'),
(144, 1, 'open', 'Web'),
(142, 2, 'open', 'Mobile')
], columns=['sent_mail_id', 'customer_id', 'event' , 'Tool_used'])

我想在使用Pandas按列Tool_used分组时计算打开率。在SQL语言中是这样的:

select 
Tool_used ,  
count(distinct case when event='open' then sent_mail_id end)/count(distinct case when 
event='delivered' then sent_mail_id end)
from df
group by 1

注意,我需要清楚地计算sent_mail_id,因为需要唯一的计数谢谢

看看这是否是您所需要的,每组中有一列open rate ratio

df1 = ((df.loc[df['event'] == 'open'].groupby('Tool_used')['event'].count() 
/ 
df.loc[df['event'] == 'delivered'].groupby('Tool_used')['event'].count())
.to_frame(name='open rate ratio')
).reset_index()

结果:

print(df1)

Tool_used  open rate ratio
0    Mobile              3.0
1       Web              0.5

使用crosstab,因此只需要用Series.reset_index:将列opendelivered分开

df1 = pd.crosstab(df['Tool_used'], df['event'])
print (df1)
event      delivered  open
Tool_used                 
Mobile             1     3
Web                2     1
df2 = df1['open'].div(df1['delivered']).reset_index(name='open rate ratio')
print (df2)
Tool_used  open rate ratio
0    Mobile              3.0
1       Web              0.5

如果需要groupby比较和聚合sum,但在我看来这更复杂:

a = (df['event'] == 'open').groupby(df['Tool_used']).sum()
b = (df['event'] == 'delivered').groupby(df['Tool_used']).sum()
df2 = a.div(b).reset_index(name='open rate ratio')
print (df2)
Tool_used  open rate ratio
0    Mobile              3.0
1       Web              0.5

具有自定义功能的解决方案(如果是大数据,则性能较差(:

def f(x):
return (x == 'open').sum() / (x == 'delivered').sum()
df2 = df.groupby('Tool_used')['event'].agg(f).reset_index(name='open rate ratio')
print (df2)
Tool_used  open rate ratio
0    Mobile              3.0
1       Web              0.5

最新更新