根据多列上聚合函数的条件结果计算唯一记录



我的数据如下:

df = pd.DataFrame({'ID': [1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4,
4, 4, 5, 5, 5],
'group': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B',
'B', 'B', 'B', 'B', 'B', 'B'],
'attempts': [0, 1, 1, 1, 1, 1, 1, 0, 1,
1, 1, 1, 0, 0, 1, 0],
'successes': [1, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 1, 1, 0, 1],
'score': [None, 5, 5, 4, 5, 4, 5, None, 1, 5,
0, 1, None, None, 1, None]})
## df output
ID group attempts successes score
0   1     A        0         1  None
1   1     A        1         0     5
2   1     A        1         0     5
3   1     A        1         0     4
4   2     A        1         0     5
5   2     A        1         0     4
6   3     A        1         0     5
7   3     A        0         1  None
8   3     A        1         0     1
9   4     B        1         0     5
10  4     B        1         0     0
11  4     B        1         0     1
12  4     B        0         1  None
13  5     B        0         1  None
14  5     B        1         0     1
15  5     B        0         1  None

我试图按两列(groupscore(进行分组,并在首先确定(groupID(的哪些组在所有score值中至少有1个successes计数之后,计算唯一ID的数量。换句话说,我只想在聚合中对ID计数一次(唯一(,如果它至少有一个相关的成功。我也只想对每个(groupID(对只计算唯一ID,而不管它包含的attempt_counts的数量(即,如果有5个成功计数的总和,我只想包括1个(。

successesattempts列是二进制的(只有1或0(。例如,对于ID=1,group=A,至少有1个成功。因此,当计数每个(groupscore(的唯一ID的数量时,我将包括该ID

我希望最终输出看起来像这样,这样我就可以计算每个(groupscore(组合的唯一成功与唯一尝试的比率。

group score successes_count attempts_counts ratio
A     5              2                3  0.67
4              1                2  0.50
1              1                1   1.0              
0              0                0   inf
B     5              1                1   1.0
4              0                0   inf
1              2                2   1.0
0              1                1   1.0

到目前为止,我已经能够运行一个透视表来计算每个(groupID(的总和,以识别那些至少成功1次的ID。然而,我不确定用这个来达到我想要的最终状态的最佳方式。

p = pd.pivot_table(data=df_new,
values=['ID'],
index=['group', 'ID'],
columns=['successes', 'attempts'],
aggfunc={'ID': 'count'})
# p output
ID     
successes    0    1
attempts     1    0
group ID           
A     1    3.0  1.0
2    2.0  NaN
3    2.0  1.0
B     4    3.0  1.0
5    1.0  2.0

让我们试试这样的东西:

import numpy as np
import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4,
4, 4, 5, 5, 5],
'group': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B',
'B', 'B', 'B', 'B', 'B', 'B'],
'attempts': [0, 1, 1, 1, 1, 1, 1, 0, 1,
1, 1, 1, 0, 0, 1, 0],
'successes': [1, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 1, 1, 0, 1],
'score': [None, 5, 5, 4, 5, 4, 5, None, 1, 5,
0, 1, None, None, 1, None]})
# Groups With At least 1 Success
m = df.groupby('group')['successes'].transform('max').astype(bool)
# Filter Out
df = df[m]
# Replace 0 successes with NaNs
df['successes'] = df['successes'].replace(0, np.nan)
# FFill BFill each group so that any success will fill the group
df['successes'] = df.groupby(['ID', 'group'])['successes'] 
.apply(lambda s: s.ffill().bfill())
# Pivot then stack to make sure each group has all score values
# Sort and reset index
# Rename Columns
# fix types
p = df.drop_duplicates() 
.pivot_table(index='group',
columns='score',
values=['attempts', 'successes'],
aggfunc='sum',
fill_value=0) 
.stack() 
.sort_values(['group', 'score'], ascending=[True, False]) 
.reset_index() 
.rename(columns={'attempts': 'attempts_counts',
'successes': 'successes_count'}) 
.convert_dtypes()
# Calculate Ratio
p['ratio'] = p['successes_count'] / p['attempts_counts']
print(p)

输出:

group  score  attempts_counts  successes_count     ratio
0     A      5                3                2  0.666667
1     A      4                2                1       0.5
2     A      1                1                1       1.0
3     A      0                0                0       NaN
4     B      5                1                1       1.0
5     B      4                0                0       NaN
6     B      1                2                2       1.0
7     B      0                1                1       1.0

相关内容

  • 没有找到相关文章

最新更新