如何在pandas中groupby后在同一列上过滤两次



我在解决以下问题时遇到了麻烦:

我有一个熊猫DataFrame类似于:

col1 col2 col3
A    P    S
A    C    W
A    C    W
A    C    M
B    P    S
B    C    M
B    C    M
D    P    W
D    C    M
D    C    S

我需要回答以下问题:

在同一组(来自col1(中,有多少个C(在col2中(是W(来自col3(,并且有多少P(在col2中(是S(来自col3(?然后,我需要把每组的所有计数加起来。

更简单地说,有多少对C-W在同一个col1组中,其中有一对P-S?然后,将所有计数相加。

到目前为止,我试图解决这个问题,如下所示:

pd_h = pd_h
.groupby(["col1", "col2", "col3"])
.filter(
lambda x: ((x["col2"].iloc[0] == 'C') & (x["col3"].iloc[0] == "W")
if ((x["col2"].iloc[0] == 'P') & (x["col3"].iloc[0] == "S")) else False)
).describe(["count"])  # here describe, because I may decide to put also "mean", "std", etc.

但是剩余的对CCD_ 13可以属于其中不存在对P-S的组。

我该如何解决这个问题?

提前感谢!

使用:

print (df)
col1 col2 col3
0    A    P    S
1    A    C    W
2    A    C    W
3    A    C    M
4    B    P    S
5    B    C    M
6    B    C    M
7    D    P    W
8    D    C    W <- changed to W for see not match D because not P, S
9    D    C    S
#first condition, second condition
m1 = (df["col2"] == 'C') & (df["col3"] == "W")
m2 = (df["col2"] == 'P') & (df["col3"] == "S")
#check if at least one True in m2 per groups
m3 = df['col1'].isin(df.loc[m2, 'col1'])
#details
print (df.assign(m1 = m1, m2=m2, m3=m3, 
m1_m3 = (m1 & m3), num = (m1 & m3).view('i1')))
col1 col2 col3     m1     m2     m3  m1_m3  num
0    A    P    S  False   True   True  False    0
1    A    C    W   True  False   True   True    1
2    A    C    W   True  False   True   True    1
3    A    C    M  False  False   True  False    0
4    B    P    S  False   True   True  False    0
5    B    C    M  False  False   True  False    0
6    B    C    M  False  False   True  False    0
7    D    P    W  False  False  False  False    0
8    D    C    W   True  False  False  False    0
9    D    C    S  False  False  False  False    0

#filter only groups if m1 and m3 with aggregate sum for count
df = (m1 & m3).view('i1').groupby(df['col1']).sum().reset_index(name='count')
print (df)
col1  count
0    A      2
1    B      0
2    D      0
df['is_CW']=[(x=='C' and y=='W') for x, y in zip(df.col2, df.col3)]
df['is_PS']=[(x=='P' and y=='S') for x, y in zip(df.col2, df.col3)]
pv=df.pivot_table(values=['is_CW','is_PS'], index = 'col1', aggfunc=['sum'])
pv.columns=['is_CW','is_PS']
pv['answer'] = pv['is_CW'] * np.sign(pv['is_PS'])
print (pv.index, pv.answer)
A    2.0
B    0.0
D    0.0

最新更新