如何在给定的数据框架中找到考虑特定子组的项目频率?



我有两个数据框:df1

groups = {'Items': ["a", "b", "c", "c", "d", "d", "x", "y", "z", "p", "q", "m", "a", "b", "c", "d", "y", "z", "o", "d"],
'Groups': ["c11d", "c11d", "c22c", "c22c", "c22c", "c22c", "dp1", "dp1", "dp3", "dp3", "dp1", "dp1", "lp1",
"lp2", "lp2", "lp2", "lp1", "lp1", "rk1", "rk1"],
'Tags': ["banana", "banana", "banana", "banana", "banana", "banana", "orange", "orange", "orange", "orange",
"orange", "orange", "lemon", "lemon", "lemon", "lemon", "lemon", "lemon", "apple", "apple"]
}
df1 = pd.DataFrame(groups)

df1

Items   Groups  Tags
0   a       c11d    banana
1   b       c11d    banana
2   c       c22c    banana
3   c       c22c    banana
4   d       c22c    banana
5   d       c22c    banana
6   x       dp1     orange
7   y       dp1     orange
8   z       dp3     orange
9   p       dp3     orange
10  q       dp1     orange
11  m       dp1     orange
12  a       lp1     lemon
13  b       lp2     lemon
14  c       lp2     lemon
15  d       lp2     lemon
16  y       lp1     lemon
17  z       lp1     lemon
18  o       rk1     apple
19  d       rk1     apple

df2

terms = {'Groups': ["banana", "apple", "orange", "lemon"],
'1C': ["a,b,c,d", "b,d,z,y", "c,x,y,p", "p,q"],
'2C': ["m,n,a", "o,c,d", "p,r,d", ""]
}
df2 = pd.DataFrame(terms)

df2

Groups  1C        2C
0   banana  a,b,c,d   m,n,a
1   apple   b,d,z,y   o,c,d
2   orange  c,x,y,p   p,r,d
3   lemon   p,q 

我想将基于Groupsdf2的每列的项目映射到另一个数据框df1,并根据df2中按分组细分的项目找到标签的频率。

期望输出为:

Groups  1C      1C.Freq         2C      2C.Freq
banana  a,b,c,d c11d=2, c22c=4  m,n,a   c11d=1, c22c=0
apple   b,d,z,y rk1=1           o,c,d   rk1=2
orange  c,x,y,p dp1=2, dp3=1    p,r,d   dp1=0, dp3=1
lemon   p,q,z   lp1=1,lp2=0             lp1=0,lp2=0

对不起我的语言。感谢任何帮助。谢谢:)真正的数据框架有更多的标签和列。

合并两个数据框,将第一个数据框中的Tag和Groups列分别重命名为Groups和SubGroups,然后过滤掉第一个数据框并进行分组和聚合,对第二个数据框进行同样的操作,最后将它们连接起来。

from collections import Counter
merged=df1.rename(columns={'Tags':'Groups', 'Groups':'SubGroups'}).merge(df2)
first=(merged[merged.apply(lambda x: x['Items'] in x['1C'], axis=1)].
groupby('Groups', sort=False).
agg({'1C':'first', 'SubGroups':Counter}).
rename(columns={'SubGroups':'1C.Freq'})
)
second=(merged[merged.apply(lambda x: x['Items'] in x['2C'], axis=1)].
groupby('Groups', sort=False).
agg({'2C':'first', 'SubGroups':Counter}).
rename(columns={'SubGroups':'2C.Freq'}
)
pd.concat([first, second], axis=1)

输出:


1C         1C.Freq              2C     2C.Freq
Groups              
banana  a,b,c,d {'c11d': 2, 'c22c': 4}  m,n,a   {'c11d': 1}
orange  c,x,y,p {'dp1': 2, 'dp3': 1}    p,r,d   {'dp3': 1}
apple   b,d,z,y {'rk1': 1}              o,c,d   {'rk1': 2}

PS:我建议你对频率列使用dictionary,因为你以后可以很容易地使用它。

下面是计算给定列(1C2C)频率的函数:

def freq(col):
df = df1.merge(df2[['Groups', col]], how='left', left_on='Tags', right_on='Groups')

# convert comma-separated values -> lists -> rows
df[col] = df[col].str.split(',')
df = df.explode(col)

# mark and count the matches
df['match'] = df['Items'] == df[col]
df = df.groupby(['Tags', 'Groups_x'])['match'].sum().reset_index()
# format the matches as 'key=value'
df[f'{col}.Freq'] = df.pop('Groups_x') + '=' + df.pop('match').astype(str)

# combine the matches per group as 'key1=value2,key2=value2,...'
return df.groupby('Tags').agg(','.join).reindex(df2['Groups'])

将频率列添加到df2的末尾:

df2['1C.Freq'] = freq('1C').values
df2['2C.Freq'] = freq('2C').values
#    Groups       1C     2C        1C.Freq        2C.Freq
# 0  banana  a,b,c,d  m,n,a  c11d=2,c22c=4  c11d=1,c22c=0
# 1   apple  b,d,z,y  o,c,d          rk1=1          rk1=2
# 2  orange  c,x,y,p  p,r,d    dp1=2,dp3=1    dp1=0,dp3=1
# 3   lemon      p,q           lp1=0,lp2=0    lp1=0,lp2=0

或者使用DataFrame.insert将它们按照OP的顺序排列:

df2.insert(2, '1C.F', freq('1C').values)
df2.insert(4, '2C.F', freq('2C').values)
#    Groups       1C           1C.F     2C           2C.F
# 0  banana  a,b,c,d  c11d=2,c22c=4  m,n,a  c11d=1,c22c=0
# 1   apple  b,d,z,y          rk1=1  o,c,d          rk1=2
# 2  orange  c,x,y,p    dp1=2,dp3=1  p,r,d    dp1=0,dp3=1
# 3   lemon      p,q    lp1=0,lp2=0           lp1=0,lp2=0

最新更新