我有两个数据框:df1
groups = {'Items': ["a", "b", "c", "c", "d", "d", "x", "y", "z", "p", "q", "m", "a", "b", "c", "d", "y", "z", "o", "d"],
'Groups': ["c11d", "c11d", "c22c", "c22c", "c22c", "c22c", "dp1", "dp1", "dp3", "dp3", "dp1", "dp1", "lp1",
"lp2", "lp2", "lp2", "lp1", "lp1", "rk1", "rk1"],
'Tags': ["banana", "banana", "banana", "banana", "banana", "banana", "orange", "orange", "orange", "orange",
"orange", "orange", "lemon", "lemon", "lemon", "lemon", "lemon", "lemon", "apple", "apple"]
}
df1 = pd.DataFrame(groups)
df1
Items Groups Tags
0 a c11d banana
1 b c11d banana
2 c c22c banana
3 c c22c banana
4 d c22c banana
5 d c22c banana
6 x dp1 orange
7 y dp1 orange
8 z dp3 orange
9 p dp3 orange
10 q dp1 orange
11 m dp1 orange
12 a lp1 lemon
13 b lp2 lemon
14 c lp2 lemon
15 d lp2 lemon
16 y lp1 lemon
17 z lp1 lemon
18 o rk1 apple
19 d rk1 apple
则df2
terms = {'Groups': ["banana", "apple", "orange", "lemon"],
'1C': ["a,b,c,d", "b,d,z,y", "c,x,y,p", "p,q"],
'2C': ["m,n,a", "o,c,d", "p,r,d", ""]
}
df2 = pd.DataFrame(terms)
df2
Groups 1C 2C
0 banana a,b,c,d m,n,a
1 apple b,d,z,y o,c,d
2 orange c,x,y,p p,r,d
3 lemon p,q
我想将基于Groups
的df2
的每列的项目映射到另一个数据框df1
,并根据df2
中按分组细分的项目找到标签的频率。
期望输出为:
Groups 1C 1C.Freq 2C 2C.Freq
banana a,b,c,d c11d=2, c22c=4 m,n,a c11d=1, c22c=0
apple b,d,z,y rk1=1 o,c,d rk1=2
orange c,x,y,p dp1=2, dp3=1 p,r,d dp1=0, dp3=1
lemon p,q,z lp1=1,lp2=0 lp1=0,lp2=0
对不起我的语言。感谢任何帮助。谢谢:)真正的数据框架有更多的标签和列。
合并两个数据框,将第一个数据框中的Tag和Groups列分别重命名为Groups和SubGroups,然后过滤掉第一个数据框并进行分组和聚合,对第二个数据框进行同样的操作,最后将它们连接起来。
from collections import Counter
merged=df1.rename(columns={'Tags':'Groups', 'Groups':'SubGroups'}).merge(df2)
first=(merged[merged.apply(lambda x: x['Items'] in x['1C'], axis=1)].
groupby('Groups', sort=False).
agg({'1C':'first', 'SubGroups':Counter}).
rename(columns={'SubGroups':'1C.Freq'})
)
second=(merged[merged.apply(lambda x: x['Items'] in x['2C'], axis=1)].
groupby('Groups', sort=False).
agg({'2C':'first', 'SubGroups':Counter}).
rename(columns={'SubGroups':'2C.Freq'}
)
pd.concat([first, second], axis=1)
输出:
1C 1C.Freq 2C 2C.Freq
Groups
banana a,b,c,d {'c11d': 2, 'c22c': 4} m,n,a {'c11d': 1}
orange c,x,y,p {'dp1': 2, 'dp3': 1} p,r,d {'dp3': 1}
apple b,d,z,y {'rk1': 1} o,c,d {'rk1': 2}
PS:我建议你对频率列使用dictionary,因为你以后可以很容易地使用它。
下面是计算给定列(1C
或2C
)频率的函数:
def freq(col):
df = df1.merge(df2[['Groups', col]], how='left', left_on='Tags', right_on='Groups')
# convert comma-separated values -> lists -> rows
df[col] = df[col].str.split(',')
df = df.explode(col)
# mark and count the matches
df['match'] = df['Items'] == df[col]
df = df.groupby(['Tags', 'Groups_x'])['match'].sum().reset_index()
# format the matches as 'key=value'
df[f'{col}.Freq'] = df.pop('Groups_x') + '=' + df.pop('match').astype(str)
# combine the matches per group as 'key1=value2,key2=value2,...'
return df.groupby('Tags').agg(','.join).reindex(df2['Groups'])
将频率列添加到df2
的末尾:
df2['1C.Freq'] = freq('1C').values
df2['2C.Freq'] = freq('2C').values
# Groups 1C 2C 1C.Freq 2C.Freq
# 0 banana a,b,c,d m,n,a c11d=2,c22c=4 c11d=1,c22c=0
# 1 apple b,d,z,y o,c,d rk1=1 rk1=2
# 2 orange c,x,y,p p,r,d dp1=2,dp3=1 dp1=0,dp3=1
# 3 lemon p,q lp1=0,lp2=0 lp1=0,lp2=0
或者使用DataFrame.insert
将它们按照OP的顺序排列:
df2.insert(2, '1C.F', freq('1C').values)
df2.insert(4, '2C.F', freq('2C').values)
# Groups 1C 1C.F 2C 2C.F
# 0 banana a,b,c,d c11d=2,c22c=4 m,n,a c11d=1,c22c=0
# 1 apple b,d,z,y rk1=1 o,c,d rk1=2
# 2 orange c,x,y,p dp1=2,dp3=1 p,r,d dp1=0,dp3=1
# 3 lemon p,q lp1=0,lp2=0 lp1=0,lp2=0