我有一个熊猫数据帧df
,带有字符串列Posts
,如下所示:
df['Posts']
0 this is an example sentence
1 this too is an example too is an example sentence
2 yup, still an example sentence
我还有另一个数据帧df1
它有一个列Phrases
中的标签列表,如下所示:
df1['Phrases']
0 example
1 example sentence
2 is an
3 is an example
4 yup
我需要一个数据帧,它df1
Phrases
的唯一计数出现在df
的Posts
中,如下所示:
Phrases Count
0 example 3
1 example sentence 3
2 is an 2
3 is an example 2
4 yup 1
使用str.extract
,然后检查非缺失值并按sum
计算出现次数 -True
s 是类似于1
s 的过程:
df1['Count'] = [df['Posts'].str.extract('(' + x + ')', expand=False).notnull().sum()
for x in df1['Phrases']]
print (df1)
Tags Count
0 example 3
1 example sentence 3
2 is an 2
3 is an example 2
4 yup 1
编辑:
对于不计算 partail 匹配,请使用单词边界:
df1['Count'] = [df['Posts'].str.extract(r'(b' + x + r'b)', expand=False).notnull().sum()
for x in df1['Phrases']]
print (df1)
Phrases Count
0 example 3
1 example sentence 3
2 is an 2
3 is an example 2
4 yup 1