计算一个数据框中列中单词的唯一匹配次数



我有一个熊猫数据帧df,带有字符串列Posts,如下所示:

df['Posts']
0       this is an example sentence
1       this too is an example too is an example sentence
2       yup, still an example sentence

我还有另一个数据帧df1它有一个列Phrases中的标签列表,如下所示:

df1['Phrases']
0       example
1       example sentence
2       is an
3       is an example
4       yup

我需要一个数据帧,它df1Phrases的唯一计数出现在dfPosts中,如下所示:

Phrases             Count   
0       example               3 
1       example sentence      3
2       is an                 2
3       is an example         2
4       yup                   1

使用str.extract,然后检查非缺失值并按sum计算出现次数 -Trues 是类似于1s 的过程:

df1['Count'] = [df['Posts'].str.extract('(' + x + ')', expand=False).notnull().sum()
for x in df1['Phrases']]
print (df1)
Tags  Count
0           example      3
1  example sentence      3
2             is an      2
3     is an example      2
4               yup      1

编辑:

对于不计算 partail 匹配,请使用单词边界:

df1['Count'] = [df['Posts'].str.extract(r'(b' + x + r'b)', expand=False).notnull().sum()
for x in df1['Phrases']]
print (df1)
Phrases  Count
0           example      3
1  example sentence      3
2             is an      2
3     is an example      2
4               yup      1

最新更新