我有两个数据集(数据帧(,其中一个包含文本,另一个包含我正在搜索的单词,我想知道它们是否包含在其中一些文本中并标记它们。
我想做的方法是在数据帧2中为每个与数据帧1中包含的值匹配的字添加一个新行,
一个例子:
数据帧1
word id
'sushi' 1
'pizza' 2
'burger' 3
'plaza' 4
'park' 5
'mountain' 6
要搜索的数据帧2:
注:数据帧2有更多列,但它们与解决问题无关
text
'I eat pizza in the park'
'I eat sushi'
'She eats sushi with pizza in the plaza'
'He eats'
所需输出为以下
text contained_word_id
'I eat pizza in the park' 2
'I eat pizza in the park' 5
'I eat sushi' 1
'She eats sushi with pizza in the plaza' 1
'She eats sushi with pizza in the plaza' 2
'She eats sushi with pizza in the plaza' 4
'He eats' NaN
我们可以先进行findall
,然后进行explode
和map
df2['word'] = df2.text.str.findall('|'.join(df1.word.tolist()))
df2 = df2.explode('word')
df2['id'] = df2.word.map(df1.set_index('word')['id'])
df2
Out[443]:
text word id
0 'I eat pizza in the park' pizza 2.0
0 'I eat pizza in the park' park 5.0
1 'I eat sushi' sushi 1.0
2 'She eats sushi with pizza in the plaza' sushi 1.0
2 'She eats sushi with pizza in the plaza' pizza 2.0
2 'She eats sushi with pizza in the plaza' plaza 4.0
3 'He eats' NaN NaN