列表到正则表达式，包括前导空格

我的列表

mylist = [apple, banana, grape]

df
text
I love banana
apple is delicious
I eat pineapple
hate whitegrape

要匹配文本中包含列表的内容，请按以下步骤进行。

mylist = [f"(?i){re.escape(k.lower())}" for k in mylist]
extracted = df['text'].str.lower().str.findall(f'({"|".join(mylist)})').apply(set)
df['matching'] = extracted.str.join(',')

匹配有问题，但由于列表前面没有空格，我要找的"苹果"包含在"苹果"中，所以它匹配。

作为另一个例子，我在寻找"葡萄"，但白葡萄中含有葡萄，所以这也在计算中。

如何在列表中每个索引的开头留出一个空格？

result above
text                 matching
I love banana        banana
apple is delicious   apple
I eat pineapple      apple
hate whitegrape      grape

结果是我想要的

text                 matching
I love banana        banana
apple is delicious   apple
I eat pineapple  
hate whitegrape

您可以先执行split，然后执行

df.text.str.lower().str.split().apply(lambda x : [y for y in x if y in mylist]).str[0]
Out[227]: 
0    banana
1     apple
2       NaN
3       NaN
Name: text, dtype: object

使用str.findall更新

df.text.str.lower().str.findall(r'b({0})b'.format('|'.join(mylist)))
Out[248]: 
0    [banana]
1     [apple]
2          []
3          []
Name: text, dtype: object

您可以使用：

df.text.str.extract(f"(?i)\b({'|'.join(mylist)})\b")
0
0  banana
1   apple
2     NaN
3     NaN

当然，您可以根据示例将extract更改为findall

相关内容

最新更新

热门标签：