我有一组单词
单词= {'谢谢给予','cat','而不是'等...}
我需要在表列中准确搜索这些单词'描述'
--------------------------------|
ID | Description |
--- |---------------------------|
1 | having fun thanks giving|
----|---------------------------|
2 | cat eats all the food |
----|---------------------------|
3 | instead you can come |
--------------------------------
def matched_words(x,words):
match_words =[]
for word in words:
if word in x:
match_words.append(word)
return match_words
df['new_col'] = df['description'].apply(lambda x:matched_words(x,words))
所需的输出:
----|---------------------------|-------------------|
ID | Description |matched words |
--- |---------------------------|-------------------|
1 | having fun thanks giving|['thanks giving'] |
----|---------------------------|------------------ |
2 | cat eats all the food |['cat'] |
----|---------------------------|-------------------|
3 | instead you can come | [] |
----------------------------------------------------
我只得到匹配项,例如['cat']
以下代码应为您提供所需的结果:
import re
words = {'thanks', 'cat', 'instead of'}
phrases = [
[1,"having fun at thanksgiving"],
[2,"cater the food"],
[3, "instead you can come"],
[4, "instead of pizza"],
[5, "thanks for all the fish"]
]
matched_words = []
matched_pairs = []
for word in words:
for phrase in phrases:
result = re.search(r'b'+word+'W', phrase[1])
if result:
matched_words.append(result.group(0))
matched_pairs.append([result.group(0), phrase])
print()
print(matched_words)
print(matched_pairs)
相关部分,即regex
位re.search(r'b'+word+'W', phrase[1])
,正在搜索从单词边界b
或empty string
开始发现我们的搜索字符串的情况,并以非词字符W
结束。这应该确保我们仅找到全串匹配。无需对您要搜索的文本做任何其他事情。
当然,您可以使用所需的任何东西,而不是words
,phrases
,matched_words
和matched_pairs
。
希望这会有所帮助!
import re
words = {'thanks', 'cat', 'instead of'}
samples = [
(1, 'having fun at thanksgiving'),
(2, 'cater the food'),
(3, 'instead you can come'),
(4, 'instead of you can come'),
]
for id, description in samples:
for word in words:
if re.search(r'b' + word + r'b', description):
print("'%s' in '%s" % (word, description))