如何检查一个词或一组词是否存在于给定的字符串列表中，以及如何提取该词?

我有一个字符串列表，如下所示:

list_of_words = ['all saints church','churchill college', "great saint mary's church", 'holy trinity church', "little saint mary's church", 'emmanuel college']

我有一个字典列表，其中包含'text'作为键，一个句子作为值。内容如下:

"dict_sentences": [
{
"text": "Can you help me book a taxi going from emmanuel college to churchill college?"
},
{
"text": "Yes, I could! What time would you like to depart from Emmanuel College?"
},
{
"text": "I want a taxi to holy trinity church"
},
{
"text": "Alright! I have a yellow Lexus booked to pick you up. The Contact number is 07543493643. Anything else I can help with?"
},
{
"text": "No, that is everything I needed. Thank you!"
},
{
"text": "Thank you! Have a great day!"
}
]

对于dict_sentences中的每个句子，我想检查list_of_words中的单词是否存在于该句子中，如果存在，我想将其存储在另一个字典中(因为我必须进一步处理它)。

例如，在dict_sentence的第一句">你能帮我预订一辆从伊曼纽尔学院到丘吉尔学院的出租车吗?">&;子字符串">丘吉尔学院&;"'emmanuel college'存在于我们的list_of_words中，所以我想将'churchill college'和'emmanuel college'这两个词存储在另一个字典中，如{ sent1 : ['churchill college', 'emmanuel college'] }

所以预期的输出将是:

{  sent1 : ['churchill college', 'emmanuel college'] ,
sent2 : [ 'emmanuel college' ],
sent3 : [ 'holy trinity church' ]
} # ignore the rest of sentences as no word from list_of_words exist in them

这里的主要问题是检查给定句子中是否包含单词/单词组(如"holy trinity church"- 3个单词)，如果是，提取相同的单词。我查看了其他答案，建议使用以下代码检查列表中的单词是否出现在句子中:

if any(word in sentence for word in list_of_words()): 
pass

然而，这样我们只能检查from sentence中的单词是否存在于list_of_words()中，为了提取单词，我将不得不运行for循环。但是，我不使用for循环，因为我需要一个非常省时的解决方案，因为我有大约300个文档，每个文档由10-15(或更多)个句子组成，并且list_of_words也很大，即大约300个字符串。因此，我需要一种省时的方法来检查和提取存在于list_of_words中的给定句子中的单词。

您可以使用re.findall，这样就没有嵌套循环了。

output = {}
find_words = re.compile('|'.join(list_of_words)).findall
for i, (s,) in enumerate(map(dict.values, data['dict_sentences']), 1):
words = find_words(s.lower())
if words:
output[f"sent{i}"] = words

{'sent1': ['emmanuel college', 'churchill college'],
'sent2': ['emmanuel college'],
'sent3': ['holy trinity church']}

这也可以在dict_comprehension中使用python 3.8+中的walrus操作符来完成，尽管可能有点过火:

find_sent = re.compile('|'.join(list_of_words)).findall
iter_sent = enumerate(map(dict.values, data['dict_sentences']), 1)
output = {f"sent{i}": words for i, (s,) in iter_sent if (words := find_sent(s.lower()))}

可能有一种更有效的方法来做到这一点，如itertools，但我不是很熟悉它。

test = {"dict_sentences":...} # I'm assuming it's a section of a json or a larger dictionary.
output = {}
j = 1
for sent in test["dict_sentences"]:
addition = []
for i in list_of_words:
if i.upper() in sent["text"].upper():
addition.append(i)
if addition:
output[f"sent{j}"] = addition
j += 1

您可以进行嵌套的字典推导，并通过将两者都转换为小写来比较内容，例如:


output = {
f"sent{i+1}": [
phrase for phrase in list_of_words if phrase.lower() in sentence['text'].lower()
] for i,sentence in enumerate(dict_sentences)
}
output_without_empty_matches = { k:v for k,v in output.items() if v }
print(output_without_empty_matches)
>>> {'sent1': ['churchill college', 'emmanuel college'], 'sent2': ['emmanuel college'], 'sent3': ['holy trinity church']}

new_list=[]
new_dict={}
for index, subdict in enumerate(dict_sentences):
for word in list_of_words:
if word in subdict['text'].lower():
key="sent"+str(index+1)
new_list.append(word)
new_dict[key]=new_list
new_list=[]
print(new_dict)

相关内容

最新更新

热门标签：