如何检查一个词或一组词是否存在于给定的字符串列表中,以及如何提取该词?



我有一个字符串列表,如下所示:

list_of_words = ['all saints church','churchill college', "great saint mary's church", 'holy trinity church', "little saint mary's church", 'emmanuel college']

我有一个字典列表,其中包含'text'作为键,一个句子作为值。内容如下:

"dict_sentences": [
{
"text": "Can you help me book a taxi going from emmanuel college to churchill college?"
},
{
"text": "Yes, I could! What time would you like to depart from Emmanuel College?"
},
{
"text": "I want a taxi to holy trinity church"
},
{
"text": "Alright! I have a yellow Lexus booked to pick you up. The Contact number is 07543493643. Anything else I can help with?"
},
{
"text": "No, that is everything I needed. Thank you!"
},
{
"text": "Thank you! Have a great day!"
}
]

对于dict_sentences中的每个句子,我想检查list_of_words中的单词是否存在于该句子中,如果存在,我想将其存储在另一个字典中(因为我必须进一步处理它)。

例如,在dict_sentence的第一句">你能帮我预订一辆从伊曼纽尔学院到丘吉尔学院的出租车吗?">&;子字符串">丘吉尔学院&;"'emmanuel college'存在于我们的list_of_words中,所以我想将'churchill college'和'emmanuel college'这两个词存储在另一个字典中,如{ sent1 : ['churchill college', 'emmanuel college'] }

所以预期的输出将是:

{  sent1 : ['churchill college', 'emmanuel college'] ,
sent2 : [ 'emmanuel college' ],
sent3 : [ 'holy trinity church' ]
} # ignore the rest of sentences as no word from list_of_words exist in them

这里的主要问题是检查给定句子中是否包含单词/单词组(如"holy trinity church"- 3个单词),如果是,提取相同的单词。我查看了其他答案,建议使用以下代码检查列表中的单词是否出现在句子中:

if any(word in sentence for word in list_of_words()): 
pass

然而,这样我们只能检查from sentence中的单词是否存在于list_of_words()中,为了提取单词,我将不得不运行for循环。但是,我不使用for循环,因为我需要一个非常省时的解决方案,因为我有大约300个文档,每个文档由10-15(或更多)个句子组成,并且list_of_words也很大,即大约300个字符串。因此,我需要一种省时的方法来检查和提取存在于list_of_words中的给定句子中的单词。

您可以使用re.findall,这样就没有嵌套循环了。

output = {}
find_words = re.compile('|'.join(list_of_words)).findall
for i, (s,) in enumerate(map(dict.values, data['dict_sentences']), 1):
words = find_words(s.lower())
if words:
output[f"sent{i}"] = words

{'sent1': ['emmanuel college', 'churchill college'],
'sent2': ['emmanuel college'],
'sent3': ['holy trinity church']}

这也可以在dict_comprehension中使用python 3.8+中的walrus操作符来完成,尽管可能有点过火:

find_sent = re.compile('|'.join(list_of_words)).findall
iter_sent = enumerate(map(dict.values, data['dict_sentences']), 1)
output = {f"sent{i}": words for i, (s,) in iter_sent if (words := find_sent(s.lower()))}

可能有一种更有效的方法来做到这一点,如itertools,但我不是很熟悉它。

test = {"dict_sentences":...} # I'm assuming it's a section of a json or a larger dictionary.
output = {}
j = 1
for sent in test["dict_sentences"]:
addition = []
for i in list_of_words:
if i.upper() in sent["text"].upper():
addition.append(i)
if addition:
output[f"sent{j}"] = addition
j += 1

您可以进行嵌套的字典推导,并通过将两者都转换为小写来比较内容,例如:


output = {
f"sent{i+1}": [
phrase for phrase in list_of_words if phrase.lower() in sentence['text'].lower()
] for i,sentence in enumerate(dict_sentences)
}
output_without_empty_matches = { k:v for k,v in output.items() if v }
print(output_without_empty_matches)
>>> {'sent1': ['churchill college', 'emmanuel college'], 'sent2': ['emmanuel college'], 'sent3': ['holy trinity church']}

new_list=[]
new_dict={}
for index, subdict in enumerate(dict_sentences):
for word in list_of_words:
if word in subdict['text'].lower():
key="sent"+str(index+1)
new_list.append(word)
new_dict[key]=new_list
new_list=[]
print(new_dict)

最新更新