在Python中比较list / dict中单词的最有效方法



我有以下句子和字典:

sentence = "I love Obama and David Card, two great people. I live in a boat"
dico = {
'dict1':['is','the','boat','tree'],
'dict2':['apple','blue','red'],
'dict3':['why','Obama','Card','two'],
}

我想匹配句子和给定字典中元素的数量。较重的方法包括执行以下步骤:

classe_sentence = []
text_splited = sentence.split(" ")
dic_keys = dico.keys()
for key_dics in dic_keys:
    for values in dico[key_dics]:
        if values in text_splited:
            classe_sentence.append(key_dics)
from collections import Counter
Counter(classe_sentence)

输出如下:

Counter({'dict1': 1, 'dict3': 2})

然而,由于有两个循环,并且它是原始比较,因此它根本没有效率。我想知道是否有更快的方法来做到这一点。也许用itertools对象。你知道吗?

提前感谢!

您可以使用set数据类型进行所有比较,并使用set.intersection方法获得匹配的数量。

这将提高算法效率,但它只计算每个单词一次,即使它在句子中的几个地方出现。

sentence = set("I love Obama and David Card, two great people. I live in a boat".split())
dico = {
'dict1':{'is','the','boat','tree'},
'dict2':{'apple','blue','red'},
'dict3':{'why','Obama','Card','two'}
}

results = {}
for key, words in dico.items():
    results[key] = len(words.intersection(sentence))

假设您想要区分大小写匹配:

from collections import defaultdict
sentence_words = defaultdict(lambda: 0)
for word in sentence.split(' '):
    # strip off any trailing or leading punctuation
    word = word.strip(''";.,!?')
    sentence_words[word] += 1
for name, words in dico.items():
    count = 0
    for x in words:
        count += sentence_words.get(x, 0)
    print('Dictionary [%s] has [%d] matches!' % (name, count,))

最新更新