减少"for loops for big data"并进行改进



我正试图使这段代码(我所做的)尽可能快。首先,代码如下

#lemmas is list consisting of about 20,000 words. 
#That is, lemmas = ['apple', 'dog', ... ] 
#new_sents is list consisting of about 12,000 lists representing a sentence. 
#That is, new_sents = [ ['Hello', 'I', 'am', 'a', 'boy'], ['Hello', 'I', 'am', 'a', 'girl'], ... ]   
for x in lemmas:
        for y in lemmas:
            # prevent zero denominator 
            x_count = 0.00001
            y_count = 0.00001
            xy_count = 0
            ## Dice denominator 
            for i in new_sents:
                x_count += i.count(x) 
                y_count += i.count(y)
                if(x in i and y in i):
                    xy_count += 1
            sim_score = float(xy_count) / (x_count + y_count)

正如你所看到的,有这么多的迭代…大概20000 * 20000 * 12000,太大了。sim_score是两个单词的骰子系数。也就是说,xy_count表示单词x和单词y在句子中同时出现的次数,x_count和y_count表示new_sent中分别显示的单词x和y的总数。

我写的代码太慢了。有更好的办法吗?

提前感谢。

每项计算两次。你的分数在x和y上是对称的,所以你可以通过这样做获得2倍的速度:

for x, y in itertools.combinations(lemmas, 2):

我假设你不想比较lemmas[0]本身,否则你可以使用combinations_with_replacement

如果您从集合中查找lemmas,则实现将更快。

但是你仍然要多次计算相同的东西。您可以取每个引理,在news_sent中计数并存储。

这里有一种方法,通过迭代句子,提取单词组合,然后相对于单个单词出现次数对它们进行计数。这样效率更高,因为它是句子数* number_of_words_per_sentence^2

lemmas = ['apple', 'dog', 'foo', 'bar','Hello', 'I', 'am', 'a', 'boy', 'girl' ]
new_sents = [ ['Hello', 'I', 'am', 'a', 'boy'], ['Hello', 'I', 'am', 'a', 'girl']]
import itertools
#A counter is an auto updating dictionary for counting
from collections import Counter
#we initialize the counter with 1 for smoothing (avoiding 0)
lemmas = Counter({k:1 for k in lemmas})
#this is where we count the co-occurrences of words
coocurrs = Counter()
#we iterate the sentences, not the dictionary
for sentence in new_sents:
    #create all the word combinations in the sentences
    combos = (tuple(sorted(pair)) for pair in itertools.combinations(sentence, 2))
    #update a count for each word in the sentence
    lemmas.update(sentence)
    #update a count for each word combinations
    coocurrs.update(combos)
probabilities = {}
#convert to "probabilities"
for xy, score in coocurrs.iteritems():
    probabilities[xy] = score/float((lemmas[xy[0]]+lemmas[xy[1]]))
print probabilities

当你有一个具有相同成员的n X n矩阵时,唯一的非重复组合的个数等于:

(n²- n)/2

在你的例子中n = 20,000,结果是不到200,000,000次迭代。然而,按照你现在编写代码的方式,有400,000,000种可能性:

for x in lemmas:
        for y in lemmas:

换句话说,除了x ==橙子和y ==苹果之外,你还选择了x ==苹果和y ==橙子的情况。大概只有一个是必需的。

找到一种方法来排除那些不必要的200,000,000次迭代将提高速度。

除此之外,我的建议是将new_sents转换为字典并完全删除此循环:

for i in new_sents

同时做这两件事应该会显著提高速度。然后总迭代次数保持在2亿次,最后使用字典查找,这比列表要快得多。这种快速查找是以消耗内存为代价的,但是对于12,000的长度,这应该不是问题。

最新更新