如何使两个列表之间的元素比较更有效率?(TF-IDF)



我一直在学习如何计算TF-IDF,并尝试自己编程。我已经解决了这个问题,但我认为必须有一种更有效的方法来计算答案。我试过在相当大的数据集上测试它,似乎这个功能导致了的主要瓶颈

import math

def calc_idf(doc_length, normalized_list, terms_with_freqs):  # TODO there must be a way to make this more efficient
dict_idf = {}
for i in range(len(normalized_list)):
dict_calc_idf = []
for word in normalized_list[i]:
for x in range(len(terms_with_freqs)):
print(terms_with_freqs[x][0])
if word[0] == terms_with_freqs[x][0]:
dict_calc_idf.append(
(word[0], math.log(doc_length / terms_with_freqs[x][1], 10)))
dict_idf[i] = dict_calc_idf
return dict_idf

normalized_list = {0: [('tom', 0.5), ('earns', 0.5)], 1: [('tom', 0.5), ('castaignede', 0.5)], 2: [('aussie', 0.5), ('mcgrath', 0.5)], 3: [('european', 0.5), ('medal', 0.5)]}
terms_with_freqs = [('tom', 2), ('earns', 1), ('castaignede', 1), ('aussie', 1), ('mcgrath', 1), ('european', 1), ('medal', 1)]
doc_length = len(normalized_list)
dict = calc_idf(doc_length, normalized_list, terms_with_freqs)
print(dict)

摆脱C++思维模式并Python化所有range(len(...))的东西大约会使运行时间减半:

def calc_idf2(doc_length, normalized_list, terms_with_freqs):  # TODO there must be a way to make this more efficient
dict_idf = {}
for key, words in normalized_list.items():
dict_calc_idf = []
for word, discarded_number in words:
for term, freq in terms_with_freqs:
#print(term)
if word == term:
dict_calc_idf.append( (term, math.log(doc_length / freq, 10)) )
dict_idf[key] = dict_calc_idf
return dict_idf

来自IPython的%timeit(在注释掉print调用之后(:

In [10]: timeit calc_idf(doc_length, normalized_list, terms_with_freqs)                                                                                                            
15.8 µs ± 137 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [11]: timeit calc_idf2(doc_length, normalized_list, terms_with_freqs)                                                                                                           
8.15 µs ± 67.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

下一个优化是避免wordterm匹配的双循环。这在计算上显然相当于一个简单的字典查找。因此,最初将terms_with_freqs表示为dict,然后:

def calc_idf2b(doc_length, normalized_list, terms_with_freqs):
dict_idf = {}
for key, words in normalized_list.items():
dict_calc_idf = []
for word, discarded_number in words:
freq = terms_with_freqs[ word ]
dict_calc_idf.append( (word, math.log(doc_length / freq, 10)) )
dict_idf[key] = dict_calc_idf
return dict_idf

测试:

In [31]: terms_with_freqs_dict = dict(terms_with_freqs)                                                                                                                            
In [32]: timeit calc_idf2b(doc_length, normalized_list, terms_with_freqs_dict)                                                                                                     
5.61 µs ± 256 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

这应该不仅仅是速度的成比例增长。由于我们已经用计算复杂度较低的字典查找取代了不必要的嵌套循环,因此该算法现在应该可以更好地扩展到更大的数据集大小。

最后,您将希望从头开始将terms_with_freqs_dict构造为字典,而不是将其构造为列表然后进行转换。您也可以预先计算该字典中的每个math.log(freq, 10),而不是使用原始频率,并预先计算math.log(doc_length, 10),因为这在循环过程中不会改变。然后在循环中,只需从另一个中减去一个(减去日志与除以日志相同(:

def calc_idf3(normalized_list, terms_with_logged_freqs):
ldl = math.log(len(normalized_list), 10)
return { key : [
( word, ldl - terms_with_logged_freqs[word] )
for word, discarded_number in words
] for key, words in normalized_list.items() }

假定预处理:

terms_with_logged_freqs = { term : math.log(freq, 10) for term, freq in terms_with_freqs }

测试:

In [43]: timeit calc_idf3(normalized_list, terms_with_logged_freqs)                                                                                                                
3.62 µs ± 25.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

正如评论中所指出的,您对数据类型的使用并不理想。

normalized_list目前是一个带有列表的dict,但实际上应该是一个列表列表。terms_with_freqs将更有效地表达为dict。

通过这样做,您可以消除至少一个嵌套循环。此外,您还可以消除许多range(len(...))调用。

下面的代码给出了与原始代码相同的结果:

import math

def calc_idf(doc_length, normalized_list, terms_with_freqs):  # TODO there must be a way to make this more efficient
dict_idf = {}
for i, word_list in enumerate(normalized_list):
dict_calc_idf = []
for word, word_freq in word_list:
dict_calc_idf.append(
(word, math.log(doc_length / terms_with_freqs[word], 10)))
dict_idf[i] = dict_calc_idf
return dict_idf

normalized_list = [[('tom', 0.5), ('earns', 0.5)], [('tom', 0.5), ('castaignede', 0.5)], [('aussie', 0.5), ('mcgrath', 0.5)], [('european', 0.5), ('medal', 0.5)]]
terms_with_freqs = {'tom': 2, 'earns': 1, 'castaignede': 1, 'aussie': 1, 'mcgrath': 1, 'european': 1, 'medal': 1}
doc_length = len(normalized_list)
dict = calc_idf(doc_length, normalized_list, terms_with_freqs)
print(dict)

然而,您已经在句子上循环生成normalized_list了。你不能在那个循环中计算IDF吗?如果你需要生成terms_with_freqs,你可以使用collections.Counter,它已经给你一个dict了。该应用程序类似于:

from collections import Counter
terms_with_freqs = Counter(your_text.split())

最新更新