使用 python 计算文章中单词列表的最快方法

我正在寻找在一篇文章中找到一袋单词中的所有单词的次数。我对每个单词的频率不感兴趣，而是对文章中找到所有单词的总次数感兴趣。我必须分析数百篇文章，因为我从互联网上检索它们。我的算法需要很长时间，因为每篇文章大约有 800 个单词。

这是我所做的(其中金额是在一篇文章中找到单词的次数，文章包含一个字符串，所有单词构成文章内容，我使用 NLTK 进行标记化。

bag_of_words = tokenize(bag_of_words)
tokenized_article = tokenize(article)
occurrences = [word for word in tokenized_article
if word in bag_of_words]
amount = len(occurrences)

其中tokenized_article如下所示：

[u'sarajevo', u'bosnia', u'herzegovi', u'war', ...]

bag_of_words也是如此。

例如，我想知道是否有更有效/更快的方法来使用 NLTK 或 lambda 函数来做到这一点。

我建议对你正在计算的单词使用set-set具有恒定时间成员资格测试，因此比使用列表(具有线性时间成员资格测试)更快。

例如：

occurrences = [word for word in tokenized_article
if word in set(bag_of_words)]
amount = len(occurrences)

一些时序测试(使用人工创建的列表，重复十次)：

In [4]: words = s.split(' ') * 10
In [5]: len(words)
Out[5]: 1060
In [6]: to_match = ['NTLK', 'all', 'long', 'I']
In [9]: def f():
...:     return len([word for word in words if word in to_match])
In [13]: timeit(f, number = 10000)
Out[13]: 1.0613768100738525
In [14]: set_match = set(to_match)
In [15]: def g():
...:     return len([word for word in words if word in set_match])
In [18]: timeit(g, number = 10000)
Out[18]: 0.6921310424804688

其他一些测试：

In [22]: p = re.compile('|'.join(set_match))
In [23]: p
Out[23]: re.compile(r'I|all|NTLK|long')
In [24]: p = re.compile('|'.join(set_match))
In [28]: def h():
...:     return len(filter(p.match, words))
In [29]: timeit(h, number = 10000)
Out[29]: 2.2606470584869385

使用集进行成员资格测试。

另一种检查方法是计算每个单词的出现次数，如果单词存在，则添加出现次数，假设文章包含一些重复单词的频率并且文章不是很短。比方说一篇文章包含 10 个"the"，现在我们只检查一次成员资格而不是 10 次。

from collections import Counter
def f():
return sum(c for word, c in Counter(check).items() if word in words)

如果你不想要计数，它不再是"单词袋"，而是一组单词。因此，如果确实如此，请将您的文档转换为set。

避免使用 for 循环和 lambda函数，尤其是嵌套函数。这需要大量的口译工作，而且速度很慢。相反，尝试使用优化的调用，例如intersection(为了性能，numpy等库也非常好，因为它们在低级C/Fortran/Cython代码中完成工作)

即

count = len(bag_of_words_set.intersection( set(tokenized_article) ))

其中word_set是您感兴趣的单词，作为set.

如果您想要经典的字数统计，请使用collections.Counter：

from collections import Counter
counter = Counter()
...
counter.update(tokenized_article)

不过，这将计算所有单词，包括那些不在您的列表中。您可以尝试此操作，但由于循环，结果可能会变慢：

bag_of_words_set = set(bag_of_words)
...
for w in tokenized_article:
if w in bag_of_words_set: # use a set, not a list!
counter[w] += 1

稍微复杂一些，但可能更快，是使用两个Counter。一个总计，一个用于文档。

doc_counter.clear()
doc_counter.update( tokenized_article )
for w in doc_counter.keys():
if not w in bag_of_words_set: del doc_counter[w]
counter.update(doc_counter) # untested.

如果您有许多重复的不需要的单词，则对文档使用计数器是有益的，您可以在其中保存查找。它也更适合多线程操作(更容易同步)

相关内容

最新更新

热门标签：