如何减少筛选文章数据集的时间

我正在尝试过滤包含近50K文章的数据集。我想从每篇文章中过滤掉停用词和标点符号。但这个过程需要很长时间。我已经过滤了数据集，花了 6 个小时。现在我有另一个数据集要过滤，其中包含 300K 篇文章。

我在蟒蛇环境中使用python。PC配置：第7代酷睿i5,8GB RAM和NVIDIA 940MX GPU。为了过滤我的数据集，我编写了一个代码，该代码将数据集中的每篇文章都用于标记单词，然后删除停用词，标点符号和数字。

def sentence_to_wordlist(sentence, filters="!"#$%&()*+,-./:;<=>?@[\]^_`{|}~tn?,।!‍.'0123456789০১২৩৪৫৬৭৮৯‘u200c–“”…‘"):
    translate_dict = dict((c, ' ') for c in filters)
    translate_map = str.maketrans(translate_dict)
    wordlist = sentence.translate(translate_map).split()
    global c,x;
    return list(filter(lambda x: x not in stops, wordlist))

现在我想减少这个过程的时间。有什么方法可以优化这一点吗？

我一直

在尝试优化您的流程：

from nltk.corpus import stopwords
cachedStopWords = set(stopwords.words("english"))
filters = "!"#$%&()*+,-./:;<=>?@[\]^_`{|}~tn?,।!‍.'0123456789০১২৩৪৫৬৭৮৯‘u200c–“”…‘"
trnaslate_table = str.maketrans('', '', filters)
def sentence_to_wordlist(sentence, filters=filters):
    wordlist = sentence.translate(trnaslate_table).split()
    return [w for w in wordlist if w not in cachedStopWords] 
from multiprocessing.pool import Pool
p = Pool(10)
results  = p.map(sentence_to_wordlist, data)

数据是包含您的文章的列表
我一直在使用 nltk 中的停用词，但您可以使用自己的停用词，请确保您的停用词是集合而不是列表(因为要查找元素是否在集合中是 O(1( 时间复杂度，列表中是 O(n((

我一直在用 100k 篇文章的列表进行测试，每篇文章大约有 2k 个字符，花了不到 9 秒的时间。

相关内容

最新更新

热门标签：