使用Python多处理加速抓取?



我正在编写一个脚本,该脚本使用praw模块从Reddit抓取评论,并使用nltk.sentimentSentimentIntensityAnalyzer对抓取的内容应用情感分析。下面是我的函数:

def analyze_keyword_reddit(keywords):
results = reddit.subreddit("all").search(keywords, sort="comments", limit=None)
all_posts = ""
all_comments = 0
all_upvotes = 0
for post in results:
all_comments += post.num_comments
all_upvotes += post.score
# get all posts for keyword
submission = reddit.submission(id=post.id)
submission.comments.replace_more(limit=0, threshold=10)
posts = " ".join([post.body.lower() for post in submission.comments.list()])
all_posts = all_posts + " " + posts
polarity_scores = sia.polarity_scores(all_posts)
return {**{'all_comments': all_comments, 'all_upvotes': all_upvotes}, **polarity_scores}

在分析这个函数的运行时时,超过90%的运行时发生在平坦注释树的步骤上(这一行:submission.comments.replace_more(limit=0, threshold=5))。遗憾的是,我没有发现这个命令的运行时是否可以得到显著改进(除了设置threshold参数,从而限制检索到的注释的数量),所以我开始探索多处理选项。使用上面的函数,我运行了以下命令:

if __name__ == '__main__':
tic = time.time()
with multiprocessing.Pool() as p:
print(p.starmap(analyze_keyword_reddit, ['$u unity stock','$fsly fastly stock','$ttcf tattooed chef stock']))
toc = time.time()

,然后,为了比较,我还运行了下面这个没有多处理的命令:

tic = time.time()
for t in ['$u unity stock','$fsly fastly stock','$ttcf tattooed chef stock']:
print(analyze_keyword_reddit(t))
toc = time.time()
print(f"Scraping without multiprocessing: {round(toc-tic,2)} seconds.")

正常版本耗时128秒。然而,多处理版本似乎创建了四个线程,并行地计算完全相同的输入四次(当我将一个简单的print(keywords)添加到analyze_keyword_reddit()时,这一点变得明显)。这四个线程分别耗时708、727、729和731秒。此外,multiprocessing代码段似乎仍然处于一个无限循环中——它没有在第三个关键字的末尾停止。

我的实现哪里错了?我的目标是加快抓取过程,还是我应该完全采用另一种实现?


编辑:根据Ron Serruya下面的伟大回复,我更新了代码,现在使用map()而不是starmap():

if __name__ == '__main__':
tic = time.time()
with Pool() as p:
print(p.map(analyze_keyword_reddit, ['$u unity stock','$fsly fastly stock','$ttcf tattooed chef stock']))
toc = time.time()
print(f"Scraping with multiprocessing: {round(toc-tic,2)} seconds.")

然而,这一次这个过程似乎甚至没有开始。函数analyze_keyword_reddit似乎永远不会被求值。我在它的开头添加了一个打印功能,在终端上根本没有出现任何输出。

尝试使用map代替starmap

Starmap期望参数是可迭代的,然后它将可迭代对象解包为函数

所以p.starmap(analyze_keyword_reddit, ["hello", "world"])将调用analyze_keyword_reddit(['h','e','l','l','o'])analyze_keyword_reddit(['w','o','r','l','d'])

我可以假设它使Reddit库工作更加努力

In [1]: from multiprocessing import Pool
In [2]: with Pool() as p:
...:     p.starmap(print, ["hello", "world"])
...:
h e l l o
w o r l d
In [3]: with Pool() as p:
...:     p.map(print, ["hello", "world"])
...:
hello
world

最新更新