用虾刮Reddit的有效方法?

我正在做一个项目，涉及到收集评论，然后对一个非常大(10k+)的术语集应用情感分析。实际上，每个学期要抓取的评论数量并不是很大，我只想检索最近1周(或最多1个月)的评论。然而，我发现速度相当令人失望。例如，即使下面这个非常简单的片段运行超过3分钟。假设这样的速度和总共14000项，我的代码将需要连续运行一个多月才能完成任务!

term = "fastly stock fsly"
results = reddit.subreddit("all").search(term, sort="comments", limit=None)
for submission in results:
for top_level_comment in submission.comments:
if not isinstance(top_level_comment, praw.models.MoreComments):
if all(word for word in term.lower().split() if word in top_level_comment.body.lower()):
print(top_level_comment.body)

是否有可能大幅缩短处理时间?我知道double for循环是一个可怕的结构，但不确定是否可以在这里避免它。此外，我意识到内部for循环循环可能会通过一个非常长的列表，尽管我很可能只需要前10-20条评论，但我不清楚是否有可能将subreddit.search()应用于特定时间段。我在文档中没有看到任何参数。

要减少一些时间，您可以将term.lower().split()更改为可以引用的列表，而不是每次在for循环中都尝试拆分字符串(这会浪费额外的处理时间)。

如果你希望提高速度，使用try-catch块比使用isinstance更可取，因为与isinstance相比，try-catch是python式的，并且经过了大量优化。

通过这种调整，你仍然有3个嵌套的for循环，时间复杂度为N^3。

term = "fastly stock fsly"
terms = term.lower().split()
results = reddit.subreddit("all").search(term, sort="comments", limit=None)

for submission in results:
for top_level_comment in submission.comments:
try:
if all(word for word in terms if word in top_level_comment.body.lower()):
print(top_level_comment.body)
except AttributeError:
pass

至于你的if all()语句，如果有一个匹配的单词，它将始终评估为真，我不确定这是否是你想要的意思，因为如果是这样，那么有办法打破搜索更快(当它找到你的列表之间的第一个单词匹配)

相关内容

最新更新

热门标签：