如何在 CrawlerProcess 完成后获取统计信息值，即在 process.start() 之后的行

我在蜘蛛内部的某个地方使用了这段代码：

raise scrapy.exceptions.CloseSpider('you_need_to_rerun')

因此，当引发此异常时，最终我的蜘蛛关闭工作，并且我使用此字符串进入控制台统计信息：

'finish_reason': 'you_need_to_rerun',

但是 - 我如何从代码中获取它？因为我想根据此统计数据中的信息再次循环运行 spider，如下所示：

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import spaida.spiders.spaida_spider
import spaida.settings

you_need_to_rerun = True
while you_need_to_rerun:
process = CrawlerProcess(get_project_settings())
process.crawl(spaida.spiders.spaida_spider.SpaidaSpiderSpider)
process.start(stop_after_crawl=False)  # the script will block here until the crawling is finished
finish_reason = 'and here I get somehow finish_reason from stats' # <- how??
if finish_reason == 'finished':
print("everything ok, I don't need to rerun this")
you_need_to_rerun = False

我在文档中找到了这个东西，但做不好，"统计数据可以通过 spider_stats 属性访问，这是一个由蜘蛛域名键控的字典 https://doc.scrapy.org/en/latest/topics/stats.html#scrapy.statscollectors.MemoryStatsCollector.spider_stats。

PS：使用process.start()时，我也会收到错误扭曲.internet.error.ReactorNotRestartable，以及使用process.start(stop_after_crawl=False)的建议 - 然后蜘蛛只是停止什么都不做，但这是另一个问题......

您需要通过Crawler对象访问统计信息对象：

process = CrawlerProcess(get_project_settings())
crawler = process.crawlers[0]
reason = crawler.stats.get_value('finish_reason')

相关内容

最新更新

热门标签：