为什么Scrapy CrawlerProcess给它之前的行控制,导致它和前一行运行2-3次?



我将尽力解释发生了什么,因为我完全迷路了。

print("check")
process = CrawlerProcess(get_project_settings())
process.crawl('BoxOfficeSpider', df=movie_data)    
process.start()

所以我把问题缩小到这个特定的代码块。发生的事情是这样的:如果process = CrawlerProcess(),它按预期执行,逐行向下,然后结束。然而,当我传入get_project_settings()时,由于某种原因,它运行第二行,然后重新运行第一行!例如,它打印两次check。使用调试,我已经确认它实际上是移动头部(可能不是正确的术语,但我试图说正在执行的行)正好回一个可执行行(意思是注释不影响它)。

这是我的设置文件

#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'extractData'
SPIDER_MODULES = ['extractData.spiders']
NEWSPIDER_MODULE = 'extractData.spiders'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 4
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 1
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 3
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# LOG_LEVEL = 'INFO'
LOG_ENABLED = True

我已经为这个问题做了很多调试,但我相信它专门与CrawlerProcess和get_project_settings()有关,因为删除get_project_settings()完全解决了这个问题(除了我需要全局设置),但我潜入文档并没有透露原因。如有任何帮助,不胜感激。

如果它提供了更多信息,那么在我的spider文件夹中有2个spider,但实际上只调用了1个。另一个是正确设置设置明智,所以它不应该干扰,只是想我应该添加,以防它以某种方式影响事情。

class FilmRatingsSpider(scrapy.Spider):
name = "FilmRatingsSpider"
allowed_domains = ["filmratings.com"]
start_urls = ['filmratings.com']
custom_settings = {
'LOG_FILE':'film_ratings_spider.log',
'ITEM_PIPELINES':{'extractData.pipelines.FilmRatingsPipeline': 400}
}
def parse(self, response, tconst):
pass
def __init__(self, df):
pass

由于问题发生在甚至到达第三行之前,我认为它肯定与CrawlerProcess或更可能的settings.py有关,只是不知道如何或为什么。

我明白了,我觉得自己很笨。由于某种原因,我还没有弄清楚,当我运行get_project_settings()时,整个python文件运行了两次。可能是因为它导入了init.py文件(我从中调用get_project_settings),这导致它再次运行。添加一个简单的if __name__ == '__main__',并把所有的代码完全修复它。

最新更新