为什么通过爬虫进程运行多个抓取蜘蛛会导致spider_idle信号失败?



我需要发出数千个需要会话令牌才能授权的请求。

一次对所有请求进行排队会导致数千个请求失败,因为会话令牌在发出后面的请求之前过期。

因此,我正在发出合理数量的请求,这些请求将在会话令牌过期之前可靠地完成。

当一批请求完成时,将触发spider_idle信号。

如果需要进一步的请求,信号处理程序会请求将新的会话令牌用于下一批请求。

这在正常运行一个蜘蛛或通过爬虫进程运行一个蜘蛛时有效。

但是,spider_idle信号失败,多个蜘蛛通过爬虫进程。

一个蜘蛛将按预期执行spider_idle信号,但其他蜘蛛会失败,出现以下异常:

2019-06-14 10:41:22 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.spider_idle of <SpideIdleTest None at 0x7f514b33c550>>
Traceback (most recent call last):
File "/home/loren/.virtualenv/spider_idle_test/local/lib/python2.7/site-packages/scrapy/utils/signal.py", line 30, in send_catch_log
*arguments, **named)
File "/home/loren/.virtualenv/spider_idle_test/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "fails_with_multiple_spiders.py", line 25, in spider_idle
spider)
File "/home/loren/.virtualenv/spider_idle_test/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 209, in crawl
"Spider %r not opened when crawling: %s" % (spider.name, request)

我创建了一个存储库,显示spider_idle在单个蜘蛛上按预期运行,而在使用多个蜘蛛时失败。

https://github.com/loren-magnuson/scrapy_spider_idle_test

以下是显示失败的版本:

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request, signals
from scrapy.exceptions import DontCloseSpider
from scrapy.xlib.pydispatch import dispatcher

class SpiderIdleTest(scrapy.Spider):
custom_settings = {
'CONCURRENT_REQUESTS': 1,
'DOWNLOAD_DELAY': 2,
}
def __init__(self):
dispatcher.connect(self.spider_idle, signals.spider_idle)
self.idle_retries = 0
def spider_idle(self, spider):
self.idle_retries += 1
if self.idle_retries < 3:
self.crawler.engine.crawl(
Request('https://www.google.com',
self.parse,
dont_filter=True),
spider)
raise DontCloseSpider("Stayin' alive")
def start_requests(self):
yield Request('https://www.google.com', self.parse)
def parse(self, response):
print(response.css('title::text').extract_first())

process = CrawlerProcess()
process.crawl(SpiderIdleTest)
process.crawl(SpiderIdleTest)
process.crawl(SpiderIdleTest)
process.start()

我尝试使用台球作为替代方法同时运行多个蜘蛛。

在使用台球进程让蜘蛛同时运行后,spider_idle信号仍然失败,但有一个不同的例外。

Traceback (most recent call last):
File "/home/louis_powersports/.virtualenv/spider_idle_test/lib/python3.6/site-packages/scrapy/utils/signal.py", line 30, in send_catch_log
*arguments, **named)
File "/home/louis_powersports/.virtualenv/spider_idle_test/lib/python3.6/site-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "test_with_billiard_process.py", line 25, in spider_idle
self.crawler.engine.crawl(
AttributeError: 'SpiderIdleTest' object has no attribute 'crawler'

这导致我尝试更改:

self.crawler.engine.crawl(
Request('https://www.google.com',
self.parse,
dont_filter=True),
spider)

spider.crawler.engine.crawl(
Request('https://www.google.com',
self.parse,
dont_filter=True),
spider)

哪个有效。

台球不是必需的。 基于 Scrapy 文档的原始尝试将在进行上述更改后起作用。

原文工作版本:

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request, signals
from scrapy.exceptions import DontCloseSpider
from scrapy.xlib.pydispatch import dispatcher

class SpiderIdleTest(scrapy.Spider):
custom_settings = {
'CONCURRENT_REQUESTS': 1,
'DOWNLOAD_DELAY': 2,
}
def __init__(self):
dispatcher.connect(self.spider_idle, signals.spider_idle)
self.idle_retries = 0
def spider_idle(self, spider):
self.idle_retries += 1
if self.idle_retries < 3:
spider.crawler.engine.crawl(
Request('https://www.google.com',
self.parse,
dont_filter=True),
spider)
raise DontCloseSpider("Stayin' alive")
def start_requests(self):
yield Request('https://www.google.com', self.parse)
def parse(self, response):
print(response.css('title::text').extract_first())

process = CrawlerProcess()
process.crawl(SpiderIdleTest)
process.crawl(SpiderIdleTest)
process.crawl(SpiderIdleTest)
process.start()

最新更新