在scrapy中,我试图编写一个下载器中间件,过滤401,403,410的响应,并向这些url发送一些新的请求。错误提示response_request必须返回一个Response或Request。因为我产生了10个请求,以确保失败的url是否尝试了足够多的次数。我该怎么修理它?谢谢你。
这是我在settings.py
上激活的中间件代码"
class NegativeResponsesDownloaderMiddlerware(Spider):
def process_response(self, request, response, spider): ## encode each request with its http status
# Called with the response returned from the downloader.
print("---(NegativeResponsesDownloaderMiddlerware)")
filtered_status_list = ['401', '403', '410']
adaptoz = FailedRequestsItem()
if response.status in filtered_status_list:
adaptoz['error_code'][response.url] = response.status
print("---(process_response) => Sending URL back do DOWNLOADER: URL =>",response.url)
for i in range(self.settings.get('ERROR_HANDLING_ATTACK_RATE')):
yield Request(response.url, self.check_retrial_result,headers = self.headers)
raise IgnoreRequest(f"URL taken out from first flow. Error Code: ", adaptoz['error_code']," => URL = ", resp)
else:
return response
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
def check_retrial_result(self, response):
if response.status == 200:
x = XxxSpider()
x.parse_event(response)
else:
return None
"
不幸的是,scrapy不知道如何处理中间件方法的返回值,当你把它变成一个生成器时,例如,你不能在任何中间件的接口方法中使用yield。
相反,您可以做的是生成请求序列并将它们反馈给scrapy引擎,以便它们可以通过爬行器进行解析,就像它们包含在start_urls
或start_requests
方法中一样。
你可以这样做,如果每个创建的请求通过你的过滤器测试,并在完成循环后引发ignoerequest给spider.crawler.engine.crawl
方法。
def process_response(self, request, response, spider):
filtered_status_list = ['401', '403', '410']
adaptoz = FailedRequestsItem()
if response.status in filtered_status_list:
adaptoz['error_code'][response.url] = response.status
for i in range(self.settings.get('ERROR_HANDLING_ATTACK_RATE')):
request = scrapy.Request(response.url, callback=callback_method, headers = self.headers)
self.spider.crawler.engine.crawl(request, spider)
raise IgnoreRequest(f"URL taken out from first flow. Error Code: ", adaptoz['error_code']," => URL = ", resp)
return response
如果我理解正确的话,您想要实现的目标可以单独使用设置:
RETRY_TIMES=10 # Default is 2
RETRY_HTTP_CODES=[401, 403,410] # Default: [500, 502, 503, 504, 522, 524, 408, 429]
文档在这里。