带有Scrapy的嵌套回调



我想在一个列出我感兴趣的所有品牌的网页中抓取不同品牌产品的网页。所以我基本上做了一个Scrapy scraper,它用解析器解析它找到的每个品牌的URL,然后调用解析器来查找它们各自产品的URL。然而,这似乎不是执行嵌套回调的正确方法。它返回给我:

2020-11-27 15:03:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.sephora.fr/marques-de-a-a-z/> (referer: None)
2020-11-27 15:03:19 [sephora] DEBUG: parse: I just visited: https://www.sephora.fr/marques-de-a-a-z/
url:  https://www.sephora.fr/ABSOL-HubPage.html
2020-11-27 15:03:19 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.sephora.fr/marques-de-a-a-z/> (referer: None)
Traceback (most recent call last):
File "c:usersantoidocumentsprogramminglearningdatasciencescr_envlibsite-packagesscrapyutilsdefer.py", line 120, in iter_errback
yield next(it)
File "c:usersantoidocumentsprogramminglearningdatasciencescr_envlibsite-packagesscrapyutilspython.py", line 353, in __next__
return next(self.data)
File "c:usersantoidocumentsprogramminglearningdatasciencescr_envlibsite-packagesscrapyutilspython.py", line 353, in __next__
return next(self.data)
File "c:usersantoidocumentsprogramminglearningdatasciencescr_envlibsite-packagesscrapycorespidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "c:usersantoidocumentsprogramminglearningdatasciencescr_envlibsite-packagesscrapyspidermiddlewaresoffsite.py", line 29, in process_spider_output
for x in result:
File "c:usersantoidocumentsprogramminglearningdatasciencescr_envlibsite-packagesscrapycorespidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "c:usersantoidocumentsprogramminglearningdatasciencescr_envlibsite-packagesscrapyspidermiddlewaresreferer.py", line 340, in <genexpr>
return (_set_referer(r) for r in result or ())
File "c:usersantoidocumentsprogramminglearningdatasciencescr_envlibsite-packagesscrapycorespidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "c:usersantoidocumentsprogramminglearningdatasciencescr_envlibsite-packagesscrapyspidermiddlewaresurllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "c:usersantoidocumentsprogramminglearningdatasciencescr_envlibsite-packagesscrapycorespidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "c:usersantoidocumentsprogramminglearningdatasciencescr_envlibsite-packagesscrapyspidermiddlewaresdepth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "c:usersantoidocumentsprogramminglearningdatasciencescr_envlibsite-packagesscrapycorespidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "C:UsersantoiDocumentsProgrammingLearningDataSciencenosetime_scrapernosetime_scraperspiderssephora.py", line 23, in parse
yield scrapy.Request(url=base_url + url, callback=self.parse_brand(response))
File "c:usersantoidocumentsprogramminglearningdatasciencescr_envlibsite-packagesscrapyhttprequest__init__.py", line 32, in __init__
raise TypeError(f'callback must be a callable, got {type(callback).__name__}')
TypeError: callback must be a callable, got generator
2020-11-27 15:03:19 [scrapy.core.engine] INFO: Closing spider (finished)
2020-11-27 15:03:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 317,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 65639,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 1.416377,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 11, 27, 14, 3, 19, 346887),
'log_count/DEBUG': 2,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/TypeError': 1,
'start_time': datetime.datetime(2020, 11, 27, 14, 3, 17, 930510)}
2020-11-27 15:03:19 [scrapy.core.engine] INFO: Spider closed (finished)

这是我的蜘蛛

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import json

class SephoraSpider(scrapy.Spider):
name = 'sephora'
allowed_domains = ['sephora.fr']
start_urls = ['https://www.sephora.fr/marques-de-a-a-z/']
# rules = (
#     Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
# )
def parse(self, response):
base_url = 'https://www.sephora.fr'
self.log("parse: I just visited: " + response.url)
urls = response.css('a.sub-category-link::attr(href)').extract()
if urls:
for url in urls:
yield scrapy.Request(url=base_url + url, callback=self.parse_brand(response))
def parse_brand(self, response):
self.log("parse_brand: I just visited: "+ response.url)
for d in response.css('div.product-tile::attr(data-tcproduct)').extract():
d = json.loads(d)
yield scrapy.Request(url=d['product_url_page'], callback=self.parse_item(response))
def parse_item(self, response):
self.log("I just visited: "+ response.url)
# item = {}
# #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
# #item['name'] = response.xpath('//div[@id="name"]').get()
# #item['description'] = response.xpath('//div[@id="description"]').get()
# return item

您通常在回调中提供对函数的引用,而不是函数调用。

因此,为了解决您的问题,在每种情况下都要按照以下步骤进行,但要进行适当的回调:

yield scrapy.Request(url=base_url + url, callback=self.parse_brand)

请参阅https://docs.scrapy.org/en/latest/intro/tutorial.html#our-以第一只蜘蛛为例。

最新更新