Scrapy回调不执行时使用剧作家JavaScript渲染



我使用Scrapy和剧作家插件来抓取一个依赖JavaScript渲染的网站。我的爬行器包括两个异步函数,parse_categories和parse_product_page。

parse_categories函数检查URL中的类别,并再次向parse_categories回调发送请求,直到找到产品页面,这应该是在没有找到类别时。如果没有找到任何类别,它应该向parse_product_page回调发送请求。

然而,当它到达parse_categories中的else块时,似乎永远不会发出对parse_product_page的请求。我已经确认代码进入了else块,但是parse_product_page函数中的print语句从未到达。

这是我的代表:

import scrapy
from scrapy_playwright.page import PageMethod
class Spider():
name = "quotes"
allowed_domains = ['quotes.toscrape.com']

def start_requests(self):
yield scrapy.Request(url='https://quotes.toscrape.com/js/', callback=self.parse_urls, 
meta=dict(
playwright = True, 
playwright_include_page = True,
playwright_page_methods = [
PageMethod('wait_for_selector','body > div > nav > ul > li > a')
],
))

async def parse_urls(self, response):
page = response.meta['playwright_page']
await page.close()

next_page_url = response.xpath('//li[@class="next"]/a/@href').get()
if next_page_url:
print("Inside if block")
url = 'https://quotes.toscrape.com' + next_page_url
yield scrapy.Request(url=url,callback=self.parse_urls,
meta=dict(
playwright = True,
playwright_include_page = True,
playwright_page_methods = [
PageMethod('wait_for_selector','body > div > div.quote')]
))
else:
print("Next page link not found")
yield scrapy.Request(url=response.request.url, callback=self.parse, 
meta=dict(
playwright = True,
playwright_include_page = True,
playwright_page_methods = [
PageMethod('wait_for_selector','body > div > div.quote')]
))

async def parse(self,response):
page = response.meta['playwright_page']
await page.close()
print("Function has been called, because next page link not found")

这是来自代理的日志:

Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Next page link not found
2023-04-11 09:47:04 [root] WARNING: spider quotes finished crawling

此问题已通过向yield scrapy添加参数dont_filter = True得到修复。

else:
yield scrapy.Request(url=response.request.url,
callback=self.parse, 
dont_filter=True,
meta=dict(
playwright = True,
playwright_include_page = True,
playwright_page_methods = [
PageMethod('wait_for_selector','body > div > div.quote')]
))

最新更新