Scrapy分页最多可跟踪2页,但它必须跟踪更多



分页得到了page_1和page_2的结果,而它必须遵循更多的结果,即最多10页。我用.xpath更改next_page.ccs选择器,但对我无效。

class YellSpider(scrapy.Spider):
name = 'yell'
base_url = 'https://www.yell.com{}'
start_urls = ['https://www.yell.com/ucs/UcsSearchAction.do?scrambleSeed=770796459&keywords=hospitals&location=united+kingdom']
def parse(self, response):
for data in response.css('div.row.businessCapsule--mainRow'):
title = data.css('.text-h2::text').get()
avg_rating = response.css('span.starRating--average::text').get()
business_url = data.css('a.businessCapsule--title::attr(href)').get()
final_url = self.base_url.format(business_url)
yield scrapy.Request(final_url,callback=self.parse_site,cb_kwargs={"title":title,"avg_rating":avg_rating})
next_page = response.urljoin(response.css('a.pagination--next::attr(href)').extract_first())
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
def parse_site(self,response,title,avg_rating):
opening_hours  = response.css('strong::text').get()
opening_hours = opening_hours.strip() if opening_hours else ""
items = {
'Title': title,
'Average Rating': avg_rating,
'Hours': opening_hours
}
yield items

我现在运行了脚本,发现它做得很好。如果您看到脚本只从第一页抓取内容,您肯定想手动查看此链接,以确定您是否受到了费率限制。当您手动访问页面并看到captcha页面时,请确保休息半小时,然后再次运行脚本。

class YellSpider(scrapy.Spider):
name = 'yell'
base_url = 'https://www.yell.com{}'
start_urls = ['https://www.yell.com/ucs/UcsSearchAction.do?scrambleSeed=770796459&keywords=hospitals&location=united+kingdom']
def parse(self, response):
for data in response.css('div.row.businessCapsule--mainRow'):
title = data.css('.text-h2::text').get()
avg_rating = response.css('span.starRating--average::text').get()
business_url = data.css('a.businessCapsule--title::attr(href)').get()
final_url = self.base_url.format(business_url)
yield scrapy.Request(final_url,callback=self.parse_site,cb_kwargs={"title":title,"avg_rating":avg_rating})
next_page = response.css('a.pagination--next::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
def parse_site(self,response,title,avg_rating):
opening_hours  = response.css('strong::text').get()
opening_hours = opening_hours.strip() if opening_hours else ""
items = {
'Title': title,
'Average Rating': avg_rating,
'Hours': opening_hours
}
yield items

最新更新