我从某个网站的爬网数据中获取空数组,可能有什么问题?
import scrapy
from scrapy.loader import ItemLoader
from jumia.items import JumiaItem
class LaptopsSpider(scrapy.Spider):
name="laptops"
start_urls = [
'https://www.jumia.co.ke/laptops/'
]
def parse(self, response):
for laptops in response.xpath("//div[contains(@class, '-gallery')]"):
loader = ItemLoader(item=JumiaItem(), selector=laptops, response=response)
loader.add_xpath('brand', ".//span[contains(@class, 'brand')]/text()")
loader.add_xpath('name', ".//span[@class='name']/text()")
loader.add_xpath('price', ".//span[@class='price-box ri']/span[contains(@class, 'price')][1]/span[@dir='ltr']/text()")
loader.add_xpath('link', ".//a[@class='link']/@href")
yield loader.load_item()
next_page = response.xpath("//a[@title='Next']/@href").extract_first()
if next_page is not None:
next_page_link = response.urljoin(next_page)
yield scrapy.Request(url=next_page_link, callback=self.parse)
我签了scrapy shell
,似乎有一些块不需要信息。检查以下结果:
In [2]: len(response.xpath("//div[contains(@class, '-gallery')]").extract())
Out[2]: 48
In [3]: len(response.xpath("//div[contains(@class, '-gallery')]//span[contains(@class, 'brand')]").extract())
Out[3]: 40
所以有 48 个区块,但其中只有 40 个是有效的。因此,我建议对您的for
循环中的所需数据(例如检查名称或品牌(进行小检查,如果没有,只需continue
.