在scrapy中使用ItemLoader将默认值设置为None的KeyError处理



通过零碎的教程,开始使用项目加载器收集数据。我使用的数据包括从预定义的字典中读取数据,该字典是我通过JSON加载的,以及后面跟着spider的产品页面。

我遇到的问题是,字典有时没有可用的密钥(如"salePrice"(,这会导致爬网中出现KeyError并完全停止执行。我想看看是否有一种干净的方法来处理items.py中该字段的KeyErrors,其中为每个字段指定了input_processors和output_processors。

如果有任何建议或例子,我们将不胜感激!

import json
import re
import time
import scrapy
from scrapy.loader import ItemLoader
from tutorial.items import Product
class SephoraSpider(scrapy.Spider):

name = 'sephora-shelf'
start_urls = [
'https://www.sephora.com/shop/moisturizing-cream-oils-mists/?currentPage=1'
]
next_page_number = 1
base_url = 'https://www.sephora.com'

def parse(self, response):
json_xpath = '//script[@type="text/json" and @id="linkSPA"]/text()'
product_container = json.loads(response.xpath(json_xpath).extract()[0])
product_container = product_container['NthCategory']['props']['products']
start_time = round(time.time())
print("starting loop")
for _product in product_container:
product = Product()
loader = ItemLoader(item=Product(), response=response)
loader.add_value('list_price', _product['currentSku']['listPrice'])
loader.add_value('sale_price', _product['currentSku']['salePrice'])
loader.add_value('sku_id', _product['currentSku']['skuId'])
loader.add_value('product_key', _product['productId'])
loader.add_value('product_name', _product['displayName'])
loader.add_value('brand_name', _product['brandName'])
loader.add_value('product_id', _product['productId'])

_product_url = self.base_url + _product['targetUrl']
loader.add_value('product_url', _product_url)
loader.add_value('status', None)
print("finished loading product")

# TODO: add a check to see if it was on the previous run's data
#       to determine if it is product status: added / deleted.
#       Only collect product data if the product is newly added.
yield response.follow(_product_url, callback=self.parse_product,
meta={'item':loader.load_item()})
next_page_xpath = '//button[@type="button" and @aria-label="Next"]'
next_page_button = response.xpath(next_page_xpath)
print(f'next_page_button: {next_page_button}')

if next_page_button:
print("Inside next_page_button")
SephoraSpider.next_page_number += 1
next_page = re.sub('?currentPage=[0-9]*',
'?currentPage=' + 
str(SephoraSpider.next_page_number),
response.request.url)
print(f"Next Page: {next_page}")
yield response.follow(next_page, callback=self.parse)

def parse_product(self, response):
loader = ItemLoader(item=response.meta['item'], 
response=response)
loader.add_xpath('item_id', '//div[@data-at="sku_size"]')
time.sleep(3)
yield loader.load_item()
一个简单的解决方法是使用dictionary的.get((方法,并在缺少键时将其默认为None。仍然不相信这是否是处理这种错误的正确方法。

之前:loader.add_value('sale_price', _product['currentSku']['salePrice'])

之后:loader.add_value('sale_price', _product.get('currentSku').get('salePrice', None))

最新更新