Scrapy Csv导出在一个单元格中包含所有提取的数据



我目前正在构建我的第一个零碎项目。目前,我正在尝试从HTML表中提取数据。到目前为止,这是我的爬行蜘蛛:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from digikey.items import DigikeyItem
from scrapy.selector import Selector
class DigikeySpider(CrawlSpider):
name = 'digikey'
allowed_domains = ['digikey.com']
start_urls = ['https://www.digikey.com/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/3?stock=1']
['www.digikey.com/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/4?stock=1']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/3?stock=1', ), deny=('subsection.php', ))),
)
def parse_item(self, response):
item = DigikeyItem()
item['partnumber'] = response.xpath('//td[@class="tr-mfgPartNumber"]/a/span[@itemprop="name"]/text()').extract()
item['manufacturer'] =  response.xpath('///td[6]/span/a/span/text()').extract()
item['description'] = response.xpath('//td[@class="tr-description"]/text()').extract()
item['quanity'] = response.xpath('//td[@class="tr-qtyAvailable ptable-param"]//text()').extract()
item['price'] = response.xpath('//td[@class="tr-unitPrice ptable-param"]/text()').extract()
item['minimumquanity'] = response.xpath('//td[@class="tr-minQty ptable-param"]/text()').extract()
yield item
parse_start_url = parse_item

它在www.digikey.com/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/4?stock=1处刮表。然后,它将所有数据导出到digikey.csv文件,但所有数据都在一个单元格中。一个单元格中包含刮取数据的Csv文件

设置.py

BOT_NAME = 'digikey'
SPIDER_MODULES = ['digikey.spiders']
NEWSPIDER_MODULE = 'digikey.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'digikey ("Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36")'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

我希望每次用一行刮取信息,并使用与该零件号相关的相应信息。

项目.py

import scrapy

class DigikeyItem(scrapy.Item):
partnumber = scrapy.Field()
manufacturer = scrapy.Field()
description = scrapy.Field()
quanity= scrapy.Field()
minimumquanity = scrapy.Field()
price = scrapy.Field()
pass

非常感谢您的帮助!

问题是要将整列加载到单个项的每个字段中。我觉得你想要的是:

for row in response.css('table#productTable tbody tr'):
item = DigikeyItem()
item['partnumber'] = (row.css('.tr-mfgPartNumber [itemprop="name"]::text').extract_first() or '').strip()
item['manufacturer'] =  (row.css('[itemprop="manufacture"] [itemprop="name"]::text').extract_first() or '').strip()
item['description'] = (row.css('.tr-description::text').extract_first() or '').strip()
item['quanity'] = (row.css('.tr-qtyAvailable::text').extract_first() or '').strip()
item['price'] = (row.css('.tr-unitPrice::text').extract_first() or '').strip()
item['minimumquanity'] = (row.css('.tr-minQty::text').extract_first() or '').strip()
yield item

我对选择器做了一些改动,试图使它变短。顺便说一句,请避免我在这里使用的手动extract_firststrip重复(仅用于测试目的),并考虑使用项目加载器,应该更容易获得第一个并剥离/格式化所需的输出。

最新更新