使用for循环或str格式绕过无用的爬网列表

我正在寻找一种解决方案，我的代码只对每个项目进行一次爬网。自从我添加了最后一个循环后，我收到了每一个项目三次。我如何才能只执行最后一个循环一次，或者是否可以确定所有的双爬网？

import scrapy
from ..items import TopartItem
class LinkSpider(scrapy.Spider):
name = "link"
allow_domains = ['topart-online.com']
start_urls = ['https://www.topart-online.com/de/Blattzweige-Blatt-und-Bluetenzweige/l-KAT282?seg=1']
custom_settings = {'FEED_EXPORT_FIELDS': ['title','links','ItemSKU','ItemEAN','Delivery_Status', 'Attribute', 'Values'] } 
def parse(self, response):
card = response.xpath('//a[@class="clearfix productlink"]')

for a in card:
items = TopartItem()
link = a.xpath('@href')
items['title'] = a.xpath('.//div[@class="sn_p01_desc h4 col-12 pl-0 pl-sm-3 pull-left"]/text()').get().strip()
items['links'] = link.get()
items['ItemSKU'] = a.xpath('.//span[@class="sn_p01_pno"]/text()').get().strip()
items['Delivery_Status'] = a.xpath('.//div[@class="availabilitydeliverytime"]/text()').get().strip().replace('/','')
yield response.follow(url=link.get(),callback=self.parse_item, meta={'items':items})
last_pagination_link = response.xpath('//a[@class="page-link"]/@href')[-1].get()
last_page_number = int(last_pagination_link.split('=')[-1])
for i in range(2,last_page_number+1):
url = f'https://www.topart-online.com/de/Blattzweige-Blatt-und-Bluetenzweige/l-KAT282?seg={i}'
yield response.follow(url=url, callback=self.parse)

def parse_item(self,response):
table = response.xpath('//div[@class="productcustomattrdesc word-break col-6"]')
for a in table:
items = TopartItem()
items = response.meta['items']
items['ItemEAN'] = response.xpath('//div[@class="productean"]/text()').get().strip()
items['Attribute'] = response.xpath('//div[@class="productcustomattrdesc word-break col-6"]/text()').getall()
items['Values'] = response.xpath('//div[@class="col-6"]/text()').getall()
yield items

我只期待51个元素，但我收到了153个。

每个项目中有3个是因为你在表周围做了一个for循环，我认为这没有必要。尽管如果数据没有意义，很乐意出错。

添加

对顶部代码的一个小添加。我之所以这样做，是因为要指定在创建CSV FIle时列应该如何显示。通常情况下，对于item，您无法按照自己想要的方式获得列的顺序。在这里，我们通过使scraby包含这些设置来指定它们。我们必须将attribute和value添加到该列表中，以便在创建CSV文档时将其包括在内。

custom_settings = {'FEED_EXPORT_FIELDS': ['title','links','ItemSKU','ItemEAN','Delivery_Status','Attribute','Values'] }

对代码的更正

def parse_item(self,response):
items = response.meta['items']
items['ItemEAN'] = response.xpath('//div[@class="productean"]/text()').get().strip()
items['Attribute'] = response.xpath('//div[@class="productcustomattrdesc word-break col-6"]/text()').getall()
items['Values'] = response.xpath('//div[@class="col-6"]/text()').getall()
yield items

解释

不需要在parse_item中实例化TopArtItem((，因为它已经在parse函数中实例化了
不需要使用for循环，只需使用response获取细节即可

提示

如果您确实需要围绕表或任何提供列表的XPATH选择器执行for循环，请记住您的XPATH选择器应该是a.xpath('.//div etc....)而不是response('//)。这是因为您想要使用a而不是response或table，并且您必须使用.//，因为您想要搜索整个文档的相对XPATH NOT//

所谓相对路径，我的意思是你想要告诉scratchy，你假设table中的xpath选择器是.//xpath选择器中的给定选择器，而使用.//XPATH_SELECTOR，你告诉scratch将该表xpath选择符添加到.//xpath选择器中的任何选择器中。这是一种不必使用非常大的字符串XPATH选择器的简洁方法。但是，如果您正在围绕XPATH选择器执行for循环，则必须使用它，该选择器已经创建了一个选择器列表。

例如

不是要包含的代码，而是作为如何在表XPATH选择器为您提供列表时使用for循环的示例。

table = response.xpath('//div[@class="productcustomattrdesc word-break col-6"]')
for a in table:
items = response.meta['items']
items['ItemEAN'] = a.xpath('.//div[@class="productean"]/text()').get().strip()
items['Attribute'] = a.xpath('.//div[@class="productcustomattrdesc word-break col-6"]/text()').getall()
items['Values'] = a.xpath('.//div[@class="col-6"]/text()').getall()
yield items

我们使用了a而不是table或response，并且我们特别使用了.//而不是//

根据注释更新

所以对于下一个问题，它需要一些字符串和列表操作。

更改为代码

为了使下面的代码正常工作，您需要更改custom_settings

custom_settings = {'FEED_EXPORT_FIELDS': ['title','links','ItemSKU','ItemEAN','Delivery_Status','Values'] }

您还需要在items.py 中删除

Attributes = scrapy.Field()

更新的parse_items代码

def parse_item(self,response):
items = response.meta['items']
attribute = response.xpath('//div[@class="productcustomattrdesc word-break col-6"]/text()').getall()
values = response.xpath('//div[@class="col-6"]/text()').getall()
combined = []
for i,j in zip(attribute,values):
combined.append(i.strip().replace('.','').replace(':',': ') + j.strip().replace(''',''))                 
items['ItemEAN'] = response.xpath('//div[@class="productean"]/text()').get().strip()            
items['values'] = ', '.join(combined)
yield items

解释

我们定义了变量CCD_ 22和CCD_。我们不将这些添加到条目字典中，因为我们想先进行一些操作。

组合变量很长，但可以很容易地遵循。

我们有两个列表，attributes和values，我们可以将两个列表中的每个项目组合在一起。属性中的第一项与值中的第一个项。这可以通过zip函数来完成。

举一个抽象的例子来理解zip在做什么。

如果我们有一个名为num = ['1','2','3']和letter = [a,b,c]的列表。CCD_ 28将创建CCD_ 29。Zip创建每个相应列表项的元组，并将它们放入列表中。

现在，我们想将这个列表中的所有项目组合成一个字符串作为目标。

我们可以像这个一样循环zip(num,letter)的每个列表项

combined = []
for i,j zip(num,letter): 
combined.append(i + j)

将创建combined = ['1 + a','2 + b','3 + c']

然后，我们使用''.join(combined)，这是一种将列表转换为字符串的标准方法，可以将所有这些转换为字符串。

所以我们用这段代码来做这件事，除了我使用strip((方法，并为每个i或j替换一些字母来整理它。

添加

对代码的更正

解释

提示

根据注释更新

更改为代码

更新的parse_items代码

解释

相关内容

最新更新

热门标签：