我在从亚马逊的第一页链接中抓取 secod 页面时遇到了一些问题



我正在做一个在亚马逊上抓取信息的实验,所以我想从第一页上的链接开始,然后转到链接下载一些信息。 但是我遇到了一些问题,它发生了很多次,以至于我无法继续.我渴望你的帮助

shuju.py

from AmazonsPro.items import AmazonsproItem

class ShujuSpider(scrapy.Spider):
name = 'shuju'
#allowed_domains = ['www.amazon.com']
start_urls = ['https://www.amazon.com/Best-Sellers-Office-Products-Woodcase-Lead-Pencils/zgbs/office-products/490674011/ref=zg_bs_pg_1?_encoding=UTF8&pg=1']
def parse(self, response):
li_list=response.xpath('//ol[@id="zg-ordered-list"]/li')
link_list=[]
for li in li_list:
#get the link
link='www.amazon.com'+li.xpath('.//span[@class="aok-inline-block zg-item"]/a/@href').extract_first()
#get the rank
rank=li.xpath('.//span[@class="a-size-small aok-float-left zg-badge-body zg-badge-color"]/span/text()').extract_first()
link_list.append(link)
for link in link_list:
print(link)
yield scrapy.Request(url=link,callback=self.sec_parse)
#get the the second page 
def sec_parse(self,response):
item=AmazonsproItem()
print('star second page')
title=response.xpath('.//*[@id="productTitle"]/text()').extract_first()
brand=response.xpath('.//*[@id="bylineInfo"]/text()').extract_first()
item['title']=title
item['brand']=brand
print('done')
yield item

setting.py


ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
'AmazonsPro.pipelines.AmazonsproPipeline': 300,
}

pipelines.py

fp=None
def open_spider(self,spider):
print('star scrapy')
self.fp = open('./Asin.txt','w')
def process_item(self,item,spider):
self.fp.write(item['rank']+"+"+item['brand']+"+"+item['title']+"+"+item['star'])
return item
def close_spider(self,spider):
self.fp.close()
print('over')

items.py

# define the fields for your item here like:
# name = scrapy.Field()
rank=scrapy.Field()
title=scrapy.Field()
link=scrapy.Field()
brand=scrapy.Field()

没有发生错误,但它只运行一次def sec_parse()函数(实际上是50次)。

知道了,我没有在链接中添加 https://,所以它不会起作用。谢谢

最新更新