如何遍历多个 URL 以从 Scrapy 中的 CSV 文件中抓取?



我从阿里巴巴网站抓取数据的代码:

import scrapy


class IndiamartSpider(scrapy.Spider):
name = 'alibot'
allowed_domains = ['alibaba.com']
start_urls = ['https://www.alibaba.com/showroom/acrylic-wine-box_4.html']

def parse(self, response):
Title = response.xpath('//*[@class="title three-line"]/a/@title').extract()
Price = response.xpath('//div[@class="price"]/b/text()').extract()
Min_order = response.xpath('//div[@class="min-order"]/b/text()').extract()
Response_rate = response.xpath('//i[@class="ui2-icon ui2-icon-skip"]/text()').extract()
for item in zip(Title,Price,Min_order,Response_rate):
scraped_info = {
'Title':item[0],
'Price': item[1],
'Min_order':item[2],
'Response_rate':item[3]
}
yield scraped_info

请注意起始 URL,它只抓取给定的 URL,但我希望这段代码废弃我的 csv 文件中存在的所有 URL。我的 csv 文件包含大量 URL。 数据示例.csv文件:

'https://www.alibaba.com/showroom/shock-absorber.html',
'https://www.alibaba.com/showroom/shock-wheel.html',
'https://www.alibaba.com/showroom/shoes-fastener.html',
'https://www.alibaba.com/showroom/shoes-women.html',
'https://www.alibaba.com/showroom/shoes.html',
'https://www.alibaba.com/showroom/shoulder-long-strip-bag.html',
'https://www.alibaba.com/showroom/shower-hair-band.html',
...........

如何一次导入代码中csv文件的所有链接?

要正确循环访问文件而不将所有文件加载到内存中,您应该使用生成器,因为 python/scrapy 中的文件对象和start_requests方法都是生成器:

class MySpider(Spider):
name = 'csv'
def start_requests(self):
with open('file.csv') as f:
for line in f:
if not line.strip():
continue
yield Request(line)

进一步解释: Scrapy 引擎使用start_requests来生成请求。它将继续生成请求,直到并发请求限制已满(CONCURRENT_REQUESTS等设置(。
还值得注意的是,默认情况下,抓取爬网首先进行深度 - 较新的请求优先,因此循环将start_requests最后一个完成。

你已经快到了。唯一的更改是在start_urls中,您希望将其设置为"*.csv 文件中的所有 URL"。以下代码可轻松实现该更改。

with open('data.csv') as file:
start_urls = [line.strip() for line in file]

假设您已经以数据帧的形式存储了 url 列表,并且您希望遍历数据帧内存在的每个 URL。下面给出了对我有用的方法。

class IndiamartSpider(scrapy.Spider):
name = 'alibot'
#allowed_domains = ['alibaba.com']
#start_urls = ['https://www.alibaba.com/showroom/acrylic-wine-box_4.html']

def start_requests(self):
df = pd.read_csv('fileContainingUrls.csv')
#Here fileContainingUrls.csv is a csv file which has a column named as 'URLS'
# contains all the urls which you want to loop over. 
urlList = df['URLS'].to_list()
for i in urlList:
yield scrapy.Request(url = i, callback=self.parse)
def parse(self, response):
Title = response.xpath('//*[@class="title three-line"]/a/@title').extract()
Price = response.xpath('//div[@class="price"]/b/text()').extract()
Min_order = response.xpath('//div[@class="min-order"]/b/text()').extract()

for item in zip(Title,Price,Min_order,Response_rate):
scraped_info = {
'Title':item[0],
'Price': item[1],
'Min_order':item[2],
'Response_rate':item[3]
}
yield scraped_info

最新更新