如何将 Scrapy 对象项传递到图像管道



我有一个蜘蛛可以下载特定网站的jpg。过去,我在图像管道中解析了response.url,以便在下载文件时重命名文件。问题是站点的目录结构很奇怪,因此解析image_urls以重命名目标文件不起作用。作为解决方法,我只使用原始图形名称作为文件。

我想使用来自实际 Scrapy 对象本身的数据,但我似乎无法将变量从蜘蛛传递到图像管道中。从下面的代码中,我想解析蜘蛛中的url并将其作为变量传递给管道中的otImagesPipeline,但没有任何效果。我尝试查看 Scrapy 文档,但找不到如何做到这一点。

Scrapy可以做到这一点吗?

这是我的蜘蛛代码:

settings.py:

BOT_NAME = 'bid'
MEDIA_ALLOW_REDIRECTS = True
SPIDER_MODULES = ['bid.spiders']
NEWSPIDER_MODULE = 'bid.spiders'
ITEM_PIPELINES = {'bid.pipelines.otImagesPipeline': 1}  
IMAGES_STORE = 'C:\temp\images\filenametest'  

pipelines.py

import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline
class otImagesPipeline(ImagesPipeline):
def file_path(self, request, response=None, info=None):
targetfile = request.url.split('/')[-1]
return targetfile

items.py

import scrapy
class BidItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
caption = scrapy.Field()
image_urls = scrapy.Field()

getbid.py(蜘蛛)

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from bid.items import BidItem
from urllib import parse as urlparse
class GetbidSpider(CrawlSpider):
name = 'getbid'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
for sel in response.xpath('//a'):
link = str(sel.xpath('@href').extract()[0])
if (link.endswith('.jpg')):
href = BidItem()
href['url'] = response.url
href['title'] = response.css("h1.entry-title::text").extract_first()
href['caption'] = response.css("p.wp-caption-text::text").extract()
href['image_urls'] = [link]
yield href
yield scrapy.Request(urlparse.urljoin('http://www.example.com/',link),callback=self.parse_item)

更新

多亏了Umair的帮助,我才能完全按照自己的需要修复它。 以下是修订后的代码:

getbid.py

def parse_item(self, response):
for sel in response.xpath('//a'):
link = str(sel.xpath('@href').extract()[0])
if (link.endswith('.jpg')):
href = BidItem()
href['url'] = response.url
href['title'] = response.css("h1.entry-title::text").extract_first()
href['caption'] = response.css("p.wp-caption-text::text").extract()
future_dir = href['url'].split("/")[-2]
href['images'] = {link: future_dir}
yield href
yield scrapy.Request(urlparse.urljoin(http://www.example.com/',link),callback=self.parse_item)

pipelines.py

class otImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
if 'images' in item:
for image_url, img_dir in item['images'].items():
request = scrapy.Request(url=image_url)
request.meta['img_dir'] = img_dir
yield request
def file_path(self, request, response=None, info=None):
filename = request.url.split('/')[-1]
filedir = request.meta['img_dir']
filepath = filedir + "/" + filename
return filepath

在你的蜘蛛类中IMAGES_STORE,这样你以后就可以在ImagesPipelinefile_path方法中访问它

class GetbidSpider(CrawlSpider):
name = 'getbid'
IMAGE_DIR = 'C:\temp\images\filenametest'
custom_settings = {
"IMAGES_STORE": IMAGE_DIR
}
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
for sel in response.xpath('//a'):
link = str(sel.xpath('@href').extract()[0])
if (link.endswith('.jpg')):
href = BidItem()
href['url'] = response.url
href['title'] = response.css("h1.entry-title::text").extract_first()
href['caption'] = response.css("p.wp-caption-text::text").extract()
href['images'] = {link: href['title']}
yield href
yield scrapy.Request(urlparse.urljoin('http://www.example.com/',link),callback=self.parse_item)

然后在你的ImagesPipeline

class CustomImagePipeline(ImagesPipeline):
def get_media_requests(self, item, info):
if 'images' in item:
for image_url, img_name in item['images'].iteritems():
request = scrapy.Request(url=image_url)
request.meta['img_name'] = img_name
yield request
def file_path(self, request, response=None, info=None):
return os.path.join(info.spider.IMAGE_DIR, request.meta['img_name'])

最新更新