Python + Scrapy重命名下载的图像



重要说明:目前在stackoverflow上提供的所有答案都适用于以前版本的Scrapy,不适用于最新版本的Scrapy 1.4

对 scrapy 和 python 完全陌生,我正在尝试抓取一些页面并下载图像。正在下载图像,但它们仍具有原始 SHA-1 名称作为文件名。我不知道如何重命名文件,它们实际上都有 SHA-1 文件名

尝试将它们重命名为"test",当我运行scrapy crawl rambopics时,我确实在输出中出现了"test"以及 url 的数据。但是文件不会在目标文件夹中重命名。下面是输出的示例:

> 2017-06-11 00:27:06 [scrapy.core.scraper] DEBUG: Scraped from <200
> http://www.theurl.com/> {'image_urls':
> ['https://www.theurl.com/-a4Bj-ENjHOY/VyE1mGuJyUI/EAAAAAAAHMk/mw1_H-mEAc0QQEwp9UkTipxNCVR-xdbcgCLcB/s1600/Image%2B%25286%2525.jpg'],
> 'image_name': ['test'], 'title': ['test'], 'filename': ['test'],
> 'images': [{'url':
> 'https://www.theurl.com/-a4Bj-ENjHOY/VyE1mGuJyUI/EAAAAAAAHMk/mw1_H-mEAc0QQEwp9UkTipxNCVR-xdbcgCLcB/s1600/Image%2B%25286%2525.jpg',
> 'path': 'full/fcbec9bf940b48c248213abe5cd2fa1c690cb879.jpg',
> 'checksum': '7be30d939a7250cc318e6ef18a6b0981'}]}

到目前为止,我已经尝试了许多不同的解决方案,都发布在stackoverflow上,对于2017年最新版本的scrapy来说,这个问题没有明确的答案,看起来这些命题可能几乎都已经过时了。我正在使用Scrapy 1.4和python 3.6。

刮擦.cfg

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.org/en/latest/deploy.html
[settings]
default = rambopics.settings
[deploy]
#url = http://localhost:6800/
project = rambopics

items.py进口刮擦

class RambopicsItem(scrapy.Item):
# defining items:
image_urls = scrapy.Field()
images = scrapy.Field()
image_name = scrapy.Field()
title = scrapy.Field()
#pass -- dont realy understand what pass is for

settings.py

BOT_NAME = 'rambopics'
SPIDER_MODULES = ['rambopics.spiders']
NEWSPIDER_MODULE = 'rambopics.spiders'

ROBOTSTXT_OBEY = True

ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = "W:/scraped/"

pipelines.py

import scrapy
from scrapy.pipelines.images import ImagesPipeline
class RambopicsPipeline(ImagesPipeline):

def get_media_requests(self, item, info):
img_url = item['img_url']
meta = {
'filename': item['title'],
'title': item['image_name']
}
yield Request(url=img_url, meta=meta)

(蜘蛛)rambopics.py

from rambopics.items import RambopicsItem
from scrapy.selector import Selector
import scrapy

class RambopicsSpider(scrapy.Spider):
name = 'rambopics'
allowed_domains = ['theurl.com']
start_urls = ['http://www.theurl.com/']
def parse(self, response):
for sel in response.xpath('/html'):
#img_name = sel.xpath("//h3[contains(@class, 'entry-title')]/a/text()").extract()
img_name = 'test'
#img_title = sel.xpath("//h3[contains(@class, 'entry-title')]/a/text()").extract()
img_title = 'test' 
for elem in response.xpath("//div[contains(@class, 'entry-content')]"):
img_url = elem.xpath("a/@href").extract_first()

yield {
'image_urls': [img_url],
'image_name': [img_name],
'title': [img_title],
'filename': [img_name]
}

请注意,我不知道最终下载的文件名正确的元名称是什么(我不确定它是文件名、image_name还是标题)。

使用file_path方法更改映像名称,如下所示:

class SaveImagesPipeline(FilesPipeline):
def get_media_requests(self, item, info):
i = 1
for image_url in item['image_urls']:
filename = '{}_{}.jpg'.format(item['name_image'], i)
yield scrapy.Request(image_url, meta={'filename': filename})
i += 1
return
def file_path(self, request, response=None, info=None):
return request.meta['filename']

最新更新