Scrapy - 如何将下载的图像保存到文件系统和 S3?



我正在使用Scrapy来抓取和下载图像。除了文件系统之外,我还想将文件保存到 Amazon S3。

我配置它们中的任何一个都没有问题,但是有谁知道同时配置两者的方法,以便将文件保存到本地文件夹和 Scrapy 中的 AWS S3?

我想出了以下解决方案,从一个GET请求中保存相同的文件两次。settings.py我使用了以下条目:

ITEM_PIPELINES = {
'project.pipelines.MyItemsPipeline': 300,
'project.pipelines.MyDualImagesPipeline': 310,
}
IMAGES_STORE = 's3://xxxxxxxxxxxxxxxx/'
AWS_ENDPOINT_URL = 'https://xxx.xxxxxxxxxx.xxxxxxxx.com'
AWS_ACCESS_KEY_ID = 'xxxxxxxxxxxx'
AWS_SECRET_ACCESS_KEY = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
IMAGES_STORE_SECONDARY = '/some/path/to/folder/'

然后是我在里面的自定义Pipelinepipelines.py我覆盖了一个方法:

from scrapy.utils.project import get_project_settings
from scrapy.pipelines.images import ImagesPipeline
from scrapy.utils.misc import md5sum

project_settings = get_project_settings()

class DualSaveImagesPipeline(ImagesPipeline):
def __init__(self, store_uri, download_func=None, settings=None):
super().__init__(store_uri, settings=settings,
download_func=download_func)
self.store_secondary = self._get_store(
project_settings.get('IMAGES_STORE_SECONDARY'))
def image_downloaded(self, response, request, info):
checksum = None
for path, image, buf in self.get_images(response, request, info):
if checksum is None:
buf.seek(0)
checksum = md5sum(buf)
width, height = image.size
self.store.persist_file(
path, buf, info,
meta={'width': width, 'height': height},
headers={'Content-Type': 'image/jpeg'})
self.store_secondary.persist_file(
path, buf, info,
meta={'width': width, 'height': height},
headers={'Content-Type': 'image/jpeg'})
return checksum

这对我来说是一个技巧。如果有人在他们的项目中遇到相同的要求,请共享。

我只用两个本地文件夹测试了它firstsecond但它应该适用于其他地方。

通常您只能在IMAGES_STORE中设置一个值,因此我直接在课堂上创建了两个具有不同设置的Pipeline。我将store_uri替换为__init__发送到不同的地方。

class FirstImagesPipeline(ImagesPipeline):
def __init__(self, store_uri, download_func=None, settings=None):
store_uri = 'first'   # local folder which has to exists
super().__init__(store_uri, download_func, settings)
class SecondImagesPipeline(ImagesPipeline):
def __init__(self, store_uri, download_func=None, settings=None):
store_uri = 'second'  # local folder which has to exists
#store_uri = 's3://bucket/images'
super().__init__(store_uri, download_func, settings)

并同时使用它们

'ITEM_PIPELINES': {
'FirstImagesPipeline': 1,
'SecondImagesPipeline': 2,
}

他们将相同的图像保存在两个本地文件夹中first/fullsecond/full.


顺便说一句:在使用示例中的文档,我发现我可以使用管道名称作为前缀为不同的Pipelines设置不同的设置FIRSTIMAGESPIPELINE_SECONDIMAGESPIPELINE_

FIRSTIMAGESPIPELINE_IMAGES_URLS_FIELD = ...
SECONDIMAGESPIPELINE_IMAGES_URLS_FIELD = ...
FIRSTIMAGESPIPELINE_FILES_EXPIRES = ...
SECONDIMAGESPIPELINE_FILES_EXPIRES = ...

但这似乎对IMAGES_STORE不起作用


最小的工作代码,您可以将其放入一个文件中并运行python script.py

它从 http://books.toscrape.com/下载图像,这些图像是由Scrapy的作者创建的,作为学习抓取的地方。

import scrapy
from scrapy.pipelines.images import ImagesPipeline
class MySpider(scrapy.Spider):
name = 'myspider'
# see page created for scraping: http://toscrape.com/
start_urls = ['http://books.toscrape.com/'] #'http://quotes.toscrape.com']
def parse(self, response):
print('url:', response.url)
# download images and convert to JPG (even if it is already JPG)
for url in response.css('img::attr(src)').extract():
url = response.urljoin(url)
yield {'image_urls': [url], 'session_path': 'hello_world'}
class FirstImagesPipeline(ImagesPipeline):
def __init__(self, store_uri, download_func=None, settings=None):
#print('FirstImagesPipeline:', store_uri)
print('FirstImagesPipeline:', settings)
store_uri = 'first'
super().__init__(store_uri, download_func, settings)
class SecondImagesPipeline(ImagesPipeline):
def __init__(self, store_uri, download_func=None, settings=None):
#print('SecondImagesPipeline:', store_uri)
store_uri = 'second'
store_uri = 's3://bucket/images'
super().__init__(store_uri, download_func, settings)
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# download images to `IMAGES_STORE/full` (standard folder) and convert to JPG (even if it is already JPG)
# it needs `yield {'image_urls': [url]}` in `parse()` and both ITEM_PIPELINES and IMAGES_STORE to work
'ITEM_PIPELINES': {
'__main__.FirstImagesPipeline': 1,
'__main__.SecondImagesPipeline': 2,
},            # used Pipeline create in current file (needs __main___)

#    'IMAGES_STORE': 'test',  # normally you use this folder has to exist before downloading
})
c.crawl(MySpider)
c.start() 
>EDIT:您始终可以使用标准ImagePipeline而不是修改后的管道之一。
'ITEM_PIPELINES': {
'ImagesPipeline': 1,        # standard ImagePipeline
'SecondImagesPipeline': 2,  # modified ImagePipeline
}
IMAGE_STORE = 'first'  # setting for standard ImagePipeline

最新更新