Scrapy:为绝对路径和相对路径启用文件管道?

问题：我的代码中缺少什么(请参阅下面的"当前代码"部分)，使我能够使用Scrapy从绝对路径和相对路径下载文件？我很感激你的帮助。我对所有这些组件如何协同工作以及如何获得所需的行为感到迷茫。

背景：我已经使用了仔细研究Scrapy文档，在GitHub上查找类似示例以及拖网StackOverflow寻找答案的组合，但我无法让Scrapy文件管道以我想要的方式工作。我正在查看相当基本的目标网站，其中包含许多文件，主要是PDF和JPG，这些文件在a href或img src选择器下作为绝对或相对路径链接。我想下载所有这些文件。我的理解是response.follow将遵循相对路径和绝对路径，但我不确定该函数是否总是会产生可以通过文件管道下载的路径。我想出了爬行绝对路径和相对路径，这要归功于对我之前问题的回答。

遇到的问题：有两个主要问题。首先，我似乎无法让蜘蛛同时遵循绝对和相对路径。其次，我似乎无法让文件管道实际下载文件。这很可能是我不了解四个.py文件如何协同工作的功能。如果有人可以提供一些基本的观察和指导，我相信我可以超越这个基本的去/不去点，开始分层一些更复杂的功能。

当前代码：以下是 myspider.py、items.py、pipelines.py 和 settings.py 的相关内容。

myspider.py ：请注意，parse_items函数不完整，但我不明白函数应该包括什么。

from scrapy import Spider
from ..items import MyspiderItem
# Using response.follow for different xpaths
class MySpider(Spider):
name='myspider'
allowed_domains=['example.com']
start_urls=['http://www.example.com/']
# Standard link extractor           
def parse_all(self, response):
# follow <a href> selector
for href in response.xpath('//a/@href'):
yield response.follow(href, self.parse_items)
# follow <img src> selector
for img in response.xpath('//img/@src'):
yield response.follow(img, self.parse_items)
# This is where I get lost
def parse_items(self, response):
# trying to define item for items pipeline
MyspiderItem.item['file_urls']=[]

items.py

import scrapy
class MyspiderItem(scrapy.Item):
file_urls=scrapy.Field()
files=scrapy.Field()

settings.py：以下是启用文件管道的相关部分。

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
FILES_STORE = '/home/me/Scraping/myspider/Downloads'

pipelines.py：

class MyspiderPipeline(object):
def process_item(self, item, spider):
return item

我认为你的蜘蛛 myspider.py 错了什么！

def parse_all()可能叫错了名字，因为你没有在蜘蛛中定义def start_requests()并将其指向你的parse_all()，Scrapy默认情况下只会理解parse()！

我认为你应该把你的parse_all()名字改成parse()

对于绝对/相对路径的问题。有一个技巧可以注意到网站的资产路径。如果您的链接包含该路径(可能以http://domain/...的形式)，它应该是绝对链接。使用相对路径，您可以手动将资产路径附加到它们并处理您的下载！

检测链接是否可以是下载该the file usually contain the extension, e.g. .pdf, .jpg ...的文件的另一个技巧

相关内容

最新更新

热门标签：