在Scrapy的ItemLoader中,如何将自定义解析方法与TakeFirst()相结合?



我是要重构我写的蜘蛛,我写了为scrape apk下载页面,例如http://www.apkmirror.com/apk/apk/adobe/adobe/photoshop-mix/photoshop-mix/photoshop-mix-mix-1-0-333释放/Adobe-Photoshop-Mix-1-0-333-beta-android-apk-download/。到目前为止,这是蜘蛛:

DEBUG = True
import scrapy
from scrapy.spiders import SitemapSpider
from apkmirror_scraper.items import ApkmirrorScraperItem, ApkmirrorItemLoader

class ApkmirrorSitemapSpider(SitemapSpider):
    name = 'apkmirror-spider'
    sitemap_urls = ['http://www.apkmirror.com/sitemap_index.xml']
    sitemap_rules = [(r'.*-android-apk-download/$', 'parse')]
    if DEBUG:
        custom_settings = {'CLOSESPIDER_PAGECOUNT': 20,
                           'CLOSESPIDER_ERRORCOUNT': 0,
                           'CONCURRENT_REQUESTS': 16,
                           'CONCURRENT_REQUESTS_PER_DOMAIN': 8}
    def parse(self, response):
        l = ApkmirrorItemLoader(item=ApkmirrorScraperItem(), response=response)
        l.add_value('url', response.url)
        l.add_xpath(field_name='title', xpath='//h1[@title]/text()')
        l.add_xpath(field_name='developer', xpath='//h3[@title]/a/text()')
        l.add_xpath(field_name='app', xpath='//*[contains(@data-channel-name, "App Updates")]/@data-channel-name')
        return l.load_item()

我试图将项目字段的处理和解析移至 items.py

import re
import scrapy
import scrapy.loader
from scrapy.loader.processors import MapCompose, TakeFirst
class ApkmirrorScraperItem(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()
    developer = scrapy.Field()
    app = scrapy.Field()

def parse_app(data_channel_name):
    '''Parse the name of the app from the "data-channel-name" attribute of the button named "Follow [app_name] Updates".'''
    pattern = re.compile(r'(?P<app>.+) App Updates')
    return pattern.search(data_channel_name).groupdict().get("app")
class ApkmirrorItemLoader(scrapy.loader.ItemLoader):
    url_out = TakeFirst()
    title_in = MapCompose(unicode.strip)
    title_out = TakeFirst()
    developer_in = MapCompose(unicode.strip)
    developer_out = TakeFirst()
    app_out = MapCompose(parse_app)

目前,如果我爬蜘蛛,它将刮擦这样的物品:

2017-04-24 19:30:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.apkmirror.com/apk/adobe/photoshop-mix/photoshop-mix-1-0-333-release/adobe-photoshop-mix-1-0-333-beta-android-apk-download/> (referer: http://www.apkmirror.com/apps_post-sitemap5.xml)
2017-04-24 19:30:57 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.apkmirror.com/apk/adobe/photoshop-mix/photoshop-mix-1-0-333-release/adobe-photoshop-mix-1-0-333-beta-android-apk-download/>
{'app': [u'Adobe Photoshop Mix'],
 'developer': u'Adobe',
 'title': u'Adobe Photoshop Mix 1.0.333 beta (arm)',
 'url': 'http://www.apkmirror.com/apk/adobe/photoshop-mix/photoshop-mix-1-0-333-release/adobe-photoshop-mix-1-0-333-beta-android-apk-download/'}

请注意,'app'字段仍然是一个列表,我仍然想将其应用于Scrapy的TakeFirst()处理器。但是,如果我尝试将相关行更改为

app_out = MapCompose(parse_app, TakeFirst())

我得到了看起来像这样的项目:

2017-04-24 19:44:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.apkmirror.com/apk/microsoft-corporation/powerpoint/powerpoint-16-0-6228-1008-release/powerpoint-16-0-6228-1008-android-apk-download/> (referer: http://www.apkmirror.com/apps_post-sitemap12.xml)
2017-04-24 19:44:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.apkmirror.com/apk/microsoft-corporation/powerpoint/powerpoint-16-0-6228-1008-release/powerpoint-16-0-6228-1008-android-apk-download/>
{'app': [u'M'],
 'developer': u'Microsoft Corporation',
 'title': u'Microsoft PowerPoint 16.0.6228.1008 (arm)',
 'url': 'http://www.apkmirror.com/apk/microsoft-corporation/powerpoint/powerpoint-16-0-6228-1008-release/powerpoint-16-0-6228-1008-android-apk-download/'}

app'M'而不是'Microsoft PowerPoint'。换句话说,TakeFirst()似乎是在列表中取下字符串的第一个字母,而不是列表中的第一项。如果我尝试将订单切换到MapCompose(TakeFirst(), parse_app),那么我会得到

之类的错误
2017-04-24 19:49:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.apkmirror.com/apk/google-inc/google/google-6-8-0-107974459-release/google-6-8-0-107974459-android-4-0-3-android-apk-download/> (referer: http://www.apkmirror.com/apps_post-sitemap13.xml)
2017-04-24 19:49:15 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.apkmirror.com/apk/google-inc/google/google-6-8-0-107974459-release/google-6-8-0-107974459-android-4-0-3-android-apk-download/> (referer: http://www.apkmirror.com/apps_post-sitemap13.xml)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/kurt/dev/apkmirror_scraper/apkmirror_scraper/spiders/sitemap_spider.py", line 43, in parse
    return l.load_item()
  File "/usr/local/lib/python2.7/dist-packages/scrapy/loader/__init__.py", line 115, in load_item
    value = self.get_output_value(field_name)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/loader/__init__.py", line 128, in get_output_value
    (field_name, self._values[field_name], type(e).__name__, str(e)))
ValueError: Error with output processor: field='app' value=[u'Google+ App Updates'] error='AttributeError: 'NoneType' object has no attribute 'groupdict''

换句话说,parse_app方法失败。

如何将TakeFirst()合并到ItemLoader

我通过使用自定义解析方法作为输入处理器,将TakeFirst()用作输出处理器,以实现所需的结果:

app_in = MapCompose(parse_app)
app_out = TakeFirst()

刮擦的字段现在就像

2017-04-24 19:55:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.apkmirror.com/apk/microsoft-corporation/excel/excel-16-0-6228-1008-release/excel-16-0-6228-1008-android-apk-download/> (referer: http://www.apkmirror.com/apps_post-sitemap12.xml)
2017-04-24 19:55:12 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.apkmirror.com/apk/microsoft-corporation/excel/excel-16-0-6228-1008-release/excel-16-0-6228-1008-android-apk-download/>
{'app': u'Microsoft Excel',
 'developer': u'Microsoft Corporation',
 'title': u'Microsoft Excel 16.0.6228.1008 (arm)',
 'url': 'http://www.apkmirror.com/apk/microsoft-corporation/excel/excel-16-0-6228-1008-release/excel-16-0-6228-1008-android-apk-download/'}

使用应用程序的全名。

相关内容

最新更新