我是要重构我写的蜘蛛,我写了为scrape apk下载页面,例如http://www.apkmirror.com/apk/apk/adobe/adobe/photoshop-mix/photoshop-mix/photoshop-mix-mix-1-0-333释放/Adobe-Photoshop-Mix-1-0-333-beta-android-apk-download/。到目前为止,这是蜘蛛:
DEBUG = True
import scrapy
from scrapy.spiders import SitemapSpider
from apkmirror_scraper.items import ApkmirrorScraperItem, ApkmirrorItemLoader
class ApkmirrorSitemapSpider(SitemapSpider):
name = 'apkmirror-spider'
sitemap_urls = ['http://www.apkmirror.com/sitemap_index.xml']
sitemap_rules = [(r'.*-android-apk-download/$', 'parse')]
if DEBUG:
custom_settings = {'CLOSESPIDER_PAGECOUNT': 20,
'CLOSESPIDER_ERRORCOUNT': 0,
'CONCURRENT_REQUESTS': 16,
'CONCURRENT_REQUESTS_PER_DOMAIN': 8}
def parse(self, response):
l = ApkmirrorItemLoader(item=ApkmirrorScraperItem(), response=response)
l.add_value('url', response.url)
l.add_xpath(field_name='title', xpath='//h1[@title]/text()')
l.add_xpath(field_name='developer', xpath='//h3[@title]/a/text()')
l.add_xpath(field_name='app', xpath='//*[contains(@data-channel-name, "App Updates")]/@data-channel-name')
return l.load_item()
我试图将项目字段的处理和解析移至 items.py
:
import re
import scrapy
import scrapy.loader
from scrapy.loader.processors import MapCompose, TakeFirst
class ApkmirrorScraperItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
developer = scrapy.Field()
app = scrapy.Field()
def parse_app(data_channel_name):
'''Parse the name of the app from the "data-channel-name" attribute of the button named "Follow [app_name] Updates".'''
pattern = re.compile(r'(?P<app>.+) App Updates')
return pattern.search(data_channel_name).groupdict().get("app")
class ApkmirrorItemLoader(scrapy.loader.ItemLoader):
url_out = TakeFirst()
title_in = MapCompose(unicode.strip)
title_out = TakeFirst()
developer_in = MapCompose(unicode.strip)
developer_out = TakeFirst()
app_out = MapCompose(parse_app)
目前,如果我爬蜘蛛,它将刮擦这样的物品:
2017-04-24 19:30:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.apkmirror.com/apk/adobe/photoshop-mix/photoshop-mix-1-0-333-release/adobe-photoshop-mix-1-0-333-beta-android-apk-download/> (referer: http://www.apkmirror.com/apps_post-sitemap5.xml)
2017-04-24 19:30:57 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.apkmirror.com/apk/adobe/photoshop-mix/photoshop-mix-1-0-333-release/adobe-photoshop-mix-1-0-333-beta-android-apk-download/>
{'app': [u'Adobe Photoshop Mix'],
'developer': u'Adobe',
'title': u'Adobe Photoshop Mix 1.0.333 beta (arm)',
'url': 'http://www.apkmirror.com/apk/adobe/photoshop-mix/photoshop-mix-1-0-333-release/adobe-photoshop-mix-1-0-333-beta-android-apk-download/'}
请注意,'app'
字段仍然是一个列表,我仍然想将其应用于Scrapy的TakeFirst()
处理器。但是,如果我尝试将相关行更改为
app_out = MapCompose(parse_app, TakeFirst())
我得到了看起来像这样的项目:
2017-04-24 19:44:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.apkmirror.com/apk/microsoft-corporation/powerpoint/powerpoint-16-0-6228-1008-release/powerpoint-16-0-6228-1008-android-apk-download/> (referer: http://www.apkmirror.com/apps_post-sitemap12.xml)
2017-04-24 19:44:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.apkmirror.com/apk/microsoft-corporation/powerpoint/powerpoint-16-0-6228-1008-release/powerpoint-16-0-6228-1008-android-apk-download/>
{'app': [u'M'],
'developer': u'Microsoft Corporation',
'title': u'Microsoft PowerPoint 16.0.6228.1008 (arm)',
'url': 'http://www.apkmirror.com/apk/microsoft-corporation/powerpoint/powerpoint-16-0-6228-1008-release/powerpoint-16-0-6228-1008-android-apk-download/'}
app
是'M'
而不是'Microsoft PowerPoint'
。换句话说,TakeFirst()
似乎是在列表中取下字符串的第一个字母,而不是列表中的第一项。如果我尝试将订单切换到MapCompose(TakeFirst(), parse_app)
,那么我会得到
2017-04-24 19:49:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.apkmirror.com/apk/google-inc/google/google-6-8-0-107974459-release/google-6-8-0-107974459-android-4-0-3-android-apk-download/> (referer: http://www.apkmirror.com/apps_post-sitemap13.xml)
2017-04-24 19:49:15 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.apkmirror.com/apk/google-inc/google/google-6-8-0-107974459-release/google-6-8-0-107974459-android-4-0-3-android-apk-download/> (referer: http://www.apkmirror.com/apps_post-sitemap13.xml)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/kurt/dev/apkmirror_scraper/apkmirror_scraper/spiders/sitemap_spider.py", line 43, in parse
return l.load_item()
File "/usr/local/lib/python2.7/dist-packages/scrapy/loader/__init__.py", line 115, in load_item
value = self.get_output_value(field_name)
File "/usr/local/lib/python2.7/dist-packages/scrapy/loader/__init__.py", line 128, in get_output_value
(field_name, self._values[field_name], type(e).__name__, str(e)))
ValueError: Error with output processor: field='app' value=[u'Google+ App Updates'] error='AttributeError: 'NoneType' object has no attribute 'groupdict''
换句话说,parse_app
方法失败。
如何将TakeFirst()
合并到ItemLoader
?
我通过使用自定义解析方法作为输入处理器,将TakeFirst()
用作输出处理器,以实现所需的结果:
app_in = MapCompose(parse_app)
app_out = TakeFirst()
刮擦的字段现在就像
2017-04-24 19:55:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.apkmirror.com/apk/microsoft-corporation/excel/excel-16-0-6228-1008-release/excel-16-0-6228-1008-android-apk-download/> (referer: http://www.apkmirror.com/apps_post-sitemap12.xml)
2017-04-24 19:55:12 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.apkmirror.com/apk/microsoft-corporation/excel/excel-16-0-6228-1008-release/excel-16-0-6228-1008-android-apk-download/>
{'app': u'Microsoft Excel',
'developer': u'Microsoft Corporation',
'title': u'Microsoft Excel 16.0.6228.1008 (arm)',
'url': 'http://www.apkmirror.com/apk/microsoft-corporation/excel/excel-16-0-6228-1008-release/excel-16-0-6228-1008-android-apk-download/'}
使用应用程序的全名。