我是使用动态scraper的新手,我使用了以下示例来学习open_news。我已经设置好了所有内容,但它一直显示相同的错误:dynamic_scraper.models.DoesNotExist: RequestPageType matching query does not exist.
2015-11-20 18:45:11+0000 [article_spider] ERROR: Spider error processing <GET https://en.wikinews.org/wiki/Main_page>
Traceback (most recent call last):
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 825, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/internet/task.py", line 645, in _tick
taskObj._oneWorkUnit()
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/internet/task.py", line 491, in _oneWorkUnit
result = next(self._iterator)
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
yield next(it)
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 26, in process_spider_output
for x in result:
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/dynamic_scraper/spiders/django_spider.py", line 378, in parse
rpt = self.scraper.get_rpt_for_scraped_obj_attr(url_elem.scraped_obj_attr)
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/dynamic_scraper/models.py", line 98, in get_rpt_for_scraped_obj_attr
return self.requestpagetype_set.get(scraped_obj_attr=soa)
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Django-1.8.5-py2.7.egg/django/db/models/manager.py", line 127, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Django-1.8.5-py2.7.egg/django/db/models/query.py", line 334, in get
self.model._meta.object_name
dynamic_scraper.models.DoesNotExist: RequestPageType matching query does not exist.
这是由于缺少"请求页面类型"引起的。每个"SCRAPER ELEMS"都必须有自己的"请求页面类型"。
要解决此问题,请按照以下步骤操作:
- 登录管理页面(通常http://localhost:8000/admin/)
- 转到主页›Dynamic_Scraper›Scrapers›Wikinews Scraper(文章)
- 点击"请求页面类型"下的"添加另一个请求页面类型
- 为每个"(基本(文章))"、"(标题(文章)"、"(描述(文章)"one_answers"(url(文章)
"请求页面类型"设置
所有"内容类型"均为"HTML"
所有"请求类型"均为"请求"
所有"方法"都是"获取"
对于"页面类型",只需像一样按顺序分配即可
(基础(文章))|主页
(标题(文章))|详细信息第1页
(描述(文章)|详细信息第2页
(url(文章))|详细信息第3页
完成上述步骤后,您应该修复"DoesNotExist:RequestPageType"错误。
但是,会出现"错误:强制性elem标题丢失!"!
为了解决这个问题。我建议您将"SCRAPER ELEMS"中的所有"请求页面类型"更改为"主页",包括"标题(文章)"。
然后更改XPath如下:
(基本(文章))|//td[@class="l_box"]
(title(文章))|span[@class="l_title"]/a/@title
(描述(文章)|p/span[@class="l_summary"]/text()
(url(文章))|span[@class="l_title"]/a/@href
毕竟,在命令提示符下运行scrapy crawl article_spider -a id=1 -a do_action=yes
。你应该能够抓取"文章"。您可以从主页›Open_News›Articles 查看
享受~
我可能会迟到,但希望我的解决方案对稍后遇到的人有所帮助。
@alannala解决方案运行良好。然而,它基本上跳过了详细信息页面抓取。
以下是如何充分利用详细信息页面抓取的方法。
首先,转到主页›Dynamic_Scraper›Scrapers›Wikinews Scraper(文章)并将其添加到请求页面类型中。
其次,请确保SCRAPER ELEMS中的元素是这样的。
现在,您可以根据文档运行手动抓取命令
scrapy crawl article_spider -a id=1 -a do_action=yes
好吧,你可能会遇到@alan nala 提到的错误
";错误:缺少必需的elem标题"
请注意错误截图,我有一条消息表明脚本是"正在调用的DP2 URL"就我而言。
最后,您可以返回SCRAPER ELEMS并更改元素"的请求页面类型;标题(文章)";至"详细信息页2";而不是";细节页1";。
保存您的设置,然后再次运行scraby命令。
注意:您的";详细页面#";可能会有所不同。
顺便说一句,我还准备了一个由GitHub托管的简短教程,以备您需要更多关于这个主题的详细信息。