Scrapy不能使用itemloader抓取第二页



更新:7/29,9:29 pm:读完这篇文章后,我更新了我的代码。

更新:7/28/15,在下午7:35,按照Martin的建议,消息更改了,但仍然没有列出项目或写入数据库。

ORIGINAL:我可以成功抓取单个页面(基页)。现在我试着从"基础"页面找到的另一个url中抓取一个项目,使用请求和回调命令。但这行不通。蜘蛛在这里:

from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy import Request
import re
from datetime import datetime, timedelta
from CAPjobs.items import CAPjobsItem 
from CAPjobs.items import CAPjobsItemLoader
from scrapy.contrib.loader.processor import MapCompose, Join
class CAPjobSpider(Spider):
    name = "naturejob3"
    download_delay = 2
    #allowed_domains = ["nature.com/naturejobs/"]
    start_urls = [
    "http://www.nature.com/naturejobs/science/jobs?utf8=%E2%9C%93&q=pathologist&where=&commit=Find+Jobs"]
    def parse_subpage(self, response):
        il = response.meta['il']
        il.add_xpath('loc_pj', '//div[@id="extranav"]/div/dl/dd[2]/ul/li/text()')  
        yield il.load_item()
    def parse(self, response):
        hxs = Selector(response)
        sites = hxs.xpath('//div[@class="job-details"]')    
        for site in sites:
            il = CAPjobsItemLoader(CAPjobsItem(), selector = site) 
            il.add_xpath('title', 'h3/a/text()')
            il.add_xpath('post_date', 'normalize-space(ul/li[@class="when"]/text())')
            il.add_xpath('web_url', 'concat("http://www.nature.com", h3/a/@href)')
            url = il.get_output_value('web_url')
            yield Request(url, meta={'il': il}, callback=self.parse_subpage)

现在刮擦部分功能,但没有loc_pj项:(更新于7月29日,7:35pm)

2015-07-29 21:28:24 [scrapy] DEBUG: Scraped from <200 http://www.nature.com/naturejobs/science/jobs/535683-assistant-associate-full-hs-clinical-clin-x-anatomic-pathology-cytopathology-11-000>
{'post_date': u'21 days ago',
'title': u'Assistant, Associate, Full (HS Clinical, Clin X) - Anatomic Pathology/Cytopathology (11-000)',
'web_url': u'http://www.nature.com/naturejobs/science/jobs/535683-assistant-associate-full-hs-clinical-clin-x-anatomic-pathology-cytopathology-11-000'}

初始化ItemLoader如下:

il = CAPjobsItemLoader(CAPjobsItem, sites)

在文档中这样做:

l = ItemLoader(item=Product(), response=response)

所以我认为你在CAPjobsItem处缺少括号,你的行应该读:

il = CAPjobsItemLoader(CAPjobsItem(), sites)

最新更新