递归抓取后没有抓取数据



我正试图从https://iowacity.craigslist.org/search/jjj递归地抓取工作的标题。也就是说,我希望蜘蛛抓取第1页上的所有职位名称,然后按照底部的"next>"链接抓取下一页,以此类推。我模仿Michael Herman的教程来编写我的蜘蛛。http://mherman.org/blog/2012/11/08/recursively-scraping-web-pages-with-scrapy/.ViJ6rPmrTIU。

下面是我的代码:
import scrapy
from craig_rec.items import CraigRecItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class CraigslistSpider(CrawlSpider):
    name = "craig_rec"
    allowed_domains = ["https://craigslist.org"]
    start_urls = ["https://iowacity.craigslist.org/search/jjj"]
    rules = (
        Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse_items", follow= True),
)
    def parse_items(self, response):
        items = []
        for sel in response.xpath("//span[@class = 'pl']"):
            item = CraigRecItem()
            item['title'] = sel.xpath("a/text()").extract()
            items.append(item)
        return items  

我释放了蜘蛛,但没有抓取数据。任何帮助吗?谢谢!

当你设置你的allowed_domains为"https://craigslist.org"它停止爬行由于离线请求子域'iowacity.craigslist.org'。

必须设置为:

allowed_domains = ["craigslist.org"]

根据文档allowed_domains是一个字符串列表,其中包含允许爬行器爬行的域。它希望它的格式为domain.com,这允许爬行器解析域本身和所有子域。

您还可以指定只允许少数子域,或者通过将属性保留为空来允许所有请求。

Michael Herman的教程很棒,但是对于旧版本的scrapy来说。这段代码避免了一些弃用警告,并将parse_page转换为生成器:

import scrapy
from craig_rec.items import CraigRecItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class CraiglistSpider(CrawlSpider):
    name = "craiglist"
    allowed_domains = ["craigslist.org"]
    start_urls = (
        'https://iowacity.craigslist.org/search/jjj/',
    )
    rules = (
        Rule(LinkExtractor(restrict_xpaths=('//a[@class="button next"]',)),
             callback="parse_page", follow=True),
    )
    def parse_page(self, response):
        for sel in response.xpath("//span[@class = 'pl']"):
            item = CraigRecItem()
            item['title'] = sel.xpath(".//a/text()").extract()
            yield item

这篇文章也有一些关于抓取Craigslist的好技巧。

最新更新